You’ve tried everything. Changing the size of the buffer, page aligning your code, even waiting extra cycles, but your code is still broken. When you turn on debug mode for the target process, or step through with a debugger, it works perfectly, but that isn’t good enough. Your code doesn’t self-modify, so you shouldn’t have to worry about cache coherency, right?
Since we finished exploring lateral attacks, the research team has taken some time to dig into the shellcoding oddities that puzzled us earlier, and we’d like to share what we've learned.
MIPS: A Short Explanation and Solution
When the TP-Link’s MIPS processor wrote our shellcode to the executable heap it only wrote the shellcode to the data cache, not to main memory. Modified areas in the data cache are marked for later syncing with main memory. However, although the heap was marked executable, the processor didn’t automatically recognize our bytes as code and never updated the instruction cache with our new values. What’s more, even if the instruction cache synced with main memory before our code ran, it still wouldn’t have received our values because they had not yet been written from the data cache to main memory. Before our shellcode could run, it needed to move from the data cache to the instruction cache, by way of main memory, and that wasn't happening.
This explained the strange crashes. After our stack buffer overflow overwrote the stored return address with our shellcode address, the processor directed execution to the correct location because the return address was data. However, it executed the old instructions that still occupied the instruction cache, rather than the ones we had recently written to the data cache. The buffer had previously been filled mostly by zeros, which MIPS interprets as NOPs. Core dumps showed an apparent “jump” to the middle of our shellcode because the processor loaded our values just before, or during, generating the core dump. The processor hadn't synced because it assumed that the instructions that had been at that location would still be at that location, a reasonable assumption given that code does not usually change mid-execution. There are legitimate reasons for modifying code (most importantly, every time a new process loads), so chip manufacturers generally provide ways to flush the data and instruction cache.
One easy way to cause a data cache write to main memory is to call sleep(), a well known strategy which causes the processor to suspend operation for a specified period of time. Originally our ROP chain only consisted of two addresses, one to calculate the address of the shellcode buffer from two registers we controlled on the stack, and the next to jump to the calculated address.
Read on for more details about syncing the MIPS cache and why calling sleep() works or scroll down for a discussion of ARM cache coherency problems.
In Depth on MIPS Caching
The data and instruction caches store between 8 and 64KBs of values, depending on the MIPS processor. The instruction cache will sync with main memory if the processor encounters a syncing instruction, execution is directed to a location outside the bounds of what is stored in the instruction cache, and after cache initialization. With a jump to the heap from a library more than a page away, we can be fairly certain that the values there will not be in the instruction cache, but we still need to write the data cache to main memory.
During sleep, a process or thread gives up its allotted time and yields execution to the next scheduled process. However, a context switch on MIPS does not necessitate a cache flush. On older chips it may, but on modern MIPS instruction cache architectures, cached addresses are tagged with an ID corresponding to the process they belong to, resulting in those addresses staying in cache rather than slowing down the context switch process any further. Without these IDs, the processor would have to sync the caches during every context switch, which would make context switching even more expensive. So how did sleep() trigger a data cache write back to main memory?
The two ways data caches are designed to write to main memory are write-back and write-through. Write-through means every memory modification triggers a write out to main memory and the appropriate cache. This ensures data from the cache will not be lost, but greatly slows down processing speed. The other method is write-back, where data is written only to the copy in the cache, and the subsequent write to main memory is postponed for an optimal time. MIPS uses the write-back method (if it didn’t, we wouldn’t have these problems) so we need to wait until the blocks of memory in the cache containing the modified values are written to main memory. This can be triggered a few different ways.
One trigger is any Direct Memory Access (DMA) . Because the processor needs to ensure that the correct bytes are in memory before access occurs, it syncs the data cache with main memory to complete any pending writes to the selected memory. Another trigger is when the data cache requires the cache blocks containing modified values for new memory. As noted before, the data cache size is at least 8KB, large enough that this should rarely happen. However, during a context switch, if the data cache requires enough new memory that it needs in-use blocks, it will trigger a write-back of modified data, moving our shellcode from the data cache to main memory.
As before, when the sleeping process woke, it caused an instruction cache miss when directing execution to our shellcode, because the address of the shellcode was far from where the processor expected to execute next. This time, our shellcode was in main memory, ready to be loaded into the instruction cache and executed.
Wait, Isn't This a Problem on ARM Too?
If you are on ARMv7 or newer and running into odd problems, one solution is to execute data barrier and instruction cache sync instructions after you write but before you execute your new bytes, as shown below.
One thing we could do is call mprotect() from libc on the modified shellcode, but an even easier thing is to call sleep() just like we did on MIPS. We ran a series of experiments and determined that calling sleep() caused the caches to sync on ARMv6.
Our shellcode was limited by a filter, so, although we were executing shellcode at this point, we took advantage of functions in libc. We found the address of sleep, but its lower byte was below the threshold of the filter. We added 0x20 to the address (the lowest byte allowed) to pass it through the filter and subtracted it with our shellcode, as shown to the right.
Although context switches don't directly cause cache invalidation, we suspect that the next process to execute often uses enough of the instruction cache that it requires blocks belonging to the sleeping process. The technique worked well on this processor and platform, but if it doesn’t work for you, we recommend using mprotect() for higher certainty.
We had fun digging into these issues. Diagnosing computer problems reminds us how difficult it can be to diagnose health conditions. Symptoms show up in a different location than their cause, like pain referred from one part of the leg to another, and simply observing the problem can change its behavior. Embedded devices were designed to be black boxes, telling us nothing and quietly going about the one task they were designed to do. With more insight into their behavior, we can begin to solve the security problems that confound us.
Just getting started in security? Check out the recent video series on the fundamentals of device security. Old hand? Try our team's research on lateral attacks, the vulnerability our ARM work was based on, and the MIPS-based router vulnerability.