Tales From The Crash Mines

06 Feb 2014, 07:26 PST

Over at Mike Ash's blog, he's posted my first issue of "Tales From The Crash Mines". I'll be writing this as an intermittent series on the interesting bugs we've tracked down on iOS and Mac OS X, interpreting crash reports, and software failure analysis and crash reporting in general.

The first issue explores a pretty interesting and educational bug, with a focus on the methodologies we use to derive more data from a fairly tricky crash report, and in the process, reconstruct exactly how an application failed.

Check it out! If you prefer PDFs (or dead tree reading), I've also posted a nicely formatted PDF generated from the original LaTeX sources.

Reliable Crash Reporting - v1.1

19 Jan 2013, 12:37 PST

A bit over a year ago, I wrote a blog post on Reliable Crash Reporting, documenting the complexity of reliably generating crash reports and how seemingly innocuous decisions could lead to failure in the crash reporter, or even corruption of user data. This was based on my experience in writing and maintaining PLCrashReporter, a standalone open-source crash reporting library that I've been maintaining (and using in our production applications) since around 2008.

Given that there have been a number of new entrants into the space, including KSCrash and Crashlytics 2.0, I thought it would be fun to revisit the previous post and review the current state of the art.

While I'd suggest reading the original post for the backstory on what makes reliable crash reporting difficult -- and why it matters -- I'll repeat the most pertitent section here:

Implementing a reliable and safe crash reporter on iOS presents a unique challenge: an application may not fork() additional processes, and all handling of the crash must occur in the crashed process. If this sounds like a difficult proposition to you, you're right. Consider the state of a crashed process: the crashed thread abruptly stopped executing, memory might have been corrupted, data structures may be partially updated, and locks may be held by a paused thread. It's within this hostile environment that the crash reporter must reliably produce an accurate crash report.

Today I'll touch on two reliability issues that remain in modern crash reporters -- handling stack overflows, and async-safety. The stack overflow issue is especially frustrating to me, given that it affects PLCrashReporter, too.

Async-Safety

Imagine that the application has just acquired a lock prior to crashing. If the crash reporter attempts to acquire the same lock, it will wait forever: the crashed thread is no longer running, and it will never release the lock. When a deadlock like this occurs on iOS, the application will appear unresponsive for 20+ seconds until the system watchdog terminates the process, or the user force quits the application, and no valid crash report will be written.

In my previous post, I touched on async-safety issues around Objective-C and re-entrantly running the user's code. Most crash reporters have moved away from those APIs, but have introduced new async-safety issues in the process.

One of the common issues I found in all new reporters was reliance on the pthread(3) API to fetch thread information, including thread names. These APIs are not async-safe, however, and will acquire a global lock in most cases -- including when fetching a thread's name via pthread_getname_np(3). The result is that if your code crashes while any thread is holding the pthread thread-list lock, the entire crash reporter will deadlock. Since the crash reporters suspend all threads during reporting, this can occur even if the pthread calls themselves do not crash, but rather, a thread just happened to be executing a pthread() call at the time a crash occured.

I put together the following test case to demonstrate this issue. It will cause crash reporters that make use of pthreads to deadlock: either until the user force-quits, or the iOS watchdog kills the process (after 20 or so seconds.)

#import <pthread.h>
 
static void unsafe_signal_handler (int signo) {
    /* Try to fetch thread names with the pthread API */
    char name[512];
    NSLog(@"Trying to use the pthread API from a signal handler. Is a deadlock coming?");
    pthread_getname_np(pthread_self(), name, sizeof(name));
      
    // We'll never reach this point. The process will stop here until the OS watchdog
    // kills it in 20+ seconds, or the user force quits it. No crash report (or a partial corrupt
    // one) will be written.
    NSLog(@"We'll never reach this point.");
  
    exit(1);
}
  
static void *enable_threading (void *ctx) {
    return NULL;
}
  
int main(int argc, char *argv[]) {
    /* Remove this line to test your own crash reporter */
    signal(SIGSEGV, unsafe_signal_handler);
 
    /* We have to use pthread_create() to enable locking in malloc/pthreads/etc -- this
     * would happen by default in any real application, as the standard frameworks
     * (such as dispatch) will trigger similar calls into the pthread APIs. */
    pthread_t thr;
    pthread_create(&thr, NULL, enable_threading, NULL);
 
    /* This is the actual code that triggers a reproducible deadlock; include this
     * in your own app to test a different crash reporter's behavior.
     *
     * While this is a simple test case to reliably trigger a deadlock, it's not necessary
     * to crash inside of a pthread call to trigger this bug. Any thread sitting inside of
     * pthread() at the time a crash occurs would trigger the same deadlock. */
    pthread_getname_np(pthread_self(), (char *)0x1, 1);     
 
    return 0;
}

This is the primary reason PLCrashReporter does not provide thread names in its crash reports; this requires either calling non-async-safe API, or directly accessing system-private structures that are often changed release-to-release. If there's significant user demand, I may consider adding optional support for fetching thread names by poking around in system-private structures.

Stack Overflow

When a thread's stack overflows, there is is no stack space left over for a signal handler to use, which results in the inability to record the crash.

This can be handled partially with sigaltstack(2), which instructs the kernel to insert an alternative stack for use by the signal handler. This is functional but imperfect, as the API requires registering a custom signal stack for every thread in the process. Despite the sigaltstack(2) man page's implication that the registered stack is process-global, the stack is only enabled for the thread calling sigaltstack(2). The result is that stack overflows can only be handled on the main thread, unless additional threads are manually registered.

On Mac OS X, we can make use of a more capable API -- Mach exception handling -- to fully solve this problem. Since Mach exceptions are handled on a dedicated thread (or out of process entirely), the crashed thread's stack is entirely independent of the crash reporter. Unfortunately, the requisite Mach definitions are private on iOS, and have been since I originally wrote PLCrashReporter. This issue previously arose when Unity had their user's apps rejected in 2009, due to Mono's direct use of the Mach exc_server() API, and they were forced to release an update that avoided the use of the API in question.

Given that the structures and definitions required for a full implementation of Mach exception handling are private (at least, insofar as I've been able to determine), PLCrashReporter has long relied on sigaltstack(2) to provide the ability to report crashes on the main thread.

Unfortunately, sigaltstack(3) is broken in later iOS releases. In fact, it simply doesn't do anything at all. I've filed rdar://13002712 (SA_ONSTACK/sigaltstack() ignored on iOS) to report the issue to Apple, but in the meantime, I can see no way to detect stack overflow on iOS using only public API.

I've implemented Mach exception handling in PLCrashReporter for Mac OS X, and it could be used as a work-around on iOS, but I'm uncomfortable with providing something that relies on undocumented and/or private SPI. To make sure I wasn't missing something obvious, I even reviewed the KSCrash and Crashlytics 2.0 implementations to determine how they work around this issue, since both use Mach exceptions. Unfortunately, KSCrash appears to have copied in the private structure definitions from the kernel source, and from what I can tell from disassembling their code, Crashlytics copied the (private on iOS) Mach headers from the Mac or Simulator SDKs.

To confirm, I contacted Apple DTS. Their reply was as follows:

Our engineers have reviewed your request and have determined that this would be best handled as a bug report, which you have already filed. There is no documented way of accomplishing this, nor is there a workaround possible.

This is a frustrating position to be in; it seems the only choices are either to leave stack overflow reporting broken, or make use of seemingly private API. I've filed a radar requesting that the requisite Mach defs/headers be made public. In the mean time, I'm considering providing iOS Mach exception support as a user-configurable feature. At the very least, it could be enabled only for development builds.

Conclusion

Crash reporting is a complex enough topic that you can be reasonably assured that 1) You will always get something wrong, and 2) there is always room to improve. There are always trade-offs and edge-cases in engineering, and especially so in crash reporting, in which one operates in an environment with significant reliability restrictions, coupled with the ability to fetch, update, and permute memory and thread state at will.

When it comes to implementing something complex like crash reporting correctly, projects like Google's Breakpad deserve considerable admiration. They've invested years of very smart people's time towards getting crash reporting right, and have been deployed on a huge number of desktops via Chrome and Firefox. I'm working to incorporate some of the solid design decisions that have gone into Breakpad -- such as placing guard pages around (or locking outright) memory that is required for function after a crash.

Going forward, I'll probably be writing more informal (and shorter) posts to explore particular aspects of crash reporting. If you have any questions, or have anything you think would be worth covering here, feel free to drop me a line.