Reliable Crash Reporting

14 Sep 2011, 16:09 PDT

Introduction

PLCrashReporter is a standalone open-source library for generating crash reports on iOS. When I first wrote the library in 2008, it was the only option for automatically generating and gathering crash reports from an iOS application. Apple's iOS crash reports were not available to developers, and existing crash reporters — such as Google's excellent Breakpad — were not supported on iOS (Breakpad still isn't). Since that time, quite a few crash reporters and crash reporting services have appeared: Apple now provides access to App Store crash reports, a number of 3rd-party products and services were built on PLCrashReporter (such as HockeyApp, JIRA Mobile Connect), and some services have chosen to write their own crash reporting library (TestFlight, Airbrake, and others).

Despite this obvious interest in adopting crash reporting from iOS developers, there has remained little understanding of the complexities and difficulties in implementing a reliable and safe crash reporter, and many of the custom crash reporting libraries have been implemented improperly. It's my intention to explore what makes crash reporting difficult (especially on iOS), and provide real-world examples of how an impoperly written crash reporter can fail — sometimes with little fanfare, and sometimes with surprising consequences.

A Hostile Environment

Implementing a reliable and safe crash reporter on iOS presents a unique challenge: an application may not fork() additional processes, and all handling of the crash must occur in the crashed process. If this sounds like a difficult proposition to you, you're right. Consider the state of a crashed process: the crashed thread abruptly stopped executing, memory might have been corrupted, data structures may be partially updated, and locks may be held by a paused thread. It's within this hostile environment that the crash reporter must reliably produce an accurate crash report.

To reliably execute within this hostile environment, code must be "async-safe": it must not rely on external, potentially inconsistent state. More concretely, this means that a crash reporter must avoid APIs that have not been written explicitly to be executed with a signal handler, in a potentially crashed process. This excludes everything from malloc — the heap may have been corrupted, or partially updated — to Objective-C — locks may be held in the runtime, or data structures might be partially initialized. In fact, there's so little that you can do safely within a signal handler, it's much easier to define what you *can* do safely — a minimum number of system calls and APIs are defined to be async-safe, and those are the only APIs you can reliably call.

Thus, to be reliable and safe, a crash reporter must be written with async-safety in mind, eliminating or minimizing the risk inherent in operating within a hostile and corrupt environment. At this point, you might ask what I mean by "reliable and safe" — after all, if the process has already crashed, what's the worst that could happen? It crashes again? In fact, there's quite a bit that can go wrong, largely depending on the non-async-safe APIs that an unreliable crash reporter might rely on.

First, do no harm

The mission of a crash reporter is simple: report crashes, and provide enough information to debug them. As it turns out, if you know how to read the report, just about everything you need to debug nearly all crashes can be found in the backtraces, register state, and a bit of intuition about your own code (look for a future blog post on this subject).

However, a crash reporter should never make things worse — I'll cover three major ways it can do that:

Let's explore these failure cases with some real world examples.

Failure Case: Async-Safety and Deadlocking Objective-C

One of the more likely failure modes when dealing with a non-async-safe APIs is a deadlock. Imagine that the application has just acquired a lock prior to crashing. If the crash reporter's implementation then attempts to acquire the same lock, it will wait forever: the crashed thread is no longer running and will never release the lock. When a deadlock like this occurs on iOS, the application will appear unresponsive for 10-20 seconds until the system watchdog terminates the process, or the user force quits the application.

It is possible to trigger such a deadlock simply by using Objective-C within the signal handler. The Objective-C runtime itself maintains a number of internal locks, and if a thread happens to hold a runtime lock when a crash occurs, any use of Objective-C in the crash reporter itself will trigger a deadlock. This is dependent on timing, however — for the purposes of a providing a simple test case, I've created a contrived example that will reliably demonstrate the deadlock on ARM and x86:

static void unsafe_signal_handler (int signo) {
    /* Attempt to use ObjC to fetch a backtrace. Will trigger deadlock. */
    [NSThread callStackSymbols];
    exit(1);
}
 
int main(int argc, char *argv[]) {
    /* Remove this line to test your own crash reporter */
    signal(SIGSEGV, unsafe_signal_handler);
 
    /* Some random data */
    void *cache[] = {
        NULL, NULL, NULL
    };
 
    void *displayStrings[6] = {
        "This little piggy went to the market",
        "This little piggy stayed at home",
        cache, 
	"This little piggy had roast beef.",
        "This little piggy had none.",
        "And this little piggy went 'Wee! Wee! Wee!' all the way home",
    };
    
    /* A corrupted/under-retained/re-used piece of memory */
    struct {
        void *isa;
    } corruptObj;
    corruptObj.isa = displayStrings;
 
    /* Message an invalid/corrupt object. This will deadlock crash reporters
     * using Objective-C. */
    [(id)&corruptObj class];
     
    return 0;
}

If you run this code on iOS or the simulator, you'll reliably trigger a deadlock in any crash reporter using Objective-C in its signal handler. While this specific example is somewhat contrived for the sake of serving as reliable test case, any use of Objective-C or any other non-async-safe function in a signal handler has the potential to trigger such a deadlock, and should be avoided.

Failure Case: Async-Safety and Data Corruption

The risk of data corruption is a far more potent concern than a deadlock. There are a surprising number of ways that this can occur — but what might be the most likely (and dangerous) mechanism to trigger data corruption is the reentrant running of the application's event loop.

Some crash reporters attempt to submit the crash report over the network immediately upon program termination. This introduces an interesting failure mode: spinning the runloop to handle network traffic may also trigger execution of the application's own code, and the application is then free to attempt to write potentially corrupt user data.

Consider a Core Data-based application, in which a model object is updated, and then saved:

person.name = name;
person.age = age; // a crash occurs here
person.birthday = birthday;
[context save: NULL];

At the time of the crash, the managed object context contains a partially updated record — certainly not something you want saved to the database. However, if the crash reporter then proceeds to reentrantly run the application's runloop, any network connections, timers, or other pending runloop dispatches in your application will also be run. If the application code dispatched from the runloop contains a call to -[NSManagedObjectContext save:], you'll write a partially updated record to the database, corrupting the user's data.

This approach of executing non-reentrant, non-async-safe code from a crash reporter is particularly dangerous. To avoid this, the signal handler can not make use of higher-level networking APIs at crash time, and crash report implementations must not attempt to submit a crash report until the application has started again.

If added to your UIApplicationDelegate, the following code will print a message to the console if your crash reporter spins the runloop after a crash has occurred:

dispatch_async(dispatch_get_main_queue(), ^{
    NSLog(@"APPLICATION CODE IS RUNNING - Crash reporter is spinning runloop");
});
*((int *)NULL) = 5; // trigger a crash

Failure Case: Stack Traces and Stack Overflow

There are two ways to implement backtrace support in a crash reporter (but only one of them is reliable):

The first solution is unreliable for a number of reasons, but the failure mode I'll be addressing here is a stack overflow. In the case of a stack overflow, a crash reporter that makes use of backtrace(3) or similar APIs will be entirely unable to report the crash. Here's a code example that will accidentally trigger an overflow:

/* A small typo can trigger infinite recursion ... */
NSArray *resultMessages = [NSMutableArray arrayWithObject: @"Error message!"];
NSMutableArray *results = [[NSMutableArray alloc] init];
 
for (NSObject *result in resultMessages)
    [results addObject: results]; // Whoops!
 
NSLog(@"Results: %@", results);

If the crash reporter makes use of backtrace(3), this crash will not be reported. The reason is straight-forward: These backtrace APIs must execute on on the crashed thread's stack, but that stack just overflowed. There is is no stack space within which the crash reporter's signal handler can be run.

However, if the crash reporter uses sigaltstack() to correctly execute on a different stack, the backtrace will be empty — the signal handler is running on a new stack!

In PLCrashReporter, this was solved by implementing custom stack walking for the supported platforms. This requires more complexity, but in addition to supporting the generation of reports in the case of stack overflow, also allows the crash reporter to provide stack traces for all running threads.

Conclusion

Implementing a reliable crash reporter is difficult, and these are only a brief overview of the potential pitfalls and complexities involved. I think Mike Ash best described the complexity of signal handlers in Friday Q&A:

There is very little that you can do safely. There is so little that I'm
not even going to discuss how to get anything done, because it's so impractical
to do so, and instead will simply tell you to avoid using signal handlers
unless you really know what you're doing and you enjoy pain.

While I can't claim that PLCrashReporter is perfect, great effort (and pain) have been expended in ensuring its reliability and correctness. If you're considering implementing your own ad-hoc reporter, I'd highly recommend reviewing the design decisions made in both Google Breakpad and PLCrashReporter, both of which are liberally licensed and may be included in any commercial and/or closed-source product.

As a developer considering the use of a crash reporter in your application, I hope this overview will provide a little more insight into their function (and design complexities), as well as providing you with some tools to evaluate the efficacy of the available solutions -- complex failure cases are often the time when you need accurate, reliable crash reporting the most.