With(out) app crashes, please! 

Kacper Kaliński | 12 Feb | 10 min read

Programmers try to avoid crashes in their code. If someone uses their application it should not break or quit unexpectedly. This is one of the simplest measurements of quality – if an app crashes often it is probably not done well.

App crashes occur when a program is about to do something undefined or bad like divide a value by zero or access restricted resources on a machine. It might also be done explicitly by the programmer who wrote the application. “That will never happen so I will skip it” – that is quite common and not completely unreasonable thinking. There are some cases that just cannot occur, never, until… it does.

Broken promises

One of the most common cases where we know that something cannot happen is APIs. We have agreed between backend and frontend – that is the only server response you can get for this request. Maintainers of this library have documented this function behavior. The function cannot do anything else. Both ways of thinking are correct, but both can cause problems.

When you are using a library you can depend on language tools to help you handle all possible cases. If the language you use lacks any form of type checking or static analysis you have to take care of that by yourself. Still, you can check that before shipping to the production environment so it is not a big deal. That can be tough, but you read changelogs before updating your dependencies and write unit tests, right? Either you use or make a library the more strict typing you can provide the better for your code and other programmers.

Backend-frontend communication is a bit harder. It is often loosely coupled, so change on one side can be done easily without being aware of how it will affect the other side. Change on the backend often can break your assumptions on frontend and both are often distributed separately. It has to end badly. We are only human and it sometimes happens that we did not understand the other side or forgot to tell them about that little change. Again, that is not a big deal with proper network handing – decoding response will fail and we know how to handle it. Even the best decoding code can be affected by bad design though…

Partial functions. Bad design.

“We will have two boolean variables here: ‘isActive’ and ‘canTransfer’, of course you cannot transfer when it is not active, but that is just a detail.” Here it begins, our bad design which can hit hard. Now someone will make a function with those two arguments and process some data based on it. The simplest solution is… just crash on an invalid state, it should never happen so we should not care. We even care sometimes and leave some comment to fix it later or to ask what should happen, but it can be shipped eventually without completing that task.

// pseudocode
function doTransfer(Bool isActive, Bool canTransfer) {
  If ( isActive and canTransfer ) {
    // do something for transfer available
  } else if ( not isActive and not canTransfer ) {
    // do something for transfer not available
  } else if ( isActive and not canTransfer ) {
    // do something for transfer not available
  } else { // aka ( not isActive and canTransfer )
    // there are four possible states
    // this should not happen, transfer should not be available when not active
    crash()
  }
}

This example might look silly but sometimes you might catch yourself in that kind of trap which is a bit harder to spot and resolve than this. You will end up with something called a partial function. This is a function that is defined only for some of its possible inputs ignoring or crashing with others. You should always avoid partial functions (please note that in dynamically typed languages most functions can be treated as partial). If your language cannot ensure proper behavior with type checking and static analysis it might crash after some time in an unexpected way. Code is constantly evolving and yesterday assumptions might not be valid today.

Fail fast. Fail often.

How can you protect yourself? The best defense is offense! There is this nice saying: “Fail fast. Fail often.” But didn’t we just agree that we should avoid app crashes, partial functions and bad design? Erlang OTP gives programmers a mythical advantage that it will heal itself after unexpected states and update while running. They can afford that, but not everyone has this kind of luxury. So why should we fail fast and often?

First of all, to find those unexpected states and behaviours. If you do not check if your app state is correct it might lead to even worse results than crashing!

Secondly, to help other programmers collaborate on the same code base. If you are alone in a project right now there might be someone else after you. You might forget some assumptions and requirements. It is rather common to not read provided documentation until everything works or don’t document internal methods and types at all. In that state someone calls one of the available functions with unexpected but valid value. For example, let’s say we have a ’wait’ function which takes any integer value and waits for that amount of seconds. What if someone passes ‘-17’ to it? If it does not crash immediately after doing that it might result in some serious errors and invalid states. Does it wait forever or not at all?

The most important part of intentional crashing is to do it gracefully. If you crash your application you have to provide some information to allow a diagnosis. It is quite easy when you are using a debugger but you should have some way to report app crashes without it. You can use logging systems to persist that information between application launches or look at it externally.

The second most important part of intentional crashing is to avoid that in the production environment…

Do not fail. Ever.

You will ship your code eventually. You cannot make it perfect, it is often too expensive to even think about making correctness guarantees. However you should ensure that it won’t misbehave or crash. How can you achieve that since we already decided to crash fast and often?

An important part of intentional crashing is doing it only in nonproduction environments. You should use assertions that are stripped away in production builds of your application. This will help during development and allow for spotting problems while not affecting end users.  However it is still better to crash sometimes to avoid invalid application states. How can we achieve that if we have already made partial functions?

Make undefined and invalid states impossible to represent and fallback to valid ones otherwise. That might sound easy but it requires a lot of thought and work. No matter how much it is, it is always less than searching for bugs, making temporary fixes and… annoying users. It will automatically make some of the partial functions less likely to happen.

// pseudocode
function doTransfer(State state) {
  switch ( state ) {
    case State.canTransfer {
      // do something for transfer available
    }
    case State.cannotTransfer {
      // do something for transfer not available
    }
    case State.notActive {
      // do something for transfer not available
    }
    // It is impossible to represent transfer available without being active
    // there are only three possible states
  }
}

How can you make invalid states impossible? Let’s pick two of the previous examples. In the case of our two boolean variables ‘isActive’ and ‘canTransfer’ we can change those two into single enumeration. It will exhaustively represent all possible and valid states. Even then someone can send undefined variables, but that is much easier to handle. It will be an invalid value that will not be imported to our program instead of an invalid state being passed inside making everything harder.

Our wait function can also be improved nicely in strongly typed languages. We can make it use only unsigned integers on input. That alone will fix all of our problems since invalid arguments will be stripped out by the compiler. But what if your language does not have types? We have some possible solutions. First – just crash, this function is undefined for negative numbers and we won’t do invalid or undefined things. We will have to find invalid use of it during tests. Unit tests (which we should do anyway) will be really important here. Second – this might be risky, but depending on the context might be useful. We can fallback to valid values keeping assertion in nonproduction builds to fix invalid states when possible. It might not be a good solution for functions like this, but if we make absolute value of integer instead we will avoid app crashes. Depending on the concrete language it might also be a good idea to throw/raise some error/exception instead. It might be worth it to fallback if possible, even when the user sees an error it is a much better experience than crashing.

Let’s take one more example here. If the state of user data in your frontend application is about to be invalid for some case it might be better to force a logout and get valid data again from the server instead of crashing. The user might be forced to do so anyway or can be caught inside an endless crash loop. Once again – we should assert and crash in such situations in nonproduction environments, but don’t let your users be external testers.

Summary

No one likes crashing and unstable applications. We don’t like making or using them. Failing fast with assertions that provide useful diagnosis during development and tests will catch a lot of problems early. Fallback on valid states in production will make your app a lot more stable. Making invalid states unrepresentable will strip out a whole class of problems. Give yourself a little more time to think before development about how to strip out and fallback on invalid states and a little more during writing to include some assertions. You can start making your applications better today!

Read more:

Recent articles about web & mobile development: