Antifragile Software — introduction

4 min readSep 17, 2020

Introduction

This is the first post from the “antifragile” software series. In this series, I’m going to gather and categorize my thoughts on software development in the context of Incerto (by Nasim Nicolas Taleb) and Thinking Fast and Slow (by Daniel Kahneman). Taleb’s books are mostly targeting domains like economics, social science, where he identified many problems with how we approach risk in complex systems and make judgments and predictions under uncertainty. I believe that those insights can be applied to software engineering practices as well. In this first, introductory post I’ll explain the main ideas briefly.

This, of course, won’t be the first attempt to combine the idea of antifragility and software development. There are some old papers on that subject. You can also find more recent articles related to the topic. This and the following posts is my personal attempt to categorize and describe ways of how we can build software in that spirit. While other sources focus more on the operational aspect of software (for example fault tolerance or automatic patches techniques) I’ll look more at other domains like code design, testing, code review, or process metrics. I’ll also cover the subject of cognitive errors, which is not directly related to antifragile software but it might help with avoiding some of the errors.

Incerto dictionary

First let’s go through some basic concepts, which will be needed for further reading. Feel free to skip it if you are already familiar with Taleb’s and Kahneman’s work.

Black Swan — unexpected event, which has 3 main characteristics — (1) it’s highly unlikely and not expected by anyone; (2) — it has a very serious (usually bad) impact, and (3) it is easily explained AFTER it occurred. In short, black swans are very rare disasters (for the given observers and time-span) which no one predicted beforehand but are easily explained after they happen.

Antifragile Systems — systems that are getting better/stronger/improved as a result of unexpected failures and stressors. Antifragile systems are not only resistant (robust) to unexpected errors but gain from such events.

Strategy for building “antifragile” software

First, let’s go briefly through all points from the list to understand the big picture of our strategy. Our goal is to build software that can survive for a long time and not only be robust but possibly can also “gain” from stressors (get better over time). That is the main idea of antifragility.

In order to be closer to achieving that goal, the following list of practices could be followed.

1. Avoid irreversible disasters

Firstly, let’s recognize single points of failure which can ruin the whole system. Please note that those “disasters” can occur on different levels. Think about situations like application being down for hours (operations failures) or attacks (security incidents). Apart from this, we can experience failures at demos (operations + sales errors) or simply be unable to fix the major bug for days without breaking other areas (development disasters)

Those, among others, are examples of situations that are not easily (or at all) reversible on a given level of our “system”. You can’t fix the user’s bad opinion. You never get a second chance to make a first impression. Some of those events can occur in minutes, some will pile up over months. In any case, we should be aware of those risks and remember that those are the most important events to avoid. If we fail at those, the rest doesn’t really matter.

In other words, we should try to protect the “system” from the most negative events, even if their probability is very small. Probability doesn’t even matter here if the impact is so enormous that it wipes out everything else.

2. Understand common cognitive errors

In the second part, we’ll go through some of the thinking errors that we all prone to while building software. As an industry, we’re doing many things in a very clever way but there are still areas for improvement. We, developers, are biased like all other people and we’re exposed to many potential, hard to see errors. We’re overwhelmed by too much information, we tend to find shortcuts when making decisions. When having not enough information, we extrapolate, make projections or guesses with way too high certainty. We tend to remember much better the negative things, which also impact our risk assessment.

All of this can be improved. First, we need to be aware of those cognitive biases. Secondly, we need to find examples of them in various phases of the software development process. While the official list is quite large I think it is enough to understand the main categories of those biases, not necessarily go through each of them.

3. Expose the system to randomness (take small risks)

Another way of improving the way we build software is to actually be open to some small risks. It might be counter-intuitive to first advise first but it all comes down to recognizing situations when potential worst-case scenario is acceptable, and on the other hand — potential payoff in the best case scenario is huge.

But why take any risk in the first place? Why have this division? The reason is that any complex system needs some kind of controlled stressors to get better over time. This is like exercising, which can end up with small injury but it prepares us for some bigger unplanned effort in the future.

4. Understand metrics and predictions limits

Finally, the last big takeaway from Incerto is to understand what can not be measured or predicted. The fact that we can draw a chart with some metric and extrapolate the trend — doesn’t mean that it makes sense and we should do it. Predictions (estimates) based on data with large variations can be dangerous.

Summary

The mentioned ideas are a bit overkill if you need to code quick proof-of-concept over the weekend. You don’t need to worry about them if you are setting up a WordPress blog or building a simple app with a single contact form. But if you are part of a more complex project, which is built over years and its users are relying on it on daily basis — I’d recommend getting familiar with the mentioned ways of dealing with risk.