Anti-fragile servers

Anti-fragile servers

Nassim Taleb defined antifragility as:
Simply, antifragility is defined as a convex response to a stressor or source of harm (for some range of variation), leading to a positive sensitivity to increase in volatility (or variability, stress, dispersion of outcomes, or uncertainty, what is grouped under the designation “disorder cluster”). Likewise fragility is defined as a concave sensitivity to stressors, leading a negative sensitivity to increase in volatility. The relation between fragility, convexity, and sensitivity to disorder is mathematical, obtained by theorem, not derived from empirical data mining or some historical narrative. It is a priori”

What does it mean for deployment to be antifragile?

Let’s start with making our servers self-healing.

In nature, the cells can regenerate themselves and ideally you would like to have your servers to do the same thin

At Ahoy.io, we have used KOPS tool to set up a Kubernetes cluster. It is really easy but if you have aversion to command line, you can just set up a cluster with few clicks on Google Container Engine.

The cluster consists of:

  • 3 masters
  • N number of nodes

Masters

Master responsibilities are looking over the nodes and ensuring that state of server reflects our definition: which containers are running, which should be killed and removed.

Each master is deployed in separate availability zone (a,b,c). In our case, it’s eu-west (Ireland) as this is the cheapest one.

In the case of one master failure, AWS AutoScaling groups will recognise the failure and automatically recreate another master.

In the meantime, the alive masters are still operating and keeping our servers tidy.
That makes sure server can auto-heal its own supervisors.

Nodes

Node is a server that only responsibility is running containers with your code.

If you tell Kubernetes that you want to have one set of containers for your main API, replicated twice, then Kubernetes masters will make sure that this is the case and the containers are scheduled on random nodes.

Let’s say we have a set of containers (Kubernetes calls it a Pod) for our backend API and frontend web app.

Containers for API:

  • python-django-app with all the code & logic
  • nginx for serving static files (images, stylesheets, fonts)
  • memcache: fast & efficient cache

Containers for Frontend:

  • nginx for serving pre-built frontend web app

Obviously, we want to have this replicated, so we tell Kubernetes that this should have two copies each.

What Kubernetes needs to do now is deploy 8 containers (3x for api, 1x for frontend; times two) to random nodes.

It also needs to constantly make sure that the containers are there are still running and responsive.

That’s why we have masters, who do not run any containers and are just supervising.

Dying node

Because masters are constantly looking over a cluster, in the case of node failure (server going down), then AWS AutoScaling group will recreate another node in our datacenter and Kubernetes will see that some containers died and reschedule them.

That means that servers can die like flies and with enough of replicas, users won’t even see this. That’s why we needed at least 2 replicas: if a node failed, then containers on that node are not accessible.

You also need to have enough servers (nodes) to handle full user traffic even when part of the nodes are dead.

But wait, how can you just do replicas

Ok, so I haven’t mentioned an important thing about our configuration.

All our containers are stateless.

What does stateless mean? It means that inside the servers and containers we do not keep any important data.

Definition of important data is: if you can remove it and no one realises, it was not important.

That’s why we are safely hosting cache, as this is a nonpersistent helper to our operations.

Like a tenth scratch pad, you carry with you, because you lost previous nine. We can lose another one and just buy it new.

Where is the state then?

Having self-healing & scalable state is hard.

Like really hard.

We are lazy efficient, so we don’t want to spend time fixing database errors, making backups and making sure there are no data corruptions.

What do you do when you don’t know how to do stuff? You hire an expert!

All Ahoy’s state is kept in PostgreSQL databases managed by Compose.com (who are hosting it in the same datacenter as our cluster).

For keeping files around (uploaded avatars and generated invoices) we are using B2 Cloud from Backblaze, although AWS S3 is a better option if you are hosted in AWS already. (we will be moving someday there).

Is this really self-healing?

If any of our nodes die:

  • users are not affected and do not even realise
  • AWS will recreate itself
  • Kubernetes will reschedule missing Pods to another node

Since data is an another castle, we can create new nodes and kill old ones as we wish, giving us the power to scale both vertically (faster servers) and horizontally (more servers).

Nature is not perfect

The scratch or small cut will heal itself. Lost limb won’t regrow.

In nature, if there is too much damage, then no healing can happen. The same situation is possible here:

  • all master can die, so there is no one to supervise nodes
  • if all nodes hosting same replica (our API) dies at the same time, then API is not available until new nodes are created and containers rescheduled.
  • too much dying nodes can slow down our servers, as there is less resource to handle traffic

Being anti-fragile is not about being invincible. It’s about adjusting to constant errors happening around us.

If you accept that every 12 hours your server dies, and every 1h random container is removed, then you start thinking about your system differently.

At Ahoy each separate API assumes that others can be dead. If B2 Cloud filesystem does not respond, you can still search for flights, but your invoice won’t have a PDF. Once it’s back, PDF will be regenerated.

This can be achieved only when you always assume the worst shit happens.

Comments are closed.