The AWS connundrum

In the aftermath of the AWS outage, there’s a lot of talk about how the Internet has been taken hostage by this specific provider, and how the internet is not decentralized anymore. While it doesn’t seem like this is necessarily wrong - after all, the recent outage, and all AWS outages for that matter, seem to take down a sizeable portion of the internet - it’s both misguided and ignores the deeper problems at play here.

The arguments goes something like: “I couldn’t use my smart bed while AWS was down, therefore AWS has too much power”. You can substitute “smart bed” for a litany of other products and apps that people use nowadays, but the complain boils down to the same thing. And while I’m sure that it’s a pain in the ass that you can’t go to sleep in your $2000 smart bed because us-east-1 has been having troubles, this complaint, and most of the complaints that you hear during times like these, are missing the point.

It’s to be expected that, at some point, some part of your infrastructure is going to be down; even more than that, it’s also to be expected that some part of the network, which you don’t control and through which your packets go through to reach their final destination, is also going to be down at some point. We’ve seen slight misconfigurations of the BGP protocol lead to whole countries being disconnected from the internet at large, or having the packets traveling through unexpected paths (like foreign, adversarial countries) which introduces latency and weird behaviours.

Building your stack on top of AWS (or any other cloud provider, for that matter) can help you ignore a host of “problems”, but does introduce a level of helplessness that can’t be ignored. And ultimately that is one of the issues that should be talked about more whenever we have these outages: the cloud doesn’t make site reliability issues go away. It can be argued that it does provide flexibility for scaling (both horizontal and vertical), but you’re also exposing yourself to a platform that’s so much more complicated than what you (probably) need, and that has a cost. And while this might be a trade-off that a CTO might be willing to make (“AWS is down” is the perfect excuse to eschew responsibility), this is not consumer-oriented, which I’d argue should always be the focus.

When I talk about helplessness, I do mean it. It does give the impression that the only products that can be built nowadays are cloud-only - not even cloud-first. There seems to be a generation of developers who can only develop CRUD apps that are depend on a cloud API. And while there isn’t something necessarily wrong with this pattern, there is a time and a place for it. I’m going to go on a limb here and say that if you can’t develop a local-first CRUD app, you probably shouldn’t be integrating cloud-first patterns into your software, let alone making your software dependent on the cloud to function at all.

This is ultimately what prompted to start this project. I don’t hate the cloud, but I do hate that it came to pretty much eat the world. Not everything needs an always-on connection to a remote computer - in fact, very few projects do. You can have resilience, high-availability, scalability, and pretty much all the other niceties that the cloud offers upfront, while developing your architecture with a local-first perspective. This should be the lesson that we take from AWS outages. Oh, and that DNS is usually to blame.