Do Multiple Implementations Make Ethereum More Reliable?

When I think about the reliability of Bitcoin, I’m mainly thinking about two things:

Reliably receive money — I want to be able to know with 100% certainty whether or not a payment is real.
Reliably send money — I want to be able to send my money 100% of the time.

Of those two goals, failure to achieve the second is usually¹ a temporary inconvenience. Failure on the first goal though can result in huge losses through double-spend fraud.

So it was no surprise to me that the crypto-currency exchange Kraken responded to the recent Ethereum consensus failure by temporarily halting withdrawals and deposits of ether:

Why multiple implementations doesn't help: any sane person would stop using Ethereum right now anyway. https://t.co/8H9Dns5YsD
— Peter Todd (@petertoddbtc) September 18, 2016

After all, a temporary halt to withdrawals and deposits is just a temporary inconvenience; in Bitcoin’s experience consensus failures can usually be fixed within a few hours. Equally, from the point of view of Kraken, it’s hard to argue that multiple implementations made Ethereum more reliable: they still had to stop using Ethereum for a few hours while the dust settled.

However, that tweet generated, uh, a bit of controversy:

No. multiple implementations is great. That's a disgusting statement, Peter. Also says a lot about you. https://t.co/iotJo85SDG
— Steven D. McKie (@Steven_McKie) September 18, 2016

@petertoddbtc I just hope you feel shame.
— Steven D. McKie (@Steven_McKie) September 18, 2016

So as promised, I thought I’d talk in a bit more detail about why multiple implementations don’t necessarily make a system more reliable. We’ll also ask an important question: How is the Ethereum protocol specified?

What is Reliability Anyway?
Consensus Systems and Redundancy
1. Voting
How is the “Ethereum Protocol” Defined?
Footnotes

What is Reliability Anyway?

If we’re going to try to make our consensus system better, we have to start with figuring out what we’re trying to achieve in the first place. I used to work in analog electronics, so lets make an analogy: Suppose I’m an electrical engineer, and I’ve been given a budget to work with and asked to design a reliable mains power system for a building. Does that mean I’m just trying to design a system that maximizes the percentage of time the lights stay on, within the budget allowed?

Of course not! “Reliability” isn’t as simple as keeping the lights on; there’s at least four things I want my design to do reliably:

Reliably don’t kill people.
Reliably don’t burn the building down.
Reliably don’t allow the power distribution system to be permanently damaged by faults.
Reliably keep the lights on.

Out of those four, just keeping the lights on - the availability of power - is actually the lowest priority! I’d much rather the power go out temporarily than have anything get permanently damaged, I definitely don’t want to burn the building down, and if my design gets anyone killed I probably should find a good lawyer ASAP.

Optimising for Reliability

Circuit breakers are a great example of these priorities in action. Circuit breakers protect against shorts and overloads by cutting the flow of electricity immediately; without them faults would result in permanent damage and even fires.

They’re also single-points-of-failure: in most buildings between your lights and the power grid there are two or three circuit breakers (or fuses) in series. If any one of them fails the lights go off - no backups.

But when we consider our priorities that’s OK: keeping the lights on is the lowest on our list, and for most purposes we can tolerate some downtime in exchange for a safer power system. In short, we’ve sacrificed the availability of power, in exchange for a higher overall reliability for the same total budget.

Optimising for Availability

That’s not always a good trade-off: sometimes the power failing is itself a safety problem. An interesting example are the “fire pumps” used in high-rise fire sprinkler systems to supply water to the sprinkler heads:

Fire pump

CC-BY-2.5 Todd A Stephens

Here our priorities are very different: if a fire pump is in use, there’s a good chance the building is already on fire. To make a long story short, building codes prohibit the installation of circuit breakers on circuits that supply fire pumps in many circumstances, because it’s better that the pump keep running so the sprinklers can put the fire out, even at the risk of potentially destroying the pump and wires connected to it due to a fault. Fire-pumps sacrifice overall reliability in exchange for higher availability.

Redundancy: High Availability and High Reliability

What if I want the best of both worlds? I could install two fire pumps, both protected by circuit breakers. I’d then have a system where faults are handled safely without damaging equipment, and (hopefully!) if one pump fails in a fire I’ll still be protected by the other.

Why don’t we do this? In the case of fire pumps, money. Building twice as many pumps costs twice as much money, and for various reasons it’s more effective to spend the money on other things like thicker wires that can handle fault currents and higher quality pumps that are less likely to fail in the first place. As often happens with trade-offs, your choices are reliability, availability, and affordability: pick two.

Consensus Systems and Redundancy

But at least you can easily make those fire pumps redundant. For a pump, redundancy is additive: if the left pump turns on and the right pump doesn’t, water is still going to flow. Additional pumps only add to the availability of the system, right?

Actually, even for something as simple as a pump that’s not necessarily true: if two pumps supply the same pipe, if one of the pump fails the result is often that the other pump wastes most of its output pushing water backwards through the pump that failed; if the failure mode was a leak, the whole system could be totally useless. So you need to add check-valves to the design, which means the cost of two pumps is now a little higher than 2x. And those valves can themselves fail, reducing reliability…

Consensus systems take this problem to the extreme; it’s really difficult to use redundancy to make a consensus system more reliable. If we have two different implementations of the same system, if one implementation thinks Alice paid Bob and the other implementation thinks Alice paid Charlie, we have a massive problem that must be fixed. Until that problem is fixed the system simply isn’t safe to use: neither Charlie nor Bob can be sure that they’re actually going to get paid. And if Bob and Charlie have both lost large sums of money in contradictory ways… Have fun trying to come to consensus on who should eat the loss.

In a consensus system naively adding redundancy subtracts from reliability in a particularly bad way: not only do you have twice as much code that can have bugs in it, previously harmless subtle implementation differences are now serious problems.

Voting

So why not make three implementations, and have them vote two-out-of-three?

Real-world systems do work this way - the Space Shuttle went as far as to have five different computers, running two independently written versions of the flight software. Obviously this comes at a cost, three implementations is roughly three times as much work; the Space Shuttle wasn’t exactly a low-cost project.

But using voting in decentralized systems also fails in another subtle way: part of the desire for independent implementations is the perception that they’ll “decentralize development”. In that respect redundancy still fails: the two-of-three solution is itself an implementation, with the choices of which three implementations being the implementation!

How is the “Ethereum Protocol” Defined?

If we’re going to use redundancy, we need a specification. For something as a simple as a pump, that specification doesn’t need to be all that detailed to work:

Fire Pump Specs

As terrible as that specification is, if I tried to use it to buy a fire pump I’d get something back that still put water on fires. It wouldn’t be the best pump for the job - and I’d be the laughingstock of the job site - but when you get down to it pumping water just isn’t that complex.

In comparison, here’s an extract from the Ethereum Homestead “Yellow Paper”:

Ethereum Homestead "Yellow Paper" Extract

And that’s just one of dozens of pages of densely written notation.

Yet is that sufficient to be a protocol specification? Apparently not! It’s very telling that the DAO bailout hard-fork isn’t a part of that Yellow Paper, the Ethereum wiki hasn’t been updated with the DAO bailout rules, nor was an Ethereum Improvement Proposal written for the DAO bailout. I looked at the codebases for Geth and Parity: both implement the hard-fork, but neither code base² points to a human-readable specification describing what that hard-fork actually was.

I don’t believe a second, compatible implementation of Bitcoin will ever be a good idea. So much of the design depends on all nodes getting exactly identical results in lockstep that a second implementation would be a menace to the network. -Satoshi Nakamoto

As far as I can tell, just like Bitcoin, in practice while the Ethereum protocol is documented by human readable text, the Ethereum protocol is defined by executable code. Yet, it’s often claimed otherwise:

It's pretty nice that ETH has a spec and multiple implementations, in cases when there's a bug in a particular implementation.
— Emin Gün Sirer (@el33th4xor) September 19, 2016

I asked Emin and Vitalik to point me to that specification. I haven’t gotten a reply yet, but I’d be very interested in hearing their answer.

Footnotes

Protocols with timeouts such as Lightning do change this, although the timeouts involved (should!) be on the order of a week or two; every Bitcoin fork to date has been resolved in no more than a few hours. ↩
The pull-req for Parity’s bailout implementation refers to this spec, but Geth doesn’t appear to mention that document at all. In any case, referring to random document on GitHub in a pull-req is pretty dodgy - there are multiple levels of trusted pointers that can fail there and make what the specification actually is unclear. ↩

Contents