Sunday, September 17, 2017

Hidden Ethereum: The Stateroot, Part 2: A Stateroot is a Handle


In source code, a database is usually treated as an an idealized abstract concept.

Real databases have tons of features.  In a recent project, we connected to 12 databases with plugable vendors, configurable locations, load balancing, automatic backup, and local caching.  Sometimes our "database" wasn't a single thing, but a distributed or logical entity, or maybe a mysterious cloud based web service (yes, the phrase DaaS is actually a thing).

Without abstraction, dealing with all of this would be madness.  Imagine having to chase data across the physical machines in a distributed filesystem, in the code at the middle of a soup sharing app.  When programming, configurable database features are just nuisance.  A developer just wants to plop a tiny data hobgoblin down on the system that feeds and expels information upon request.

And that is basically what engineers created.

In source code, the oddly named object called the "handle" is a small pointer to an idealized, single-minded data hobgoblin.  At runtime, we can wire up a handle to point to a real database.  A handle is a promise that someday some devops guy is going to set up an actually useful backend, but today we are free to store everything in shoddily configured SQLite.

Handles are black boxes.  Internally, they tend to be pointers to opened network connections or unix sockets, or even something as simple as a unique ID that the backend stores in a lookup table, but mostly, developers neither know nor care what is inside.

A blockchain *itself* is a very abstract database.  It is a logical collection of mutually hostile peers competing to steer the future of the data they hold.  But, it is still a thing that holds data, and the data can be queried.  Might we want to bake a "handle" into our blockchain APIs?



So, how *do* you query a blockchain?  What if you want to ask the simple question "how much money do I have?".

Even this basic task is slightly complicated.  You can find an arbitrary peer in the network to query, but peers often lag in syncing up.  Or worse yet, since blockchain updates are always subject to some wheeling and dealing, peers might temporarily disagree with each other on reality itself.  At any given moment, the state of the database is in a hodgepodge of competing snapshots.  We need a way to label each candidate snapshot, and include this label in any query.  This will give us unambiguous results, independent of the opinions of the peer you happen to be connected to.

Or to put it differently, we need a database handle.  Just like before, we just want a simple object that points to an idealized information hobgoblin (in this case a database snapshot).  Once we have this pointer, we will include it in any query to indicate which database snapshot we would like to query.

The stateroot of a database is statistically unique, and peers already pass them around as a hashes to verify data integrity.  Stateroots therefore are a perfect handle.

Whenever you want to query some data, you select your favorite snapshot, then include its stateroot as a handle in your query.  The peer will know which snapshot it refers to, and can perform the query against it.

This works great in resolving accidental ambiguity, but how do you know that the peer you connected to isn't lying to you?



When we connect to a simple local non-blockchain SQL database that the devops guy has set up, we trust the data it returns.

On the other hand, when we connect to an arbitrary peer in a blockchain network, you have no way to know whether it is run by the Bank of Norway, some kid doing a blockchain experiment for his high school class, or Bernard Madoff.  Most folks are good at heart, but you should just assume that you've connected to an evil buffoon.

Luckily, a stateroot isn't just a handle....  It is a *cryptographic* handle!

As I mentioned in part 1 of my discussion of stateroots, a stateroot is a hash that can verify a database piecewise, even if you don't have all the data in that database, using an extra piece of information called the "Merkle Branch".  You can therefore query a peer, and simultaneously challenge it to *prove* its return value is correct by including a Merkle Branch.  If everything checks out, it doesn't matter who sent you that value, you can be sure that it was in the initial snapshot.

(Of course, deciding whether you should trust a snapshot is the job of mining.... so I won't get into that here).



There is one last oddity that I want to mention.  If you update a database, you can also quickly recalculate its stateroot, even if you don't have any other data in the database, as long as you have the Merkle Branch to the location being modified.

You might think this could extend our discussion of blockchain querying to include updating, but this definitely is not the case.  Remember, a blockchain database is locked down to changes, which can only occur through the ceremonial process of submitting transactions and mining.  There is no technical reason that the API could not include direct updates, but it would be undesirable to allow it.

However, at a lower level, within the software that powers a blockchain node, this is exactly how updates are performed.  This process allows us to view the evolution of a database as succession of stateroots over time.

I personally find this idea very cool, and will cover it in part 3.







Monday, September 11, 2017

"Ground Zero" Memories

When the Twin Towers fell, everyone called the site of the attack "Ground Zero".  I realized recently that this phase has disappeared from usage, probably about the time that the Freedom tower was built.

I moved to California about a year before 9/11, and was lucky not to lose anyone I knew in the attack.  I grew up in Connecticut and the Twin Towers were always about an hour away.  You never prioritize what is nearby, and by early 2001, I had walked through the underground WTC shops, but had never been to the top.

2001 was also the year that N and I were dating pretty seriously, and on March she flew out to meet my folks.  She had never been to NYC, so I planned an introductory tour, including all the tourist traps.  I usually brought newbies to the Empire State Building, but since *I* had never gone, we instead went to the World Trade Center Observation Deck.  This of course was our one and only time.

(I can't find the one picture we took at the top, I may add it here at some time in the future.)

We spent the next few months in California, busy with work, and planning our wedding.  On 9/11 my mother called from Connecticut at 6 in the morning.  She was watching the attack live, and suddenly most of the TV stations blacked out (the WTC used to broadcast from antennas at the roof).  She found comparable Connecticut stations, and we watched together, shocked, as the towers fell.

This is how most of us saw it that day.  Watching that video brings me right back.

As big of a deal it was in California, the effect in NYC was enormous.

N and I returned to the East coast in December, as a married couple, and took a series of trips into Manhattan.  Giant flags and memorials were everywhere.




The city was clumsily trying to readjust to the new reality.  We finally took that trip the the Empire State Building, to see this sign:


I was a bit shocked to already see peddlers aggressively selling tribute books and attack related trinkets.  It felt too soon.  We met with a friend who lived in NYC, who initially also was offended but eventually had grown to like the peddlers, it indicated a return to normalcy.  She also described the dust, smog and vile smell that had permeated the city until recently.

As bad as midtown was, the south was a mess.  Southern Manhattan was basically shut down.  Subways were severed, roads were barricaded, buildings were locked up, dusty and vacant:



Buildings surrounding Ground Zero had their faces ripped to shreds:


That week the City erected a platform at St. Paul's Chapel at the edge of Ground Zero.  I took a series of photos and glued together this panorama:





Here is a closeup:




It is 2017, and my family has now moved to Manhattan.  This is our first September 11th here, so I went down to the World Trade Center and took some pictures to compare it to Ground Zero.





Until retirement, my father worked as a manager in the NYC construction scene.  In the 80s, he actually was assigned World Trade Center Building 7 (which seems now to be the focus point of some odd conspiracy theories).  It must have been very odd for him to watch a building he helped create collapse in front of his eyes.

Here is the new Freedom Tower and the newly rebuilt WTC Building 7:



While viewing the footprint of the old towers, I found myself thinking about all the underground stores I had wandered decades earlier, including a small cake shop.  Then, I realized I was actually looking at that shop itself:


It is 16 years later, and construction still continues at parts of the site.


At the current site, only one object from the old remains:


Friday, September 8, 2017

First World Problems, Blockchain Developer Edition: My Official "Trie" Pronunciation Rant

Here's this thing that happens....  A youngish chap starts discussing his fun blockchain project, and naively steps into a beartrap....  He utters the phrase "Merkle Trie" (Trie as "tree"), then an older guy chortles with a paternal sternness and repronounces it "Merkle Trie" (Trie as "try").

Well angry experienced guy....  You are wrong!

And I have word on this from no less than Edward Kmett.  Or if you aren't impressed by that, how about a little thing I call...........  the collective wisdom of entire world!  Or maaaaaaybe.....  the actual inventor of the trie himself!

I mean, I get it....  over time it has evolved a bit and now has this cute quirky alternative pronunciation that your CS486 TA used in lab class right before uttering the phrase "A review session shall be held to study the essential datum to be considered".

But try this on for size Mr. Hot-Pants:  A trie actually *is* a tree.  I mean, one specific type of tree, but a tree nevertheless, so is it really gonna freak out people's brains if you say "The balance is stored in the trie" and they hear "The balance is stored in the tree"?

And do you know what a trie is definitely not?  It is not a "try", whatever that *possibly* could even mean.

And yeah, you just happen to be in that one in a million case where you have both a tree-that-isn't-a-trie *and* and tree-that-is-a-trie, and boy it is useful that you can verbally distinguish the two without having to actually say one extra word.

So it is great that *your* superior pronunciation exists to avoid all that confusion.  Except...........  You ***literally*** chose to pronounce it as the 127th most common word in the English language (as opposed to "tree", which comes in at 596, you dork), a word that for some bizarre reason happens to be hardwired in *my brain* to only ever be interpreted precisely as a verb.

I mean, why don't you just call it a "the", or an "and", or something equally preposterous.  In fact, why don't we just call all of our internal algorithmic structures the top words on the English frequency list so we can avoid an accidental overlap with any mythological zoological creatures or whatever.  Then we can have incredibly clean sentences like

"Querying the first A in the BUT and adding the OF to the HAVE requires a complexity of O(n^2)"


Thursday, September 7, 2017

Hidden Ethereum: The Stateroot, Part 1: A Stateroot is a Hash

If you glance at an Ethereum block, among all the important looking stuff is an almost afterthought called the "stateroot" (or "root hash", or "root", if it is even mentioned at all).  People who just want to trade a bunch of Ether should just ignore that weirdo tag (and everything else in a blockchain explorer) and head straight to Coinbase.

But, for anyone interested in the technology behind blockchains, this "stateroot" is one of the most important concepts.  It's so important, that I am forgoing an evening in hip Brooklyn to stay in and passionately write about it.

The stateroot is one of those topics that you can relearn at progressively deeper levels, so I'll build up the discussion across a few posts, starting here with its description as an interesting low level piece of plumbing, then go deeper in future posts....



If you push aside all the hype, a blockchain is basically a database.  It happens to be a globally shared database designed to thrive under hostile and paranoid conditions, run by opportunistic volunteers begrudgingly syncing state... but a database nevertheless.

How one coerces hostile adversaries to maintain the state of a database is an interesting topic, but for now, let me focus a simpler sub-question.  How can you even *know* that two databases hold identical data, even in the best of circumstances?

Let's say you and your buddy have two copies of the same evolving database, one in Topeka, the other in Kathmandu.  You *think* that they are in sync, but need to *prove* this.  This database is big-ish, say, 500GB, and the number of updates is small-ish, say, 100 or so per every 15 seconds (a bit like, say, a blockchain?).

How do you show that the Topeka and Kathmandu data are identical, ideally in realtime?

A really dumb way to try to do this would be to just send a copy of the full database from Topeka to Kathmandu over the internet (and forgo any possibility of watching Netflix for the next 24 hours), then do a giant diff.

Right about now, a scowling developer just threw his coffee mug at the computer screen and angrily yelled "You idiot!  This is trivially solved using a hash function".

This is the usual way that we verify that two blobs of data are identical.  Some mysterious crypto function (with a strange name like "SHA" or "Keccak") converts each and every byte of data in a giant dataset into one single line of gibberish.  If the gibberish in Topeka and Kathmandu match, so do the databases (more accurately, if the hashes match, it would be statistically impossible for the databases to differ).



While this is much less "really dumb" than just sending the full dataset across the world, it still doesn't work all that well.  SHA'ing all that data will take about half an hour, way longer than the 15 second modification time.  And the computer is going to be pretty unusable with this nonstop SHAing going on (remember, it still has to act as a functional database during all of this).

Honestly, it would probably take 10 minutes to just output all the data straight to /dev/null, much less run it through a hash algorithm.  So this is a deeper problem than "just get a quicker algorithm" can fix.  We need to rethink the hashing process itself.

The real problem here is that every single update triggers a full recalculation of the hash.  It would be ideal if we could find a hash function that could be updated quickly, then just evolve it alongside the database from some known point in time (perhaps early along when the database was much smaller).  Anticipating changes in a hash almost seems inconsistent with the spirit of hash functions, but as long as those changes themselves are unpredictable, it is not.  In fact, as I am about to show, there is a simple way to modify any pre-existing hash function to have this property.

The standard way to deal with just-too-large of a problem is to break it into many smaller pieces.  For instance, imagine every bank balance in the entire world in a database:

Adeline$10
Archibald$41
Bernard$72
Bert$2
Cecil$17
Clement$18
Clyde$45
Domingo$57
Donald$82
.... and so on


That will take some serious time to hash, for arguments sake, let's say, half an hour (a tiny understatement).

Instead, let's use the first letter of each name to split this into sub-databases (Adeline and Archibald are in the 'A' sub-database, Bernard and Bert in the 'B', etc).  Topeka and Kathmandu now need to verify that 26 hashes are the same... or, better yet, realizing that the list of hashes itself is just data, hash that blob and compare just that one single hash.  (Yes, this is the hash of a hash, and yes, that is kind of weird....  but, let's live a little!).



Now, if Clement earns $7, you just rehash the 'C' database and Kathmandu will know in record time that all is good.  This is definitely a win, as our resync (of a 500GB database) is about a minute, and we still just pass around a single hash.

While better, one minute per hash isn't good enough.  To keep up with all the changes, the CPU would be maxed all the time.  We could start fine tuning the number of divisions (smaller chunks=faster hashing/extra overhead on the secondary hash), but there is a better way.

The magic is....  You can split sub-databases themselves.  All the good stuff that happened when we split our big database into smaller ones repeats when we split those into smaller ones (and so on).  For instance, the 'A' database above can be split into 'AA', 'AB', 'AC', etc.  And those can be split further, until each sub-sub-etc-database contains one single item.  This is much better than just adding subdivisions to the one one level split, because the size of the data you need to hash for a single update remains small (ie- grows logarithmically with the number of accounts).

This process shatters the database into a beautiful tree.  You wander up the tree using paths determined by the letters of the username (A-D-E-L-I-N-E).  Each branchpoint is populated with the hashes of hashes of hashes (="hashception"?), etc going down away from the leaves.  And at the root lies one single hash, called the "root hash", which itself *is* a hash that can be used to verify all the data in the database.

In the special case where your data just happens to hold the full state of a giant globally shared database in a hostile and paranoid environment (AKA- a "blockchain"), we call that particular "root hash" the "stateroot".



(yeah, I mashed together some of the linear parts of the tree in the diagram above in places where it would have been stupid not to....)

This tree is called "the Merkle Patricia Trie", in case you are interested, and it has applications that go beyond just hashing stuff.  You can read more about this here and here.



The stateroot is a hash with some amazing properties:

1. Fully calculating a stateroot is effectively as fast as calculating a SHA, however

2. Updating a stateroot from a change in the database is effectively immediate.  This means that

3. You can verify two databases are in sync in realtime.

These next two points kind of blow me away:

4. A stateroot can be used to verify just a single item in a database, *even if you don't have any other data in that database*.  (You will also need to download a small metadata "proof" called the "Merkle branch" that ties this value to the stateroot).

5. Likewise, you can update a stateroot from a single change in the database *even if you don't have any other data in that database*, as long as you have the Merkle branch to that point.

For now, you can almost think of a stateroot as a hash with a sort-of-kind-of concept of locality.  As described above, you can both update and verify from just some local data.

It is these two points that made me suspect that the concept of the stateroot goes deeper than just hashing....  I will delve into this more in my next stateroot post.










Wednesday, August 23, 2017

Crescent Suns Everywhere

Yup, I saw the total eclipse, in a nondescript town called Gaston, SC.

Nope, pictures will never do it justice, I will not even try to show you one.  If you want to experience a solar eclipse, you need to go and see a solar eclipse.  I can, however, show you a picture of something that I thought was amazing....

Right before the sun disappeared, openings in the tree branches above converted to thousands of pinhole cameras, each projecting a mini crescent sun.  No-one had ever told me that totality would be precipitated by dancing crescents painted upon earth everywhere, this took my breath away.


Here we stand in the quickly dimming sky, awaiting the moment....



Monday, August 21, 2017

Pancake FAX Machine

Our Holiday Inn in Aiken, SC seems to have a pancake FAX machine....  


Press a button, and pancakes come out.

Sunday, August 20, 2017

Never Stare at the Sun

I've always known it's a pretty bad idea to stare at the sun, but it wasn't until I started researching the eclipse that I discovered *how* bad of an idea it is.  We all know that you can "burn your retinas", which, yeah, sounds pretty terrible, but I thought that this was more of a metaphorical "ohhh-this-menthol-patch-kinda-burns" kind of thing, more like a fluff local late night news warning, "common household item can kill your dog, news at 11".

In fact, if anything, it is much worse....

Let me explain using a story we will all be familiar with....

Bad child holds magnifying glass over ant.  Sun beams focus on ant.  Ant incinerates.

That is *exactly* what would happen in your eyeball.  *precisely*, *literally*

The lens of your eyeball is a small magnifying glass, evolved to focus light beams precisely to a point in the middle of your retina.  Point this tiny magnifying glass towards the sun, and you incinerate what is behind it.  Let's rewrite the above story as

Clueless human holds eyeball-lens over delicate rods and cones.  Sun beams focus on rods and cones.  Rods and cones melt.

Sorry to be so graphic, but you would literally cook your retina, and boil that fluidy stuff inside your eyeball whose name nobody knows.

This is pretty serious stuff.  Using a lens, you can ignite foliage.  You can start a fire using dollar store reading glasses.  This guy starts a fire with nothing other than the concave reflective bottom of a pepsi can.  Here is a case where a telescope pointed in the wrong direction may have burned down part of a house.

This guy thought it would be funny to hold a magnifying glass over his buddy.  His buddy didn't agree.

And it is even worse than that.  Your retina doesn't have any pain receptors in it (why would it?), so you don't actually have any sensation to warn you of what is happening.  In fact, according to this guy, if anything, the discomfort of looking at a bright object decreases as you partially blind yourself.  And unlike your skin after sunburn, the retina will never heal.

On the bright side, if you ever find yourself alone and afraid on a deserted island with just a pair of reading glasses and a can of beer, you now know how to start a fire or boil some water....  You're welcome.

One last note: Every last bit of this warning is applicable to cameras also.  Remember this if you decide to film the whole eclipse using your iPhone.


(No ants were harmed in the writing of this post.)






Thursday, August 17, 2017

Statistical Impossibilities


In Ethereum, data is stored by its hash (256 bit Keccak, to be exact).  Blocks point to their parents via hash, contracts are conjured up via hash, even that measly amount of Ether you purchased in a panic because the price was going up quickly back in April is stored in a data structure in a Patricia Merkle Trie on top of a LevelDB that you call up by....  256 bit Keccak hash.

I've trained a bunch of people on the guts of Ethereum over the last couple years, and every once and a while I get the hash collision question:

  -Hashes aren't unique.  Eventually, isn't someone gonna obliterate my Ether??

Yeah, hashes aren't unique.  This is kinda obvious by the fact that the number of hashable things (aka- *all-possible-data-sets*) is larger than the number of 256-bit hashes (ie- infinity > a finite value), so eventually you're gonna start having to reuse those hashes.  So it is technically *possible* that I will wake up one day and my super-secure-banking-smart-contract will have been replaced by a prototype for a soup sharing app....  But it isn't gonna happen.  It is a statistical impossibility.

Banks know this.  They are so confident that you aren't going to stumble upon an arbitrary large number, that they base their cryptography on it.

But, even if unlikely, as a developer, isn't it best practice to account for all possibilities?  Shouldn't you always write some code to (at least) warn that a catastrophe is occurring?

No.     No you should not.

For comparison, here are 10 things that are more likely than a single collision between two 256 bit numbers (a 1 in 1077 likelihood).

1. You will be run over and killed by a car *today*: 1 in 109

2. You will be killed by a lightning bolt: 1 in 107

3. A coconut will fall on your head, and you will die: 1 in 108

4. An asteroid will destroy humanity *this week*: 1 in 108

5. In the *same minute*, a television will fall on you and crush you, a tornado sweeps in from the sky and violently tosses you, then a shark eats you, all killing you.  Also, in that same minute, a coconut falls on your head, killing you again:
(1 in 1016) * (1 in 1012) * (1 in 1016) * (1 in 1016) = 1 in 1060

6. You will be the next president of the USA: 1 in 108
(assuming, of course, you are a citizen of the USA....  The chances go down a wee bit if not, but definitely not all the way down to 1 in 1077)

7. You will win Powerball 4 times in a row, and then you will win Mega Millions 4 times in a row: 1 in 1066

8. One day you will wake up with an uncontrollable Swedish accent, that will forever make it look like you are making fun of Swedish people: 1 in 1010
(it is only 1 in 108 that you will contract foreign accent syndrome, and extra two orders of magnitude that it be Swedish)

9. You have a plant growing inside you right now: (I don't know the probability, but it has happened at least once in our lifetime, so presumably higher than 1 in 7 billion)

10. The 10 top Ethereum developers all die of a heart attack tomorrow, leaving the Ethereum foundation in chaos: 1 in 1053



And remember, 1053 may sound like it's kinda *almost* as big as 1077, but that ain't how exponents work.  You would have to live in 1024 universes where the top 10 Ethereum developers die of a heart attack tomorrow (that is a billion billion million universes) before a single Keccak cracked.



But, you say, on the blockchain we have more than 2 hashes that can collide.  Certainly the chances of a single collision go up if you have, not 2, but many hashes floating around.  This is a very well understood problem, and it turns out that even with a billion hashes the probability of just one single collision just drops to about 1060.

OK, what about a deliberate hack that searches for a collision?  Could a shadowy figure engineer the downfall of the financial system with sufficient computing power?

Well, *maybe* Keccak will be mathematically cracked someday (or already, in a musky underground NSA lab), perhaps using a quantum computer, but unless this happens, I am not going to lose any sleep.  For instance,

1. Consider the brute force search, on a single super fast computer that can perform 1 billion Keccak hashes per second (real numbers are a few orders of magnitude slower).  Crack time would be 3*1060 years = 1050 "the-length-of-the-universe"s

Maybe you could string together a bunch of GPUs and speed this up using parallelization?

2. Consider the brute force search, on a massively parallel computer where we have converted all 1050 atoms on the planet earth (including all that lava) each to an individual, fully functional computing CPU core that can each perform 1 billion Keccaks in a second.  Crack time would be 1010 years = 1 "the-length-of-the-universe"s

or

3. Consider the brute force search, on a massively parallel computer where we have converted all 1068 atoms in the galaxy, each to an individual, fully functional computing CPU core that can each perform 1 billion Keccaks in a second: Crack time would be 0.01 seconds

Hey, 100 per second!  Now were are getting somewhere! (of course you will have to wait on average another 105 years for this information to propagate to you through the now-galaxy-computer at the speed of light).

If these laughable scenarios haven't convinced you of the impossibility of a collision, in case you still feel that it *just-can't-hurt* to add that code to deal with a hash collision, let me leave you with one final statistic:

1. The chances that the (difficult to test and never triggered) code you write to deal with a hash collision will add destabilizing bugs to the code: pretty close to 100%

Wednesday, July 26, 2017

Presents for the boys



Indian Dunkin Donuts


Q: What is the difference between California and India



A: India has Dunkin' Donuts!





Malls of India


I used to think of malls as about the most American thing possible....  Not so anymore.



India (and all of Asia) is filled to the brim with them.  They really change the feel of the cities.




















Jim's Birthday




Visions of Goa























Spice Plantation


We toured a spice plantation while in Goa.








coffee


betelnut tree



banana


vanilla










cinnamon





turmeric



Willy Wonka's chocolate river