tl;dr: You can now use our searchable database to download Bitcoin timestamps for items in the Internet Archive.

While that title sounds like clickbait, the hard work of the Internet Archive made it much more accurate than it sounds. They’re a San Francisco non-profit digital library that provides free public access to collections of digitized materials, ranging from software applications/games, music, movies/videos, moving images, and millions of public-domain books. But they’re perhaps best known for the Wayback Machine, an archive of hundreds of billions of website snapshots, providing a priceless historical record of the evolution of the web.

In short, if it’s on the internet, there’s a pretty good chance the Internet Archive has a copy of it.

But is that copy the right copy?

OpenTimestamps helps answer that question by cryptographically proving data existed in the past, long before an attacker would have had an opportunity or reason to forge or modify that data.

The OpenTimestamps team has timestamped every item in the Internet Archive - about 750,000,000 files in total - and made those timestamps publicly available via a searchable database. This means that right now you can get timestamps for every book, movie, song, computer program, legal document, etc. in the thousands of collections in the archive. In the future we hope to be able to work with the Internet Archive to extend this to timestamping website snapshots, and our infrastructure will continue to timestamping new items as they’re added to the archive.

Let’s look at an example attack on the archive, how a timestamp could prevent it, and finally, the tech details behind this effort.

Disclaimer: this is not an official Internet Archive project and was done entirely with publicly accessible APIs (though we did check with them in advance to ensure they had no objections to the project).

Contents

  1. I’m Satoshi Nakamoto
    1. How Timestamps can (and can’t) Protect the Internet Archive
  2. The Tech
    1. Getting the Digests
    2. Generating the Merkle Tree
    3. The User-Interface
    4. Timestamping the Merkle Tree
    5. SHA1 is Good Enough for Timestamps!
  3. What’s Next?
    1. Browser Compatibility
    2. Improved Coverage
    3. Mirrors
    4. Web Captures
  4. Footnotes

I’m Satoshi Nakamoto

…and I’d like you to invest your money in my next project, mChain, which will revolutionize the energy efficiency of proof-of-work with simulated quantum computing. Still don’t believe me? Let’s prove it. First, here’s a PGP signed statement:

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Peter Todd is Bitcoin creator Satoshi Nakamoto.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJZIgp7AAoJEMRbeOy+vH8ry20IAIn1jGibaU39n6Z3Mn1MwKlA
AHksriNkxZSTivm0kHN5xjCatujFDXL7WSkQkuP30/TUhVuMfwU5Fiw7qHw9QfFA
f2JrLy+XcEv2xxsziA7IrdvjJPSAIjl39hQODgSrhBpj21+hQxIzTtlm6UaHAnVg
iiSOCOVkl35GFi6bMRU80apHEFMOAcakMEDje+qlv2C/p1J/0lPdigdfEJh9nOnw
7EOTps9aXG5LeCXG2IuGBW1CzqlMuD/KOfmkK2WQxVytC80TaNBmkN9i05xSYbnd
BsByB3rMEKAlNRMo2pHhreOzdww+badEB7/w4Dj1rsLgcyGqjq/ZKeeXR7j7MLI=
=SMVi
-----END PGP SIGNATURE-----

As we all know the Australian scammer Craig Wright produced a similar PGP message with a fake backdated PGP key. We know that key is a fake for a lot of reasons, including the fact that it doesn’t match the Wayback Machine’s Jan 2011 snapshot of bitcoin.org.

And yes, I signed that message with a different key too. But I have an explanation: you see, I stored all the Satoshi Nakamoto pseudonym stuff on a MicroSD card, which I lost in a tragic house fire right after Gavin visited the CIA. But you see, I actually had to delay the publication of Bitcoin a few months when I realised I needed to add smart contracts to it, and I just found a backup from that attempt. I uploaded it to the Internet Archive a few months prior to releasing Bitcoin:

Fake Satoshi Key Search Result

Similarly the fingerprint is mentioned in the original whitepaper, also on the Internet Archive:

Fake Satoshi Paper Search Result

Of course, those screen shots are photoshopped. But if I colluded with - or coerced - an Internet Archive sysadmin I could easily make them all too real. With Craig Wright alone allegedly scamming tens of millions of dollars, it’s easy to see how there can be a lot of incentive to manipulate history.

How Timestamps can (and can’t) Protect the Internet Archive

By consistently timestamping all Internet Archive content, we make attacks like the above easy to detect. The OpenTimestamps proofs we’ve generated are traceable back to the Bitcoin blockchain, a widely witnessed data structure with timestamps that can’t be backdated. Even with a sysadmin’s help, the best the attacker could do is create a modified file that’s very suspiciously missing a timestamp that all other files have.

However, it’s important to note timestamps are not a panacea: they’re just evidence as to when a file existed; by themselves they can’t prove a file is legit. For example, if I had known in 2008 that Satoshi was going to release Bitcoin, I could have generated fake keys and fake Bitcoin papers with 100% real timestamps. While such a scam is much less likely, it’s certainly not impossible1.

The Tech

The Internet Archive collection is massive, dozens of petabytes in size. It’s so big that when Wayback Machine Director Mark Graham learned that we had timestamped the Internet Archive, he sent me an email asking:

How are you able to do anything with “all” of the Internet Archive? :-)

In fact, due to the excellent API the Internet Archive provides, this was way easier than you might expect!

Getting the Digests

Every item in the archive consists of a set of files and some associated metadata. Here’s an example from an item I’ve uploaded:

<file name="isos/disc-a.iso" source="original">
    <mtime>1479664926</mtime>
    <size>407240704</size>
    <md5>8afba33859360da23c2d354b92ab5c47</md5>
    <crc32>d5b9f56b</crc32>
    <sha1>0f42d040ff8370ff8a3041dd3e659f7e8a3d6c8c</sha1>
    <format>ISO Image</format>
</file>

Importantly, we can get the SHA1 digest without actually downloading the file!

To get the complete set of digests I use the scraping API to search for all items added on a given day. Using the excellent library internetarchive the above was implemented in about sixty lines of Python.

Generating the Merkle Tree

There have been a lot of ridiculously inefficient Bitcoin timestamping schemes, using one, or even two Bitcoin transactions per timestamp. The insane thing is these schemes actually get used on a large scale:

Not having $116 million of spare change lying around, I decided to use a merkle tree instead - the advanced moon-math invented by Ralph Merkle in 19792.

To save time I decided to re-use the OpenTimestamps Calendar Server codebase. The way a calendar server normally works is it maintains an append-only journal of commitments (digests) it has promised to timestamp, and a database of timestamps for those commitments.

The actual database is a simple LevelDB key:value mapping, with the keys being the messages, and the values being the commitment operations and notary attestations (namely Bitcoin headers) that comprise the timestamp tree nodes. When a client asks for a timestamp for a given message, the server just walks the tree recursively.

However normally calendars will only timestamp a few thousands commitments at a time. So the code to generate merkle trees and add them to the database isn’t particularly efficient - it doesn’t have to be - and even worse, keeps multiple copies of the entire tree in memory until written, one for each time the fees are bumped with a replacement transaction.

I knew that code was going to fall over trying to timestamp hundreds of millions of digests, so I quickly hacked up a more efficient incremental import script for the initial merkle tree that wrote to the database level-by-level incrementally, keeping nothing in memory. The idea being that subsequent timestamping could be done with much smaller, per-day, trees.

The User-Interface

In parallel Riccardo Casatta and Luca Vaccaro of Eternity Wall, and Igor Barinov were working on the code and graphics design for the database UI. They did 100% of the work on the website - this description is second hand - but what they’ve told me is the site is essentially a wrapper around the Internet Archive advanced search API, that additionally queries the calendar I setup.

Timestamping the Merkle Tree

I gotta admit, I figured I’d get a few small donations; in the end I got dozens of donations totalling 0.218 BTC! Those donations got combined into one output in tx 564d27fc17068e8d4c997a86287fe79b37b07552b3fb5e3c11c1a3d4fd933882, with the actual timestamp being 8465d34ede9e3387cfd7aacae880cfc86a5bdc603d8822e3b0e7c1369f8acfa8.

The final step of adding the transaction to the database was done manually with the python-opentimestamps library, about an hour and a half prior to when I was supposed to give a talk and live demo.

In fact, as I soon found out prepping for the demo, I’d managed to completely miss an entire year in my initial merkle tree, and part of two other years, as I forget to copy a few directories! For the live demo I wanted to have the audience pick what I’d search for; I was out of time at that point so I crossed my fingers and hoped it’d work the first try, which it fortunately did.

SHA1 is Good Enough for Timestamps!

A limitation of our approach is that we’re restricted to timestamping the digests the Internet Archive API gives us, the strongest of which is SHA1. While it is true that SHA1 has been broken, that break isn’t relevant for timestamping: while it is possible to generate two messages with the same SHA1 digest, both messages have to be generated simultaneously. For the purpose of a timestamp proof this is totally OK! All we care about is preimage attacks that find a message with a specific hash digest. SHA1 is not vulnerable to those attacks, with the one exception of Snefru-2 there are no examples of any modern cryptographic hash function being vulnerable to pre-image attacks3.

What’s Next?

Browser Compatibility

Currently the database UI works fine on Chrome, but has issues with Firefox and Safari; we’re working towards supporting all major browsers.

Fixed!

Improved Coverage

Unfortunately our initial timestamping effort wasn’t 100% complete - as we’ve found later some of the items in the archive are entirely missing the “added-on” metadata field that we used to find all items. The Internet Archive is also not a consensus system, so random errors may have left some items without timestamps as well. So we’re doing another pass to make sure we have 100% coverage.

Mirrors

While you can download the raw LevelDB database of Internet Archive timestamps, we could use a better way to mirror this work. In particular, we need a format for which it’s easy to upload the timestamps generated to the Internet Archive. We also need a format that can be downloaded incrementally, even as new timestamps are added. This problem exists for OpenTimestamps in general, so I hope to solve it for both use-cases at once.

Web Captures

Behind the scenes the Wayback Machine uses the Web ARChive (WARC) archive format to archive web crawl data. These archives are stored in the archive as items, with dozens of collections available such as the Common Crawl.

This means that we have timestamped the underlying internet crawl data as part of this effort. However we have two problems:

  1. Much of the crawl data isn’t publicly accessible for various reasons such as embargo agreements. Without that raw data, third-parties like you or I can’t actually verify the timestamp proofs.

  2. The granularity of the timestamps isn’t on a per-capture basis, so even if you do get the raw data, it’s inconvenient to verify.

The second problem is obvious: we’ve only timestamped a few hundred million individual files, while the Wayback Machine has captured over half a trillion4 individual web objects! Scaling up our effort to the entire Wayback Machine is going to be a lot of work - and it’ll need the direct involvement of the Internet Archive - but what I’d like to see in the future is the ability for an advanced user to download a snapshot and all resources referenced by the snapshot as some kind of WARC capture, extended with timestamp proofs.

Footnotes

  1. In fact, as I’m on record as having been discussing crypto-currencies with Adam Back and Hal Finney back in 2001, I guess I’m a possible suspect for such a fraud! 

  2. “Method of providing digital signatures”, Ralph Merkle, US Patent 4,309,569 

  3. “Lessons From The History Of Attacks On Secure Hash Functions”, Zooko Wilcox, accessed 2017-02-24 

  4. “Defining Web pages, Web sites and Web captures”, Vinnay Goel, Internet Archive Blogs, Oct 23rd 2016