Cryptography is a massive field; one could easily spend an entire semester on the topic, so as you read this page, keep in mind that we’re just covering a few of the basics.
Hashes #
You may be familiar with the side dish where you chop up some potatoes, fry them in butter, and add salt. The potatoes get hashed, and then browned, so we call this “hash browns.” Note that it is easy to make hash browns, but impossible to turn hash browns back into potatoes.
Just like hashing potatoes, a hash function is a mathematical function that is easy to calculate in one direction, but impossible to reverse. The results of hash function are always the same for any unique input.
Impossible to reverse? #
What does it mean that a hash is impossible to reverse?
In one sense, it just means that if you are given the output of the function, you can’t figure out what input was used to create it.
But there’s a deeper problem, which is that each hash output can be generated by multiple different inputs, all of which vary wildly in their values. This means that even if you could figure out one input that creates a certain output, you still can’t be sure that this was actually the input used where you found the hash.
When you find two inputs that a hash function maps to the same output, that’s called a hash collision. If you can do this for an arbitrary input in a reasonable amount of time, rather than just hunting at random for two matching hashes, we say that the hash is broken.
A few popular hashes and an example #
- CRC-32: “cyclic redundancy check”, 32 bits, 232 = 4.3 x 109
- MD5: “message digest”, 128 bits, 2128 = 3.4 x 1038
- SHA-1: “secure hash algorithm”, 160 bits, 2160 = 1.5 x 1048
- SHA-256: “secure hash algorithm”, 256 bits, 2256 = 1.2 x 1077
- BDH8: “Brandon’s dumb hash”, 8 bits, 28 = 256
The first four are real hashes in actual use around the world. CRC-32 is a checksum that is easily broken, but so fast that’s its still useful for error correction in data transmission. Given the speed of modern PCs, MD5 is now considered broken for password authentication. SHA-1 hash collisions are harder to calculate, but still within the range of state-level adversaries (e.g. the NSA).
Update: Ming Chow tells me that SHA-1 is, in fact, thoroughly broken: http://shattered.io/
You can calculate hashes on your Raspberry Pi pretty easily. Here are a few examples.
pi@raspberrypi:~ $ echo this-is-a-random-test-string | cksum # calculates the CRC-32 checksum
3127080174 29 # the 29 at the end is the number of bytes in the input
pi@raspberrypi:~ $ echo this-is-a-random-test-string | md5sum
a13c14dbabf814f24b60e4ec3de68b22 -
pi@raspberrypi:~ $ echo this-is-a-random-test-string | sha1sum
43ea18961ab2d5740b94ccd015f2ab92aaed22b4 -
pi@raspberrypi:~ $ echo this-is-a-random-test-string | sha256sum
5347838842332698f7c1e58d66621104cc7a3a8e0515d269eb84a333922997d3 -
Brandon’s dumb hash #
The last hash, BDH8, is a function that I’m making up just as an example; I define it as just an XOR of each byte with the next one until you get to the end of whatever you’re hashing. It is trivially broken, but it has the rest of the attributes of a typical hash function. Specifically, it maps all lists of bytes onto a finite output range. The outputs are evenly distributed, and small changes in the input value create large changes in the output value, on average.
(Wait, what does XOR mean? That means the exclusive-or operation. If you have two bits, the output is 1 if one bit OR the other is 1. If the two bits are both 1 or both 0, the output is 0.)
As an example, let’s run BDH8 on the string “fun”. By checking asciitable.com, we can figure out that the characters “f”, “u”, and “n” are represented by the bytes 0x66
, 0x75
, and 0x6E
. In binary, we would represent those as b0110 0110
, b0111 0101
, and b0110 1110
.
If we XOR the first two bytes, we get b0001 0011
. Then XOR that with the last byte, and you get the final hash, b0111 1101
= 0x7D
.
It’s simple to find hash collisions. For example, the string “ven”, represented as 0x76
, 0x65
, 0x6E
, has the same BDH8 hash of 0x7D
.
What is a hash for? #
Verifying file integrity #
Say you download some software, and you want to check that the file is not corrupted or modified by a malicious villain. You can calculate, say, the MD5 hash of the file you downloaded, and then compare it to what the MD5 hash should be. How do you know what the MD5 hash should be? Usually, the people who wrote the software post it on their website. We would call this verifying the integrity of the file.
For example, the Raspberry Pi people uses SHA-256 for OS download checks. The version of Raspberry Pi OS Lite from 2021-01-11 has the hash d49d6fab1b8e533f7efc40416e98ec16019b9c034bc89c59b83d0921c2aefeef
. You could check it by downloading that version of the OS and then running a command like md5sum 2021-01-11-raspios-buster-armhf-lite.zip
.
Similarly, PyPI, the Python software repository, publishes SHA256, MD5, and BLAKE2-256 hashes for its downloads.
Password authentication #
When you pick a password for a website, the hash of the password is generated, and the website stores the hash, but not the password itself. Then, when you want to log in again, you hash the password again, and compare it against the stored hash. Nobody else can figure out what password to use to generate that hash, so nobody can log in as you. Nobody can steal all the passwords from the site, because the website doesn’t have them.
(Note: this is how it should work, but some people still store all the passwords in one big file. They are idiots. If a website ever sends you your password, you should realize that they must have been storing your password instead of a hash, and then you should stop trusting that website.)
File identification #
A company like Dropbox can use hashes to quickly scan through milions of files to see whether you are storing illegal content on their servers. They can also use this for deduplication: if lots of people are storing an MP3 of the latest hit by Ms. Knowles-Carter, they only need to store one copy, which saves them disk space.
(Digression, if you’re interested: filesystem deduplication is a sidechannel.)
Similarly, anti-virus companies can use hashes to identify malware, which is convenient for files that you generally don’t want to open.
Attestation #
Look at this tweet by Patrick McKenzie. He is using the tweet to demonstrate that he had created a certain file by a the date of the tweet. The file contained predictions about the path of covid19 in Japan, but that’s not relevant here. The point is that he was saying, “Hey, world, my predictions aren’t just wisdom in hindsight, because look, I predicted this, and you can be sure of that because I tweeted this hash back in March. You know these are real predictions because I can’t change my past tweet history, and I can’t change my predictions because I can’t find SHA-512 hash collisions.”
We would call this an attestation. Patrick McKenzie has more information about the practice in the computer security community, if you are interested.
Rainbow tables and salts #
If you log in to your Raspberry Pi, you can see password hashes in the file /etc/shadow
.
You’ll see a section that looks like this:
systemd-timesync:*:18494:0:99999:7:::
systemd-network:*:18494:0:99999:7:::
systemd-resolve:*:18494:0:99999:7:::
_apt:*:18494:0:99999:7:::
pi:$6$TcStb3ADs0Vmhl1P$HOz0afl84EFN0Ws4IFhEtgW6iFt3VbmI2u4Q6WsAy9IPeXiJRJ0vOhBPzGRNEOrR.gdI6jWe7Nmy8Ub1ZSRHY/:18494:0:99999:7:::messagebus:*:18494:0:99999:7:::
_rpc:*:18494:0:99999:7:::
statd:*:18494:0:99999:7:::
This is a list of usernames, followed by information about each user. Most of them have *
after the first colon, which means they have no password hash, and thus can’t log in. Those user accounts are just for the operating system to go about its business in an orderly way.
If you look at the user called pi
, you’ll see after the first colon: $6$TcStb3ADs0Vmhl1P$HOz0afl84EFN0Ws4IFhEtgW6iFt3VbmI2u4Q6WsAy9IPeXiJRJ0vOhBPzGRNEOrR.gdI6jWe7Nmy8Ub1ZSRHY/
This is actually 3 fields, separated by the character $
. The first field is the value 6
, which indicates that SHA-512 is used for the hash. The second field, TcStb3ADs0Vmhl1P
, is called the salt. The last field, which begins HOz0af
, is the hash itself.
You can verify that the hash is the correct one using the openssl
command line tool.
pi@raspberrypi:~ $ openssl passwd -6 -salt TcStb3ADs0Vmhl1P
Password: (I typed my password here.)
$6$TcStb3ADs0Vmhl1P$HOz0afl84EFN0Ws4IFhEtgW6iFt3VbmI2u4Q6WsAy9IPeXiJRJ0vOhBPzGRNEOrR.gdI6jWe7Nmy8Ub1ZSRHY/
If an attacker can get access to your shadow file, then they could try to figure out what password generates the right hash. They could do this using a rainbow table, which is a precomputed list of the hashes of all possible passwords of a certain length.
But, we have a counter-measure deployed against that: salts. A salt is a random string that is stored with the hash and added to the password before hashing. With the salt added, the attacker needs a new rainbow table for each possible salt. (You have to assume that the attacker has access to your salts as well. Otherwise, you would just use the salts as passwords. But generally, people fail at keeping valuable lists of random strings secret.)
Bcrypt, Scrypt, and Argon2 #
In our modern era, it turns out that computing a rainbow table of hashes is, computationally speaking, relatively cheap. Now, instead of fast hashes like MD5 or SHA-256, we use variable-difficulty algorithms like bcrypt or scrypt. Each has a work factor that can be increased to change how long the hash takes to calculate.
Bcrypt has been around since 1999. It was created by some OpenBSD contributors and is widely used.
Scrypt, created by Tarsnap founder Colin Percival in 2009, tries to do roughly the same thing as bcrypt, but can increase costs in both time and memory.
Argon2 is the newest competitor, created in 2015. It has parameters that make it more costly in time, memory, and parallelism.