Hash functions for Dummies

Chronicles of DB Indexes - Part 11

Oct 30, 2022

Let's talk about hash functions for a bit. Not only because they form the basis for so many actively-used data structures (e.g. crypto hashes, hash indexes, digital fingerprints, hash maps, etc.) but because they form a basis of Hash Indexing as well, which is the topic that we will talk about next time!

So, what are Hash Functions? If somebody were to wake you up and offer you million dollars to answer that question in one sentence, your answer should be - They are mathematical functions that take arbitrarily long data on input and ALWAYS produce same-sized output! There you go - million dollars 💰 for you!

But is it really THAT simple? Like - that's all they are? Yes, that's actually all they are at their core! And not only that, but this property of producing ALWAYS the exact-same-sized output has plethora of appliances! Like, pretty much any area of CS field you take, it has something to do with Hashes. Be it cryptography, or indexing or just those O(1) lookup tables - it's all hash funcs in the background!

Now I can hear some of you saying - "Umm, yeah Mixa, but mathematical functions work with numbers only, and yet Hash Functions (e.g. MD5, SHA2, etc.) take ANY data - text, binary, image, ... How do you explain THAT?!". Good question! Great question! Hash functions operate on BITS! And bits are 0s and 1s. And they join together to form longer sequences of 0s and 1s. But do you know what those are? They are NUMBERS! 0010 is a number 2 and 0111 is 7! Binaries are also numbers but in a different base (2). Because, the fact that we humans are so used to Base 10 is just a consequence of some old civilization deciding to go with 10 (and it helps that we have ten fingers and toes, apparently), but there's no real ultimate reason to go with base 10!

But I digress. My point is -- all those texts, images, etc. are just bits. And bits are numbers. And hash functions work on numbers. You see where I'm going with this? :)

Now another question could be - so why do we have so many of them? Well, each one has some interesting properties; especially so when we dig into cryptographically-safe hashes, which are Hash Functions on steroids (and I'd urge you to look them up! It's an amazing topic on it's own!).

All in all, having a function that maps variable-sized data to predetermined and fixed-sized one turns to be super useful. Especially if you use those fixed-sized outputs as array keys. And then you store data in those arrays and you have a Hash Table :)

Well, something similar is true for Hash Indexes and, just like any Hashish (pun intended) structure - they provide O(1) lookup speed (in contrast to O(logn) for B+Trees). But that's something we'll discuss in the following article!

Stay tuned!

Bitesized Engineering

Discussion about this post