HashMap, dict, Map, object, whatever your language calls it, you probably use one every day. I definitely do.
I’d known the theory for years. Hashing, buckets, O(1), sure. If you’d asked me to walk through what actually happens when you call map.get("email"), I would’ve hand-waved about hashing and buckets and moved on. I knew it was “O(1)” because that’s what you say in interviews and everyone nods. Could I have built one? No.
So I did. And honestly? It’s not that complicated. It’s just really clever.
This article walks through the whole thing with interactive demos you can poke at. By the end you’ll understand hash maps better than most people who use them daily.
It all starts with arrays
OK so here’s the thing. Arrays have this one superpower that makes everything else possible: instant access by index. When you do arr[3], the CPU doesn’t search for anything. It just calculates base_address + 3 * element_size and jumps straight there. O(1). Done. No questions asked.
That’s the magic trick we want for our hash map.
Our keys aren’t numbers though. They’re strings like "email" and "name" and "age". You can’t just do arr["email"] (well, you can in JavaScript, but that’s a whole different rabbit hole). Arrays need numbers.
So we need something that takes any key and turns it into a number. A number we can use as an array index.
That something is called a hash function.
Turning gibberish into numbers
A hash function takes any input (a string, a number, whatever) and spits out an integer. Same input, same output, every time. Different inputs usually give different outputs.
That “usually” is going to bite us later. Keep it in the back of your mind.
Here’s a real one. It’s called djb2, first reported by Dan Bernstein in comp.lang.c many years ago. Not the best hash function by modern standards, but fast, simple, and perfect for understanding the concept:
unsigned long hash(unsigned char *str) {
unsigned long hash = 5381;
int c;
while (c = *str++)
hash = ((hash << 5) + hash) + c; /* hash * 33 + c */
return hash;
}
Each iteration multiplies hash by 33 (that’s what (hash << 5) + hash does) and adds the next character. Dead simple. The interactive demos below use a JavaScript port of the same algorithm.
Why 33? According to oz’s classic hash function page, “the magic of number 33 — why it works better than many other constants, prime or not — has never been adequately explained.” Why 5381? Nobody really knows. Bernstein picked these constants and they just work. Sometimes engineering is just vibes and benchmarks.
The raw hash is a massive number, way bigger than our array. So we squeeze it into range with modulo: hash % capacity. If we have 8 slots, djb2("email") % 8 gives us a number between 0 and 7. That’s our index. (Technically the hash function itself is O(key length), not O(1), but keys are usually short so we treat it as constant.)
Go ahead, type anything in here and watch it get hashed:
Where stuff actually goes
Cool, so now we have a function that turns any string into a number between 0 and 7. Now what?
We make an array (usually called a bucket array) and use that number to decide which slot each key-value pair lands in. You want to store "name": "Alice"? Hash "name", get a bucket index, drop the entry there. If the key already exists, just update its value. That’s it.
Let me walk you through it. Hit “Next” to step through each insertion one by one:
Three insertions, three different buckets. Each lookup is just: hash the key, jump to the bucket, grab the value. O(1). This is the happy path.
Remember that “usually” from earlier? Yeah. Time to deal with that.
When things go wrong (kind of)
Here’s a fact that might make you uncomfortable: we have 8 buckets and an infinite number of possible strings. There is literally no way to guarantee every key gets its own slot. This is the pigeonhole principle: more pigeons than holes means at least two pigeons are sharing. Math doesn’t care about your feelings.
So what happens when "name" and "city" both hash to bucket 6? Does it just… break?
Nope. The simplest solution is called chaining. Instead of each bucket holding one entry, it holds a list (a chain) of entries. On lookup, you hash to the bucket, then walk the chain comparing keys until you find yours. It’s a tiny linear search, but only within that one bucket.
This is way easier to see than to explain. Step through it:
So a collision doesn’t break anything. It just adds one extra comparison per chained entry. As long as the chains stay short (one or two entries) you barely notice.
Wait. What stops the chains from getting long? If I keep cramming entries in, won’t everything pile up into a few long chains and basically become a linked list? At that point our “O(1)” hash map is just an O(n) list with extra steps. So much for constant time.
Turns out the people who designed this thought of that.
When the table says “I’m full”
The load factor is dead simple: entries / capacity. It tells you the average number of entries per bucket. With a good hash function, keeping the load factor bounded keeps chain lengths bounded too. That’s the whole trick.
Java draws the line at 0.75. Other languages choose differently: Python uses ~0.67, Rust ~0.875, C++ defaults to 1.0. Lower means more wasted space, higher means more collisions before a resize.
Cross the threshold, and the hash map doubles its size and rehashes every single entry into the new, larger array.
This blew my mind when I first got it. The old bucket indices are completely invalid after a resize because hash % 4 gives a totally different result than hash % 8. So every entry has to be re-placed. It’s expensive, but it happens rarely enough that the math works out.
Keep pressing “Insert” here and watch the load factor bar creep up. When it crosses the line, watch closely:
See what happened there? The table doubled, every entry found a new home, and the load factor dropped back down. Chains got shorter. Performance restored.
This is why people say hash maps have amortized O(1) performance. Most operations are instant. Every now and then, one unlucky insert triggers a resize that costs O(n). Each resize doubles the capacity, so you need n more cheap inserts before the next expensive one. The total cost of all resizes is n + n/2 + n/4 + … which is less than 2n. Spread across n inserts, that’s O(1) each.
Pay a little now, save a lot later.
Go break it
OK, theory’s done. Here’s a full hash map sandbox. You can put, get, and del keys. Try to cause a collision on purpose (hint: try "name" and "city"). Fill it up until it resizes. Look up a key that’s buried deep in a chain and watch it traverse.
Or hit “flood bucket 0” and watch what happens when an attacker crafts keys that all land in the same bucket. Your O(1) map becomes a linked list. This is a real attack vector: web frameworks parse query strings, JSON, and form data into hash maps, so an attacker controls the keys by controlling the HTTP request.
The fix is using a keyed hash function (like SipHash) with a per-process random secret, so an attacker can’t predict which keys collide. Python added hash randomization in 3.3 (2012) after a disclosure in late 2011, then switched to SipHash in 3.4 (2014).
The whole thing in one line:
key → hash(key) → index = hash % capacity → bucket[index] → walk chain → value
Everything above is just unpacking this pipeline.
What I left out (the rabbit hole goes deep)
What I showed you is chaining, the simplest collision strategy. Most production hash maps today actually use a different approach, and the optimizations go deep:
Open addressing is what most modern hash maps use instead of chaining. When there’s a collision, you don’t build a linked list, you just look for the next empty slot in the array itself. This has way better cache locality because you’re scanning contiguous memory instead of chasing pointers all over the heap. The tradeoff: deletion requires tombstones and is genuinely tricky, which is part of why chaining persisted as long as it did. Robin Hood hashing is a particularly beautiful variant that “steals from the rich” by letting entries with longer probe distances bump out entries with shorter ones. Cute name, clever algorithm.
(Java’s HashMap still uses chaining but converts long chains into red-black trees at length 8, so its worst case is O(log n) not O(n). Clever hybrid.)
SIMD probing is what Google’s Swiss Table design does (implemented in Abseil C++ and, independently, in Rust’s hashbrown which backs its standard HashMap). It uses CPU vector instructions to check 16 slots’ metadata bytes in a single operation, then only does full key comparison on matches. Extremely fast. If you want to go deep, Matt Kulukundis’s CppCon talk is an absolute banger.
Power-of-two sizing is a neat trick: when your capacity is always a power of 2, you can replace the expensive modulo with a bitwise AND: hash & (capacity - 1) does the same thing, way faster. Most modern implementations do this.
Better hash functions fall into two camps. SipHash is a keyed PRF designed for adversarial resistance: give it a per-process secret key and an attacker can’t predict collisions. xxHash and wyhash are built for raw speed and better distribution, not security. Different tools for different problems.
The core idea never changes though: hash the key, find the bucket, handle collisions, resize when it gets full.
I wish I’d looked inside sooner.