Jay Taylor's notes
back to listing indexComputer Science for Hackers | Interview Cake
[web search]
Computer Science
for Hackers
Data Structures, from the Ground Up
In order to really understand how data structures like arrays, linked lists and hash maps work, we're going to build them all from scratch starting with bits. It's pretty simple, and really rewarding.
Random Access Memory (RAM)
The functions we write often include variables which hold data like integers, arrays and linked lists. These variables are usually stored in the "working memory" or "random access memory" (RAM) of a computer. We sometimes just call this "memory."
RAM is a set of sequentially addressed buckets, like this:
"Hey, that's kinda like an array," you may be thinking. That's true! And like an array, we can grab the contents of the bucket at a given address "immediately"—in constant (O(1)O(1)O(1)) time. We don't have to "walk through" all of our memory to get to something—we just go straight there. (That's what the "random" in RAM means—any location in memory can be accessed directly.)
What's in each bucket? A set of bits. What's a bit? It's a tiny set of electric circuits with two states—either "on" or "off," which we call 000 and 111.
Each addressed bucket in memory usually has 8 bits. 8 bits is called a byte.
What can we do with the 8 bits (1 byte) in each slot in memory? A whole heck of a lot! To start, we can store an integer.
RAM is not where mp3s and apps get stored. In addition to "memory," your computer has "storage" (sometimes called "persistent storage" or "disc"). While memory is where we keep the variables our functions allocate as they crunch data for us, storage is where we keep files, like mp3s, videos, word documents, even executable programs/apps. Memory (or "RAM") is faster but has less space, while storage (or "disc") is slower but has more space. My laptop has ~250GB of storage but only ~16GB of RAM.
Binary Integers
The way we express integers with bits is called "base 2" or "binary" because each "digit" in our number has two possible values (0 and 1). In contrast, the number system we usually use is called "base 10" or "decimal" because each digit in our number has ten possible values (0, 1, 2, 3, 4, 5, 6, 7, 8, and 9).
So if we take this number:
1 0 1
How do we read that in our base 10 system? Notice that we have two "1"s, but they don't mean the same thing. The leftmost "1" means 100, and the rightmost "1" means 1. That's because the leftmost "1" is in the "hundreds place," while the rightmost "1" is in the "ones place." And the "0" between them is in the "tens place."
So this "101" in base 10 is telling us that we have "1 hundred, 0 tens, and 1 one."
Notice that the rightmost "place" is the "ones place," and 1 is 10010^0100. The second "place" is the "tens place," and 10 is 10110^1101. The third place is the "hundreds place," and 100 is 10210^2102. Our "places" are sequential powers of our base, which is 10 in this case.
So what are the "places" in base 2? 202^020 which is 1, 212^121 which is 2, 222^222 which is 4, etc.
So let's read that same number as base 2.
1 0 1
Reading this from right to left: we have a 1 in the ones place, a 0 in the twos place, and a 1 in the fours place. So our total is 4 + 1 which is 5.
Here are the base-10 numbers 0 through 10 in binary:
# decimal binary
0 0000
1 0001
2 0010
3 0011
4 0100
5 0101
6 0110
7 0111
8 1000
9 1001
10 1010
Okay, so we can encode an integer in binary. Where does that fit into our RAM with its sequentially addressed slots?
Well, we could use a single memory address (a single 8-bit byte) to store an integer.
This works fine. But with only 8 bits, we'll run out of space if we want to store large integers.
Another common numbering system in computers is base 16, also called "hexadecimal" or "hex." For that, our possible values for each digit are 0, 1, 2, 3 ,4, 5, 6, 7, 8, 9, a, b, c, d, e, and f. Hex numbers are often prefixed with "0x" or sometimes "#". You've probably seen hex numbers around before. For example, in CSS we often express colors as hex numbers—Interview Cake's signature blue color is "#5bc0de".
Multi-byte Numbers
How many integers can we express with 1 byte (8 bits)?
Let's start simpler: how many integers can we express with 1 bit? Just 2:
0
1
How many can we express with 2 bits? Well for each possibility for the first bit, we can put a 1 in front of it or we can put a 0 in front of it:
10
11
00
01
So by adding a second bit, we have twice as many possibilities as with just 1 bit. 2 * 2 = 4 possibilities in total.
Same idea with 3 bits—for each of the 4 possibilities with 2 bits, we can put a 1 in front or we can put a 0 in front. That gives us twice as many possibilities as with 2 bits, which is twice as many as with 1 bit, so 2 * 2 * 2 = 8:
110
111
100
101
010
011
000
001
Do you see the pattern? In general, with nnn bits, we can express 2n2^n2n different possible numbers!
For 8 bits (1 byte) we have 28=2562^8=25628=256 possibilities.
If the integer is unsigned we'll usually split our 256 possibilities across even and odd numbers, and we'll have the leftmost bit represent the "sign" of the number (+ or -). But if we want an "unsigned" (always positive) integer, our max possible integer will be twice as high (again, each additional bit gives us twice as many possibilities, and here we're "freeing up" the leftmost bit that used to represent "sign" and using it to extend the actual number).
What happens if we have 256 in an 8-bit unsigned integer (1111 1111 in binary) and we add 1? Our integer might overflow ↴ . Which can cause some nasty bugs where adding 1 to a big number gives us a smaller number.
The 256 possibilities we get with 1 byte are pretty limiting. So most modern computers default to 4 or 8 bytes (32 or 64 bits) for storing integers. 32-bit integers have 2322^{32}232 possible values—more than 4 billion, while 64-bit integers have 2642^{64}264 possible values—more than 101910^{19}1019 or 10 billion billion.
When is 32 bits not enough? When you're counting views on a viral video. YouTube famously ran into trouble when the Gangnam Style video hit over 2322^{32}232 views, forcing them to upgrade their view counts from 32-bit to 64-bit integers.
Whether 8, 16, 32, or 64, the number of bits we use to store an integer is almost always constant, which means integers have an O(1)O(1)O(1) space cost. That's hugely important. It means:
- Integers are "free" in terms of big O space cost. We can add as many individual integer variables to a function as we want and their total space cost will still be O(1)O(1)O(1).
- Most operations on integers take constant time, since they involve tweaking a constant number of bits. +, -, /, *, ==, <, > are all O(1)O(1)O(1) time operations on integers.
These powerful properties come from the constant size of integers. But that constant size also has a cost: the values of our integers are limited (to 2n2^n2n possibilities, where nnn is the number of bits). So we have a tradeoff. As we'll see, that's a trend in data structures—to get a nice property, we'll often have to lose something.
You know what isn't constant space? Arrays.
"How come I've never had to think about how many bits my integers are?" Perhaps you have. Have you ever created a table in SQL? When you specify that a column will hold integers, you must specify how many bits you want: 1 byte (tinyint), 2 bytes (smallint), 4 bytes (int), or 8 bytes (bigint). Have you noticed that sometimes integers are "Longs" and sometimes they're "Integers" (common in Java and C)? That also has to do with how many bits we're using. If we don't specify, we're usually defaulting to 32 or 64 bits.
Arrays
Remember how memory is a sequentially addressed set of buckets?
Computer memory is basically an array already! And remember: we can access the bits at any given memory address in O(1)O(1)O(1) time.
Suppose you want an array of, say, 5 integers. Say they're 8 bit (1 byte) integers. All you have to do is grab a sequential set of 5 free bytes of memory. Then each slot in memory maps directly to a slot in your array:
So if this is our RAM:
Memory slot 2 is currently occupied by another program, so we have to start below it. We'll start from address 3.
So we have an array of 5 single-byte-integers, starting at memory address 3. How can we get the integer at index 4? Simple arithmetic: take the starting address (3), add the index we're looking for (4), and that's the address of the item we're looking for. 3 + 4 = 7. In general, for getting the nnnth item in our array:
address_of_nth_item = address_of_array_start + n
This works out nicely because the addressable memory buckets and the size of each item in the array are both 1 byte. But that's not always the case. As we've discussed, these days we'll usually use 64-bit integers, but our addressed buckets are still 8 bits.
So how do we build an array of 64-bit integers? Just as you might imagine:
Each integer takes up 8 address buckets. So we can still use simple arithmetic to grab the start of the nth item in our array, in O(1)O(1)O(1) time:
address_of_nth_item = address_of_array_start + (n * size_of_each_item_in_bytes)
Here the math is simplified by the fact that the each item in our array is 8 bits and each address slot is also 8 bits. If those sizes weren't the same, we could say:
address_of_nth_item = address_of_array_start + (n * (size_of_each_item_in_bits / size_of_each_address_slot_in_bits))
With this simple arithmetic, we can grab the nnnth item in an array in O(1)O(1)O(1) time. The arithmetic itself takes O(1)O(1)O(1) time (remember: +, *, / , etc are all O(1)O(1)O(1) time operations on integers!) and we use the result to get the address in memory of the array item we're looking for.
This O(1)O(1)O(1)-time lookup capability is the most important property of arrays.
And notice that this property depends on the array being contiguous (uninterrupted) in memory. If there were "gaps" in the array, we wouldn't be able to use simple math to figure out the address of a given item.
Strings
Life isn't all about numbers. If you want to use words, you need strings.
A string is simply a series of characters. We already know one way to store a series of things—arrays! So to store a string, we just need to figure out a way to store characters. Then we can throw those characters into an array.
Easy. Let's just define a mapping between integers and characters. Let's say "a" is 1 (or 0000 0001 in binary), "b" is 2 (or 0000 0010 in binary), etc. Bam, now we have characters! This mapping of numbers to characters is called "character encoding." You may have heard of "ASCII" or "UTF-8"—those are common character encodings.
(or maybe a table of a section of ascii)
Cool, since we can express characters as 8-bit integers, we can express strings as arrays of 8-bit numbers characters.
Pointers
What if we want an array of strings? We need to make an array, where each element is another array. Woah dude.
Our first thought might be, "No problem, let's just choose how long we want our strings to be (same way we chose how many bits we wanted for our integers in our integer arrays) and then make that the length of each item in our outer array."
That'd kinda work. It's not great, since one item in our array of strings might be "Bill" and another might be "Sir William Wigglesworth Davidson the Second, Esquire." These strings have different lengths. We could put each of them in arbitrarily large arrays (say, 100 characters each), and just use a special character to mark the end of the string within each array...
We'd be tying up a lot of unused memory this way. And what if we want to store a string that was more than 100 characters? We'd be out of luck.
There's a better way. Instead of storing the strings right inside our array, let's just put the strings wherever we can fit them in memory. Then we'll have each element in our array hold the address in memory of its corresponding string! Each address is just an integer, so really our outer array is just an array of integers! We can call each of these integers a "pointer," since it "points to" another spot in memory.
Pretty clever, right? This way each string only needs to be as long as the actual number of characters it holds. The other neat advantage of this approach is that we don't need to have enough continuous free memory to store all of our strings—we can place each of them separately, wherever they fit.
Dynamic Arrays
In a low-level language like C or Java, when we allocate an array we have to specify the number of slots ahead of time. There's a reason for this. Recall that arrays get their O(1)O(1)O(1)-time-lookup superpower from the fact that they're contiguous (they're without gaps) in memory. So when we allocate an array, we have to ask the operating system to "reserve" a bunch of sequential buckets. If we didn't say how many buckets, the operating system wouldn't know how many spots to reserve.
But there are lots of cases where having to specify the size of an array ahead of time is tough. What if we wanted an array to hold email addresses of people who have signed up to learn about the private beta for our hip new app (Tinder for cake recipes)? We might get a million signups, or our moms might be the only ones who check it out. So how big should we make our array?
To avoid specifying the size of our array ahead of time, we could just kinda guess how big we want our array and resize it as needed. This is called a dynamic array.
This is what Python, Ruby, and JavaScript use for their default array-like data structures. In Python it's called a "list" (confusingly). In Java we have naked arrays (whose size we have to define ahead of time) but we also have dynamic arrays called ArrayLists.
So with a dynamic array, we just say "I want an array" and our environment says "cool no problem, let me get that for you" and it shrugs and says, "100 indices is probably good, right?" And it makes you your array:
You can start "appending" items to your array, and your environment will internally keep track of the "current_index" so it knows where to put stuff you're appending:
But at some point you'll be at index 99 and you'll want to append but your array won't be big enough!
So you'll say ".append()" and to make it work, your dynamic array will do a few things under the hood:
- Make a new, bigger array (usually twice as big).
- Copy each element from the old array into the new array.
- "Free up" the old array (tell the operating system that part of memory is available for other use).
- Actually append your item.
Notice that we're not just extending the size of our current array (the next 100 contiguous memory slots might not be free!), we're allocating a whole new array. That's why we also have to copy each item from the old array to the new one.
We could call these special appends that require steps 1-3 above "doubling" appends since they require us to make a new array that's (usually) double the size of the old one. Appending an item to an array is usually an O(1)O(1)O(1) time operation, but a single "doubling" append is an O(n)O(n)O(n) time operation since we have to copy all nnn items from our array!
Does that mean that an append operation on a dynamic array is always worst-case O(n)O(n)O(n) time? Yes. So if we spin up an empty dynamic array and append nnn items, that has some crazy time cost like O(n2)O(n^2)O(n2) or O(n!)O(n!)O(n!)?!?! Not quite.
While the time cost of each special, O(n)O(n)O(n)-cost "doubling" append doubles each time, the amount of O(1)O(1)O(1)-time appends you get until the next doubling doubles as well, at the same speed. This "cancels out" and we can say each append has an "average" or "amortized" time cost of O(1)O(1)O(1)
Given this, in industry we usually wave our hands and say dynamic arrays have a time cost of O(1)O(1)O(1) for appends, even though strictly speaking that's only true for the "average case" or the "amortized cost" (the worst-case cost for any given append is O(n)O(n)O(n)).
In an interview, if we were worried about that O(n)O(n)O(n)-time worst case cost of appends, we might try to use a normal, non-dynamic array. The advantage of dynamic arrays over arrays is that you don't have to specify the size ahead of time, but the drawback is that some appends can be more expensive.
But what if we wanted the best of both worlds:
- worst-case O(1)O(1)O(1)-time appends, and
- dynamic sizing
Data structures are full of tradeoffs. Dynamic arrays give us dynamic sizing, but at the cost of worst-case append speed. Our next data structure will give us dynamic sizing without incurring that cost, but it'll incur a different cost. Choosing the best data structure depends entirely on what you'll be doing with it.
Linked Lists
Remember how we used pointers to make our array of strings? What if we took that idea a step further. What if each item in our array was just a two-item array, where the first item was the value, and the second item was a pointer to the next item (in other words, the second item was a number, representing the address in memory of the next item).
We would call each of these two-item arrays a "node" and we'd call this series of nodes a "linked list."
Here's how we'd actually implement it in memory:
The first node of a linked list is called the "head," and the last node is usually called the "tail."
Confusingly, some people prefer to use "tail" to refer to everything after the head of a linked list. In an interview it's fine to use either definition. It's best to briefly say which definition you're using, just to be clear.
The way we "store" the array is usually as a pointer referencing the head of the list. So when we pass a linked list to a function, for example, we'll usually pass that head pointer.
It's usually helpful to keep a pointer to the tail of the list on hand and up to date as well. This allows us to append to our linked list in O(1)O(1)O(1) time—we simply:
- create the new linked list node
- set the "next" pointer of our current "tail" to point to this new node
- update our "tail" to point to this new node
Remember how we said dynamic arrays will keep track of a "current_index," to make append() operations easy? Same idea!
Linked lists' worst-case O(1)O(1)O(1)-time appends are an advantage over dynamic arrays, which have worst-case O(n)O(n)O(n)-time appends.
Linked lists have another advantage—inserting a node at any position (not just the end) is O(1)O(1)O(1) time. Let's look at an example:
Suppose you wanted to store a big set of high scores in a video game, and you wanted to keep them in sorted order so that the highest score was always at the front.
Here's what that might look like as an array
Suppose a player just scored 150 points and we wanted to add that to our list.
We'd walk to through our array until we find the first lower score. That's the spot where our 150 will go.
So how do we insert our 150? We have to scoot over everything after it to make room!
So even when we found the spot where we wanted to put our new score, it cost O(n)O(n)O(n) time to insert it.
But if our set of scores was a linked list...
To insert 150 we'd walk to the spot with the first lower score, just as before.
But now, to insert it all we have to do is tweak some pointers.
So our insertion is just O(1)O(1)O(1) time!
This O(1)O(1)O(1)-time insertion property comes at a cost: grabbing the iiith item in a linked list costs O(i)O(i)O(i) time—we've lost the O(1)O(1)O(1) time lookups we got with arrays. With an array, you can go "directly" to the iiith item because you can do some arithmetic to figure out its exact address in memory. With a linked list, the nodes are scattered randomly in memory. The only way to get the iiith node is to "walk" through the list, from the beginning.
Hash Maps
The O(1)O(1)O(1)-time lookups we get from arrays are immensely powerful. Partly for this reason, we usually use them instead of linked lists.
We can think of arrays as defining a "mapping" between an integer (the index) and an arbitrary value. We could use this to map, say, user ids to favorite ice cream flavors, provided our user ids were always sequential integers starting with 0.
favorite_ice_cream_flavor[4] = 'vanilla'
favorite_ice_cream_flavor[6] = 'chocolate'
Hash maps let us map arbitrary keys to arbitrary values. They're like arrays, except instead of "indices" we have "keys" that don't have to be sequential integers anymore. So we could map usernames to favorite ice cream flavors, for example.
favorite_ice_cream_flavor['parker'] = 'rocky road'
favorite_ice_cream_flavor['noah'] = 'cookie dough'
Hash maps are built on top of arrays. To build one, let's start with an array.
To turn this into a hash map, we just need to figure out some way to "translate" a key like "parker" into an index in our array, like 5.
What do we need out of this "translation" function? We need to ensure that if "Noah" translates to 1, "Noah" always translates to 1. It's almost like encryption, except we don't need to go backwards—we don't need to be able to go from 1 back to "Noah." So it's a sort of 1-way encryption...more like a fingerprint. In computer science, this "fingerprinting" is called "hashing".
Here's a simple example of a hashing function:
- Take our string, "parker".
- It's just an array of characters, right? And each of those characters is really just a number, right?
- So take all those numbers and add them together. Let's say the result is 42.
- Now we have an integer which we can use to index into our array!
- But our integer might end up too big—what if we only have 30 slots in our array?
- Let's take the result of the addition and mod ↴ it (the "%" operator) by the length of our array, in this case 30. this ensures we get a number in the range 0-29.
Bam, that's a hashing function.
The hashing functions we use in modern systems are quite a bit more complicated. But this is a good proof of concept.
So to look something up in our hash map, we:
- take the key,
- run it through our hashing function to get an index in our array,
- then pull out the item at that index.
Similar idea for inserting something.
One problem though—what if "Parker" and "rekraP" ("Parker" backwards) hash to the same index in our array? This is called a hash collision.
We could make our hashing algorithm a bit smarter to avoid this particular type of hash collision, but we do have a fundamental problem here. Since we don't know ahead of time what our keys are, it's impossible to ensure that our hashing function never causes collisions. So we need a way to handle them.
Here's one way: instead of storing the actual values in our array, let's have each array slot hold a pointer to a linked list of all the values for keys that hash to that index:
But if "Parker" and "rekraP" hash to the same index, when we open the linked list at that index we don't know which value is the value for "Parker" and which value is the value for "rekraP".
So let's store the key next to its value in each linked list node.
"But wait!" you may be thinking, "now lookups in our hash map take O(n)O(n)O(n) time in the worst case, since we have to walk down a linked list!" That's true! You could even say that "in the worst case, this kind of hash map degrades to a linked list." In industry though, we'll usually wave our hands and say on average lookups in a hash map are O(1)O(1)O(1) time. There are fancy algorithms that keep collisions low and keep the lengths of our linked lists nice and short.
Summary
Integers
- space: O(1)O(1)O(1)
- most operations (+, -, *, ==, etc): O(1)O(1)O(1) time
- vulnerable to overflow ↴ as they get big
Pointers
- integers representing the address in memory of something else
- can point to an array, a string, a linked list node, etc.
- space: O(1)O(1)O(1) (just like other integers)
Strings
- arrays of characters
- characters are just integers (so they each take up constant space)
- space: O(n)O(n)O(n), where nnn is the number of characters
Arrays
- map indices to values
- lookups: O(1)O(1)O(1)
- appends: O(1)O(1)O(1)
- insertions: O(n)O(n)O(n) (need to "scoot everything over" to make space)
- space: O(n)O(n)O(n), where nnn is the number of items in the array (assuming each item is O(1)O(1)O(1) space)
Dynamic Arrays
- arrays where you don't have to specify the number of elements ahead of time
- dynamically double their size when they're running out of space, taking O(n)O(n)O(n) time for each "doubling"
- lookups: O(1)O(1)O(1)
- appends: O(1)O(1)O(1) amortized/average, O(n)O(n)O(n) worst case
- insertions: O(n)O(n)O(n) (need to "scoot everything over" to make space)
- space: O(n)O(n)O(n), where nnn is the number of items in the array (assuming each item is O(1)O(1)O(1) space)
Hash Maps
- lookups: O(1)O(1)O(1) average, O(n)O(n)O(n) worst case
- insertions: O(1)O(1)O(1) average, O(n)O(n)O(n) worst case
- space: O(n)O(n)O(n), where nnn is the number of keys and values (assuming each key and value is O(1)O(1)O(1) space)
Linked Lists
- lookups: O(n)O(n)O(n) (need to walk there, starting from the head)
- appends: O(1)O(1)O(1) (as long as you have a tail pointer)
- insertions: O(1)O(1)O(1)
- space: O(n)O(n)O(n), where nnn is the number of items in the list (assuming each item is O(1)O(1)O(1) space)
What's next?
If you're ready to start applying these concepts to some problems, check out our mock coding interview questions.
They mimic a real interview by offering hints when you're stuck or you're missing an optimization.
By Parker Phinney
Parker founded Interview Cake to help more people like Alice land the jobs they deserve. He runs coding interview workshops at several coding boot camps and personally tutors engineers with various backgrounds. Catch him wearing a ridiculous Superman onesie around The Mission District of San Francisco.