Use a database index they said. It will make things faster, they said.
But if you’ve got five records in a table, how can it be faster to search an equivalent five index records instead of just searching the table? What if the word I was searching for was “zebra” – wouldn’t it need to search to the end of the index to find it?
Relational databases are, in computing terms, very old technology. They’ve been around since the 1970s and because of that, the actual database engines, the servers that respond to your inserts, updates and reads, are extremely well optimised.
Let’s take a simple example – suppose you had a “people” table; with fields for id (an auto-incrementing primary key), first_name, last_name and email (all varchar/string fields). Let’s also say that the email address has to be unique (so if we were writing a Rails model we’d use a uniqueness validation).
If we had three records in this table – with emails email@example.com, firstname.lastname@example.org and email@example.com – and we tried to insert a fourth record with the email address firstname.lastname@example.org, our Rails model would trigger the uniqueness validation and it would query the table – “are there any records in the people table with a value of ‘email@example.com’ in the ’email’ field?”. The database engine would then scan through the table, examining each record in turn until it found firstname.lastname@example.org. For our three record table, this is a trivial operation. But imagine the same for a ten million record table … suddenly scanning ten million records becomes quite a task for our database engine.
So instead we add a “unique index” to the email field on the people table. This creates a mini-table, for the database engine’s own use, which just stores email addresses and IDs. When records are inserted or updated in the table, the database engine also updates the index at the same time. Which does mean that there is a slight performance hit when adding data to a table with indexes.
But the clever bit comes when searching. If you ask the database “are there any records in the people table with a value of ‘email@example.com’ in the ’email’ field?” instead of scanning through ten million records in the table, it goes straight to the email-index. Any decent database engine attempts to keep the entire index in memory, not on disk, so searching through the index is fast. But on top of that, it can also keep the values in a known order – so instead of just starting a scan at the beginning and ploughing through till it finds a match, it can jump to the halfway point and make intelligent decisions on which way to jump next based upon whether firstname.lastname@example.org is higher or lower than the value in the middle of the index.
UPDATE: Just to be clear, indexes generally aren’t structured as flat files, they are usually B-Trees, which are extremely fast to search through. Thanks to Cristiano and Sean for pointing out my vagueness.
And what about if we need to search on last name as well? In this case we can add another index to the last_name field, but this time it won’t be unique. So instead of just storing the value and the associated record ID, it stores the value and multiple associated record IDs (as there could be hundreds of “Smiths” in the index). The same logic applies though – the database engine tries to hold the index in memory at all times so it’s fast to access, and because it knows the sort order on the index, it can use that to optimise finding the matching records.