Mar 15, 2024 10 min read

The database is an implementation detail

Documents and Folders ... oh my!

"The database is an implementation detail - agree?"

That should get some reactions!!!

And the answer is "sort of".

Some back-story

But first a bit of background.

When I was a kid there were some books or magazines or something that had a series of puzzles that I used to love.

They would describe something totally ordinary, but from the perspective of an alien who had no idea about how life on earth worked. I adored the idea of trying to figure out stuff, that we totally took for granted, but looking at it from another perspective.

Also, when I was a kid, my favourite book was "The Hitch-hiker's Guide to the Galaxy". It's still one of my favourites. On the surface, it's a silly sci-fi tale, but underneath, it's a deep book on philosophy^[1] (and I ended up doing philosophy^[2] at university). My favourite bit was, and still is, about the dolphins.

“For instance, on the planet Earth, man had always assumed that he was more intelligent than dolphins because he had achieved so much - the wheel, New York, wars and so on — whilst all the dolphins had ever done was muck about in the water having a good time. But conversely, the dolphins had always believed that they were far more intelligent than man — for precisely the same reasons.”

The same facts - dolphins muck about in the water while humans do important stuff - but two entirely different conclusions.

Because the same thing looks different depending upon your perspective^[3].

What does this have to do with databases?

So far, we've been building up a series of "feature classes" that describe various bits of logic in our application. But we've pretty much ignored the database. Now we're at the stage where we need to start considering it.

But, from the point of view of the core functionality, the database does not matter. All the business logic cares about is that "we have a user object with a can? method" and "we have a user-group object with a grant_access_to method". These objects could be stored in database tables, they could be stored in flat files, in memory or in Redis. The business logic - what the app is actually for - doesn't care about storage. It only cares whether it can tell the objects it meets to do certain things and provide the responses that it expects.

However, now we're going to have a perspective shift.

Because, in order to get our original "feature" specification to pass, we need to start building the real-life objects that do the work on behalf of this business logic. And that means storage, it means user-interface, it means efficiency and it means performance.

Resets and configurations

The code I was showed you last time didn't work.

I had to do a load of tedious mucking about^[4] because my locator and the events publisher both needed to be cleared out after every test. This was so previous registrations did not interfere with the current test.

I added a global after block that called StandardProcedure.reset!, which in turn removed all event subscribers from StandardProcedure::Events, cleared out all registered objects from the locator, then reloaded all the defaults.

This in turn meant I had to add in a configure option, which ended up being more complicated than I first expected. Calling StandardProcedure.configure yields a block, passing in an instance of the StandardProcedure object. That block populates the locator with the relevant objects - but crucially, it is not called until the StandardProcedure instance is created. Because the locator is configured from multiple places (for example, each of our feature objects handles their own configuration), the configure method actually stores an array of configuration blocks and calls them all, one by one, when required.

This ended up far more work than I was first expecting (about two hours), but sometimes your test suite needs privileged access to your objects so it can do what it needs.

The database structure

Our features mention objects with names like user, user_group, folder, organisation and permission. Now we need to start thinking about what these actually are.

At this point, I've already made some deployment decisions about this system. Those decisions may be wrong, but we need to start somewhere and part of the idea of working this way is to try and make it easy to change our minds later^[5].

I'm going to use sqlite3 for my data storage^[6] because my clients are small organisations and, for the past five years, none of these systems have had to deal with continued heavy load. This means deployment becomes much simpler, as I don't need to worry about managing multiple servers. In fact, I'm hoping I can get away with a single Docker container per client^[7], which using Dokku and Anvil, is incredibly easy to set up and maintain.

The file-system

I also know that this system is going to have some sort of "filesystem" represented within the database. Most businesses are very comfortable with the idea of putting documents into folders - whether they are stored on a network share or in Dropbox. And most projects fail, not for technical reasons, but for people reasons. Meet your audience where they are and gradually guide them forwards.

So, I'm making a design decision right now.

The system is going to have a hierarchy, built from "nodes". Folders will be a subclass^[8] of Node, as will Organisation. I will use the Ancestry gem to organise the nodes into a hierarchy^[9] and each node will have an optional, polymorphic, item reference, which I'll talk about later.

Let me stress again - the business logic we've written so far does not care how the database is structured. So, if I get this wrong, I can change my mind and try something different^[10] without worrying that I will be breaking everything that I've written so far.

So, I'm adding in a Node base class, in a nodes table. There will be a subclass of Node called Folder and a subclass of Folder called Organisation^[11].

class Node < ApplicationRecord
  include HasName
  include IsActive

  enum node_status: {primary: 0, linked: 1}
  has_ancestry cache_depth: true, counter_cache: true
  belongs_to :item, polymorphic: true, optional: true
  has_many :permissions, dependent: :destroy
  has_rich_text :description

  def node
    self
  end

  def folders
    Folder.active.where(id: child_ids)
  end

  def organisations
    Organisation.active.where(id: child_ids)
  end
end

Don't worry about the HasName and IsActive modules - they simply define a to_s method for models with a name attribute and an enum status: {active: 0, inactive: -1} so I have a simple "soft-delete"^[12]. The folders and organisations methods are little helpers to filter the children of this Node (using Ancestry's child_ids method).

We also want something representing a folder that is sym-linked across to another folder, so I'll add in a subclass of Folder called LinkedFolder. The special thing about LinkedFolder is it's item property points to the source Folder and it's children and child_ids methods^[13] returns the children and child_ids from that source folder.

I had a fancy idea to try and mimic how Unix filesystems work using hard-links. In Unix filesystems, there is a hierarchy of directory entries - each of these represents a "file" (which could be a directory, a volume, an actual file or something like a socket or process). The directory entry itself then points to an "inode" which represents the actual contents of that directory entry. So if you want to hard-link a file, say "example.txt", to appear in two separate directories, you would create two directory entries that both point to the same node. This is what the item property on a Node represents - although it's used slightly differently for LinkedFolders^[14].

In the feature specification, we have a setup line @policy = a_file_attachment parent: @policies_folder. The FileAttachment is another ActiveRecord model, in the file_attachments table and the class uses has_one_attached :file to manage the actual file contents.

I created a ruby module called app/models/concerns/has_nodes.rb that looks like this:

module HasNodes
  extend ActiveSupport::Concern

  included do
    has_many :nodes, as: :item, dependent: :destroy

    after_create :create_primary_node
  end

  def node
    @node ||= nodes.primary.first
  end

  def parent_node
    node.parent
  end

  def parent
    @parent ||= parent_node&.item
  end

  def parent= item
    raise ArgumentError.new("#{item} is not part of the node hierarchy") if item.present? && !item&.respond_to?(:node)
    @parent = item
  end

  def child_nodes
    node.children.active
  end

  def children
    child_nodes.collect(&:item)
  end

  private

  def create_primary_node
    nodes.create! name: name, icon: icon, parent: @parent&.node
  end
end

The specification for HasNodes looks like this:

require "rails_helper"

RSpec.describe HasNodes, type: :model do
  # standard:disable Lint/ConstantDefinitionInBlock
  class HasNodes::FakeModel < ApplicationRecord
    include HasNodes
    self.table_name = :fake_models
  end
  # standard:enable Lint/ConstantDefinitionInBlock

  before do
    ActiveRecord::Base.connection.create_table :fake_models, force: true do |t|
      t.string :name, default: ""
      t.string :icon, default: "none"
    end
  end

  context "when inserted into the hierarchy" do
    it "is added to the top of the hierarchy" do
      model = HasNodes::FakeModel.create! name: "My model", parent: nil

      expect(model.parent_node).to be_nil
      expect(model.parent).to be_nil
    end

    it "it added as the child of an existing item" do
      parent_model = HasNodes::FakeModel.create! name: "Parent", parent: nil
      child_model = HasNodes::FakeModel.create! name: "Child", parent: parent_model

      expect(child_model.parent_node).to eq parent_model.node
      expect(child_model.parent).to eq parent_model
    end

    it "it has child items" do
      parent_model = HasNodes::FakeModel.create! name: "Parent", parent: nil

      first_child_model = HasNodes::FakeModel.create! name: "First Child", parent: parent_model
      second_child_model = HasNodes::FakeModel.create! name: "Second Child", parent: parent_model

      expect(parent_model.child_nodes.count).to eq 2
      expect(parent_model.children.count).to eq 2
      expect(parent_model.children).to include first_child_model
      expect(parent_model.children).to include second_child_model
    end

    it "copies its details to the node" do
      model = HasNodes::FakeModel.create! name: "My model", parent: nil

      expect(model.node).to_not be_nil
      expect(model.node.name).to eq model.name
      expect(model.node.icon).to eq model.icon
    end

    it "can only be added as the child of an item that HasNodes itself" do
      non_node_object = Object.new

      expect { HasNodes::FakeModel.create! name: "My model", parent: non_node_object }.to raise_error(ArgumentError)
    end

    it "automatically saves the node to the database when being assigned to an item that has been saved" do
      existing_model = HasNodes::FakeModel.create! name: "My model"
      node = existing_model.node
      expect(node).to_not be_new_record
    end
  end
end

An object that implements HasNodes has a primary Node which is where it is "anchored" within the filesystem. It can also have an arbitrary number of other Nodes, which represent "hard-links" to that object. This means it can appear in the filesystem in multiple places. When you create a HasNodes object, you can pass in a parent object, which is either a Node or another HasNodes^[15]. This tells HasNodes where to this instance belongs within the hierarchy.

What this all means is that there are four classes of model within the system.

There are some, like User and UserGroup that sit outside of the filesystem hierarchy.
There are some, like Folder and Organisation that are descendants of Node and define the filesystem hierarchy.
There are some objects, like FileAttachment that are within the filesystem hierarchy by virtue of being attached to one or more nodes^[16].
And there are "pure" Nodes (not subclasses) which position those objects within the hierarchy.

At the moment, I don't know if this is good design or over-engineering. The next few specifications will tell me - if they're hard to implement I should probably revisit this. But, once again, the business logic does not care, so I'm free to change my mind if it turns out I'm going down the wrong path.

The audit trail

After spending a good few hours sorting the reset! and configuration stuff, writing the migrations and models for the database^[17] and speccing and debugging the HasNodes module, I ran the full specification again. And now it fails on the final two steps:

	When I look at the audit trail
    Then I should see a log of how access to this folder was granted

I've written audit trails like these several times across several systems and I think I know how I need it to work.

But that's a tale for another day.

Like most of Douglas Adams' work. He was a philosopher at heart. ↩︎
Probably the only academic thing I've done that I've really enjoyed. It also turned out to be most valuable thing I've learnt, as it taught me how to manipulate abstract ideas - very important for software developers - and how to communicate those ideas to others - which is essential for pretty much everything you do. ↩︎
This is why discovering object-orientation was also the moment when programming clicked in my head. When you're in Object A, how you see things and what you can do is totally different to when you're in Object B. ↩︎
Another Hitch-Hiker's reference - "The Infinite Improbability Drive was a wonderful new method of crossing interstellar distances in a mere nothing'th of a second, without tedious mucking about in hyperspace." ↩︎
Defer the big decisions till as late as possible, because the start of the project is the time when you have the least understanding. ↩︎
In fact, I'm going to use LiteStack so everything runs off sqlite3 - including ActiveJob and caching - and we should have no external dependencies such as Redis or Memcached. ↩︎
I know lots of people will be screaming REDUNDANCY and HIGH AVAILABILITY at me. If this were Collabor8Online, or one of my larger clients who put tens of thousands of pounds worth of business through the system every day, then there's no way I would make these choices. But I know the StandardProcedure audience very well and they don't need that complexity. They've already been running on a single, small, servers for years and the odd hour of downtime is something they can live with. ↩︎
Using ActiveRecord's Single Table Inheritance ↩︎
The simplest thing with hierarchies is to use a parent_id field, but Ancestry uses a path field that allows you to pull an entire subtree from the database in a single SQL query. And that is really valuable and an example of where our current "real-world" perspective has an impact on the design. ↩︎
At least until I've put it live and my clients have got some real data in the system - at which point changing the database structure becomes a risky activity. ↩︎
I'm actually putting in a further subclass of Organisation called Organisation::Home - if you look back at the specification, the system wants to copy folders from the "home" organisation to the newly created client organisation. The Organisation::Home class handles this, and uses an ActiveRecord validates :type, uniqueness: {case_insensitive: true} constraint to ensure that there is only ever one "home" organisation in the hierarchy. ↩︎
As soon as you start creating audit trails (which we'll get to later) physically deleting records becomes a problem - not least because if something bad happens, you need to be able to see the actual things that caused the issues ↩︎
Which is added by the Ancestry gem and, by default, returns the children of this record from the tree. ↩︎
I read that Unix originally treated directories as inodes. The directory entry would point to an inode representing the directory contents - so the same directory could appear at multiple points in the filesystem. But it ended up causing a load of problems, so the Unix people decided to use soft-links rather than hard-links for directories. Instead of assuming I know better than them, I copied their strategy and "soft-linked" my LinkedFolder instead of "hard-linking" the same Folder object into multiple places. ↩︎
Reading the spec back, I've realised I missed the bit about having a parent Node - it only checks for parent HasNodes. So I'll have to go back and update that. ↩︎
An inode in Unix terms ↩︎
I don't think I wrote a single unit test for the database models. I'm treating these as "dumb data objects" with very minimal behaviour. Any methods that they do have will be covered by a combination of the other sets of specs. This is probably frowned on by "clean code" purists but I suspect that, if I come across a method that needs a test writing, I'll recognise it as I start. ↩︎

Rahoul Baruah

Rubyist since 1.8.6. I like hair, dogs and Kim/Charli/Poppy. Also CTO at Collabor8Online.

Leeds, England