Simple, Fast, Reliable Database

How would one go about designing a persistent database? These things exist in many of our operating systems and various other distributed systems. A system monitoring service, a file service, an authentication service, etc. are such small performant databases that provide us with persistent data across system restarts, concurrent writes, and process failures.

I am calling them databases because besides offering persistence, these are memory resident and offer a respectable update rate. Moreover, these update operations do not entail operations by the client. Such characteristics allow them to be designed in a much simpler manner compared to general-purpose databases.

The design presents a simplified version of the mini tarsnap[1] or redis model. To ensure persistence, a strongly-typed data structure is maintained in the virtual memory. The structure thus stored can be cache optimized for the use-case in hand. Mine was for a two step hash compute and key-value datastore for dynamic C compilation for my runtime. The updates are recorded incrementally on a log (either as a program script or its expanded form in an LSM or other trees). The log is committed to the disk, and a new checkpoint is marked. New processes build on earlier snapshots from previous checkpoints. At any point in time, we maintain two components of the in-memory data structure (database): a checkpoint of some previous state of the full database, and a log record of subsequent updates.

The process maintaining the database forks a child process to produce a new checkpoint and starts a new log for recording updates. The child process creates the new checkpoint based on an earlier snapshot and the update log. This ensures consistency. The child process image is appended to the disk to form the new checkpoint. In case of a child process failure, two log files in addition to the earlier snapshot are used by the subsequent child process, and this is repeated till checkpointing is a success.

Since the memory-resident data structure houses fused stack/hashtable/object/array types, a threadsafe memory pool table is used for each individual case for efficient memory management (like buddy or slab allocator, _alloca for stacks, large pageallocs with Fibonacci allotment strategy, mmap with supervised copy-on-write mechanisms, etc.). A garbage-collector free implementation is currently in place.

This mechanism ensures high availability since update operations are included by the checkpointing mechanism. This technique is effective only for small in-memory databases. The big data case is handled with a vector processing language runtime (a functoral consuming stack on arrays). The VM is in the early construction stages, but this study is for later though.