Concept

Aim is to build a storage system suitable for continuously running machine learning (ML) systems where new data/new algorithms are arriving all of the time. Note the implication in that last sentence: the algorithms are as important as the data, not second class citizens, as a traditional DB makes them. For this first version the goal is proof of concept, with the right interface/design but a simplified backend that just uses existing DBs, making this an access layer (or federated DB). The aim is still for it to be capable of solving real problems though.

Note that there are a lot of "ML databases" out there that are just a database and a pile of algorithms glued together, as some kind of flailing technical debt monster. That's missing the point. A database is a tool/component for developing software, which includes the ML algorithms themselves.

To summarise, this is what happens if you pour a makefile and an open world computer game storage engine into a virtual particle accelerator and smash them together. Alternatively, a change/time aware data store, in contrast to current databases that take no responsibility for managing updates or supporting commits that could take hours to run. It can also be observed that this is an example of a blackboard system, if anyone remembers what those are (I only realised that after the first pass).

Entity component system

The use cases tie in well with the entity-component-system (ECS) model that has become a favourite for open world computer games, due to how it orients itself around simulation. Same principals make it ideal for ML as well, which has similar properties to simulations, though a much higher data load (wont fit in RAM, like the world state of a computer game is expected to) and run time (need proper job management).

This means we have the following structure:

Entity: An object (not in the OOP sense, but related), except it's represented as an identifying integer only — it's the addition of (potentially many) components that make it have substance (in OOP terminology, it does everything via mixin-s).
Component: Raw data associated with an entity. Effectively a list of variables associated with a subset of entities, much like a database table. Except these variables can also be large binary files. Note that sets can be defined with an empty variable list, so entities are members/not members without gaining any actual data.
System: These are the processes that run ML algorithms. They identify entities that have one set of components but not others, and do work to add the others. Need awareness of dependencies, version/revision numbers and when an algorithm was run for makefile behaviour in the ML case.

To map it back to a web scrapper scenario from use cases, the scrapper process (system) finds entities with a URL component but no html component and downloads the web page, putting it into a (new) html component. Another system (process) then parses the web page and writes back that structure as another component. Yet another extracts URLs to be fetched, creating entities with those URLs, such that the web scrapper system sees them and does its thing. And another spots when a web page has been parsed and extracts the main text, followed by another that tags it with parts of speech and so on.

Being a database we need the concept of multiple environments in which the above is occurring; we refer to each of them as a World. A security model is needed, with accounts stored in the :global world; see security.

Queries

Queries are simple (and all done within the context of a world) - you can ask for

A list of components.
The variables and their type information for any given component.
A list of entities that satisfy a specific component list (including not having a component).
Which components an entity has.
The contents of a component for a specific entity. For files this will require streaming.

This is a long way short of a proper database, but supports what's needed for an initial version, and full database querying probably isn't needed for the ML part of a complete system anyway. Indices will ultimately be needed, i.e. sets that represent the combinations of other sets, such as .+sheep-cats for sheep that are not cats (note that - and + are not allowed in user provided component names).

The variety of ML requirements means that having support for multiple component backends is valuable, particularly as each backend is allowed to offer further query possibilities. For instance you could have a backend that does C struct-s, which is limiting but keeps it fast. As another example you could also have a backend that does vectors and keeps a kd-tree around, so it can answer nearest neighbour queries. Another that's just a traditional DB, with the usual set of features. Or another designed for large files that interfaces with a distributed file system. And so on!

There will be an extension system, so components can offer interfaces to these extra features, with the ability to query which interfaces a component supports. Simple get/set will be included as an extension, for feature orthogonality.

Work queues

A work queue is quite complicated in the background, but for a system mostly just involves specifying:

Dependent components, i.e. do this work for each entity that has this set of components.
Output components, i.e. what will be added/replaced.

In the background it keeps track of any work that is ongoing and makes sure not to issue two jobs that intend to write the same output component. This is done via a token that the system must say hi with regularly, to renew its reservation; this will be done automatically as long as the code doesn't crash. It also keeps track of update times and the revision numbers of algorithms (systems), so it will automatically update old versions when needed. As a version/revision number could be "how much data the algorithm was trained with" thresholds should be supported, such that it doesn't rerun if the model has only changed a little (this is good for the environment). When dependency structures form it will handle them correctly, not running until all dependencies have been satisfied. There will also need to be prioritisation, running within the database itself.

This would ideally allow someone to write (Python) code that looks something like with <db connection>.version(<version info for code>): for eid in <db connection>.queue(<query: required components / output components>): <get components> <do work> <write components> and have it behave exactly as you would expect, including waiting for data when none exists and exiting the loop on an exception in a safe way. At some level the main point of this idea is to make trying out new algorithms on a complex and evolving data set as simple as doing the above.

Version and edit time information will be recorded into a separate (automatic) component, i.e. if you write to the component rabbit then all of this information goes into the rabbit= component. This component will almost certainly have it's own specialist backed in the long run, as it's an easy one to optimise (the version information will tend to be repeated, a lot!). Future versions may record further meta data into this component, e.g. legal information.

Note that there will need to be a skip() function: a computer may not have the resources to run a given task, and need to give it back so a computer that does can do the work. There is also the scenario of a web scrapper, where it has hit a specific server too many times recently and hence should annoy other servers. For this reason the skip() function should imply to never give this exact task back to this loop again, but there should be an optional delay parameter, to indicate that it would be happy to to try it after a certain period of time has passed. This logic should mostly be run locally, as we don't want the DB to waste resources recording skips, but equally this implies the API for the work queue needs some "but not that one" functionality.

Consultants

A consultant is a service (process) that makes itself available to multiple worlds, quite possibly all. They can then be attached to specific work queues, which they will then work on. This requires a different interface: there needs to be a mapping to the specific components in the world it is being applied to. The :global world contains a list of available consultants, and then the local :work component in each world contains the mapping from the local components to the named components of the consultant. Prioritisation will be needed, but its exact form is unclear.

Consultants need to be authenticated, and to avoid the risk in a multiuser scenario of one user seeing another's data by creating a consultant this needs to be precise. They will have names, and an account can only attach consultants to names that they have been authorised to attach to.

Edit stream

The system allows you to subscribe to the sequence of edits, just as a list of eid as they are changed in any way. In part because the work queue system needs this anyway, in part because the handling of jobs that involve groups of entities will need it. Each edit is assigned an always increasing number, which is also provided, and you can request all edits that have occurred since a given number. The history will always be sparse, because each eid can only appear once, in its most recent position.

Layers

Sharing contexts between worlds is necessary: typically you have deployment and testing servers, when coding a website or other online system. That's a problem once ML is involved, as replicating all of the hardware required to train models is rather expensive. Duplicating training would be insane, and not very green. You also may want to conduct experiments, where you try something out in an isolated space and see if it works: an independent system for each experiment is also duplication.

This can be resolved by allowing worlds to be "layered", i.e. a world inherits from another parent world, and can see, but not edit, all of it's components. If it has a shared component the code will work its way up the hierarchy, returning the first that exists for the eid requested. It should be noted that this can get a little weird: the components at each level can be in different backends and have different structures. This has to be supported, and the inclusion of a source world in the return may be necessary. Creating the component in each layer remains explicit, so if an inherited component lacks a partner in the current layer it's effectively read only.

Conceptually there is a . world, which is at the top of the hierarchy of layered worlds. Each world has it's own set of eid, but can see and assign components to eid that are defined in parent worlds. To avoid identifier clashes a global number stores the largest issued, and when new ones are created they are always larger. There is a security implication here, in that you can potentially observe statistics about eid creation in worlds you don't have access to, but it's slight so I'm ignoring it for now; to minimise this worlds reserve blocks of eid at a time. There is a fun little detail in this: tombstones are needed, because it must be possible to delete something in the current world when it's actually provided by a parent world, i.e. if you delete something it writes a tombstone to indicate it has been deleted, even though it happily continues to exist in a parent. The version/edit time components (component name with a = on the end) should support this, so backends don't have to deal with this logic. Not necessarily the most efficient solution, but if reading from this DB is your bottleneck you're using it wrong, plus a caching layer can always be added later. Note that if a parent word deletes an entity it vanishes from children as well; this is arguably a bit weird (have tombstones for deletion after all) but I'm leaving this as is — think it fits typical uses cases better.

Group systems

The handling of ML tasks that revolve around groups of entities is supported, but is a little convoluted and arguably a weakness of this design. The solution is to have two systems:

A grouping system, that subscribes to the edit stream and identifies when a group changes, maintaining an entity that represents the group as a whole. A time stamp should be included so the below system knows when the group has changed.
A normal system, that does the actual work with the group by monitoring the group entities, using the time stamp to decide if it needs to rerun. May want to do some sleeping to prevent running too often.

It's probable that certain grouping strategies will be sufficiently common that they should be built in, but going to leave that for now (design does actually include quite a bit to help with this, though could provide more). Note that dependencies have to be checked explicitly by the grouping system, which is not ideal. Some helpers could easily be provided to support this however.

Weaknesses

Grouping as discussed above.

I've focused on the slow update problem, and mostly ignored queries. In practise a ML database that allows you to store vectors and do nearest neighbour (for assorted distance functions) efficiently is needed. This of course implies support for nD arrays as variables in the entities. The extensions system allows this problem to be solved, but specifics are yet to be defined.

Have ignored atomicity, joins etc. The usual DB stuff, that is all still useful. Above design should have a presumption of atomicity on component updates however, which at least avoids the most probable kind of corruption/inconsistency. Supporting atomicity across components is going to be ignored with this version however: it's just too hard with multiple backends. If it's ever added it will probably be for specific backends only, maybe even specific backend combinations. This would imply a need to query what would work, for compatibility testing when loading a set of services into a world.