Database best practices for SDLs and database types (e.g., document, relational, knowledge graphs)

There are various database frameworks, some popular ones being document-based (e.g., MongoDB), relational (e.g., postgreSQL), and knowledge graph-based (e.g., neo4j).

Generally a document-based one is easiest to get started with (especially given the parallels and user familiarity with Python dictionaries), but tends to adapt poorly as needs change and schema are updated.

SQL and other relational databases have a long history, and while the entry barrier is generally higher, these seem to do relatively well when adjusting to dynamically changing needs. Data (characterization results, compositions and processing conditions, and other data and metadata) for the physical sciences are inherently relational.

Graph-based databases often evoke somewhat nebulous terms to some such as ontologies and knowledge graphs. However, generally people who come to understand what these are tend to agree that it’s the “long-term dream”, in some sense taking relational a step further. However, these also seem to have one of the highest learning curves and a fair bit of ambiguity and careful thought required in terms of the design choices of the database (again, especially thinking about ontologies).

Depending on the discipline, the data types and data formats can vary a lot. It gets more complex when you start thinking about integrations with electronic lab notebooks or laboratory information management systems or direct upload of data by devices. How are you approaching this in your labs? What is working well, or not working so well? Even with highly custom implementations, it would help to be aware of what thinking went into the design decisions.

Cc @willigo09. Also, Sergio and Matthias Popp.


EDIT: At the knowledge graph workshop at Accelerate ‘25, @benjimaruyama answered a question about the value proposition of a knowledge graph to a company with a good existing relational database - “What questions do you want to ask that you haven’t been able to ask already?”

1 Like

We use a MySQL database. We didn’t implement it from scratch though. We’re using eLabFTW as both a ELN and LIMS. It’s a little clunky at times, but I appreciate its flexibility.

I haven’t thought about it too much yet, but I assume the data within our db could be ported to a knowledge graph relatively easily.

1 Like

Coming to this thread late, but thought I might contribute.

We’ve been coming to the idea that different tools are useful in different contexts.

  • We have to handle a lot of long time-series data where looking at one datapoint isn’t very helpful, but finding the whole time-series is. We’re finding that a document-based approach works well there, because it’s much more performant.
    • I’ve had some early success with DeltaLake, which is a little heavy-duty, but also gives you a full history of your data changes for free. DeltaLake also handle schema changes really effectively
  • We also do a lot of “dimensional” data (e.g. “What project does this belong to?”), and aggregated data (e.g. “What was the overall best yield achieved in this experiment?”. Which RDBS works really well for, since it’s so well-supported and flexible.
    • Postgres has been our choice here, again, because it’s been around for a long time and has a lot of support.
  • Finally, we’ve found that as you cross domains the joins get super complicated and performance really slows down in a SQL database, so we’re toying with the idea of a semantic layer on top of everything, built in a graph database, that gets to that “long-term dream” without having to build a perfect ontology of the universe. The idea is that the semantic layer describes the important relationships, and otherwise mostly points down to where the actual data is, so it’s both fast to find things and to load and analyze data.
    • This is super early, but so far neo4j has been easy enough to use for prototyping, and seems like it has a lot of community support.

Really important caveat - my work involves building data systems that allow collaboration across research organizations, and this kind of multi-layer data platform has a ton of overhead. Probably crazy overkill for a single lab.

Still, I think the philosophy of not trying to make one tool do everything is a useful one when you don’t have a lot of time.

1 Like