03. Offbeat Databases - PART 1

I was interested to know more about different storage mechanisms and decided to explore a few offbeat storages.

Introduction:

I firmly believe in the law of karma. I used to wonder as how the law of karma manages its storage of activities and effects for the beings of the universe!

And then, I was interested to know more about different storage mechanisms and decided to explore a few whacky storages. The list of references for my research is given in the appendix.

With a little bit of research, I hoped to understand better the below mentioned ‘offbeat’ databases:

· In memory database systems (IMDS) (e.g., SAP HANA, eXTremeDB)

· Graph data bases (e.g., Neo4J)

· Hierarchical databases (e.g., Windows Registry)

· Time Series databases (e.g., Druid, eXTremeDB, InfluxDB)

· Event stream as database (e.g., Kafka)

In part-1, we will see the first 2 type of databases.

Evaluation Mechanisms:

While analysing the above types of databases, let us understand the special purpose use cases that these databases support. We will also try to understand the usual deployment architecture and performance characteristics of these databases.

1. In memory database systems (IMDS):

IMDS system considered:

eXTremeDB

Factors that drove this innovation:

Cheap hardware, declining RAM costs, new chip designs, emergence of data-hungry real-time systems

Some Properties of an IMDS:

1. Faster I/O to RAM (not to file systems) using DIMM pins.

2. No DB index files.

3. There is no other ‘cache’ (as seen in file-based DBMS), hence page cache related CPU cycles are not present.

4. IMDS has single data-transfer in each direction since, there are no intermediate copies in a database cache or file system cache.

5. Log-less transactions.

6. Supports multiple OS level processes by placing the DB in shared memory.

7. Size of the transaction is unlimited.

8. A true in-memory database should have just a single memory pool and provide flexibility to use memory as needed.

9. An IMDS should employ superior memory managers leveraging multi-core/multi-cpu systems.

10. When something is deleted from an IMDS, the free space goes back into the general database memory pool and can be reused for any subsequent need, whether it is for a row of a different table, or a page for a tree node, or anything else.

11. I quote from the link the below performance traits of IMDS as compared to traditional DBMS system whose file based contents are moved to RAM:

In the benchmark, moving the on-disk database to a RAM drive quadrupled read performance, and tripled database write (update) performance. But the same application running on a true in-memory database system delivered much more dramatic performance gains: the IMDS outperformed the RAM-disk database by 4x for database reads and by a startling 420x for database writes.

The below architecture diagrams are source from this link.

IMDS Architecture:

Normal DBMS architecture:

2. Graph Database:

A graph database is defined as a specialized, single-purpose platform for creating and manipulating graphs. Graphs contain nodes, edges, and properties, all of which are used to represent and store data in a way that relational databases are not equipped to do.

Please refer to this link for a primer on Graph Database. There are 2 types of Graph Databases – Labelled Property Graphs (for analytics- focussing on data nodes and its relationships) and RDF Graphs (for linking resources as knowledge graphs or as linked data)

System considered:

Neo4J

Factors that drove this innovation:

From the rate of growth in the graph database category, it’s becoming clear that most organizations want to take advantage of connections within their data and are exploring multiple means of doing so. The main CPU cycles related to joining the data relationships are made less by leveraging the graph data bases.

Some Properties of a Graph Database:

1. Because graph databases explicitly store relationships, queries and algorithms utilizing the connectivity between vertices can be run in sub-seconds rather than hours or days. Users don’t need to execute countless joins and the data can more easily be used for analysis and machine learning to discover more about the world around us.

2. Graph databases generally run queries in languages such as Property Graph Query Language (PGQL). The example below shows the same query in PGQL and SQL.

3. The relationships allow data in the store to be linked together directly and, in many cases, retrieved with one operation. Graph databases hold the relationships between data as a priority. Relationships are a first-class citizen in a graph database and can be labelled, directed, and given properties.

4. In September 2019 a proposal for a project to create a new standard graph query language (ISO/IEC 39075 Information Technology — Database Languages — GQL) was approved by 1. members of ISO/IEC Joint Technical Committee 1(ISO/IEC JTC 1). GQL is intended to be a declarative database query language, like SQL.

5. Storage can be RDBMS based or key-value stores or document-oriented databases or custom(Neo4J).

6. Index-free adjacency:

Data lookup performance is dependent on the access speed from one particular node to another. Because index-free adjacency enforces the nodes to have direct physical RAM addresses and physically point to other adjacent nodes, it results in a fast retrieval. Native graph databases use index-free adjacency to process CRUD operations on the stored data.

7. Compared with relational databases, graph databases are often faster for associative data sets and map more directly to the structure of object-oriented applications. They can scale more naturally to large datasets as they do not typically need join operations, which can often be expensive. As they depend less on a rigid schema, they are marketed as more suitable to manage ad hoc and changing data with evolving schemas.

8. Sometimes, graph databses are1000x faster than relational databases.

9. A well-architected Graph DB like Neo4J (using its own design of undelying DB engine) - scales to support massive growth in data and users while optimizing costs. Scale out is done with high throughput while maintaining an elastic infrastructure with Autonomous Clustering. It allows to divide very large graphs into shards and query them efficiently.

10. Operational Trust- Neo4j is the most deployed graph database in the world. ACID compliance guarantees transactional integrity for mission-critical workloads across billions of nodes and trillions of relationships, while keeping query responses to milliseconds. Granular security is especially built for the graph.

High level architecture of Neo4J: