On my last two projects I have been dealing with highly hierarchical data that had to be stored into and retrieved from the persistent store at the later stage of the process.
In both cases because of organisational and operational contexts, there was a strong preference - or should I rather say, a constraint - for the Oracle DBMS. As a result we have been left with little choice beyond mapping our hierarchical data into the relational model.
It would probably not be that much of an issue if only we were not to deal with tree-like structures with dimensions not known at design time.
Because other kinds of storage where out of the question, we did not really have a chance to compare how much more performant, if at all, an alternative - say native object/graph - store would prove to be.
My role as an architect is to make technology choices that maximise quality. Sometimes you have no choice but "to do the right thing" and operate within the strategically chosen technology stack, but sometimes you get this fantastic opportunity "to do it right".
I do not feel it would be enough to follow Gartner reports in order to make an informed decision. I need to see with my own eyes in order to believe. Hence I always prove any concept to which I do not have an existing prove.
This time around I'd like to prove that using native graph backend for the tree-like structures will result in significant performance improvements over relational storage even with relatively simple structures.
I've initially chosen OrientDB (NovulaBase) and MySQL (CloudBees) for my test, primarily because they are both available on the cloud under free plans and this makes it possible for everybody to take my test code and run it themselves. After the initial test, which came out quite shocking, I have also added Amazon RDS MySQL to verify that test results against Cloudant MySQL made any sense.
I'd like to avoid starting a war over which database system is greater, however if there are certain advantages of one over another and you know how to prove it in the context of my test and there is publicly and freely available infrastructure, or it could be easily obtained by installing a software locally, please feel free to join the conversation.
My intent is to start simple and possibly evolve and improve the test over time, as long as there is enough interest in the results.
We will start with a basic question: having a relatively simple tree, how long does it take to persist and load the graph? Next, the series of publications will focus on usability aspects of those competing technologies, such as traversal, searching and general engineer friendliness - although the last one might be a bit subjective I guess.
For now let's focus on save and load performance.
Let's state some initial assumptions as they will allows us to put the results into perspective later on.
- For now we will ignore differences in network latency between test machine and the selected SQL and noSQL stores. We will eventually have to incorporate this factor into an overall picture, but first we need to understand the chattiness of both technologies to even consider if network latency differences have any significance.
- No backend type specific optimisations will be performed on the domain model level - both solutions will be tested against structurally indistinguishable sets of classes.
- Object identifiers will be generated by the respective databases using the their default id generation strategies.
- No client side cache enabled for neither reads nor writes.
Source code can be downloaded from github.
The test tree is modelled in the following way:
DOCUMENT <-[0.*]-> CHAPTER <-[0.*]-> PARAGRAPH <-[0.*]-> LINE <-[0.*]-> WORD <-[0.*]-> LETTER
Every node is of class Node, where DOCUMENT, CHAPTER, PARAGRAPH, LINE, WORD and LETTER represent the NodeType property of the node. A tree is fully grown if it contains nodes of all NodeTypes, ie. each subtree extends from the DOCUMENT up to the LETTER. Letter is a terminating node, ie. it does not have any children.
Each database have been subjected to saving and loading of 3 gradually more complex fully-grown tree configurations differing in width, from 1 up to 3. This test defines a width as number of children at each non-terminating node of a tree.
The following table presents metrics gathered from an example test run.
I will need to investigate a bit further, but my guess is that this significant difference can be, at least partially, attributed to the massive chattiness of the ORM technology (Hibernate) when compared to the slick interface of OrientDB's native object API, which rather than inserting each object individually, serialises and transfers the entire graph in one go. I am curious whatever generating identifiers manually and pushing hibernate to perform batch inserts would make a difference. I'm also intending to test relational database based graph persistence against Oracle, to see whatever this makes any difference.
Would be very keen your feedback.