« Code Impressions Episode 1 - Is Spring coming yet? | Main | Do we need architecture? »
Thursday
Jan102013

Storing objects graphs - native vs. relational storage

On my last two projects I have been dealing with highly hierarchical data that had to be stored into and retrieved from the persistent store at the later stage of the process. 

In  both cases because of organisational and operational contexts, there was a strong preference - or should I rather say, a constraint - for the Oracle DBMS. As a result we have been left with little choice beyond mapping our hierarchical data into the relational model.

It would probably not be that much of an issue if only we were not to deal with tree-like structures with dimensions not known at design time.

Because other kinds of storage where out of the question, we did not really have a chance to compare how much more performant, if at all, an alternative - say native object/graph - store would prove to be.

My role as an architect is to make technology choices that maximise quality. Sometimes you have no choice but "to do the right thing" and  operate within the strategically chosen technology stack, but sometimes you get this fantastic opportunity "to do it right".

I do not feel it would be enough to follow Gartner reports in order to make an informed decision. I need to see with my own eyes in order to believe. Hence I always prove any concept to which I do not have an existing prove.

This time around I'd like to prove that using native graph backend for the tree-like structures will result in significant performance improvements over relational storage even with relatively simple structures.

I've initially chosen OrientDB (NovulaBase) and MySQL (CloudBees) for my test, primarily because they are both available on the cloud under free plans and this makes it possible for everybody to take my test code and run it themselves. After the initial test, which came out quite shocking, I have also added Amazon RDS MySQL to verify that test results against Cloudant MySQL made any sense.

I'd like to avoid starting a war over which database system is greater, however if there are certain advantages of one over another and you know how to prove it in the context of my test and there is publicly and freely available infrastructure, or it could be easily obtained by installing a software locally, please feel free to join the conversation.

My intent is to start simple and possibly evolve and improve the test over time, as long as there is enough interest in the results.

We will start with a basic question: having a relatively simple tree, how long does it take to persist and load the graph? Next, the series of publications will focus on usability aspects of those competing technologies, such as traversal, searching and general engineer friendliness - although the last one might be a bit subjective I guess.

For now let's focus on save and load performance.

Let's state some initial assumptions as they will allows us to put the results into perspective later on.

  • For now we will ignore differences in network latency between test machine and the selected SQL and noSQL stores. We will eventually have to incorporate this factor into an overall picture, but first we need to understand the chattiness of both technologies to even consider if network latency differences have any significance.
  • No backend type specific optimisations will be performed on the domain model level - both solutions will be tested against structurally indistinguishable sets of classes.
  • Object identifiers will be generated by the respective databases using the their default id generation strategies.
  • No client side cache enabled for neither reads nor writes.

Source code can be downloaded from github.

The test tree is modelled in the following way:

DOCUMENT <-[0.*]-> CHAPTER <-[0.*]-> PARAGRAPH <-[0.*]-> LINE <-[0.*]-> WORD <-[0.*]-> LETTER

Every node is of class Node, where DOCUMENT, CHAPTER, PARAGRAPH, LINE, WORD and LETTER represent the NodeType property of the node. A tree is fully grown if it contains nodes of all NodeTypes, ie. each subtree extends from the DOCUMENT up to the LETTER. Letter is a terminating node, ie. it does not have any children.

Each database have been subjected to saving and loading of 3 gradually more complex fully-grown tree configurations differing in width, from 1 up to 3. This test defines a width as number of children at each non-terminating node of a tree. 

The following table presents metrics gathered from an example test run.

I will need to investigate a bit further, but my guess is that this significant difference can be, at least partially, attributed to the massive chattiness of the ORM technology (Hibernate) when compared to the slick interface of OrientDB's native object API, which rather than inserting each object individually, serialises and transfers the entire graph in one go. I am curious whatever generating identifiers manually and pushing hibernate to perform batch inserts would make a difference. I'm also intending to test relational database based graph persistence against Oracle, to see whatever this makes any difference.

Would be very keen your feedback. 

PrintView Printer Friendly Version

EmailEmail Article to Friend

References (2)

References allow you to track sources for this article, as well as articles that were written in response to this article.
  • Response
    Response: longchamp taschen
    For this article. I think the authors write very well. Content lively and interesting. Details are as follows:longchamp taschen
  • Response
    Response: Interview Proces
    All Things Integrated - Journal - Storing objects graphs - native vs. relational storage

Reader Comments (1)

Interesting article. I don't find it odd that some of these types of operations are performed better with the NOSQL databases than a traditional relational db, but I'm having trouble understanding the configurations of the backends to make out how much comes from setup vs actual performance difference. Specifically, I'm curious as to what the backend is configured for OrientDB...Not a local In-Memory store, right? Probably most critically, OrientDB does not do synchronous writes by default (see the bottom of https://github.com/nuvolabase/orientdb/wiki/Transactions), which is similar to MongoDB. I believe MySQL does (though perhaps it is dependent on the backing store for it as well).

I would think that the simple fact that the data being stored into mysql has to be flattened before being stored and similarly reconstructed to be read, would generate overhead that doesn't necessarily have to exist in a nosql setup. I don't really know anything about OrientDB in particular, but the sacrificing some reliability to gain some performance is often one of the sacrifices that NoSQL db engines make (and sometimes that is configurable to some extent).

When you take the overhead already being applied to the relational database for dealing with the hierarchical data and combine that with asynchronous writes the nosql db is using, OrientDB should be faster!

Jan 13, 2013 at 7:41 PM | Unregistered CommenterDustin

PostPost a New Comment

Enter your information below to add a new comment.

My response is on my own website »
Author Email (optional):
Author URL (optional):
Post:
 
Some HTML allowed: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong>