Tuesday, January 8, 2019

In Search of "XML database" ( eXist-db vs BaseX vs Sedna vs MarkLogic)

Is XML completely dead? The answer is no.

If you look around and find that
1)Mostly RSS feeds are available in XML
2)Wikipedia dump is available in XML(In my free time I am experimenting on this.)
3)SOAP API request/response format is XML.
ETC.(You can add many more cases;All three above is as per my experience)

But if we talk about processing a large XML file(In my case it is 600+MB;And in complete project XML size will be 15+GB).

If someone has to process this much huge databases;In mostly cases you will be suggested that option is write a code and parse XML and transform it to load into SQL/NoSQL databases.

But while searching finds that there are already some databases available that stores/process XML data;And worked as a XML databases(If termed it correctly)

XML database List is not too long(You can find it here https://en.wikipedia.org/wiki/XML_database);I experimented on following databases and all databases tried for first time

1)MarkLogic
2)BaseX
3)Sedna
4)eXist-db
5)Berkeley DB XML

Requirement is to load 600MB file and able to access/search into that data.My experience is as follows

1)MarkLogic
i)UI is good in comparison of other XML databases
ii)But Uploading XML is challenging;Needed to go through different options and after stopping indexing it works for me
iii)Not able to run query on uploaded XML.

2)BaseX
i)UI based access
ii)Not able to upload XML(Java Heap problem;Increased Xms,Xmx but no success)

3)Sedna
i)No UI
ii)Simple XML upload failed

4)eXist-db(This works for me;All technical details is in next blog)
i)UI based access
ii)Java based client GUI
iii)CLI based access
iv)XML Upload(Needed to change value of Xmx,Xms accordingly)
a)UI based option failed.
b)CLI based option works
c)To make it upload successful needed to disable full text index
v)REST API for running queries(In other not able to experiment this option)
vi)Can index any XML field and perform search on the basis of that(Experimented and it works perfectly.
v)Full text index is supported and it is based on lucence.(Pending)

5)Berkeley DB XML
Compiled and installed.But after reading its license term not experimented on this.

Conclusion
After some tweaking eXistDB works for me like a charm.
Marklogic is promising.But even after applying same amount of time;I failed on that miserably.

Working on
So for now experimenting on eXist-db
1)eXist-db Python based library is quite outdated and not working for me.So,I wrote a wrapper for performing search and getting data.
2)lucence powered search experimentation is pending
3)15+GB upload and experimentation is pending
4)Clustering option experiment is pending.

No comments: