xml - MongoDB storage along with MySQL XPath features -
i have: large dataset of complex documents tree-like data structure (each document have own data tree may differ document document). backend implemented using django 1.3 , mysql.
i need:
- store these documents using scalable , fast storage
- filter documents predefined queries
- search against data within limited subset of document's fields
- extra feature: search documents using possible query , extract possible information data tree. extra feature customer's requirement, , should comfortable non-technical users use it. @ moment have agreement xpath enough.
note: not required change documents frequently. 90% of time reading operations.
note: need possible fields data tree. data needed 90% of time 10% of whole data tree. case data needed extra feature described above. however, practically isn't popular feature of system.
note: data tree comes each document representation of custom format not changed. can pick needed pieces of data tree , convert them readable form (as write them using custom format).
currently, use:
- mysql store data tree each document xml
- some pre-picked data xml columns in same table speed search
- all other needed fields extracted xml on fly during each query using mysql's extractvalue() function (more 10 extractvalue() calls in each sql query)
- all searched , filterings performed using xpath queries , extractvalue() against stored xml
the problem temporary solution poor performance point of view. on 100k+ records dataset have wait minutes execute single query ~10 extractvalue() calls in 1 single query.
the solutions see @ moment:
1) continue using approach pre-picked fields in columns. these fields extracted once while document comes system.
pros:
- use of proven technology (mysql)
- most searches performed against pre-picked fields not using slow extractvalue() function
- i have tested xpath query searches in approach on 100k+ records dataset, , seems performance of 1 extractvalue() call isn't slow (<1s query) in comparison 10+ simultaneous extractvalue() calls in current temporary approach
cons:
- due each document have own data tree, different set of pre-picked fields, have create bunch of tables store these sets of data, , join these tables depending on document type
- pre-picked fields can lists, not single values, each list should have separate table
- synchronization required of pre-picked fields , xml
2) use mysql xpath custom searches (i.e. extra feature), use mongodb other searches. mongodb stores needed pre-picked fields, mysql stores xml.
note: don't think worth me store possible system's data in mongodb, user accounts, sessions etc. mongodb documents enough.
pros:
- 90% of needed queries should fast
- i store nested data against each document. data trees can different document document. no need join lot of tables.
- mongodb seems have convenient tools use python
cons:
- unproven technology (at least me). have no experience mongodb, have consulted programmers using it, , seems looks promising.
- mongodb has no xpath-like features (and seems not have them in near future), have continue using mysql xpath (as in solution 1)). developer, want avoid inventing new custom query language, xpath looks compromise.
- thus, synchronization between mysql , mongodb required
my questions:
- could there hidden or unobvious problems solution 2)?
- i still worrying performance extractvalue(). customer's view, possible replace xpath approach similar one, have no idea of that, tools available mysql or mongodb. can there similar solutions xpath?
will grateful feedback, thoughts , comments on written above.
so if understand question right want
- find given node in tree, given path through portion of tree node plus query expressions.
- then node , below it.
with materialized paths approach can above. main thing needs tweaking if there path "a..b..c..d..e" document , want find documents path "..b..c..d..", how make fast. if starting top easy. here aren't. may make sense use combination approach 1 has materialized path in document node plus array of node's ancestors, like:
{ path : ",a,b,c,d,e,", ancestor : ['a','b','c','d','e'] }
we index on ancestors create multikey index. query following find nodes on path "...b,c,d..." efficiency:
find( { path : /,b,c,d,/, ancestor : 'd', <more_query_expressions_optionally> } )
in above index on ancestor used , docs 'd' down need inspected. following tried might better depending on how smart query optimizer is:
find( { path : /,b,c,d,/, ancestor : { $all : ['a','d'] }, ... } )
Comments
Post a Comment