How to search HUGE XML with DOM?

Discussion in 'Python' started by Sullivan WxPyQtKinter, Mar 31, 2006.

  1. a relation database has admiring search efficiency when the database is
    very big (several thousands or tens of thousands of records). But my
    current project is based on XML, for its tree-like data structure has
    much more flexibility; and DOM, which could be manipulated just like a
    tree. However, how to establish such a XML data base for search when it
    contains 10,000 records (One record usually contain 10~30 tags) or
    more?

    My search needs:
    1. Search and return all the record (an element) with specific id.
    2. Search and return all the record whose child nodes has a specific id
    or attribute.

    the xml.dom.minidom object is too slow when parsing such a big XML file
    to a DOM object. while pulldom should spend quite a long time going
    through the whole database file. How to enhance the searching speed?
    Are there existing solution or algorithm? Thank you for your
    suggetion...
     
    Sullivan WxPyQtKinter, Mar 31, 2006
    #1
    1. Advertising

  2. > the xml.dom.minidom object is too slow when parsing such a big XML file
    > to a DOM object. while pulldom should spend quite a long time going
    > through the whole database file. How to enhance the searching speed?
    > Are there existing solution or algorithm? Thank you for your
    > suggetion...


    I've told you that before, and I tell you again: RDBMS is the way to go.
    There might be XML-parsers that work faster - I suppose cElementTree can
    gain you some speed - but ultimately the problems are inherent in the
    representation as DOM: no type-information, no indices, no nothing. Just a
    huge pile of nodes in memory.

    So all searches are linear in the number of nodes. Of course you might be
    able to create indices yourself, even devise a clever scheme to make using
    them as declarative as possible. But that would in the end mean nothing but
    re-creating RDBMS technology - why do that, if it's already there?

    Maybe there are frameworks out there that support you in this, but the very
    nature of XML makes that for sure a more tedious task than just defining a
    simple SQL-Schema. If I'd have to search for some XML-tools that go beyond
    DOM, I'd go for uche ogbuji's 4suite as a starter and work my way down from
    there - maybe AMARA is what you need?

    Now having said that: I'm not a SQL-bigot. Just use the right tool for the
    job.

    Regards,

    Diez
     
    Diez B. Roggisch, Mar 31, 2006
    #2
    1. Advertising

  3. Sullivan WxPyQtKinter wrote:
    > a relation database has admiring search efficiency when the database is
    > very big (several thousands or tens of thousands of records). But my
    > current project is based on XML, for its tree-like data structure has
    > much more flexibility; and DOM, which could be manipulated just like a
    > tree. However, how to establish such a XML data base for search when it
    > contains 10,000 records (One record usually contain 10~30 tags) or
    > more?
    >
    > My search needs:
    > 1. Search and return all the record (an element) with specific id.
    > 2. Search and return all the record whose child nodes has a specific id
    > or attribute.
    >
    > the xml.dom.minidom object is too slow when parsing such a big XML file
    > to a DOM object. while pulldom should spend quite a long time going
    > through the whole database file. How to enhance the searching speed?
    > Are there existing solution or algorithm? Thank you for your
    > suggetion...


    - have a look at cElementTree ?
    - store your XML as persistant objects in a ZODB instance, then use ZODB
    catalog for queries ?
    - index relevant data in a DB (RDBMS, Berkeley, whatever...) ?
    - have a look at 4suite (http://4suite.org/index.xhtml) ?

    My 2 cents...
    --
    bruno desthuilliers
    python -c "print '@'.join(['.'.join([w[::-1] for w in p.split('.')]) for
    p in ''.split('@')])"
     
    bruno at modulix, Mar 31, 2006
    #3
  4. Sullivan WxPyQtKinter

    Paul Boddie Guest

    Diez B. Roggisch wrote:
    > > the xml.dom.minidom object is too slow when parsing such a big XML file
    > > to a DOM object. while pulldom should spend quite a long time going
    > > through the whole database file. How to enhance the searching speed?
    > > Are there existing solution or algorithm? Thank you for your
    > > suggetion...

    >
    > I've told you that before, and I tell you again: RDBMS is the way to go.


    We've lost some context from the original post that may be relevant
    here, but if populating what the original questioner calls "the
    database" is an infrequent operation, then an RDBMS probably is the way
    to go, in general. On the other hand, if a lot of parsing has to happen
    in order to perform a search, such parsing would probably incur a lot
    of overhead from SQL inserts that wouldn't be particularly desirable.

    > There might be XML-parsers that work faster - I suppose cElementTree can
    > gain you some speed - but ultimately the problems are inherent in the
    > representation as DOM: no type-information, no indices, no nothing. Just a
    > huge pile of nodes in memory.


    Well, I would hope that W3C DOM operations like getElementById would be
    supported by some index in the implementation: that would make some of
    the searches mentioned by the questioner fairly rapid, given enough
    memory.

    > So all searches are linear in the number of nodes. Of course you might be
    > able to create indices yourself, even devise a clever scheme to make using
    > them as declarative as possible. But that would in the end mean nothing but
    > re-creating RDBMS technology - why do that, if it's already there?


    I agree that careful usage of RDBMS technology would solve the general
    problems of searching large amounts of data, but the stated queries
    should involve indexes and be fairly quick.

    Paul
     
    Paul Boddie, Mar 31, 2006
    #4
  5. Sullivan WxPyQtKinter

    bayerj Guest

    Mind, that XML documents are not more flexible than RDBMS.

    You can represent any XML document in a RDBMS. You cannot represent any
    RDBMS in an XML document. RDBMS are (strictly spoken) relations and XML
    documents are trees. Relations are superior to trees, at least
    mathematically speaking.

    Once you have set up your system in a practicable way (e.G. not needing
    to create a new table via SQL Queries for a new type of node, which
    would be a pain) SQL is far superior to XML.

    Anyway, cElementTree seems to be the best way to go for you now. Its
    performance is untopped by any other python xml library, as far as I
    know.
     
    bayerj, Mar 31, 2006
    #5
  6. On 31-Mar-06, at 11:17 AM, bayerj wrote:

    > Mind, that XML documents are not more flexible than RDBMS.
    >
    > You can represent any XML document in a RDBMS. You cannot represent
    > any
    > RDBMS in an XML document. RDBMS are (strictly spoken) relations and
    > XML
    > documents are trees. Relations are superior to trees, at least
    > mathematically speaking.
    >
    > Once you have set up your system in a practicable way (e.G. not
    > needing
    > to create a new table via SQL Queries for a new type of node, which
    > would be a pain) SQL is far superior to XML.
    >
    > Anyway, cElementTree seems to be the best way to go for you now. Its
    > performance is untopped by any other python xml library, as far as I
    > know.
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list


    If I may hijack this thread for a bit, I'd like to dig deeper into
    this issue :)

    Currently my simulation program produces an XML log file with events
    represented as nodes.
    Often those files grow to multiple GB size. I like this setup because
    the format is open
    and easily parse-able with a variety of tools. So I have a bunch I
    scripts that can analyze
    different aspects of the simulation.

    I have not much clue about databases, except that they exist,
    somewhat complex, and often
    use proprietary formats for efficiency. So any points on whether RDBM-
    based setup
    would be better would be greatly appreciated.

    Even trivial aspects, such as whether to produce RDBM during the
    simulation, or convert the complete XML log file into one, are not
    entirely clear to me. I gather that RDBM would be much better suited
    for analysis, but what about portability ? Is database file a
    separate entity that may be passed around?

    Apologies if this seems like a selfish question, perhaps consider it
    a full disclosure, different set-ups/examples would be appreciated as
    well.

    --
    Cheers, Ivan
     
    Ivan Vinogradov, Mar 31, 2006
    #6
  7. On Fri, 31 Mar 2006 12:00:25 -0500, Ivan Vinogradov
    <> declaimed the following in comp.lang.python:

    > Even trivial aspects, such as whether to produce RDBM during the
    > simulation, or convert the complete XML log file into one, are not
    > entirely clear to me. I gather that RDBM would be much better suited
    > for analysis, but what about portability ? Is database file a
    > separate entity that may be passed around?
    >

    I'm going to assume you don't want to modify all the code that is
    /generating/ the log data to do database access...

    Question: are you continuously appending to a single log, or do you
    start a new log file while taking the old one for processing [the first
    option would work better with direct database access as you'd need some
    quick way to skip over already processed records to avoid duplicate
    entries in the database].

    As for the last question above... It depends on the RDBM... SQLite
    and JET (most call it "Access") are file-server (not quite the right
    term but...) databases. They use single files (with whatever limitation
    the file system has for file length -- JET has a 2GB limit I believe);
    said files can be copied -- but you still need the programs that manage
    them to access the data. MySQL's MyISAM tables and Visual FoxPro use a
    file triple per "table" (MyISAM: definition file, data file, index file;
    VFP: fixed width data file, variable width text/memo file, index file).
    MyISAM's Inno DB, SQL Server (I think), Firebird, MaxDB (aka SAP-DB) use
    multiple files but the data tends to be scattered across the files.

    Many have a "backup"/"restore" utility that works by dumping a text
    file that contains SQL statements to recreate the table, followed by SQL
    insert statements for the data.

    Transport is where you may want to use XML... Extract from the
    database to generate the XML, copy XML to destination, process XML back
    into database for analyses.

    {Personally, my understanding of XML has always led me to believe
    that it is not meant to be a "working" storage format... It is great as
    an intermediate for transforming data from one format to another -- XSLT
    stuff; or for transporting between systems... But any real manipulation
    needs conversion to a more "native" format}
    --
    > ============================================================== <
    > | Wulfraed Dennis Lee Bieber KD6MOG <
    > | Bestiaria Support Staff <
    > ============================================================== <
    > Home Page: <http://www.dm.net/~wulfraed/> <
    > Overflow Page: <http://wlfraed.home.netcom.com/> <
     
    Dennis Lee Bieber, Mar 31, 2006
    #7
  8. Perhaps what you have said is correct. But XML is more direct for
    programmers and readers in my view point.

    bayerj 写é“:

    > Mind, that XML documents are not more flexible than RDBMS.
    >
    > You can represent any XML document in a RDBMS. You cannot represent any
    > RDBMS in an XML document. RDBMS are (strictly spoken) relations and XML
    > documents are trees. Relations are superior to trees, at least
    > mathematically speaking.
    >
    > Once you have set up your system in a practicable way (e.G. not needing
    > to create a new table via SQL Queries for a new type of node, which
    > would be a pain) SQL is far superior to XML.
    >
    > Anyway, cElementTree seems to be the best way to go for you now. Its
    > performance is untopped by any other python xml library, as far as I
    > know.
     
    Sullivan WxPyQtKinter, Apr 1, 2006
    #8
  9. Sullivan WxPyQtKinter

    Magnus Lycka Guest

    Ivan Vinogradov wrote:
    > I have not much clue about databases, except that they exist, somewhat
    > complex, and often use proprietary formats for efficiency.


    Prorietary storage format, but a standardized API...

    > So any points on whether RDBM-based setup
    > would be better would be greatly appreciated.


    The typical use case for RDBMS is that you have a number
    of record types (classes/relations/tables) with a regular
    structure, and all data fits into these structures. When
    you want to search for something, you know exactly in what
    field of what table to look (but not which item of course).
    You also typically have multiple users who need to be able
    to update the same database simultaneously without getting
    in each others way.

    > Even trivial aspects, such as whether to produce RDBM during the
    > simulation, or convert the complete XML log file into one, are not
    > entirely clear to me.


    Most databases as suited at writing data in fairly small chunks,
    although it's typically much faster to write 100 items in a
    transaction, than to write 100 transactions with one item each.

    > I gather that RDBM would be much better suited for
    > analysis, but what about portability ? Is database file a separate
    > entity that may be passed around?


    Who says that a database needs to reside in a file? Most
    databases reside on disk, but it might well be in raw
    partitions.

    In general, you should see the database as a persistent
    representation of data in a system. It's not a transport
    mechanism.
     
    Magnus Lycka, Apr 7, 2006
    #9
  10. Sullivan WxPyQtKinter wrote:
    > My search needs:
    > 1. Search and return all the record (an element) with specific id.
    > 2. Search and return all the record whose child nodes has a specific id
    > or attribute.


    Try lxml, which is based on the libxml2 library. The current SVN version has
    support for xml:id through the XMLDTDID function. It simply returns an XML
    tree and an ID dictionary.

    http://codespeak.net/lxml

    Stefan
     
    Stefan Behnel, Apr 25, 2006
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    0
    Views:
    575
  2. Replies:
    3
    Views:
    552
    Stefan Behnel
    Aug 3, 2007
  3. Alan
    Replies:
    6
    Views:
    1,671
  4. Replies:
    3
    Views:
    530
  5. Abby Lee
    Replies:
    5
    Views:
    450
    Abby Lee
    Aug 2, 2004
Loading...

Share This Page