splitting an XML file on the basis on basis of XML tags

Discussion in 'Python' started by bijeshn@gmail.com, Apr 2, 2008.

  1. Guest

    Hi all,

    i have an XML file with the following structure::

    <r1>
    <r2>-----|
    <r3> |
    <r4> |
    .. |
    .. | --------------------> constitutes one record.
    .. |
    .. |
    .. |
    </r4> |
    </r3> |
    </r2>----|
    <r2>
    ..
    ..
    .. -----------------------|
    .. |
    .. |
    .. |----------------------> there are n
    records in between....
    .. |
    .. |
    .. |
    .. ------------------------|
    ..
    ..
    </r2>
    <r2>-----|
    <r3> |
    <r4> |
    .. |
    .. | --------------------> constitutes one record.
    .. |
    .. |
    .. |
    </r4> |
    </r3> |
    </r2>----|
    </r1>


    Here <r1> is the main root tag of the XML, and <r2>...</r2>
    constitutes one record. What I would like to do is
    to extract everything (xml tags and data) between nth <r2> tag and (n
    +k)th <r2> tag. The extracted data is to be
    written down to a separate file.

    Thanks...
    , Apr 2, 2008
    #1
    1. Advertising

  2. Chris Guest

    wrote:
    > Hi all,
    >
    > i have an XML file with the following structure::
    >
    > <r1>
    > <r2>-----|
    > <r3> |
    > <r4> |
    > . |
    > . | --------------------> constitutes one record.
    > . |
    > . |
    > . |
    > </r4> |
    > </r3> |
    > </r2>----|
    > <r2>
    > .
    > .
    > . -----------------------|
    > . |
    > . |
    > . |----------------------> there are n
    > records in between....
    > . |
    > . |
    > . |
    > . ------------------------|
    > .
    > .
    > </r2>
    > <r2>-----|
    > <r3> |
    > <r4> |
    > . |
    > . | --------------------> constitutes one record.
    > . |
    > . |
    > . |
    > </r4> |
    > </r3> |
    > </r2>----|
    > </r1>
    >
    >
    > Here <r1> is the main root tag of the XML, and <r2>...</r2>
    > constitutes one record. What I would like to do is
    > to extract everything (xml tags and data) between nth <r2> tag and (n
    > +k)th <r2> tag. The extracted data is to be
    > written down to a separate file.
    >
    > Thanks...


    You could create a generator expression out of it:

    txt = """<r1>
    <r2><r3><r4>1</r4></r3></r2>
    <r2><r3><r4>2</r4></r3></r2>
    <r2><r3><r4>3</r4></r3></r2>
    <r2><r3><r4>4</r4></r3></r2>
    <r2><r3><r4>5</r4></r3></r2>
    </r1>
    """
    l = len(txt.split('r2>'))-1
    a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
    and i.replace('>','').replace('<','').strip())

    Now you have a generator you can iterate through with a.next() or
    alternatively you could just create a list out of it by replacing the
    outer parens with square brackets.
    Chris, Apr 2, 2008
    #2
    1. Advertising

  3. bijeshn Guest

    On Apr 2, 5:37 pm, Chris <> wrote:
    > wrote:
    > > Hi all,

    >
    > >          i have an XML file with the following structure::

    >
    > > <r1>
    > > <r2>-----|
    > > <r3>     |
    > > <r4>     |
    > > .           |
    > > .           |         --------------------> constitutes one record.
    > > .           |
    > > .           |
    > > .           |
    > > </r4>    |
    > > </r3>    |
    > > </r2>----|
    > > <r2>
    > > .
    > > .
    > > .    -----------------------|
    > > .                           |
    > > .                           |
    > > .                           |----------------------> there are n
    > > records in between....
    > > .                           |
    > > .                           |
    > > .                           |
    > > .   ------------------------|
    > > .
    > > .
    > > </r2>
    > > <r2>-----|
    > > <r3>     |
    > > <r4>     |
    > > .           |
    > > .           |         --------------------> constitutes one record.
    > > .           |
    > > .           |
    > > .           |
    > > </r4>    |
    > > </r3>    |
    > > </r2>----|
    > > </r1>

    >
    > >        Here <r1> is the main root tag of the XML, and <r2>...</r2>
    > > constitutes one record. What I would like to do is
    > > to extract everything (xml tags and data) between nth <r2> tag and (n
    > > +k)th <r2> tag. The extracted data is to be
    > > written down to a separate file.

    >
    > > Thanks...

    >
    > You could create a generator expression out of it:
    >
    > txt = """<r1>
    >     <r2><r3><r4>1</r4></r3></r2>
    >     <r2><r3><r4>2</r4></r3></r2>
    >     <r2><r3><r4>3</r4></r3></r2>
    >     <r2><r3><r4>4</r4></r3></r2>
    >     <r2><r3><r4>5</r4></r3></r2>
    >     </r1>
    >     """
    > l = len(txt.split('r2>'))-1
    > a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
    > and i.replace('>','').replace('<','').strip())
    >
    > Now you have a generator you can iterate through with a.next() or
    > alternatively you could just create a list out of it by replacing the
    > outer parens with square brackets.- Hide quoted text -
    >
    > - Show quoted text -


    Hmmm... will look into it.. Thanks

    the XML file is almost a TB in size...

    so SAX will have to be the parser.... i'm thinking of doing something
    to split the file using SAX
    ... Any suggestions on those lines..? If there are any other parsers
    suitable, please suggest...
    bijeshn, Apr 3, 2008
    #3
  4. Steve Holden Guest

    bijeshn wrote:
    > On Apr 2, 5:37 pm, Chris <> wrote:
    >> wrote:
    >>> Hi all,
    >>> i have an XML file with the following structure::
    >>> <r1>
    >>> <r2>-----|
    >>> <r3> |
    >>> <r4> |
    >>> . |
    >>> . | --------------------> constitutes one record.
    >>> . |
    >>> . |
    >>> . |
    >>> </r4> |
    >>> </r3> |
    >>> </r2>----|
    >>> <r2>
    >>> .
    >>> .
    >>> . -----------------------|
    >>> . |
    >>> . |
    >>> . |----------------------> there are n
    >>> records in between....
    >>> . |
    >>> . |
    >>> . |
    >>> . ------------------------|
    >>> .
    >>> .
    >>> </r2>
    >>> <r2>-----|
    >>> <r3> |
    >>> <r4> |
    >>> . |
    >>> . | --------------------> constitutes one record.
    >>> . |
    >>> . |
    >>> . |
    >>> </r4> |
    >>> </r3> |
    >>> </r2>----|
    >>> </r1>
    >>> Here <r1> is the main root tag of the XML, and <r2>...</r2>
    >>> constitutes one record. What I would like to do is
    >>> to extract everything (xml tags and data) between nth <r2> tag and (n
    >>> +k)th <r2> tag. The extracted data is to be
    >>> written down to a separate file.
    >>> Thanks...

    >> You could create a generator expression out of it:
    >>
    >> txt = """<r1>
    >> <r2><r3><r4>1</r4></r3></r2>
    >> <r2><r3><r4>2</r4></r3></r2>
    >> <r2><r3><r4>3</r4></r3></r2>
    >> <r2><r3><r4>4</r4></r3></r2>
    >> <r2><r3><r4>5</r4></r3></r2>
    >> </r1>
    >> """
    >> l = len(txt.split('r2>'))-1
    >> a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
    >> and i.replace('>','').replace('<','').strip())
    >>
    >> Now you have a generator you can iterate through with a.next() or
    >> alternatively you could just create a list out of it by replacing the
    >> outer parens with square brackets.- Hide quoted text -
    >>
    >> - Show quoted text -

    >
    > Hmmm... will look into it.. Thanks
    >
    > the XML file is almost a TB in size...
    >

    Good grief. When will people stop abusing XML this way?

    > so SAX will have to be the parser.... i'm thinking of doing something
    > to split the file using SAX
    > ... Any suggestions on those lines..? If there are any other parsers
    > suitable, please suggest...


    You could try pulldom, but the documentation is disgraceful.

    ElementTree.iterparse *might* help.

    regards
    Steve

    --
    Steve Holden +1 571 484 6266 +1 800 494 3119
    Holden Web LLC http://www.holdenweb.com/
    Steve Holden, Apr 3, 2008
    #4
  5. Steve Holden wrote:

    >> the XML file is almost a TB in size...
    >>

    > Good grief. When will people stop abusing XML this way?


    Not before somebody writes a clever xmlfs for the linux kernel :-/
    Marco Mariani, Apr 3, 2008
    #5
  6. Marco Mariani, Apr 3, 2008
    #6
  7. Chris Guest

    On Apr 3, 8:51 am, Steve Holden <> wrote:
    > bijeshn wrote:
    > > On Apr 2, 5:37 pm, Chris <> wrote:
    > >> wrote:
    > >>> Hi all,
    > >>>          i have an XML file with the following structure::
    > >>> <r1>
    > >>> <r2>-----|
    > >>> <r3>     |
    > >>> <r4>     |
    > >>> .           |
    > >>> .           |         --------------------> constitutes one record.
    > >>> .           |
    > >>> .           |
    > >>> .           |
    > >>> </r4>    |
    > >>> </r3>    |
    > >>> </r2>----|
    > >>> <r2>
    > >>> .
    > >>> .
    > >>> .    -----------------------|
    > >>> .                           |
    > >>> .                           |
    > >>> .                           |----------------------> there are n
    > >>> records in between....
    > >>> .                           |
    > >>> .                           |
    > >>> .                           |
    > >>> .   ------------------------|
    > >>> .
    > >>> .
    > >>> </r2>
    > >>> <r2>-----|
    > >>> <r3>     |
    > >>> <r4>     |
    > >>> .           |
    > >>> .           |         --------------------> constitutes one record.
    > >>> .           |
    > >>> .           |
    > >>> .           |
    > >>> </r4>    |
    > >>> </r3>    |
    > >>> </r2>----|
    > >>> </r1>
    > >>>        Here <r1> is the main root tag of the XML, and <r2>...</r2>
    > >>> constitutes one record. What I would like to do is
    > >>> to extract everything (xml tags and data) between nth <r2> tag and (n
    > >>> +k)th <r2> tag. The extracted data is to be
    > >>> written down to a separate file.
    > >>> Thanks...
    > >> You could create a generator expression out of it:

    >
    > >> txt = """<r1>
    > >>     <r2><r3><r4>1</r4></r3></r2>
    > >>     <r2><r3><r4>2</r4></r3></r2>
    > >>     <r2><r3><r4>3</r4></r3></r2>
    > >>     <r2><r3><r4>4</r4></r3></r2>
    > >>     <r2><r3><r4>5</r4></r3></r2>
    > >>     </r1>
    > >>     """
    > >> l = len(txt.split('r2>'))-1
    > >> a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
    > >> and i.replace('>','').replace('<','').strip())

    >
    > >> Now you have a generator you can iterate through with a.next() or
    > >> alternatively you could just create a list out of it by replacing the
    > >> outer parens with square brackets.- Hide quoted text -

    >
    > >> - Show quoted text -

    >
    > > Hmmm... will look into it.. Thanks

    >
    > > the XML file is almost a TB in size...

    >
    > Good grief. When will people stop abusing XML this way?
    >
    > > so SAX will have to be the parser.... i'm thinking of doing something
    > > to split the file using SAX
    > > ... Any suggestions on those lines..? If there are any other parsers
    > > suitable, please suggest...

    >
    > You could try pulldom, but the documentation is disgraceful.
    >
    > ElementTree.iterparse *might* help.
    >
    > regards
    >   Steve
    >
    > --
    > Steve Holden        +1 571 484 6266   +1 800 494 3119
    > Holden Web LLC              http://www.holdenweb.com/


    I abuse it because I can (and because I don't generally work with XML
    files larger than 20-30meg) :)
    And the OP never said the XML file for 1TB in size, which makes things
    different.
    Chris, Apr 3, 2008
    #7
  8. > I abuse it because I can (and because I don't generally work with XML
    > files larger than 20-30meg) :)
    > And the OP never said the XML file for 1TB in size, which makes things
    > different.


    Even with small xml-files your advice was not very sound. Yes, it's
    tempting to use regexes to process xml. But usually one falls flat on
    his face soon - because of whitespace or attribute order or <foo></foo>
    versus <foo/> or .. or .. or.

    Use an XML-parser. That's what they are for. And especially with the
    pythonic ones like element-tree (and the compatible lxml), its even more
    straight-forward than using rexes.


    Diez
    Diez B. Roggisch, Apr 3, 2008
    #8
  9. bijeshn Guest

    On Apr 3, 11:28 pm, "Diez B. Roggisch" <> wrote:
    > > I abuse it because I can (and because I don't generally work with XML
    > > files larger than 20-30meg) :)
    > > And the OP never said the XML file for 1TB in size, which makes things
    > > different.

    >
    > Even with small xml-files your advice was not very sound. Yes, it's
    > tempting to use regexes to process xml. But usually one falls flat on
    > his face soon - because of whitespace or attribute order or <foo></foo>
    > versus <foo/> or .. or .. or.
    >
    > Use an XML-parser. That's what they are for. And especially with the
    > pythonic ones like element-tree (and the compatible lxml), its even more
    > straight-forward than using rexes.
    >
    > Diez


    yeah, i plan to use SAX.. but the thing is how do you do it with
    that?....

    forget 1 TB for now... how do you split an XML file which is something
    like 70-80 GB... on the basis of my need (thats the post.)?
    bijeshn, Apr 4, 2008
    #9
  10. schrieb:
    > Hi all,
    >
    > i have an XML file with the following structure::
    >
    > <r1>
    > <r2>-----|
    > <r3> |
    > <r4> |
    > . |
    > . | --------------------> constitutes one record.
    > . |
    > . |
    > . |
    > </r4> |
    > </r3> |
    > </r2>----|
    > <r2>
    > .
    > .
    > . -----------------------|
    > . |
    > . |
    > . |----------------------> there are n
    > records in between....
    > . |
    > . |
    > . |
    > . ------------------------|
    > .
    > .
    > </r2>
    > <r2>-----|
    > <r3> |
    > <r4> |
    > . |
    > . | --------------------> constitutes one record.
    > . |
    > . |
    > . |
    > </r4> |
    > </r3> |
    > </r2>----|
    > </r1>
    >
    >
    > Here <r1> is the main root tag of the XML, and <r2>...</r2>
    > constitutes one record. What I would like to do is
    > to extract everything (xml tags and data) between nth <r2> tag and (n
    > +k)th <r2> tag. The extracted data is to be
    > written down to a separate file.


    What do you mean by "written down to a separate file"? Do you have a specific
    format in mind?

    In general, you can try this:

    >>> from xml.etree import cElementTree as ET
    >>> itercontext = ET.iterparse("thefile.xml", events=("start", "end")
    >>> event,root = itercontext.next()
    >>> for event,element in itercontext:

    ... if event == "end" and element.tag == "r2":
    ... print ET.tostring(element) # write record subtree as XML
    ... root.clear() # one record done, clean up everything

    http://effbot.org/zone/element-iterparse.htm

    You can also do things like

    ... print element.findtext("r3/r4")

    Read the ElementTree tutorial to learn how to extract your data:

    http://effbot.org/zone/element.htm#searching-for-subelements

    Stefan
    Stefan Behnel, Apr 7, 2008
    #10
  11. bijeshn Guest

    >
    > What do you mean by "written down to a separate file"? Do you have a specific
    > format in mind?
    >



    sorry, it should be extracted into separate "files". i.e. if i have an
    XML file containing 10 million records, i need to split the file to
    100 files containing 100,000 records each.

    i hope this is clearer...
    bijeshn, Apr 7, 2008
    #11
  12. bijeshn Guest

    pls disregard the above post....

    On Apr 7, 3:13 pm, bijeshn <> wrote:
    > > What do you mean by "written down to a separate file"? Do you have a specific
    > > format in mind?

    >
    > sorry, it should be extracted into separate " XML files". i.e. if i have an
    > XML file containing 10 million records, i need to split the file to
    > 100 XML files containing 100,000 records each.
    >
    > i hope this is clearer...
    bijeshn, Apr 7, 2008
    #12
  13. bijeshn Guest

    the extracted files are to be XML too. ijust need to extract it raw
    (tags and data just like it is in the parent XML file..)
    bijeshn, Apr 7, 2008
    #13
  14. bijeshn wrote:
    > the extracted files are to be XML too. ijust need to extract it raw
    > (tags and data just like it is in the parent XML file..)


    Ah, so then replace the "print tostring()" line in my example by

    ET.ElementTree(element).write("outputfile.xml")

    and you're done.

    Stefan
    Stefan Behnel, Apr 7, 2008
    #14
  15. bijeshn Guest

    On Apr 7, 5:34 pm, Stefan Behnel <> wrote:
    > bijeshn wrote:
    > > the extracted files are to be XML too. ijust need to extract it raw
    > > (tags and data just like it is in the parent XML file..)

    >
    > Ah, so then replace the "print tostring()" line in my example by
    >
    >     ET.ElementTree(element).write("outputfile.xml")
    >
    > and you're done.
    >
    > Stefan


    thanks a lot, Stefan....
    i haven't tested out your idea yet.
    Will get back as soon as I do it...
    bijeshn, Apr 8, 2008
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page