splitting an XML file on the basis on basis of XML tags

bijeshn · Apr 2, 2008

Hi all,

i have an XML file with the following structure::

<r1>
<r2>-----|
<r3> |
<r4> |
.. |
.. | --------------------> constitutes one record.
.. |
.. |
.. |
</r4> |
</r3> |
</r2>----|
<r2>
..
..
.. -----------------------|
.. |
.. |
.. |----------------------> there are n
records in between....
.. |
.. |
.. |
.. ------------------------|
..
..
</r2>
<r2>-----|
<r3> |
<r4> |
.. |
.. | --------------------> constitutes one record.
.. |
.. |
.. |
</r4> |
</r3> |
</r2>----|
</r1>

Here <r1> is the main root tag of the XML, and <r2>...</r2>
constitutes one record. What I would like to do is
to extract everything (xml tags and data) between nth <r2> tag and (n
+k)th <r2> tag. The extracted data is to be
written down to a separate file.

Thanks...

Chris · Apr 2, 2008

[email protected] said:
Hi all,

i have an XML file with the following structure::

<r1>
<r2>-----|
<r3> |
<r4> |
. |
. | --------------------> constitutes one record.
. |
. |
. |
</r4> |
</r3> |
</r2>----|
<r2>
.
.
. -----------------------|
. |
. |
. |----------------------> there are n
records in between....
. |
. |
. |
. ------------------------|
.
.
</r2>
<r2>-----|
<r3> |
<r4> |
. |
. | --------------------> constitutes one record.
. |
. |
. |
</r4> |
</r3> |
</r2>----|
</r1>

Here <r1> is the main root tag of the XML, and <r2>...</r2>
constitutes one record. What I would like to do is
to extract everything (xml tags and data) between nth <r2> tag and (n
+k)th <r2> tag. The extracted data is to be
written down to a separate file.

Thanks...

You could create a generator expression out of it:

txt = """<r1>
<r2><r3><r4>1</r4></r3></r2>
<r2><r3><r4>2</r4></r3></r2>
<r2><r3><r4>3</r4></r3></r2>
<r2><r3><r4>4</r4></r3></r2>
<r2><r3><r4>5</r4></r3></r2>
</r1>
"""
l = len(txt.split('r2>'))-1
a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
and i.replace('>','').replace('<','').strip())

Now you have a generator you can iterate through with a.next() or
alternatively you could just create a list out of it by replacing the
outer parens with square brackets.

bijeshn · Apr 3, 2008

You could create a generator expression out of it:

txt = """<r1>
<r2><r3><r4>1</r4></r3></r2>
<r2><r3><r4>2</r4></r3></r2>
<r2><r3><r4>3</r4></r3></r2>
<r2><r3><r4>4</r4></r3></r2>
<r2><r3><r4>5</r4></r3></r2>
</r1>
"""
l = len(txt.split('r2>'))-1
a = ('<r2>%sr2>'%i for j,i in enumerate(txt.split('r2>')) if 0 < j < l
and i.replace('>','').replace('<','').strip())

Now you have a generator you can iterate through with a.next() or
alternatively you could just create a list out of it by replacing the
outer parens with square brackets.- Hide quoted text -

- Show quoted text -

Hmmm... will look into it.. Thanks

the XML file is almost a TB in size...

so SAX will have to be the parser.... i'm thinking of doing something
to split the file using SAX
... Any suggestions on those lines..? If there are any other parsers
suitable, please suggest...

Steve Holden · Apr 3, 2008

bijeshn said:
Hmmm... will look into it.. Thanks

the XML file is almost a TB in size...

Good grief. When will people stop abusing XML this way?

so SAX will have to be the parser.... i'm thinking of doing something
to split the file using SAX
... Any suggestions on those lines..? If there are any other parsers
suitable, please suggest...

You could try pulldom, but the documentation is disgraceful.

ElementTree.iterparse *might* help.

regards
Steve

Marco Mariani · Apr 3, 2008

Steve said:
Good grief. When will people stop abusing XML this way?

Not before somebody writes a clever xmlfs for the linux kernel :-/

Marco Mariani · Apr 3, 2008

Marco said:
Not before somebody writes a clever xmlfs for the linux kernel :-/

Ok.

I meant it as a joke, but somebody has been there and done that.

Twice.

http://xmlfs.modry.cz/user_documentation/

http://www.haifa.ibm.com/projects/storage/xmlfs/index.html

Chris · Apr 3, 2008

Good grief. When will people stop abusing XML this way?

You could try pulldom, but the documentation is disgraceful.

ElementTree.iterparse *might* help.

regards
Steve

I abuse it because I can (and because I don't generally work with XML
files larger than 20-30meg)

And the OP never said the XML file for 1TB in size, which makes things
different.

Diez B. Roggisch · Apr 3, 2008

I abuse it because I can (and because I don't generally work with XML

files larger than 20-30meg)
And the OP never said the XML file for 1TB in size, which makes things
different.

Even with small xml-files your advice was not very sound. Yes, it's
tempting to use regexes to process xml. But usually one falls flat on
his face soon - because of whitespace or attribute order or <foo></foo>
versus <foo/> or .. or .. or.

Use an XML-parser. That's what they are for. And especially with the
pythonic ones like element-tree (and the compatible lxml), its even more
straight-forward than using rexes.

Diez

bijeshn · Apr 4, 2008

Even with small xml-files your advice was not very sound. Yes, it's
tempting to use regexes to process xml. But usually one falls flat on
his face soon - because of whitespace or attribute order or <foo></foo>
versus <foo/> or .. or .. or.

Use an XML-parser. That's what they are for. And especially with the
pythonic ones like element-tree (and the compatible lxml), its even more
straight-forward than using rexes.

Diez

yeah, i plan to use SAX.. but the thing is how do you do it with
that?....

forget 1 TB for now... how do you split an XML file which is something
like 70-80 GB... on the basis of my need (thats the post.)?

Stefan Behnel · Apr 7, 2008

Hi all,

i have an XML file with the following structure::

<r1>
<r2>-----|
<r3> |
<r4> |
. |
. | --------------------> constitutes one record.
. |
. |
. |
</r4> |
</r3> |
</r2>----|
<r2>
.
.
. -----------------------|
. |
. |
. |----------------------> there are n
records in between....
. |
. |
. |
. ------------------------|
.
.
</r2>
<r2>-----|
<r3> |
<r4> |
. |
. | --------------------> constitutes one record.
. |
. |
. |
</r4> |
</r3> |
</r2>----|
</r1>

Here <r1> is the main root tag of the XML, and <r2>...</r2>
constitutes one record. What I would like to do is
to extract everything (xml tags and data) between nth <r2> tag and (n
+k)th <r2> tag. The extracted data is to be
written down to a separate file.

What do you mean by "written down to a separate file"? Do you have a specific
format in mind?

In general, you can try this:
... if event == "end" and element.tag == "r2":
... print ET.tostring(element) # write record subtree as XML
... root.clear() # one record done, clean up everything

http://effbot.org/zone/element-iterparse.htm

You can also do things like

... print element.findtext("r3/r4")

Read the ElementTree tutorial to learn how to extract your data:

http://effbot.org/zone/element.htm#searching-for-subelements

Stefan

bijeshn · Apr 7, 2008

What do you mean by "written down to a separate file"? Do you have a specific
format in mind?

sorry, it should be extracted into separate "files". i.e. if i have an
XML file containing 10 million records, i need to split the file to
100 files containing 100,000 records each.

i hope this is clearer...

bijeshn · Apr 7, 2008

pls disregard the above post....

sorry, it should be extracted into separate " XML files". i.e. if i have an
XML file containing 10 million records, i need to split the file to
100 XML files containing 100,000 records each.

i hope this is clearer...

bijeshn · Apr 7, 2008

the extracted files are to be XML too. ijust need to extract it raw
(tags and data just like it is in the parent XML file..)

Stefan Behnel · Apr 7, 2008

bijeshn said:
the extracted files are to be XML too. ijust need to extract it raw
(tags and data just like it is in the parent XML file..)

Ah, so then replace the "print tostring()" line in my example by

ET.ElementTree(element).write("outputfile.xml")

and you're done.

Stefan

bijeshn · Apr 8, 2008

Ah, so then replace the "print tostring()" line in my example by

ET.ElementTree(element).write("outputfile.xml")

and you're done.

Stefan

thanks a lot, Stefan....
i haven't tested out your idea yet.
Will get back as soon as I do it...

Simple Processor VHDL Doubt	0	May 24, 2011
Finding all instances of a string in an XML file	0	Jun 21, 2013
Simple XML into HTML double (2 two) columns	1	Nov 4, 2003
Having trouble on initialization of array signal	0	May 9, 2011
how to make a tree with randomly selected html tags from an array in python?	0	Mar 10, 2013
How can I construct an XML file to contain HTML tags in the data for a Literal element?	5	Sep 26, 2009
A Look At The Advantages and Drawbacks of XML	13	Jan 22, 2013
add document tags to xml doc	3	Sep 15, 2010

splitting an XML file on the basis on basis of XML tags

bijeshn

Chris

bijeshn

Steve Holden

Marco Mariani

Marco Mariani

Chris

Diez B. Roggisch

bijeshn

Stefan Behnel

bijeshn

bijeshn

bijeshn

Stefan Behnel

bijeshn

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads