XML -> Tab-delimited text file (using lxml)


G

Gibson

I'm attempting to do the following:
A) Read/scan/iterate/etc. through a semi-large XML file (about 135 mb)
B) Grab specific fields and output to a tab-delimited text file

The only problem I'm having is that the tab-delimited text file
requires a different order of values than which appear in the XML
file. Example below.

<Title>
<Item ID="1234abcd">
<ItemVal ValueID="image" value="image.jpg" />
<ItemVal ValueID="name" value="My Wonderful Product 1" />
<ItemVal ValueID="description" value="My Wonderful Product 1 is
a wonderful product, indeed." />
</Item>
<Item ID="2345bcde">
<ItemVal ValueID="image" value="image2.jpg" />
<ItemVal ValueID="name" value="My Wonderful Product 2" />
<ItemVal ValueID="description" value="My Wonderful Product 2 is
a wonderful product, indeed." />
</Item>
<Item ID="3456cdef">
<ItemVal ValueID="image" value="image3.jpg" />
<ItemVal ValueID="description" value="My Wonderful Product 3 is
a wonderful product, indeed." />
<ItemVal ValueID="name" value="My Wonderful Product 3" />
</Item>
</Title>

(Note: The last item "3456cdef" shows the description value as being
before the name, where as in previous items, it comes after. This is
to simulate the XML data with which I am working.)
And the tab-delimited text file should appear as follows: (tabs are as
2 spaces, for the sake of readability here)

(ID,name,description,image)
1234abcd My Wonderful Product 1 My Wonderful Product 1 is a
wonderful product, indeed. image.jpg
2345bcde My Wonderful Product 2 My Wonderful Product 2 is a
wonderful product, indeed. image2.jpg
3456cdef My Wonderful Product 3 My Wonderful Product 3 is a
wonderful product, indeed. image3.jpg

Currently, I'm working with the lxml library for iteration and
parsing, though this is proving to be a bit of a challenge for data
that needs to be reorganized (such as mine). Sample below.

''' Start code '''

from lxml import etree

def main():
# Far too much room would be taken up if I were to paste my
# real code here, so I will give a smaller example of what
# I'm doing. Also, I do realize this is a very naive way to do
# what it is I'm trying to accomplish... besides the fact
# that it doesn't work as intended in the first place.

out = open('output.txt','w')
cat = etree.parse('catalog.xml')
for el in cat.iter():
# Search for the first item, make a new line for it
# and output the ID
if el.tag == "Item":
out.write("\n%s\t" % (el.attrib['ID']))
elif el.tag == "ItemVal":
if el.attrib['ValueID'] == "name":
out.write("%s\t" % (el.attrib['value']))
elif el.attrib['ValueID'] == "description":
out.write("%s\t" % (el.attrib['value']))
elif el.attrib['ValueID'] == "image":
out.write("%s\t" % (el.attrib['value']))
out.close()

if __name__ == '__main__': main()

''' End code '''

I now realize that etree.iter() is meant to be used in an entirely
different fashion, but my brain is stuck on this naive way of coding.
If someone could give me a push in any correct direction I would be
most grateful.
 
Ad

Advertisements

S

Stefan Behnel

Gibson said:
I'm attempting to do the following:
A) Read/scan/iterate/etc. through a semi-large XML file (about 135 mb)
B) Grab specific fields and output to a tab-delimited text file
[...]
out = open('output.txt','w')
cat = etree.parse('catalog.xml')

Use iterparse() instead of parsing the file into memory completely.

untested:

for _, item in etree.iterparse('catalog.xml', tag='Item'):
# do some cleanup to save memory
previous_item = item.getprevious()
while previous_item is not None:
previous_item.getparent().remove(previous_item)
previous_item = item.getprevious()

# now read the data
id = item.get('ID')
collect = {}
for child in item:
if child.tag != 'ItemVal': continue
collect[child.get('ValueId')] = child.get('value')

print "%s\t%s\t%s\t%s" % ((id,) + tuple(
collect[key] for key in ['name','description','image']))

Stefan
 
Ad

Advertisements

G

Gibson

Use iterparse() instead of parsing the file into memory completely.

*stuff*

Stefan

That worked wonders. Thanks a lot, Stefan.

So, iterparse() uses an iterate -> parse method instead of parse() and
iter()'s parse -> iterate method (if that makes any sense)?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top