NEWB: reverse traversal of xml file

M

manstey

Hi,

I have an xml file of about 140Mb like this:

<book>
<record>
....
<wordpartWTS>1</wordpartWTS>
</record>
<record>
...
<wordpartWTS>2</wordpartWTS>
</record>
<record>
....
<wordpartWTS>1</wordpartWTS>
</record>
</book>

I want to traverse it from bottom to top and add another field to each
record <totalWordPart>1</totalWordPart>
which would give the highest value of wordpartWTS for each record for
each word

so if wordparts for the first ten records were 1 2 1 1 1 2 3 4 1 2
I want totalWordPart to be 2 2 1 1 4 4 4 4 2 2

I figure the easiest way to do this is to go thru the file backwards.

Any ideas how to do this with an xml data file?

Thanks
 
S

Serge Orlov

manstey said:
Hi,

I have an xml file of about 140Mb like this:

<book>
<record>
...
<wordpartWTS>1</wordpartWTS>
</record>
<record>
...
<wordpartWTS>2</wordpartWTS>
</record>
<record>
...
<wordpartWTS>1</wordpartWTS>
</record>
</book>

I want to traverse it from bottom to top and add another field to each
record <totalWordPart>1</totalWordPart>
which would give the highest value of wordpartWTS for each record for
each word

so if wordparts for the first ten records were 1 2 1 1 1 2 3 4 1 2
I want totalWordPart to be 2 2 1 1 4 4 4 4 2 2

I figure the easiest way to do this is to go thru the file backwards.

Any ideas how to do this with an xml data file?

You need to iterate from the beginning and use itertools.groupby:

from itertools import groupby

def enumerate_words(parts):
word_num = 0
prev = 0
for part in parts:
if prev >= part:
word_num += 1
prev = part
yield word_num, part


def get_word_num(item):
return item[0]

parts = 1,2,1,1,1,2,3,4,1,2
for word_num, word in groupby(enumerate_words(parts), get_word_num):
parts_list = list(word)
max_part = parts_list[-1][1]
for word_num, part_num in parts_list:
print max_part, part_num

prints:

2 1
2 2
1 1
1 1
4 1
4 2
4 3
4 4
2 1
2 2
 
M

manstey

But will this work if I don't know parts in advance. I only know parts
by reading through the file, which has 450,000 lines.
 
S

Serge Orlov

manstey said:
But will this work if I don't know parts in advance.

Yes it will work as long as the highest part number in the whole file
is not very high. The algorithm needs only store N records in memory,
where N is the highest part number in the whole file.
I only know parts
by reading through the file, which has 450,000 lines.

Lines or records? I created a sequence of 10,000,000 numbers which is
equal to your ten million records like this:

def many_numbers():
for n in xrange(1000000):
for part in xrange(10):
yield part
parts = many_numbers()

and the code processed it consuming virtually no memory in 13 seconds.
That is the advantage of iterators and generators, you can process long
sequences without allocating a lot of memory.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top