NEWB: reverse traversal of xml file

manstey · May 23, 2006

Hi,

I have an xml file of about 140Mb like this:

<book>
<record>
....
<wordpartWTS>1</wordpartWTS>
</record>
<record>
...
<wordpartWTS>2</wordpartWTS>
</record>
<record>
....
<wordpartWTS>1</wordpartWTS>
</record>
</book>

I want to traverse it from bottom to top and add another field to each
record <totalWordPart>1</totalWordPart>
which would give the highest value of wordpartWTS for each record for
each word

so if wordparts for the first ten records were 1 2 1 1 1 2 3 4 1 2
I want totalWordPart to be 2 2 1 1 4 4 4 4 2 2

I figure the easiest way to do this is to go thru the file backwards.

Any ideas how to do this with an xml data file?

Thanks

Serge Orlov · May 23, 2006

manstey said:
Hi,

I have an xml file of about 140Mb like this:

<book>
<record>
...
<wordpartWTS>1</wordpartWTS>
</record>
<record>
...
<wordpartWTS>2</wordpartWTS>
</record>
<record>
...
<wordpartWTS>1</wordpartWTS>
</record>
</book>

I want to traverse it from bottom to top and add another field to each
record <totalWordPart>1</totalWordPart>
which would give the highest value of wordpartWTS for each record for
each word

so if wordparts for the first ten records were 1 2 1 1 1 2 3 4 1 2
I want totalWordPart to be 2 2 1 1 4 4 4 4 2 2

I figure the easiest way to do this is to go thru the file backwards.

Any ideas how to do this with an xml data file?

You need to iterate from the beginning and use itertools.groupby:

from itertools import groupby

def enumerate_words(parts):
word_num = 0
prev = 0
for part in parts:
if prev >= part:
word_num += 1
prev = part
yield word_num, part

def get_word_num(item):
return item[0]

parts = 1,2,1,1,1,2,3,4,1,2
for word_num, word in groupby(enumerate_words(parts), get_word_num):
parts_list = list(word)
max_part = parts_list[-1][1]
for word_num, part_num in parts_list:
print max_part, part_num

prints:

2 1
2 2
1 1
1 1
4 1
4 2
4 3
4 4
2 1
2 2

manstey · May 23, 2006

But will this work if I don't know parts in advance. I only know parts
by reading through the file, which has 450,000 lines.

Serge Orlov · May 24, 2006

manstey said:
But will this work if I don't know parts in advance.

Yes it will work as long as the highest part number in the whole file
is not very high. The algorithm needs only store N records in memory,
where N is the highest part number in the whole file.

I only know parts
by reading through the file, which has 450,000 lines.

Lines or records? I created a sequence of 10,000,000 numbers which is
equal to your ten million records like this:

def many_numbers():
for n in xrange(1000000):
for part in xrange(10):
yield part
parts = many_numbers()

and the code processed it consuming virtually no memory in 13 seconds.
That is the advantage of iterators and generators, you can process long
sequences without allocating a lot of memory.

Problem Splitting Text String	2	Dec 29, 2022
How to save textBox values into a xml-file(with naming an choosing directory)?	1	Aug 23, 2022
Trouble writing lines into file with line feeds- Python Newb	1	Dec 23, 2013
I would like to use awk to calculate the total number of records processed	1	Aug 25, 2022
Php combine identical lines in text file	4	Oct 11, 2023
Sharing: File Reader Generator with & w/o Policy	14	Mar 15, 2014
Python point location of intersect between two lines	0	Feb 28, 2018
How to sort a CSV file with merge sort JAVA	7	May 6, 2021

NEWB: reverse traversal of xml file

manstey

Serge Orlov

manstey

Serge Orlov

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads