SAX/Python : read an xml from the end to the top

kepioo · Mar 7, 2006

I currently have an xml input file containing lots of data. My objectiv
is to write a script that reports in another xml file only the data I
am interested in. Doing this is really easy using SAX.

The input file is continuously updated. However, the other xml file
should be updated only on request.

Everytime we run the script, we track the new elements in the input
file and report them in the output file.

My idea was to :
_ detect in the output file the last event reported
_ read the input file from the end
_ report all the new events ( since the last time the script was run).

Question : IS it possible to read an XML file and process it from the
end to the beginning, using SAX????

Diez B. Roggisch · Mar 7, 2006

kepioo said:
I currently have an xml input file containing lots of data. My objectiv
is to write a script that reports in another xml file only the data I
am interested in. Doing this is really easy using SAX.

The input file is continuously updated. However, the other xml file
should be updated only on request.

Everytime we run the script, we track the new elements in the input
file and report them in the output file.

My idea was to :
_ detect in the output file the last event reported
_ read the input file from the end
_ report all the new events ( since the last time the script was run).

Question : IS it possible to read an XML file and process it from the
end to the beginning, using SAX????

No. And in no other XML-related technology I know of.

Generally speaking, I'd say your approach is inherently flawed. XML as a language requires well-formed documents to have
exactly one root element. This makes it unsuitable
for e.g. logging-files, as these have no explicit "end" - except the implicit last log-entry. So you will always have
something like this:

--- begin ---
<root>
<entry/>
<entry/>
--- end ---

I don't know _what_ you do, but unless you always write the whole XML-file completely new, you can't possibly write that
closing end-tag. So you end up with an malformed xml-document. Or you _do_ write all the file contents new each time -
but then you'd be able to reverse the order of elements so that the last came first. But I doubt the latter, as it
imposes a great performance-bottleneck with little gain.

SAX won't puke on you for your file being malformed, as it only learns about that when it is to late. So - you might use
it, as when that happens you are already finished with your actual task.

But you will always have to parse it from the beginning, to catch the document header, and there is no fast-forward
build into SAX.

So - what are your options?

- use seperate output files for each entry, that are well-formed in themselves. Beware if you've got plenty of them
(few K to M) that some FS might not deal well with that

- if you can keep the file open reading all the time (because you are kind of a background process), you can read the
contents, create a buffer and search for start-tags in that yourself. Then you can snip out the necessary portions,
complete them with a xml-header and feed them separately.

- if you can't keep it open, you can simulate that using the seed-function

Both the last options are somewhat cumbersome, as you have to do a lot of parsing yourself - the exact purpose one chose
XML the first time... From that follows the last advice:

- ditch XML. Either totally, or at least as format for the whole file. Instead, use some protocol like this:

--- begin ---
Chunk-Length: 100
<?xml version="1.0"?>
<root>... ( a 100 byte size xml document)
</root>
Chunk-Length: 200
<?xml version="1.0"?>
<root>... ( a 200 byte size xml document)
</root>
....

Then you can easily read through your document, skip unnecessary entries and extract the ones you want. Or, when keeping
the file open, know exactly what to read for the next chunk.

Diez

kepioo · Mar 7, 2006

Hi Diez,

thank you for your answer. Let me give you more background on the
project.

The input xml I am parsing is always well formed. It is coming out from
another application that append to this xml. I didn't see the source
code of the application, but i know that it is not re-writing the whole
xml. I thinnk it is just removing the last root element, adding the new
tags and writing again the </root> tag.

We don't want to create new output files for every entry ( each entry
is an event, and we have approximativaly 5 events per minute). So I
have to stick with this xml input file.

I guess, i will parse it till I find the last reported event and update
the output xml from there, reporting only the events I am interested
in....I hope SAX won't take too much time to do all this...(let's say 1
event = 10 tags, 5 events/minutes, xml file running for 1 month -->
5400 000 opening tags)...

What do you think?

Diez B. Roggisch · Mar 7, 2006

We don't want to create new output files for every entry ( each entry

is an event, and we have approximativaly 5 events per minute). So I
have to stick with this xml input file.

Well, the overall amount of data won't change. But I can understand that
decision. However, you might consider using a file per day/week.

I guess, i will parse it till I find the last reported event and update
the output xml from there, reporting only the events I am interested
in....I hope SAX won't take too much time to do all this...(let's say 1
event = 10 tags, 5 events/minutes, xml file running for 1 month -->
5400 000 opening tags)...

Use my suggested approach 2 - that boils down to using "seek" and some
hand-written parsing/buffering. A little bit nasty, but better than
consuming all of that file through sax.

Diez

Peter Hansen · Mar 7, 2006

kepioo said:
The input xml I am parsing is always well formed. It is coming out from
another application that append to this xml. I didn't see the source
code of the application, but i know that it is not re-writing the whole
xml. I thinnk it is just removing the last root element, adding the new
tags and writing again the </root> tag.

If the writers had a clue, they probably just seek to the end of the
file minus len('</root>') (or whatever) and then overwrite with the new
entry and another </root> element. At least, that's what seemed like
the obvious approach when I had to do this once.

Not that this is particularly relevant to the problem. ;-)

I guess, i will parse it till I find the last reported event and update
the output xml from there, reporting only the events I am interested
in....I hope SAX won't take too much time to do all this...(let's say 1
event = 10 tags, 5 events/minutes, xml file running for 1 month -->
5400 000 opening tags)...

What do you think?

I think (guessing wildly) you probably have a fairly restricted number
of possibilities being written to this file, possibly as simple as the
somewhat stereotypical '<entry text="blah blah"/>' type of thing which
I've seen lots of times.

If so, you can simply treat this as a text file which you process
manually, in whatever direct and crude fashion works best, such as by
seeking 1000 chars back from the end (assuming new entries are always
less than that length), scanning for the last "<entry" string, and
slicing and dicing till you find the stuff you need.

In other words, screw SAX, just grab the data directly and forget about
all those silly well-formed XML issues etc. Go for the simplest thing
that could possibly work, and if you don't need the complexity of SAX,
don't use it.

-Peter

kepioo · Mar 7, 2006

Thanks Diez for your suggestion, I'll look around to find out more
about the seek function ( i learnt python 2 weeks ago and I do not have
a programmer background, but so far, I am doing well).

Peter,

I cannot really process as your advice : it is not that stereotypical
entries....we built a data structure for the xml and we report various
types of events, always under the same format but with different
contents types.

The script i am writing aims at picking only special events (
identified by a route tag and an information tag).

Anyway, thank you for your advices!!

Peter Hansen · Mar 7, 2006

kepioo said:
Peter,

I cannot really process as your advice : it is not that stereotypical
entries....we built a data structure for the xml and we report various
types of events, always under the same format but with different
contents types.

The script i am writing aims at picking only special events (
identified by a route tag and an information tag).

Can you post one or two small examples that show the range of
possibilities? I still have this feeling there will be a simpler
approach than really parsing the XML, but maybe I'm wrong.

-Peter

kepioo · Mar 8, 2006

An example ( i changed the content to make it easier) :

################### input file ####################3

<root>
<case>
<TimeStamp Date="Mon Feb 20 19:40:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="apple">5</Element>
<Element name="banana">10</Element>
<Element name="peach">25</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:45:28 SGT 2006" >
<Message>names</Message>
<Elements>
<Element name="CEO">vincent</Element>
<Element name="Analysit">Robert</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:50:28 SGT 2006" >
<Message>open the car</Message>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="peach">25</Element>
<Element name="apple">8</Element>
<Element name="cherry">120</Element>
</Elements>
</TimeStamp>
</case>
</root>
##############################################3

The script I want to write has to track any change in the input
file(what we want to track are parameters in the script. Here for
instance, the number of apple and cherry). The ouput file for this
example would be ( we write it as a stream):

################### OutPut file #################################
<track>
<case>
<TimeStamp Date="Mon Feb 20 19:40:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="apple">5</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="apple">8</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="cherry">120</Element>
</Elements>
</TimeStamp>
</case>
</track>
############################################33333
The input file keeps being generated. The ouput file is generated on
request. Both are streamed based : we happend to the end of the file.

kepioo · Mar 8, 2006

An example ( i changed the content to make it easier) :

################### input file ####################3

<root>
<case>
<TimeStamp Date="Mon Feb 20 19:40:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="apple">5</Element>
<Element name="banana">10</Element>
<Element name="peach">25</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:45:28 SGT 2006" >
<Message>names</Message>
<Elements>
<Element name="CEO">vincent</Element>
<Element name="Analysit">Robert</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:50:28 SGT 2006" >
<Message>open the car</Message>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="peach">25</Element>
<Element name="apple">8</Element>
<Element name="cherry">120</Element>
</Elements>
</TimeStamp>
</case>
</root>
##############################################3

The script I want to write has to track any change in the input
file(what we want to track are parameters in the script. Here for
instance, the number of apple and cherry). The ouput file for this
example would be ( we write it as a stream):

################### OutPut file #################################
<track>
<case>
<TimeStamp Date="Mon Feb 20 19:40:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="apple">5</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="apple">8</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="cherry">120</Element>
</Elements>
</TimeStamp>
</case>
</track>
############################################33333
The input file keeps being generated. The ouput file is generated on
request. Both are streamed based : we happend to the end of the file.

Any suggestion?

Read xml column inside csv file with Python	0	Jul 23, 2022
How does a HEAD pointer end up pointing to the first node in a linked list?	3	Jan 24, 2023
How do I save information from an GUI into a XML-file?	0	Aug 17, 2022
How to display input options only after selecting an option from the 'select class' tag JS?	6	May 12, 2023
XML parsing: SAX/expat & yield	2	Aug 4, 2010
How to read from a .csv file in Java?	1	Nov 6, 2023
Top down Python	4	Feb 12, 2014
Trying to access hdml from an open browser using Python.	1	Jan 18, 2023

SAX/Python : read an xml from the end to the top

kepioo

Diez B. Roggisch

kepioo

Diez B. Roggisch

Peter Hansen

kepioo

Peter Hansen

kepioo

kepioo

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads