SAX/Python : read an xml from the end to the top

K

kepioo

I currently have an xml input file containing lots of data. My objectiv
is to write a script that reports in another xml file only the data I
am interested in. Doing this is really easy using SAX.

The input file is continuously updated. However, the other xml file
should be updated only on request.

Everytime we run the script, we track the new elements in the input
file and report them in the output file.

My idea was to :
_ detect in the output file the last event reported
_ read the input file from the end
_ report all the new events ( since the last time the script was run).



Question : IS it possible to read an XML file and process it from the
end to the beginning, using SAX????
 
D

Diez B. Roggisch

kepioo said:
I currently have an xml input file containing lots of data. My objectiv
is to write a script that reports in another xml file only the data I
am interested in. Doing this is really easy using SAX.

The input file is continuously updated. However, the other xml file
should be updated only on request.

Everytime we run the script, we track the new elements in the input
file and report them in the output file.

My idea was to :
_ detect in the output file the last event reported
_ read the input file from the end
_ report all the new events ( since the last time the script was run).



Question : IS it possible to read an XML file and process it from the
end to the beginning, using SAX????

No. And in no other XML-related technology I know of.

Generally speaking, I'd say your approach is inherently flawed. XML as a language requires well-formed documents to have
exactly one root element. This makes it unsuitable
for e.g. logging-files, as these have no explicit "end" - except the implicit last log-entry. So you will always have
something like this:

--- begin ---
<root>
<entry/>
<entry/>
--- end ---

I don't know _what_ you do, but unless you always write the whole XML-file completely new, you can't possibly write that
closing end-tag. So you end up with an malformed xml-document. Or you _do_ write all the file contents new each time -
but then you'd be able to reverse the order of elements so that the last came first. But I doubt the latter, as it
imposes a great performance-bottleneck with little gain.

SAX won't puke on you for your file being malformed, as it only learns about that when it is to late. So - you might use
it, as when that happens you are already finished with your actual task.

But you will always have to parse it from the beginning, to catch the document header, and there is no fast-forward
build into SAX.

So - what are your options?

- use seperate output files for each entry, that are well-formed in themselves. Beware if you've got plenty of them
(few K to M) that some FS might not deal well with that

- if you can keep the file open reading all the time (because you are kind of a background process), you can read the
contents, create a buffer and search for start-tags in that yourself. Then you can snip out the necessary portions,
complete them with a xml-header and feed them separately.

- if you can't keep it open, you can simulate that using the seed-function

Both the last options are somewhat cumbersome, as you have to do a lot of parsing yourself - the exact purpose one chose
XML the first time... From that follows the last advice:

- ditch XML. Either totally, or at least as format for the whole file. Instead, use some protocol like this:

--- begin ---
Chunk-Length: 100
<?xml version="1.0"?>
<root>... ( a 100 byte size xml document)
</root>
Chunk-Length: 200
<?xml version="1.0"?>
<root>... ( a 200 byte size xml document)
</root>
....

Then you can easily read through your document, skip unnecessary entries and extract the ones you want. Or, when keeping
the file open, know exactly what to read for the next chunk.

Diez
 
K

kepioo

Hi Diez,

thank you for your answer. Let me give you more background on the
project.

The input xml I am parsing is always well formed. It is coming out from
another application that append to this xml. I didn't see the source
code of the application, but i know that it is not re-writing the whole
xml. I thinnk it is just removing the last root element, adding the new
tags and writing again the </root> tag.

We don't want to create new output files for every entry ( each entry
is an event, and we have approximativaly 5 events per minute). So I
have to stick with this xml input file.

I guess, i will parse it till I find the last reported event and update
the output xml from there, reporting only the events I am interested
in....I hope SAX won't take too much time to do all this...(let's say 1
event = 10 tags, 5 events/minutes, xml file running for 1 month -->
5400 000 opening tags)...

What do you think?
 
D

Diez B. Roggisch

We don't want to create new output files for every entry ( each entry
is an event, and we have approximativaly 5 events per minute). So I
have to stick with this xml input file.

Well, the overall amount of data won't change. But I can understand that
decision. However, you might consider using a file per day/week.
I guess, i will parse it till I find the last reported event and update
the output xml from there, reporting only the events I am interested
in....I hope SAX won't take too much time to do all this...(let's say 1
event = 10 tags, 5 events/minutes, xml file running for 1 month -->
5400 000 opening tags)...

Use my suggested approach 2 - that boils down to using "seek" and some
hand-written parsing/buffering. A little bit nasty, but better than
consuming all of that file through sax.

Diez
 
P

Peter Hansen

kepioo said:
The input xml I am parsing is always well formed. It is coming out from
another application that append to this xml. I didn't see the source
code of the application, but i know that it is not re-writing the whole
xml. I thinnk it is just removing the last root element, adding the new
tags and writing again the </root> tag.

If the writers had a clue, they probably just seek to the end of the
file minus len('</root>') (or whatever) and then overwrite with the new
entry and another </root> element. At least, that's what seemed like
the obvious approach when I had to do this once.

Not that this is particularly relevant to the problem. ;-)
I guess, i will parse it till I find the last reported event and update
the output xml from there, reporting only the events I am interested
in....I hope SAX won't take too much time to do all this...(let's say 1
event = 10 tags, 5 events/minutes, xml file running for 1 month -->
5400 000 opening tags)...

What do you think?

I think (guessing wildly) you probably have a fairly restricted number
of possibilities being written to this file, possibly as simple as the
somewhat stereotypical '<entry text="blah blah"/>' type of thing which
I've seen lots of times.

If so, you can simply treat this as a text file which you process
manually, in whatever direct and crude fashion works best, such as by
seeking 1000 chars back from the end (assuming new entries are always
less than that length), scanning for the last "<entry" string, and
slicing and dicing till you find the stuff you need.

In other words, screw SAX, just grab the data directly and forget about
all those silly well-formed XML issues etc. Go for the simplest thing
that could possibly work, and if you don't need the complexity of SAX,
don't use it.

-Peter
 
K

kepioo

Thanks Diez for your suggestion, I'll look around to find out more
about the seek function ( i learnt python 2 weeks ago and I do not have
a programmer background, but so far, I am doing well).

Peter,

I cannot really process as your advice : it is not that stereotypical
entries....we built a data structure for the xml and we report various
types of events, always under the same format but with different
contents types.

The script i am writing aims at picking only special events (
identified by a route tag and an information tag).

Anyway, thank you for your advices!!
 
P

Peter Hansen

kepioo said:
Peter,

I cannot really process as your advice : it is not that stereotypical
entries....we built a data structure for the xml and we report various
types of events, always under the same format but with different
contents types.

The script i am writing aims at picking only special events (
identified by a route tag and an information tag).

Can you post one or two small examples that show the range of
possibilities? I still have this feeling there will be a simpler
approach than really parsing the XML, but maybe I'm wrong.

-Peter
 
K

kepioo

An example ( i changed the content to make it easier) :

################### input file ####################3

<root>
<case>
<TimeStamp Date="Mon Feb 20 19:40:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="apple">5</Element>
<Element name="banana">10</Element>
<Element name="peach">25</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:45:28 SGT 2006" >
<Message>names</Message>
<Elements>
<Element name="CEO">vincent</Element>
<Element name="Analysit">Robert</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:50:28 SGT 2006" >
<Message>open the car</Message>
</TimeStamp>
</case>


<case>
<TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="peach">25</Element>
<Element name="apple">8</Element>
<Element name="cherry">120</Element>
</Elements>
</TimeStamp>
</case>
</root>
##############################################3

The script I want to write has to track any change in the input
file(what we want to track are parameters in the script. Here for
instance, the number of apple and cherry). The ouput file for this
example would be ( we write it as a stream):

################### OutPut file #################################
<track>
<case>
<TimeStamp Date="Mon Feb 20 19:40:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="apple">5</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="apple">8</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="cherry">120</Element>
</Elements>
</TimeStamp>
</case>
</track>
############################################33333
The input file keeps being generated. The ouput file is generated on
request. Both are streamed based : we happend to the end of the file.
 
K

kepioo

An example ( i changed the content to make it easier) :

################### input file ####################3

<root>
<case>
<TimeStamp Date="Mon Feb 20 19:40:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="apple">5</Element>
<Element name="banana">10</Element>
<Element name="peach">25</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:45:28 SGT 2006" >
<Message>names</Message>
<Elements>
<Element name="CEO">vincent</Element>
<Element name="Analysit">Robert</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:50:28 SGT 2006" >
<Message>open the car</Message>
</TimeStamp>
</case>


<case>
<TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="peach">25</Element>
<Element name="apple">8</Element>
<Element name="cherry">120</Element>
</Elements>
</TimeStamp>
</case>
</root>
##############################################3

The script I want to write has to track any change in the input
file(what we want to track are parameters in the script. Here for
instance, the number of apple and cherry). The ouput file for this
example would be ( we write it as a stream):

################### OutPut file #################################
<track>
<case>
<TimeStamp Date="Mon Feb 20 19:40:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="apple">5</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="apple">8</Element>
</Elements>
</TimeStamp>
</case>

<case>
<TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
<Message>fruits</Message>
<Elements>
<Element name="cherry">120</Element>
</Elements>
</TimeStamp>
</case>
</track>
############################################33333
The input file keeps being generated. The ouput file is generated on
request. Both are streamed based : we happend to the end of the file.

Any suggestion?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top