SAX/Python : read an xml from the end to the top

Discussion in 'Python' started by kepioo, Mar 7, 2006.

  1. kepioo

    kepioo Guest

    I currently have an xml input file containing lots of data. My objectiv
    is to write a script that reports in another xml file only the data I
    am interested in. Doing this is really easy using SAX.

    The input file is continuously updated. However, the other xml file
    should be updated only on request.

    Everytime we run the script, we track the new elements in the input
    file and report them in the output file.

    My idea was to :
    _ detect in the output file the last event reported
    _ read the input file from the end
    _ report all the new events ( since the last time the script was run).



    Question : IS it possible to read an XML file and process it from the
    end to the beginning, using SAX????
     
    kepioo, Mar 7, 2006
    #1
    1. Advertising

  2. kepioo schrieb:
    > I currently have an xml input file containing lots of data. My objectiv
    > is to write a script that reports in another xml file only the data I
    > am interested in. Doing this is really easy using SAX.
    >
    > The input file is continuously updated. However, the other xml file
    > should be updated only on request.
    >
    > Everytime we run the script, we track the new elements in the input
    > file and report them in the output file.
    >
    > My idea was to :
    > _ detect in the output file the last event reported
    > _ read the input file from the end
    > _ report all the new events ( since the last time the script was run).
    >
    >
    >
    > Question : IS it possible to read an XML file and process it from the
    > end to the beginning, using SAX????


    No. And in no other XML-related technology I know of.

    Generally speaking, I'd say your approach is inherently flawed. XML as a language requires well-formed documents to have
    exactly one root element. This makes it unsuitable
    for e.g. logging-files, as these have no explicit "end" - except the implicit last log-entry. So you will always have
    something like this:

    --- begin ---
    <root>
    <entry/>
    <entry/>
    --- end ---

    I don't know _what_ you do, but unless you always write the whole XML-file completely new, you can't possibly write that
    closing end-tag. So you end up with an malformed xml-document. Or you _do_ write all the file contents new each time -
    but then you'd be able to reverse the order of elements so that the last came first. But I doubt the latter, as it
    imposes a great performance-bottleneck with little gain.

    SAX won't puke on you for your file being malformed, as it only learns about that when it is to late. So - you might use
    it, as when that happens you are already finished with your actual task.

    But you will always have to parse it from the beginning, to catch the document header, and there is no fast-forward
    build into SAX.

    So - what are your options?

    - use seperate output files for each entry, that are well-formed in themselves. Beware if you've got plenty of them
    (few K to M) that some FS might not deal well with that

    - if you can keep the file open reading all the time (because you are kind of a background process), you can read the
    contents, create a buffer and search for start-tags in that yourself. Then you can snip out the necessary portions,
    complete them with a xml-header and feed them separately.

    - if you can't keep it open, you can simulate that using the seed-function

    Both the last options are somewhat cumbersome, as you have to do a lot of parsing yourself - the exact purpose one chose
    XML the first time... From that follows the last advice:

    - ditch XML. Either totally, or at least as format for the whole file. Instead, use some protocol like this:

    --- begin ---
    Chunk-Length: 100
    <?xml version="1.0"?>
    <root>... ( a 100 byte size xml document)
    </root>
    Chunk-Length: 200
    <?xml version="1.0"?>
    <root>... ( a 200 byte size xml document)
    </root>
    ....

    Then you can easily read through your document, skip unnecessary entries and extract the ones you want. Or, when keeping
    the file open, know exactly what to read for the next chunk.

    Diez
     
    Diez B. Roggisch, Mar 7, 2006
    #2
    1. Advertising

  3. kepioo

    kepioo Guest

    Hi Diez,

    thank you for your answer. Let me give you more background on the
    project.

    The input xml I am parsing is always well formed. It is coming out from
    another application that append to this xml. I didn't see the source
    code of the application, but i know that it is not re-writing the whole
    xml. I thinnk it is just removing the last root element, adding the new
    tags and writing again the </root> tag.

    We don't want to create new output files for every entry ( each entry
    is an event, and we have approximativaly 5 events per minute). So I
    have to stick with this xml input file.

    I guess, i will parse it till I find the last reported event and update
    the output xml from there, reporting only the events I am interested
    in....I hope SAX won't take too much time to do all this...(let's say 1
    event = 10 tags, 5 events/minutes, xml file running for 1 month -->
    5400 000 opening tags)...

    What do you think?
     
    kepioo, Mar 7, 2006
    #3
  4. > We don't want to create new output files for every entry ( each entry
    > is an event, and we have approximativaly 5 events per minute). So I
    > have to stick with this xml input file.


    Well, the overall amount of data won't change. But I can understand that
    decision. However, you might consider using a file per day/week.

    > I guess, i will parse it till I find the last reported event and update
    > the output xml from there, reporting only the events I am interested
    > in....I hope SAX won't take too much time to do all this...(let's say 1
    > event = 10 tags, 5 events/minutes, xml file running for 1 month -->
    > 5400 000 opening tags)...


    Use my suggested approach 2 - that boils down to using "seek" and some
    hand-written parsing/buffering. A little bit nasty, but better than
    consuming all of that file through sax.

    Diez
     
    Diez B. Roggisch, Mar 7, 2006
    #4
  5. kepioo

    Peter Hansen Guest

    kepioo wrote:
    > The input xml I am parsing is always well formed. It is coming out from
    > another application that append to this xml. I didn't see the source
    > code of the application, but i know that it is not re-writing the whole
    > xml. I thinnk it is just removing the last root element, adding the new
    > tags and writing again the </root> tag.


    If the writers had a clue, they probably just seek to the end of the
    file minus len('</root>') (or whatever) and then overwrite with the new
    entry and another </root> element. At least, that's what seemed like
    the obvious approach when I had to do this once.

    Not that this is particularly relevant to the problem. ;-)

    > I guess, i will parse it till I find the last reported event and update
    > the output xml from there, reporting only the events I am interested
    > in....I hope SAX won't take too much time to do all this...(let's say 1
    > event = 10 tags, 5 events/minutes, xml file running for 1 month -->
    > 5400 000 opening tags)...
    >
    > What do you think?


    I think (guessing wildly) you probably have a fairly restricted number
    of possibilities being written to this file, possibly as simple as the
    somewhat stereotypical '<entry text="blah blah"/>' type of thing which
    I've seen lots of times.

    If so, you can simply treat this as a text file which you process
    manually, in whatever direct and crude fashion works best, such as by
    seeking 1000 chars back from the end (assuming new entries are always
    less than that length), scanning for the last "<entry" string, and
    slicing and dicing till you find the stuff you need.

    In other words, screw SAX, just grab the data directly and forget about
    all those silly well-formed XML issues etc. Go for the simplest thing
    that could possibly work, and if you don't need the complexity of SAX,
    don't use it.

    -Peter
     
    Peter Hansen, Mar 7, 2006
    #5
  6. kepioo

    kepioo Guest

    Thanks Diez for your suggestion, I'll look around to find out more
    about the seek function ( i learnt python 2 weeks ago and I do not have
    a programmer background, but so far, I am doing well).

    Peter,

    I cannot really process as your advice : it is not that stereotypical
    entries....we built a data structure for the xml and we report various
    types of events, always under the same format but with different
    contents types.

    The script i am writing aims at picking only special events (
    identified by a route tag and an information tag).

    Anyway, thank you for your advices!!
     
    kepioo, Mar 7, 2006
    #6
  7. kepioo

    Peter Hansen Guest

    kepioo wrote:
    > Peter,
    >
    > I cannot really process as your advice : it is not that stereotypical
    > entries....we built a data structure for the xml and we report various
    > types of events, always under the same format but with different
    > contents types.
    >
    > The script i am writing aims at picking only special events (
    > identified by a route tag and an information tag).


    Can you post one or two small examples that show the range of
    possibilities? I still have this feeling there will be a simpler
    approach than really parsing the XML, but maybe I'm wrong.

    -Peter
     
    Peter Hansen, Mar 7, 2006
    #7
  8. kepioo

    kepioo Guest

    An example ( i changed the content to make it easier) :

    ################### input file ####################3

    <root>
    <case>
    <TimeStamp Date="Mon Feb 20 19:40:28 SGT 2006" >
    <Message>fruits</Message>
    <Elements>
    <Element name="apple">5</Element>
    <Element name="banana">10</Element>
    <Element name="peach">25</Element>
    </Elements>
    </TimeStamp>
    </case>

    <case>
    <TimeStamp Date="Mon Feb 20 19:45:28 SGT 2006" >
    <Message>names</Message>
    <Elements>
    <Element name="CEO">vincent</Element>
    <Element name="Analysit">Robert</Element>
    </Elements>
    </TimeStamp>
    </case>

    <case>
    <TimeStamp Date="Mon Feb 20 19:50:28 SGT 2006" >
    <Message>open the car</Message>
    </TimeStamp>
    </case>


    <case>
    <TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
    <Message>fruits</Message>
    <Elements>
    <Element name="peach">25</Element>
    <Element name="apple">8</Element>
    <Element name="cherry">120</Element>
    </Elements>
    </TimeStamp>
    </case>
    </root>
    ##############################################3

    The script I want to write has to track any change in the input
    file(what we want to track are parameters in the script. Here for
    instance, the number of apple and cherry). The ouput file for this
    example would be ( we write it as a stream):

    ################### OutPut file #################################
    <track>
    <case>
    <TimeStamp Date="Mon Feb 20 19:40:28 SGT 2006" >
    <Message>fruits</Message>
    <Elements>
    <Element name="apple">5</Element>
    </Elements>
    </TimeStamp>
    </case>

    <case>
    <TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
    <Message>fruits</Message>
    <Elements>
    <Element name="apple">8</Element>
    </Elements>
    </TimeStamp>
    </case>

    <case>
    <TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
    <Message>fruits</Message>
    <Elements>
    <Element name="cherry">120</Element>
    </Elements>
    </TimeStamp>
    </case>
    </track>
    ############################################33333
    The input file keeps being generated. The ouput file is generated on
    request. Both are streamed based : we happend to the end of the file.
     
    kepioo, Mar 8, 2006
    #8
  9. kepioo

    kepioo Guest

    An example ( i changed the content to make it easier) :

    ################### input file ####################3

    <root>
    <case>
    <TimeStamp Date="Mon Feb 20 19:40:28 SGT 2006" >
    <Message>fruits</Message>
    <Elements>
    <Element name="apple">5</Element>
    <Element name="banana">10</Element>
    <Element name="peach">25</Element>
    </Elements>
    </TimeStamp>
    </case>

    <case>
    <TimeStamp Date="Mon Feb 20 19:45:28 SGT 2006" >
    <Message>names</Message>
    <Elements>
    <Element name="CEO">vincent</Element>
    <Element name="Analysit">Robert</Element>
    </Elements>
    </TimeStamp>
    </case>

    <case>
    <TimeStamp Date="Mon Feb 20 19:50:28 SGT 2006" >
    <Message>open the car</Message>
    </TimeStamp>
    </case>


    <case>
    <TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
    <Message>fruits</Message>
    <Elements>
    <Element name="peach">25</Element>
    <Element name="apple">8</Element>
    <Element name="cherry">120</Element>
    </Elements>
    </TimeStamp>
    </case>
    </root>
    ##############################################3

    The script I want to write has to track any change in the input
    file(what we want to track are parameters in the script. Here for
    instance, the number of apple and cherry). The ouput file for this
    example would be ( we write it as a stream):

    ################### OutPut file #################################
    <track>
    <case>
    <TimeStamp Date="Mon Feb 20 19:40:28 SGT 2006" >
    <Message>fruits</Message>
    <Elements>
    <Element name="apple">5</Element>
    </Elements>
    </TimeStamp>
    </case>

    <case>
    <TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
    <Message>fruits</Message>
    <Elements>
    <Element name="apple">8</Element>
    </Elements>
    </TimeStamp>
    </case>

    <case>
    <TimeStamp Date="Mon Feb 20 19:55:28 SGT 2006" >
    <Message>fruits</Message>
    <Elements>
    <Element name="cherry">120</Element>
    </Elements>
    </TimeStamp>
    </case>
    </track>
    ############################################33333
    The input file keeps being generated. The ouput file is generated on
    request. Both are streamed based : we happend to the end of the file.

    Any suggestion?
     
    kepioo, Mar 8, 2006
    #9
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Angela Robert
    Replies:
    0
    Views:
    3,724
    Angela Robert
    Jun 27, 2003
  2. Replies:
    3
    Views:
    588
    Peter Flynn
    Oct 13, 2005
  3. Sanjeev
    Replies:
    4
    Views:
    1,462
    Stanimir Stamenkov
    May 4, 2008
  4. laredotornado
    Replies:
    4
    Views:
    7,444
  5. Erik Wasser
    Replies:
    5
    Views:
    484
    Peter J. Holzer
    Mar 5, 2006
Loading...

Share This Page