File parser

Discussion in 'Python' started by Angelic Devil, Aug 30, 2005.

  1. I'm building a file parser but I have a problem I'm not sure how to
    solve. The files this will parse have the potential to be huge
    (multiple GBs). There are distinct sections of the file that I
    want to read into separate dictionaries to perform different
    operations on. Each section has specific begin and end statements
    like the following:

    KEYWORD
    ..
    ..
    ..
    END KEYWORD

    The very first thing I do is read the entire file contents into a
    string. I then store the contents in a list, splitting on line ends
    as follows:


    file_lines = file_contents.split('\n')


    Next, I build smaller lists from the different sections using the
    begin and end keywords:


    begin_index = file_lines.index(begin_keyword)
    end_index = file_lines.index(end_keyword)
    small_list = [ file_lines[begin_index + 1] : file_lines[end_index - 1] ]


    I then plan on parsing each list to build the different dictionaries.
    The problem is that one begin statement is a substring of another
    begin statement as in the following example:


    BAR
    END BAR

    FOOBAR
    END FOOBAR


    I can't just look for the line in the list that contains BAR because
    FOOBAR might come first in the list. My list would then look like

    [foobar_1, foobar_2, ..., foobar_n, ..., bar_1, bar_2, ..., bar_m]

    I don't really want to use regular expressions, but I don't see a way
    to get around this without doing so. Does anyone have any suggestions
    on how to accomplish this? If regexps are the way to go, is there an
    efficient way to parse the contents of a potentially large list using
    regular expressions?

    Any help is appreciated!

    Thanks,
    Aaron

    --
    "Tis better to be silent and be thought a fool, than to speak and
    remove all doubt."
    -- Abraham Lincoln
    Angelic Devil, Aug 30, 2005
    #1
    1. Advertising

  2. Angelic Devil

    William Park Guest

    William Park, Aug 30, 2005
    #2
    1. Advertising

  3. Angelic Devil

    Rune Strand Guest

    It's not clear to me from your posting what possible order the tags may
    be inn. Assuming you will always END a section before beginning an new,
    eg.

    it's always:

    A
    some A-section lines.
    END A

    B
    some B-section lines.
    END B

    etc.

    And never:

    A
    some A-section lines.
    B
    some B-section lines.
    END B
    END A

    etc.

    is should be fairly simple. And if the file is several GB, your ought
    to use a generator in order to overcome the memory problem.

    Something like this:


    def make_tag_lookup(begin_tags):
    # create a dict with each {begin_tag : end_tag}
    end_tags = [('END ' + begin_tag) for begin_tag in begin_tags]
    return dict(zip(begin_tags, end_tags))


    def return_sections(filepath, lookup):
    # Generator returning each section

    inside_section = False

    for line in open(filepath, 'r').readlines():
    line = line.strip()
    if not inside_section:
    if line in lookup:
    inside_section = True
    data_section = []
    section_end_tag = lookup[line]
    section_begin_tag = line
    data_section.append(line) # store section start tag
    else:
    if line == section_end_tag:
    data_section.append(line) # store section end tag
    inside_section = False
    yield data_section # yield entire section

    else:
    data_section.append(line) #store each line within section


    # create the generator yielding each section
    #
    sections = return_sections(datafile,
    make_tag_lookup(list_of_begin_tags))

    for section in sections:
    for line in section:
    print line
    print '\n'
    Rune Strand, Aug 30, 2005
    #3
  4. Angelic Devil

    MrJean1 Guest

    Take a closer look at SimpleParse/mxTextTools

    <//www.python.org/pypi/SimpleParse/2.0.1a3>

    We have used these to parse log files of several 100 MB with simple and
    complex grammars up to 250+ productions. Highly recommended.

    /Jean Brouwers

    PS) For an introduction see also this story
    <http://www-128.ibm.com/developerworks/linux/library/l-simple.html>
    MrJean1, Aug 30, 2005
    #4
  5. Angelic Devil

    infidel Guest

    Angelic Devil wrote:
    > I'm building a file parser but I have a problem I'm not sure how to
    > solve. The files this will parse have the potential to be huge
    > (multiple GBs). There are distinct sections of the file that I
    > want to read into separate dictionaries to perform different
    > operations on. Each section has specific begin and end statements
    > like the following:
    >
    > KEYWORD
    > .
    > .
    > .
    > END KEYWORD
    >
    > The very first thing I do is read the entire file contents into a
    > string. I then store the contents in a list, splitting on line ends
    > as follows:
    >
    >
    > file_lines = file_contents.split('\n')
    >
    >
    > Next, I build smaller lists from the different sections using the
    > begin and end keywords:
    >
    >
    > begin_index = file_lines.index(begin_keyword)
    > end_index = file_lines.index(end_keyword)
    > small_list = [ file_lines[begin_index + 1] : file_lines[end_index - 1] ]
    >
    >
    > I then plan on parsing each list to build the different dictionaries.
    > The problem is that one begin statement is a substring of another
    > begin statement as in the following example:
    >
    >
    > BAR
    > END BAR
    >
    > FOOBAR
    > END FOOBAR
    >
    >
    > I can't just look for the line in the list that contains BAR because
    > FOOBAR might come first in the list. My list would then look like
    >
    > [foobar_1, foobar_2, ..., foobar_n, ..., bar_1, bar_2, ..., bar_m]
    >
    > I don't really want to use regular expressions, but I don't see a way
    > to get around this without doing so. Does anyone have any suggestions
    > on how to accomplish this? If regexps are the way to go, is there an
    > efficient way to parse the contents of a potentially large list using
    > regular expressions?
    >
    > Any help is appreciated!
    >
    > Thanks,
    > Aaron


    Some time ago I was toying around with writing a tool in python to
    parse our VB6 code (the original idea was to write our own .NET
    conversion tool because the Wizard that comes with VS.NET sucks hard on
    some things). I tried various parsing tools and EBNF grammars but VB6
    isn't really an EBNF-esque syntax in all cases, so I needed something
    else. VB6 syntax is similar to what you have, with all kinds of
    different "Begin/End" blocks, and some files can be rather big. Also,
    when you get to conditionals and looping constructs you can have
    seriously nested logic, so the approach I took was to imitate a SAX
    parser. I created a class that reads VB6 source line by line, and
    calls empty "event handler" methods (just like SAX) such as
    self.begin_type or self.begin_procedure and self.end_type or
    self.end_procedure. Then I created a subclass that actually
    implemented those event handlers by building a sort of tree that
    represents the program in a more abstract fashion. I never got to the
    point of writing the tree out in a new language, but I had fun hacking
    on the project for a while. I think a similar approach could work for
    you here.
    infidel, Aug 30, 2005
    #5
  6. infidel wrote:

    >Angelic Devil wrote:
    >
    >

    ....

    >Some time ago I was toying around with writing a tool in python to
    >parse our VB6 code (the original idea was to write our own .NET
    >conversion tool because the Wizard that comes with VS.NET sucks hard on
    >some things). I tried various parsing tools and EBNF grammars but VB6
    >isn't really an EBNF-esque syntax in all cases, so I needed something
    >else.
    >

    ....

    You may find this project interesting to play with:
    http://vb2py.sourceforge.net/index.html

    Have fun,
    Mike

    --
    ________________________________________________
    Mike C. Fletcher
    Designer, VR Plumber, Coder
    http://www.vrplumber.com
    http://blog.vrplumber.com
    Mike C. Fletcher, Aug 30, 2005
    #6
  7. "Rune Strand" <> writes:


    Thanks. This shows definate promise. I've already tailored it for
    what I need, and it appears to be working.


    --
    "Society in every state is a blessing, but Government, even in its best
    state, is but a necessary evil; in its worst state, an intolerable one."
    -- Thomas Paine
    Angelic Devil, Aug 30, 2005
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Bernd Oninger
    Replies:
    0
    Views:
    754
    Bernd Oninger
    Jun 9, 2004
  2. ZOCOR

    XML Parser VS HTML Parser

    ZOCOR, Oct 3, 2004, in forum: Java
    Replies:
    11
    Views:
    810
    Paul King
    Oct 5, 2004
  3. Bernd Oninger
    Replies:
    0
    Views:
    810
    Bernd Oninger
    Jun 9, 2004
  4. Joel Hedlund
    Replies:
    2
    Views:
    502
    Joel Hedlund
    Nov 11, 2006
  5. Joel Hedlund
    Replies:
    0
    Views:
    303
    Joel Hedlund
    Nov 11, 2006
Loading...

Share This Page