Parsing a file based on differing delimiters

Discussion in 'Python' started by Kylotan, Oct 21, 2003.

  1. Kylotan

    Kylotan Guest

    I have a text file where the fields are delimited in various different
    ways. For example, strings are terminated with a tilde, numbers are
    terminated with whitespace, and some identifiers are terminated with a
    newline. This means I can't effectively use split() except on a small
    scale. For most of the file I can just call one of several functions I
    wrote that read in just as much data as is required from the input
    string, and return the value and modified string. Much of the code
    therefore looks like this:

    filedata = file('whatever').read()
    firstWord, filedata = GetWord(filedata)
    nextNumber, filedata = GetNumber(filedata)

    This works, but is obviously ugly. Is there a cleaner alternative that
    can avoid me having to re-assign data all the time that will 'consume'
    the value from the stream)? I'm a bit unclear on the whole passing by
    value/reference thing. I'm guessing that while GetWord gets a
    reference to the 'filedata' string, assigning to that will just reseat
    the reference and not change the original string.

    The other problem is that parts of the format are potentially repeated
    an arbitrary number of times and therefore a degree of lookahead is
    required. If I've already extracted a token and then find out I need
    it, putting it back is awkward. Yet there is nowhere near enough
    complexity or repetition in the file format to justify a formal
    grammar or anything like that.

    All in all, in the basic parsing code I am doing a lot more operations
    on the input data than I would like. I can see how I'd encapsulate
    this behind functions if I was willing to iterate through the data
    character by character like I would in C++. But I am hoping that
    Python can, as usual, save me from the majority of this drudgery
    somehow.

    Any help appreciated.

    --
    Ben Sizer
     
    Kylotan, Oct 21, 2003
    #1
    1. Advertising

  2. Kylotan wrote:

    > I have a text file where the fields are delimited in various different
    > ways. For example, strings are terminated with a tilde, numbers are
    > terminated with whitespace, and some identifiers are terminated with a


    What sadist designed it?-) Anyway...

    I suggest a simple class which holds the filedata and an index into
    it. Your functions such as GetWord(f) examine f.data from f.index
    onwards, and increment f.index before returning the result. To
    "pushback", you just decrement f.index again (you may want to
    keep a small stack of values - perhaps just one -- for the "undo",
    again in the simple class in question).


    Alex
     
    Alex Martelli, Oct 21, 2003
    #2
    1. Advertising

  3. Kylotan

    Peter Otten Guest

    Kylotan wrote:

    > I have a text file where the fields are delimited in various different
    > ways. For example, strings are terminated with a tilde, numbers are
    > terminated with whitespace, and some identifiers are terminated with a
    > newline. This means I can't effectively use split() except on a small
    > scale. For most of the file I can just call one of several functions I
    > wrote that read in just as much data as is required from the input
    > string, and return the value and modified string. Much of the code
    > therefore looks like this:
    >
    > filedata = file('whatever').read()
    > firstWord, filedata = GetWord(filedata)
    > nextNumber, filedata = GetNumber(filedata)
    >
    > This works, but is obviously ugly. Is there a cleaner alternative that
    > can avoid me having to re-assign data all the time that will 'consume'
    > the value from the stream)? I'm a bit unclear on the whole passing by
    > value/reference thing. I'm guessing that while GetWord gets a
    > reference to the 'filedata' string, assigning to that will just reseat
    > the reference and not change the original string.


    The strategy to rebind is to wrap the reference into a mutable object and
    pass that object around instead of the original reference.

    > The other problem is that parts of the format are potentially repeated
    > an arbitrary number of times and therefore a degree of lookahead is
    > required. If I've already extracted a token and then find out I need
    > it, putting it back is awkward. Yet there is nowhere near enough
    > complexity or repetition in the file format to justify a formal
    > grammar or anything like that.
    >
    > All in all, in the basic parsing code I am doing a lot more operations
    > on the input data than I would like. I can see how I'd encapsulate
    > this behind functions if I was willing to iterate through the data
    > character by character like I would in C++. But I am hoping that
    > Python can, as usual, save me from the majority of this drudgery
    > somehow.


    I've made a little Reader class that should do what you want. Of course the
    actual parsing routines will differ, depending on your file format.

    <code>
    class EndOfData(Exception):
    pass

    class Reader:
    def __init__(self, data):
    self.data = data
    self.positions = [0]

    def _getChunk(self, delim):
    start = self.positions[-1]
    if start >= len(self.data):
    raise EndOfData
    end = self.data.find(delim, start)
    if end < 0:
    end = len(self.data)
    self.positions.append(end+1)
    return self.data[start:end]

    def rest(self):
    return self.data[self.positions[-1]:]
    def rewind(self):
    self.positions = [0]
    def unget(self):
    self.positions.pop()
    def getString(self):
    return self._getChunk("~")
    def getInteger(self):
    chunk = self._getChunk(" ")
    try:
    return int(chunk)
    except ValueError:
    self.unget()
    raise

    #example usage:

    sample = "abc~123 456 rst"
    r = Reader(sample)

    commands = {
    "i": r.getInteger,
    "s": r.getString,
    "u": lambda: r.unget() or "#unget " + r.rest(),
    }

    for key in "ssuiisuuisi":
    try:
    print commands[key]()
    except ValueError:
    print "#error"
    </code>

    Peter
     
    Peter Otten, Oct 22, 2003
    #3
  4. On 21 Oct 2003 15:21:13 -0700, (Kylotan) wrote:

    >I have a text file where the fields are delimited in various different
    >ways. For example, strings are terminated with a tilde, numbers are
    >terminated with whitespace, and some identifiers are terminated with a
    >newline. This means I can't effectively use split() except on a small
    >scale. For most of the file I can just call one of several functions I
    >wrote that read in just as much data as is required from the input
    >string, and return the value and modified string. Much of the code
    >therefore looks like this:
    >
    >filedata = file('whatever').read()
    >firstWord, filedata = GetWord(filedata)
    >nextNumber, filedata = GetNumber(filedata)
    >
    >This works, but is obviously ugly. Is there a cleaner alternative that
    >can avoid me having to re-assign data all the time that will 'consume'
    >the value from the stream)? I'm a bit unclear on the whole passing by
    >value/reference thing. I'm guessing that while GetWord gets a
    >reference to the 'filedata' string, assigning to that will just reseat
    >the reference and not change the original string.
    >
    >The other problem is that parts of the format are potentially repeated
    >an arbitrary number of times and therefore a degree of lookahead is
    >required. If I've already extracted a token and then find out I need

    A generator can look ahead by holding put-back info in its own state
    without yielding a result until it has decided what to do. It can read
    input line-wise and scan lines for patterns and store ambiguous info
    for re-analysis if backup is needed. You can go character by character
    or whip through lines of comments in bigger chunks, and recognize alternative
    patterns with regular expressions. There are lots of options.

    >it, putting it back is awkward. Yet there is nowhere near enough

    A generator wouldn't have to put it back, but if that is a convenient way to
    go, you can define one with a put-back stack or queue by including a mutable
    for that purpose as one of the initial arguments in the intial generator call.

    >complexity or repetition in the file format to justify a formal
    >grammar or anything like that.


    Communicating clearly and precisely should be more than enough justification IMO ;-)

    What you've said above sounds like approximately:

    kylotan_file: ( string_text '~' | number WS | some_identifiers NL )*

    If it's not that complicated, why not complete the picture? I'd bet you'll get several
    versions of tokenizers/parsers for it, and questions as to what you want to do with the
    pieces. Maybe a tokenizer as a generator that gives you a sequence of (token_type, token_data)
    tuples would work. If you have nested structures, you can define start-of-nest and end-of-nest
    tokens as operator tokens like ( OP, '(' ) and ( OP ')' )

    Look and Andrew Dalke's recent post for a number of ideas and code you might snip and adapt
    to your problem (I think this shortened url will get you there):

    http://groups.google.com/groups?q=rpn.compile group:comp.lang.python.*&hl=en&lr=&ie=UTF-8

    >
    >All in all, in the basic parsing code I am doing a lot more operations
    >on the input data than I would like. I can see how I'd encapsulate
    >this behind functions if I was willing to iterate through the data
    >character by character like I would in C++. But I am hoping that
    >Python can, as usual, save me from the majority of this drudgery
    >somehow.

    I suspect you could recognize bigger chunks with regular expressions, or at
    least split them apart by splitting on a regex of delimiters (which you can
    preserve in the split list by enclosing in parens).

    >
    >Any help appreciated.
    >

    HTH

    Regards,
    Bengt Richter
     
    Bengt Richter, Oct 22, 2003
    #4
  5. Kylotan

    Kylotan Guest

    (Bengt Richter) wrote in message news:<bn5o45$gru$0@216.39.172.122>...

    > A generator can look ahead by holding put-back info in its own state
    > without yielding a result until it has decided what to do. It can read
    > input line-wise and scan lines for patterns and store ambiguous info
    > for re-analysis if backup is needed. You can go character by character
    > or whip through lines of comments in bigger chunks, and recognize alternative
    > patterns with regular expressions. There are lots of options.


    Sadly none of these options seem obvious to me :) Basically 90% of
    the time, I know exactly what type to expect. Other times, I am gonna
    get one of several things back, where sometimes one of those things is
    actually part of something totally different, so I need to leave it
    there for the next routine. How would that be done with a generator?

    > Communicating clearly and precisely should be more than enough justification IMO ;-)
    >
    > What you've said above sounds like approximately:
    >
    > kylotan_file: ( string_text '~' | number WS | some_identifiers NL )*
    >
    > If it's not that complicated, why not complete the picture?


    Because it would be a fairly flat grammar where each non-terminal
    symbol has a very long rule of almost exclusively terminal symbols
    describing what it contains. There's no recursiveness and very little
    iteration or alternation in here. With all this in mind, I'd rather
    keep all the logic for reading and assigning values in one place
    rather than going through a parser middleman which will complicate the
    code. Traditional tokenizers and lexers are also of little use since
    many of the tokens are context-dependent.

    > Look and Andrew Dalke's recent post for a number of ideas and code you might
    > snip and adapt to your problem


    All I found in a short search was something complex that appeared to
    be an expression parser, which is not really what I need here.

    Thanks,

    Ben Sizer
     
    Kylotan, Oct 23, 2003
    #5
  6. Kylotan

    Kylotan Guest

    Peter,

    Thanks for your reply. I will probably use something similar to this
    in the end. However, I was wondering if there's an obvious
    implementation of multiple delimiters for the _getChunk() function?
    The most obvious and practical example would be the ability to get the
    next chunk up to any sort of whitespace, not just a space.

    --
    Ben Sizer
     
    Kylotan, Oct 23, 2003
    #6
  7. Kylotan

    Peter Otten Guest

    Kylotan wrote:

    > in the end. However, I was wondering if there's an obvious
    > implementation of multiple delimiters for the _getChunk() function?
    > The most obvious and practical example would be the ability to get the
    > next chunk up to any sort of whitespace, not just a space.


    As far as I know, nothing short of regular expressions will do.

    def _getChunk(self, expr):
    start = self.positions[-1]
    if start >= len(self.data):
    raise EndOfData
    match = expr.search(self.data, start)
    if match:
    end = match.start()
    self.positions.append(match.end())
    else:
    end = len(self.data)
    self.positions.append(end)
    return self.data[start:end]

    This would be called, e. g. for one or more whitespace chars as the
    delimiter:

    whites = re.compile(r"\s+")
    def getString(self):
    return self._getChunk(self.whites)


    Peter
     
    Peter Otten, Oct 23, 2003
    #7
  8. Kylotan

    Andrae Muys Guest

    (Kylotan) wrote in message news:<>...
    > (Bengt Richter) wrote in message news:<bn5o45$gru$0@216.39.172.122>...
    >
    > > Look and Andrew Dalke's recent post for a number of ideas and code you might
    > > snip and adapt to your problem

    >
    > All I found in a short search was something complex that appeared to
    > be an expression parser, which is not really what I need here.
    >


    To me it sounds like you need parser, so why not just bite the bullet
    and use one?

    From your description of the file format earlier in this thread, it
    sounds like you just need a straight forward LL parser.

    Andrae
     
    Andrae Muys, Oct 25, 2003
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jo Blow

    documents on differing paradigms

    Jo Blow, Dec 25, 2004, in forum: ASP .Net
    Replies:
    0
    Views:
    330
    Jo Blow
    Dec 25, 2004
  2. JV
    Replies:
    1
    Views:
    353
  3. Gernot Frisch

    operator with differing return type?

    Gernot Frisch, Jun 29, 2005, in forum: C++
    Replies:
    10
    Views:
    480
    Gernot Frisch
    Jun 30, 2005
  4. Christopher Key

    Differing function prototypes and definitions

    Christopher Key, Sep 28, 2007, in forum: C Programming
    Replies:
    3
    Views:
    372
    CBFalconer
    Oct 3, 2007
  5. Harald Finster
    Replies:
    1
    Views:
    290
    Harald Finster
    Mar 6, 2009
Loading...

Share This Page