Parsing a file based on differing delimiters

Kylotan · Oct 21, 2003

I have a text file where the fields are delimited in various different
ways. For example, strings are terminated with a tilde, numbers are
terminated with whitespace, and some identifiers are terminated with a
newline. This means I can't effectively use split() except on a small
scale. For most of the file I can just call one of several functions I
wrote that read in just as much data as is required from the input
string, and return the value and modified string. Much of the code
therefore looks like this:

filedata = file('whatever').read()
firstWord, filedata = GetWord(filedata)
nextNumber, filedata = GetNumber(filedata)

This works, but is obviously ugly. Is there a cleaner alternative that
can avoid me having to re-assign data all the time that will 'consume'
the value from the stream)? I'm a bit unclear on the whole passing by
value/reference thing. I'm guessing that while GetWord gets a
reference to the 'filedata' string, assigning to that will just reseat
the reference and not change the original string.

The other problem is that parts of the format are potentially repeated
an arbitrary number of times and therefore a degree of lookahead is
required. If I've already extracted a token and then find out I need
it, putting it back is awkward. Yet there is nowhere near enough
complexity or repetition in the file format to justify a formal
grammar or anything like that.

All in all, in the basic parsing code I am doing a lot more operations
on the input data than I would like. I can see how I'd encapsulate
this behind functions if I was willing to iterate through the data
character by character like I would in C++. But I am hoping that
Python can, as usual, save me from the majority of this drudgery
somehow.

Any help appreciated.

Alex Martelli · Oct 21, 2003

Kylotan said:
I have a text file where the fields are delimited in various different
ways. For example, strings are terminated with a tilde, numbers are
terminated with whitespace, and some identifiers are terminated with a

What sadist designed it?-) Anyway...

I suggest a simple class which holds the filedata and an index into
it. Your functions such as GetWord(f) examine f.data from f.index
onwards, and increment f.index before returning the result. To
"pushback", you just decrement f.index again (you may want to
keep a small stack of values - perhaps just one -- for the "undo",
again in the simple class in question).

Alex

Peter Otten · Oct 22, 2003

Kylotan said:
I have a text file where the fields are delimited in various different
ways. For example, strings are terminated with a tilde, numbers are
terminated with whitespace, and some identifiers are terminated with a
newline. This means I can't effectively use split() except on a small
scale. For most of the file I can just call one of several functions I
wrote that read in just as much data as is required from the input
string, and return the value and modified string. Much of the code
therefore looks like this:

filedata = file('whatever').read()
firstWord, filedata = GetWord(filedata)
nextNumber, filedata = GetNumber(filedata)

This works, but is obviously ugly. Is there a cleaner alternative that
can avoid me having to re-assign data all the time that will 'consume'
the value from the stream)? I'm a bit unclear on the whole passing by
value/reference thing. I'm guessing that while GetWord gets a
reference to the 'filedata' string, assigning to that will just reseat
the reference and not change the original string.

The strategy to rebind is to wrap the reference into a mutable object and
pass that object around instead of the original reference.

The other problem is that parts of the format are potentially repeated
an arbitrary number of times and therefore a degree of lookahead is
required. If I've already extracted a token and then find out I need
it, putting it back is awkward. Yet there is nowhere near enough
complexity or repetition in the file format to justify a formal
grammar or anything like that.

All in all, in the basic parsing code I am doing a lot more operations
on the input data than I would like. I can see how I'd encapsulate
this behind functions if I was willing to iterate through the data
character by character like I would in C++. But I am hoping that
Python can, as usual, save me from the majority of this drudgery
somehow.

I've made a little Reader class that should do what you want. Of course the
actual parsing routines will differ, depending on your file format.

<code>
class EndOfData(Exception):
pass

class Reader:
def __init__(self, data):
self.data = data
self.positions = [0]

def _getChunk(self, delim):
start = self.positions[-1]
if start >= len(self.data):
raise EndOfData
end = self.data.find(delim, start)
if end < 0:
end = len(self.data)
self.positions.append(end+1)
return self.data[start:end]

def rest(self):
return self.data[self.positions[-1]:]
def rewind(self):
self.positions = [0]
def unget(self):
self.positions.pop()
def getString(self):
return self._getChunk("~")
def getInteger(self):
chunk = self._getChunk(" ")
try:
return int(chunk)
except ValueError:
self.unget()
raise

#example usage:

sample = "abc~123 456 rst"
r = Reader(sample)

commands = {
"i": r.getInteger,
"s": r.getString,
"u": lambda: r.unget() or "#unget " + r.rest(),
}

for key in "ssuiisuuisi":
try:
print commands[key]()
except ValueError:
print "#error"
</code>

Peter

Bengt Richter · Oct 22, 2003

I have a text file where the fields are delimited in various different
ways. For example, strings are terminated with a tilde, numbers are
terminated with whitespace, and some identifiers are terminated with a
newline. This means I can't effectively use split() except on a small
scale. For most of the file I can just call one of several functions I
wrote that read in just as much data as is required from the input
string, and return the value and modified string. Much of the code
therefore looks like this:

filedata = file('whatever').read()
firstWord, filedata = GetWord(filedata)
nextNumber, filedata = GetNumber(filedata)

This works, but is obviously ugly. Is there a cleaner alternative that
can avoid me having to re-assign data all the time that will 'consume'
the value from the stream)? I'm a bit unclear on the whole passing by
value/reference thing. I'm guessing that while GetWord gets a
reference to the 'filedata' string, assigning to that will just reseat
the reference and not change the original string.

The other problem is that parts of the format are potentially repeated
an arbitrary number of times and therefore a degree of lookahead is
required. If I've already extracted a token and then find out I need

A generator can look ahead by holding put-back info in its own state
without yielding a result until it has decided what to do. It can read
input line-wise and scan lines for patterns and store ambiguous info
for re-analysis if backup is needed. You can go character by character
or whip through lines of comments in bigger chunks, and recognize alternative
patterns with regular expressions. There are lots of options.

it, putting it back is awkward. Yet there is nowhere near enough

A generator wouldn't have to put it back, but if that is a convenient way to
go, you can define one with a put-back stack or queue by including a mutable
for that purpose as one of the initial arguments in the intial generator call.

complexity or repetition in the file format to justify a formal
grammar or anything like that.

Communicating clearly and precisely should be more than enough justification IMO ;-)

What you've said above sounds like approximately:

kylotan_file: ( string_text '~' | number WS | some_identifiers NL )*

If it's not that complicated, why not complete the picture? I'd bet you'll get several
versions of tokenizers/parsers for it, and questions as to what you want to do with the
pieces. Maybe a tokenizer as a generator that gives you a sequence of (token_type, token_data)
tuples would work. If you have nested structures, you can define start-of-nest and end-of-nest
tokens as operator tokens like ( OP, '(' ) and ( OP ')' )

Look and Andrew Dalke's recent post for a number of ideas and code you might snip and adapt
to your problem (I think this shortened url will get you there):

http://groups.google.com/groups?q=rpn.compile+group:comp.lang.python.*&hl=en&lr=&ie=UTF-8

All in all, in the basic parsing code I am doing a lot more operations
on the input data than I would like. I can see how I'd encapsulate
this behind functions if I was willing to iterate through the data
character by character like I would in C++. But I am hoping that
Python can, as usual, save me from the majority of this drudgery
somehow.

I suspect you could recognize bigger chunks with regular expressions, or at
least split them apart by splitting on a regex of delimiters (which you can
preserve in the split list by enclosing in parens).

Any help appreciated.

HTH

Regards,
Bengt Richter

Kylotan · Oct 23, 2003

A generator can look ahead by holding put-back info in its own state
without yielding a result until it has decided what to do. It can read
input line-wise and scan lines for patterns and store ambiguous info
for re-analysis if backup is needed. You can go character by character
or whip through lines of comments in bigger chunks, and recognize alternative
patterns with regular expressions. There are lots of options.

Sadly none of these options seem obvious to me

Basically 90% of
the time, I know exactly what type to expect. Other times, I am gonna
get one of several things back, where sometimes one of those things is
actually part of something totally different, so I need to leave it
there for the next routine. How would that be done with a generator?

Communicating clearly and precisely should be more than enough justification IMO ;-)

What you've said above sounds like approximately:

kylotan_file: ( string_text '~' | number WS | some_identifiers NL )*

If it's not that complicated, why not complete the picture?

Because it would be a fairly flat grammar where each non-terminal
symbol has a very long rule of almost exclusively terminal symbols
describing what it contains. There's no recursiveness and very little
iteration or alternation in here. With all this in mind, I'd rather
keep all the logic for reading and assigning values in one place
rather than going through a parser middleman which will complicate the
code. Traditional tokenizers and lexers are also of little use since
many of the tokens are context-dependent.

Look and Andrew Dalke's recent post for a number of ideas and code you might
snip and adapt to your problem

All I found in a short search was something complex that appeared to
be an expression parser, which is not really what I need here.

Thanks,

Ben Sizer

Kylotan · Oct 23, 2003

Peter,

Thanks for your reply. I will probably use something similar to this
in the end. However, I was wondering if there's an obvious
implementation of multiple delimiters for the _getChunk() function?
The most obvious and practical example would be the ability to get the
next chunk up to any sort of whitespace, not just a space.

Peter Otten · Oct 23, 2003

Kylotan said:
in the end. However, I was wondering if there's an obvious
implementation of multiple delimiters for the _getChunk() function?
The most obvious and practical example would be the ability to get the
next chunk up to any sort of whitespace, not just a space.

As far as I know, nothing short of regular expressions will do.

def _getChunk(self, expr):
start = self.positions[-1]
if start >= len(self.data):
raise EndOfData
match = expr.search(self.data, start)
if match:
end = match.start()
self.positions.append(match.end())
else:
end = len(self.data)
self.positions.append(end)
return self.data[start:end]

This would be called, e. g. for one or more whitespace chars as the
delimiter:

whites = re.compile(r"\s+")
def getString(self):
return self._getChunk(self.whites)

Peter

Andrae Muys · Oct 25, 2003

All I found in a short search was something complex that appeared to
be an expression parser, which is not really what I need here.

To me it sounds like you need parser, so why not just bite the bullet
and use one?

From your description of the file format earlier in this thread, it
sounds like you just need a straight forward LL parser.

Andrae

Code suggestions?	0	Sep 21, 2013
Rearranging .ply file via C++ String Parsing	0	Dec 14, 2019
Pixel based analysis of landsat satellite image Using python codes	1	May 1, 2023
crawling/parsing a webpage based on dynamic javascript	0	Aug 18, 2013
Generate one HTML from API based on the object key language and their value	2	Aug 19, 2022
Filter table rows based on multiple checkboxes value	2	Jan 13, 2023
Select Eof extension files based on text list of filenames with if condition	0	May 4, 2022
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022

Parsing a file based on differing delimiters

Kylotan

Alex Martelli

Peter Otten

Bengt Richter

Kylotan

Kylotan

Peter Otten

Andrae Muys

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads