Parsing a file with iterators

Luis Zarrabeitia · Oct 17, 2008

I need to parse a file, text file. The format is something like that:

TYPE1 metadata
data line 1
data line 2
....
data line N
TYPE2 metadata
data line 1
....
TYPE3 metadata
....

And so on. The type and metadata determine how to parse the following data
lines. When the parser fails to parse one of the lines, the next parser is
chosen (or if there is no 'TYPE metadata' line there, an exception is thrown).

This doesn't work:

===
for line in input:
parser = parser_from_string(line)
parser(input)
===

because when the parser iterates over the input, it can't know that it finished
processing the section until it reads the next "TYPE" line (actually, until it
reads the first line that it cannot parse, which if everything went well, should
be the 'TYPE'), but once it reads it, it is no longer available to the outer
loop. I wouldn't like to leak the internals of the parsers to the outside..

What could I do?
(to the curious: the format is a dialect of the E00 used in GIS)

Eddie Corns · Oct 17, 2008

I need to parse a file, text file. The format is something like that:

TYPE1 metadata
data line 1
data line 2
...
data line N
TYPE2 metadata
data line 1
...
TYPE3 metadata
...

And so on. The type and metadata determine how to parse the following dat=
a
lines. When the parser fails to parse one of the lines, the next parser i=
s
chosen (or if there is no 'TYPE metadata' line there, an exception is thr=
own).

This doesn't work:

=3D=3D=3D
for line in input:
parser =3D parser_from_string(line)
parser(input)
=3D=3D=3D

because when the parser iterates over the input, it can't know that it fi=
nished
processing the section until it reads the next "TYPE" line (actually, unt=
il it
reads the first line that it cannot parse, which if everything went well,=
should
be the 'TYPE'), but once it reads it, it is no longer available to the ou=
ter
loop. I wouldn't like to leak the internals of the parsers to the outside=
.

What could I do?
(to the curious: the format is a dialect of the E00 used in GIS)
=20
--=20
Luis Zarrabeitia
Facultad de Matem=E1tica y Computaci=F3n, UH
http://profesores.matcom.uh.cu/~kyrie

One simple way is to allow your "input" iterator to support pushing values
back into the input stream as soon as it finds an input it can't handle.

See http://code.activestate.com/recipes/502304/ for an example.

Marc 'BlackJack' Rintsch · Oct 17, 2008

I need to parse a file, text file. The format is something like that:

TYPE1 metadata
data line 1
data line 2
...
data line N
TYPE2 metadata
data line 1
...
TYPE3 metadata
...
[â€¦]
because when the parser iterates over the input, it can't know that it
finished processing the section until it reads the next "TYPE" line
(actually, until it reads the first line that it cannot parse, which if
everything went well, should be the 'TYPE'), but once it reads it, it is
no longer available to the outer loop. I wouldn't like to leak the
internals of the parsers to the outside.

What could I do?
(to the curious: the format is a dialect of the E00 used in GIS)

Group the lines before processing and feed each group to the right parser:

import sys
from itertools import groupby, imap
from operator import itemgetter

def parse_a(metadata, lines):
print 'parser a', metadata
for line in lines:
print 'a', line

def parse_b(metadata, lines):
print 'parser b', metadata
for line in lines:
print 'b', line

def parse_c(metadata, lines):
print 'parser c', metadata
for line in lines:
print 'c', line

def test_for_type(line):
return line.startswith('TYPE')

def parse(lines):
def tag():
type_line = None
for line in lines:
if test_for_type(line):
type_line = line
else:
yield (type_line, line)

type2parser = {'TYPE1': parse_a,
'TYPE2': parse_b,
'TYPE3': parse_c }

for type_line, group in groupby(tag(), itemgetter(0)):
type_id, metadata = type_line.split(' ', 1)
type2parser[type_id](metadata, imap(itemgetter(1), group))

def main():
parse(sys.stdin)

Paul McGuire · Oct 17, 2008

I need to parse a file, text file. The format is something like that:

TYPE1 metadata
data line 1
data line 2
...
data line N
TYPE2 metadata
data line 1
...
TYPE3 metadata
...

And so on. The type and metadata determine how to parse the following data
lines. When the parser fails to parse one of the lines, the next parser is
chosen (or if there is no 'TYPE metadata' line there, an exception is thrown).

<snip>

Pyparsing will take care of this for you, if you define a set of
alternatives and then parse/search for them. Here is an annotated
example. Note the ability to attach names to different fields of the
parser, and then how those fields are accessed after parsing.

"""
TYPE1 metadata
data line 1
data line 2
....
data line N
TYPE2 metadata
data line 1
....
TYPE3 metadata
....
"""

from pyparsing import *

# define basic element types to be used in data formats
integer = Word(nums)
ident = Word(alphas) | quotedString.setParseAction(removeQuotes)
zipcode = Combine(Word(nums,exact=5) + Optional("-" +
Word(nums,exact=4)))
stateAbbreviation = oneOf("""AA AE AK AL AP AR AS AZ CA CO CT DC DE
FL FM GA GU HI IA ID IL IN KS KY LA MA MD ME MH MI MN MO MP MS
MT NC ND NE NH NJ NM NV NY OH OK OR PA PR PW RI SC SD TN TX UT
VA VI VT WA WI WV WY""".split())

# define data format for each type
DATA = Suppress("data")
type1dataline = Group(DATA + OneOrMore(integer))
type2dataline = Group(DATA + delimitedList(ident))
type3dataline = DATA + countedArray(ident)

# define complete expressions for each type - note different types
# may have different metadata
type1data = "TYPE1" + ident("name") + \
OneOrMore(type1dataline)("data")
type2data = "TYPE2" + ident("name") + zipcode("zip") + \
OneOrMore(type2dataline)("data")
type3data = "TYPE3" + ident("name") + stateAbbreviation("state") + \
OneOrMore(type3dataline)("data")

# expression containing all different type alternatives
data = type1data | type2data | type3data

# search a test input string and dump the matched tokens by name
testInput = """
TYPE1 Abercrombie
data 400 26 42 66
data 1 1 2 3 5 8 13 21
data 1 4 9 16 25 36
data 1 2 4 8 16 32 64
TYPE2 Benjamin 78704
data Larry, Curly, Moe
data Hewey,Dewey ,Louie
data Tom , Dick, Harry, Fred
data Thelma,Louise
TYPE3 Christopher WA
data 3 "Raspberry Red" "Lemon Yellow" "Orange Orange"
data 7 Grumpy Sneezy Happy Dopey Bashful Sleepy Doc
"""
for tokens in data.searchString(testInput):
print tokens.dump()
print tokens.name
if tokens.state: print tokens.state
for d in tokens.data:
print " ",d
print

Prints:

['TYPE1', 'Abercrombie', ['400', '26', '42', '66'], ['1', '1', '2',
'3', '5', '8', '13', '21'], ['1', '4', '9', '16', '25', '36'], ['1',
'2', '4', '8', '16', '32', '64']]
- data: [['400', '26', '42', '66'], ['1', '1', '2', '3', '5', '8',
'13', '21'], ['1', '4', '9', '16', '25', '36'], ['1', '2', '4', '8',
'16', '32', '64']]
- name: Abercrombie
Abercrombie
['400', '26', '42', '66']
['1', '1', '2', '3', '5', '8', '13', '21']
['1', '4', '9', '16', '25', '36']
['1', '2', '4', '8', '16', '32', '64']

['TYPE2', 'Benjamin', '78704', ['Larry', 'Curly', 'Moe'], ['Hewey',
'Dewey', 'Louie'], ['Tom', 'Dick', 'Harry', 'Fred'], ['Thelma',
'Louise']]
- data: [['Larry', 'Curly', 'Moe'], ['Hewey', 'Dewey', 'Louie'],
['Tom', 'Dick', 'Harry', 'Fred'], ['Thelma', 'Louise']]
- name: Benjamin
- zip: 78704
Benjamin
['Larry', 'Curly', 'Moe']
['Hewey', 'Dewey', 'Louie']
['Tom', 'Dick', 'Harry', 'Fred']
['Thelma', 'Louise']

['TYPE3', 'Christopher', 'WA', ['Raspberry Red', 'Lemon Yellow',
'Orange Orange'], ['Grumpy', 'Sneezy', 'Happy', 'Dopey', 'Bashful',
'Sleepy', 'Doc']]
- data: [['Raspberry Red', 'Lemon Yellow', 'Orange Orange'],
['Grumpy', 'Sneezy', 'Happy', 'Dopey', 'Bashful', 'Sleepy', 'Doc']]
- name: Christopher
- state: WA
Christopher
WA
['Raspberry Red', 'Lemon Yellow', 'Orange Orange']
['Grumpy', 'Sneezy', 'Happy', 'Dopey', 'Bashful', 'Sleepy', 'Doc']

More info on pyparsing at http://pyparsing.wikispaces.com.

-- Paul

James Harris · Oct 17, 2008

I need to parse a file, text file. The format is something like that:

TYPE1 metadata
data line 1
data line 2
...
data line N
TYPE2 metadata
data line 1
...
TYPE3 metadata
...

And so on. The type and metadata determine how to parse the following data
lines. When the parser fails to parse one of the lines, the next parser is
chosen (or if there is no 'TYPE metadata' line there, an exception is thrown).

This doesn't work:

===
for line in input:
parser = parser_from_string(line)
parser(input)
===

because when the parser iterates over the input, it can't know that it finished
processing the section until it reads the next "TYPE" line (actually, until it
reads the first line that it cannot parse, which if everything went well, should
be the 'TYPE'), but once it reads it, it is no longer available to the outer
loop. I wouldn't like to leak the internals of the parsers to the outside.

What could I do?
(to the curious: the format is a dialect of the E00 used in GIS)

The main issue seems to be that you need to keep the 'current' line
data when a parser has decided it doesn't understand it so it can
still be used to select the next parser. The for loop in your example
uses the next() method which only returns the next and never the
current line. There are two easy options though:

1. Wrap the input file with your own object.
2. Use the linecache module and maintain a line number.

http://blog.doughellmann.com/2007/04/pymotw-linecache.html

George Sakkis · Oct 18, 2008

I need to parse a file, text file. The format is something like that:

Click to expand...

TYPE1 metadata
data line 1
data line 2
...
data line N
TYPE2 metadata
data line 1
...
TYPE3 metadata
...
[…]
because when the parser iterates over the input, it can't know that it
finished processing the section until it reads the next "TYPE" line
(actually, until it reads the first line that it cannot parse, which if
everything went well, should be the 'TYPE'), but once it reads it, it is
no longer available to the outer loop. I wouldn't like to leak the
internals of the parsers to the outside.

Click to expand...

What could I do?
(to the curious: the format is a dialect of the E00 used in GIS)

Click to expand...

Group the lines before processing and feed each group to the right parser:

import sys
from itertools import groupby, imap
from operator import itemgetter

def parse_a(metadata, lines):
print 'parser a', metadata
for line in lines:
print 'a', line

def parse_b(metadata, lines):
print 'parser b', metadata
for line in lines:
print 'b', line

def parse_c(metadata, lines):
print 'parser c', metadata
for line in lines:
print 'c', line

def test_for_type(line):
return line.startswith('TYPE')

def parse(lines):
def tag():
type_line = None
for line in lines:
if test_for_type(line):
type_line = line
else:
yield (type_line, line)

type2parser = {'TYPE1': parse_a,
'TYPE2': parse_b,
'TYPE3': parse_c }

for type_line, group in groupby(tag(), itemgetter(0)):
type_id, metadata = type_line.split(' ', 1)
type2parser[type_id](metadata, imap(itemgetter(1), group))

def main():
parse(sys.stdin)

I like groupby and find it very powerful but I think it complicates
things here instead of simplifying them. I would instead create a
parser instance for every section as soon as the TYPE line is read and
then feed it one data line at a time (or if all the data lines must or
should be given at once, append them in a list and feed them all as
soon as the next section is found), something like:

class parse_a(object):
def __init__(self, metadata):
print 'parser a', metadata
def parse(self, line):
print 'a', line

# similar for parse_b and parse_c
# ...

def parse(lines):
parse = None
for line in lines:
if test_for_type(line):
type_id, metadata = line.split(' ', 1)
parse = type2parser[type_id](metadata).parse
else:
parse(line)

George

Dynamic block parsing + scrolling	0	May 30, 2024
How to change key name in json file with python	0	Oct 2, 2022
Using a DTSX file with GoDaddy	0	Apr 21, 2024
Rearranging .ply file via C++ String Parsing	0	Dec 14, 2019
parsing email from stdin	0	Oct 8, 2013
Can I convert MBOX to Outlook PST without affecting the original email formatting?	2	Dec 28, 2024
How to transform a .pst file into a .eml file?	1	Jan 15, 2025
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023

Parsing a file with iterators

Luis Zarrabeitia

Eddie Corns

Marc 'BlackJack' Rintsch

Paul McGuire

James Harris

George Sakkis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads