A vote for re scanner

W

Wade Leftwich

Every couple of months I have a use for the experimental 'scanner'
object in the re module, and when I do, as I did this morning, it's
really handy. So if anyone is counting votes for making it a standard
part of the module, here's my vote:

+1

-- Wade Leftwich
Ithaca, NY
 
J

Jeremy Fincher

Every couple of months I have a use for the experimental 'scanner'
object in the re module, and when I do, as I did this morning, it's
really handy. So if anyone is counting votes for making it a standard
part of the module, here's my vote:

While I don't think they're still accepting votes :), you've pointed
me to something I didn't know about until now. What kinds of things
have you been using re.Scanner for?

Jeremy
 
W

Wade Leftwich

While I don't think they're still accepting votes :), you've pointed
me to something I didn't know about until now. What kinds of things
have you been using re.Scanner for?

Jeremy

A scanner is constructed from a regex object and a string to be
scanned. Each call to the scanner's search() method returns the next
match object of the regex on the string. So to work on a string that
has multiple matches, it's the bee's roller skates.
 
D

Dang Griffith

A scanner is constructed from a regex object and a string to be
scanned. Each call to the scanner's search() method returns the next
match object of the regex on the string. So to work on a string that
has multiple matches, it's the bee's roller skates.

Or in Eric's case, *the* roller skate.
--dang
 
A

Alex Martelli

Wade Leftwich wrote:
...
A scanner is constructed from a regex object and a string to be
scanned. Each call to the scanner's search() method returns the next
match object of the regex on the string. So to work on a string that
has multiple matches, it's the bee's roller skates.

....if that method's name was 'next' (and an appropriate __iter__
also present) it might be even cooler, though...


Alex
 
W

Wade Leftwich

Alex Martelli said:
Wade Leftwich wrote:
...

...if that method's name was 'next' (and an appropriate __iter__
also present) it might be even cooler, though...


Alex
Indeed:
.... def __init__(self, regex, s):
.... self.scanner = regex.scanner(s)
.... def next(self):
.... m = self.scanner.search()
.... if m:
.... return m
.... else:
.... raise StopIteration
.... def __iter__(self):
.... while 1:
.... yield self.next()
.... .... print m.group('before'), m.group('after')
....
1 b
2 c
3 d
-- Wade
 
F

Fredrik Lundh

Alex said:
Wade Leftwich wrote:
...

...if that method's name was 'next' (and an appropriate __iter__
also present) it might be even cooler, though...

re.finditer

</F>
 
W

Wade Leftwich

Fredrik Lundh said:
... print m.group('before'), m.group('after')
...
1 b
2 c
3 d

</F>

There I go, reimplementing the wheel again. Guess I didn't pay enough
attention to "What's New In 2.2". Thanks for the pointer. It appears
we don't need that scanner() method after all.

However, from my point of view it was a good exercise, because now I
know how easy it is to make an iterator.

Thanks again

-- Wade
 
A

allanc

I'm new with python so bear with me.

I'm looking for a way to elegantly parse fixed-width text data (as opposed
to CSV) and saving the parsed data unto a database. The text data comes
from an old ISAM-format table and each line may be a different record
structure depending on key fields in the line.

RegExp with match and split are of interest but it's been too long since
I've dabbled with RE to be able to judge whether its use will make the
problem more complex.

Here's a sample of the records I need to parse:

01508390019002 11284361000002SUGARPLUM
015083915549 SHORT ON LAST ORDER
0150839220692 000002EA BMC 15 KG 001400

1st Line is a (portion of) header record.
2nd Line is an text instruction record.
3rd Line is a Transaction Line Item record.

Each type of record has a different structure. But these set of lines
appear in the one table.


Any ideas would be greatly appreciated.

Allan
 
D

Dang Griffith

I'm new with python so bear with me.

I'm looking for a way to elegantly parse fixed-width text data (as opposed
to CSV) and saving the parsed data unto a database. The text data comes
from an old ISAM-format table and each line may be a different record
structure depending on key fields in the line.

RegExp with match and split are of interest but it's been too long since
I've dabbled with RE to be able to judge whether its use will make the
problem more complex.

Here's a sample of the records I need to parse:

01508390019002 11284361000002SUGARPLUM
015083915549 SHORT ON LAST ORDER
0150839220692 000002EA BMC 15 KG 001400

1st Line is a (portion of) header record.
2nd Line is an text instruction record.
3rd Line is a Transaction Line Item record.

Each type of record has a different structure. But these set of lines
appear in the one table.

Are the key fields in fixed positions? If so, pluck them out and use
them as an index into a dictionary of functions to call. I can't tell
from your example where the keys are, so I'm assuming the first 8 are
simply a line number and the next 4 are the key.

Maybe something along these lines:

def header(x):
print 'header: %s' % x # process header

def testinstruction(x):
print 'test instruction: %s' % x # process test instruction

def lineitem(x):
print 'lineitem: %s' % x # process line item

ptable = {'0190':header, '5549': testinstruction, '2069': lineitem}

for line in file("data.dat"):
ptable[line[8:12]](line)

--dang
 
D

David Goodger

allanc said:
Here's a sample of the records I need to parse:

01508390019002 11284361000002SUGARPLUM
015083915549 SHORT ON LAST ORDER
0150839220692 000002EA BMC 15 KG 001400

1st Line is a (portion of) header record.
2nd Line is an text instruction record.
3rd Line is a Transaction Line Item record.

I've written many programs to parse data very similar to this,
until I generalized the algorithm (a line-oriented state machine)
into a module. You can find the module (internally documented)
at http://docutils.sf.net/docutils/statemachine.py.

Hope it helps!
 
W

wes weston

allanc said:
I'm new with python so bear with me.

I'm looking for a way to elegantly parse fixed-width text data (as opposed
to CSV) and saving the parsed data unto a database. The text data comes
from an old ISAM-format table and each line may be a different record
structure depending on key fields in the line.

RegExp with match and split are of interest but it's been too long since
I've dabbled with RE to be able to judge whether its use will make the
problem more complex.

Here's a sample of the records I need to parse:

01508390019002 11284361000002SUGARPLUM
015083915549 SHORT ON LAST ORDER
0150839220692 000002EA BMC 15 KG 001400

1st Line is a (portion of) header record.
2nd Line is an text instruction record.
3rd Line is a Transaction Line Item record.

Each type of record has a different structure. But these set of lines
appear in the one table.


Any ideas would be greatly appreciated.

Allan


allanc,
-slices as in str[0:5] or str[5:] or str[5:-1] - get pieces of a string
-you'll probably want to strip leading/trailing spaces; see strings doc
-you may need to cast/convert
_int = int("55")
_float = float("4.2")
wes
 
W

wes weston

allanc said:
I'm new with python so bear with me.

I'm looking for a way to elegantly parse fixed-width text data (as opposed
to CSV) and saving the parsed data unto a database. The text data comes
from an old ISAM-format table and each line may be a different record
structure depending on key fields in the line.

RegExp with match and split are of interest but it's been too long since
I've dabbled with RE to be able to judge whether its use will make the
problem more complex.

Here's a sample of the records I need to parse:

01508390019002 11284361000002SUGARPLUM
015083915549 SHORT ON LAST ORDER
0150839220692 000002EA BMC 15 KG 001400

1st Line is a (portion of) header record.
2nd Line is an text instruction record.
3rd Line is a Transaction Line Item record.

Each type of record has a different structure. But these set of lines
appear in the one table.


Any ideas would be greatly appreciated.

Allan

Allan,
Maybe this will help more:
>>> line = "015083915549 SHORT ON LAST ORDER 0150839220692"
>>> print line[0:10] 0150839155
>>> print line [:10] 0150839155
>>> print line[5:10] 39155
>>> print line[-10:-1] 083922069
>>> print int(line[-10:-1]) 83922069
>>> print " xyz ".strip()
xyz

wes
 
P

Paul McGuire

allanc said:
I'm new with python so bear with me.

I'm looking for a way to elegantly parse fixed-width text data (as opposed
to CSV) and saving the parsed data unto a database. The text data comes
from an old ISAM-format table and each line may be a different record
structure depending on key fields in the line.

RegExp with match and split are of interest but it's been too long since
I've dabbled with RE to be able to judge whether its use will make the
problem more complex.

Here's a sample of the records I need to parse:

01508390019002 11284361000002SUGARPLUM
015083915549 SHORT ON LAST ORDER
0150839220692 000002EA BMC 15 KG 001400

1st Line is a (portion of) header record.
2nd Line is an text instruction record.
3rd Line is a Transaction Line Item record.

Each type of record has a different structure. But these set of lines
appear in the one table.


Any ideas would be greatly appreciated.

Allan
Allan -

Let me put in a plug for pyparsing. I think your problem is tailor-made for
pyparsing's easy-to-use grammar definitions and execution. No special
lexx/yacc-like syntax or RE symbology to master, you assemble your grammar
using simply-named classes (such as Literal, OneOrMore, Word(wordchars),
Optional, etc.) and intuitive operators (+ for sequence, | for greedy
alternation, ^ for longest-match alternation, ~ for, um, Not-tion).

A grammar to parse "Hello, World!" might look like:
helloGrammar = Word(alphas) + "," + Word(alphas) + oneOf(". ! ? !! !!!")
which could then parse any of:
Hello, World!
Hello , World !
Hello,World!
Yo, Adrian!!!
Hey, man.
Whattup, dude?

You can associate field names with specific parse elements, so that the
fields can be extracted from the results such as:
helloGrammar = Word(alphas).setResultsName("greeting") + "," + \
Word(alphas).setResultsName("to") + oneOf(". ! ? !! !!!")
results = helloGrammar.parseString( greetingstring )
print results.greeting
print results.to

You can associate parse actions (a la SAX) to fire when matching parse
elements are matched in the input.

You can find the pyparsing home page at http://pyparsing.sourceforge.net.

-- Paul McGuire
 
L

Larry Bates

I think one of the easiest ways to do this is to
write a class that knows how to parse each of the
unique lines. As you are reading through the file/table
and encounter a line like the first, create a new
class instance and pass it the line's contents. The
__init__ method of the class can parse the line and
place each of the field values in an attribute of the
class.

Something like (this is pseudocode):

class linetype01:
#
# Define a list that contains information about how to
# parse a single linetype. The info is fieldname,
# beginning column, ending column, fieldlength
#

_parsinginfo=[('recnum',0,8),
('linetype',8,3),
('dataitem2',11,3),
...)
def __init__(self, linetext):
self.linetext=linetext
for fieldname, begincol, fieldlength in _parsinginfo:
self.__dict__[fieldname]=linetext[begincol,
begincol+fieldlength+1]
return

you would define a class like this for each unique linetype

in main program
import sys

#
# Insert code to open file/table here
#
for line in table:
#
# See which linetype it is
#
linetype=line[8:10]
if linetype == "01":
pline=linetype01(line)
#
# Now you can extract the values by accessing attributes of
# the class.
#
recordnum=pline.recnum
tlinetype=pline.linetype
#
# Do something with the values
#


elif linetype == "55":
pline=linetype55(line)

elif linetype == "20":
pline=linetype20(line)
else:
print "ERROR-Illegal linetype encountered")
sys.exit(2)


Just one of many ways to solve this problem.

-Larry
 
J

Josiah Carlson

01508390019002 11284361000002SUGARPLUM
015083915549 SHORT ON LAST ORDER
0150839220692 000002EA BMC 15 KG 001400

Is the above the format of all possible lines (aside from empty lines)?

- Josiah
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top