Regular Expression - old regex module vs. re module

Steve · Jun 29, 2006

Hi All,

I'm having a tough time converting the following regex.compile patterns
into the new re.compile format. There is also a differences in the
regsub.sub() vs. re.sub()

Could anyone lend a hand?

import regsub
import regex

import re # << need conversion to this module

.....

"""Convert perl style format symbology to printf tokens.

Take a string and substitute computed printf tokens for perl style
format symbology.

For example:

###.## yields %6.2f
######## yields %8d
<<<<< yields %-5s
"""

exponentPattern = regex.compile('\(^\|[^\\#]\)\(#+\.#+\*\*\*\*\)')
floatPattern = regex.compile('\(^\|[^\\#]\)\(#+\.#+\)')
integerPattern = regex.compile('\(^\|[^\\#]\)\(##+\)')
leftJustifiedStringPattern = regex.compile('\(^\|[^\\<]\)\(<<+\)')
rightJustifiedStringPattern = regex.compile('\(^\|[^\\>]\)\(>>+\)')

while 1: # process all integer fields
print("Testing Integer")
if integerPattern.search(s) < 0: break
print("Integer Match : ", integerPattern.search(s).span() )
# i1 , i2 = integerPattern.regs[2]
i1 , i2 = integerPattern.search(s).span()
width_total = i2 - i1
f = '%'+`width_total`+'d'
# s = regsub.sub(integerPattern, '\\1'+f, s)
s = integerPattern.sub(f, s)

Thanks in advance!

Steve

Jim Segrave · Jun 30, 2006

Hi All,

I'm having a tough time converting the following regex.compile patterns
into the new re.compile format. There is also a differences in the
regsub.sub() vs. re.sub()

Could anyone lend a hand?

import regsub
import regex

import re # << need conversion to this module

....

"""Convert perl style format symbology to printf tokens.

Take a string and substitute computed printf tokens for perl style
format symbology.

For example:

###.## yields %6.2f
######## yields %8d
<<<<< yields %-5s
"""

Perhaps not optimal, but this processes things as requested. Note that
all floats have to be done before any integer patterns are replaced.

==========================
#!/usr/local/bin/python

import re

"""Convert perl style format symbology to printf tokens.
Take a string and substitute computed printf tokens for perl style
format symbology.

For example:

###.## yields %6.2f
######## yields %8d
<<<<< yields %-5s
"""

# handle cases where there's no integer or no fractional chars
floatPattern = re.compile(r'(?<!\\)(#+\.(#*)|\.(#+))')
integerPattern = re.compile(r'(?<![\\.])(#+)(?![.#])')
leftJustifiedStringPattern = re.compile(r'(?<!\\)(<+)')
rightJustifiedStringPattern = re.compile(r'(?<!\\)(>+)')

def float_sub(matchobj):
# fractional part may be in either groups()[1] or groups()[2]
if matchobj.groups()[1] is not None:
return "%%%d.%df" % (len(matchobj.groups()[0]),
len(matchobj.groups()[1]))
else:
return "%%%d.%df" % (len(matchobj.groups()[0]),
len(matchobj.groups()[2]))

def unperl_format(s):
changed_things = 1
while changed_things:
# lather, rinse and repeat until nothing new happens
changed_things = 0

mat_obj = leftJustifiedStringPattern.search(s)
if mat_obj:
s = re.sub(leftJustifiedStringPattern, "%%-%ds" %
len(mat_obj.groups()[0]), s, 1)
changed_things = 1

mat_obj = rightJustifiedStringPattern.search(s)
if mat_obj:
s = re.sub(rightJustifiedStringPattern, "%%%ds" %
len(mat_obj.groups()[0]), s, 1)
changed_things = 1

# must do all floats before ints
mat_obj = floatPattern.search(s)
if mat_obj:
s = re.sub(floatPattern, float_sub, s, 1)
changed_things = 1
# don't fall through to the int code
continue

mat_obj = integerPattern.search(s)
if mat_obj:
s = re.sub(integerPattern, "%%%dd" % len(mat_obj.groups()[0]),
s, 1)
changed_things = 1
return s

if __name__ == '__main__':
testarray = ["integer: ####, integer # integer at end #",
"float ####.## no decimals ###. no int .### at end ###.",
"Left string <<<<<< short left string <",
"right string >>>>>> short right string >",
"escaped chars \\#### \\####.## \\<\\<<<< \\>\\><<<"]

for s in testarray:
print("Testing: %s" % s)
print "Result: %s" % unperl_format(s)
print

======================

Running this gives

Testing: integer: ####, integer # integer at end #
Result: integer: %4d, integer %1d integer at end %1d

Testing: float ####.## no decimals ###. no int .### at end ###.
Result: float %7.2f no decimals %4.0f no int %4.3f at end %4.0f

Testing: Left string <<<<<< short left string <
Result: Left string %-6s short left string %-1s

Testing: right string >>>>>> short right string >
Result: right string %6s short right string %1s

Testing: escaped chars \#### \####.## \<\<<<< \>\><<<
Result: escaped chars \#%3d \#%6.2f \<\<%-3s \>\>%-3s

Paul McGuire · Jun 30, 2006

Steve said:
Hi All,

I'm having a tough time converting the following regex.compile patterns
into the new re.compile format. There is also a differences in the
regsub.sub() vs. re.sub()

Could anyone lend a hand?

Not an re solution, but pyparsing makes for an easy-to-follow program.
TransformString only needs to scan through the string once - the
"reals-before-ints" testing is factored into the definition of the
formatters variable.

Pyparsing's project wiki is at http://pyparsing.wikispaces.com.

-- Paul

-------------------
from pyparsing import *

"""
read Perl-style formatting placeholders and replace with
proper Python %x string interp formatters

###### -> %6d
##.### -> %6.3f

"""

# set up patterns to be matched - Word objects match character groups
# made up of characters in the Word constructor; Combine forces
# elements to be adjacent with no intervening whitespace
# (note use of results name in realFormat, for easy access to
# decimal places substring)
intFormat = Word("#")
realFormat = Combine(Word("#")+"."+
Word("#").setResultsName("decPlaces"))
leftString = Word("<")
rightString = Word(">")

# define parse actions for each - the matched tokens are the third
# arg to parse actions; parse actions will replace the incoming tokens with
# value returned from the parse action
intFormat.setParseAction( lambda s,l,toks: "%%%dd" % len(toks[0]) )
realFormat.setParseAction( lambda s,l,toks: "%%%d.%df" %
(len(toks[0]),len(toks.decPlaces)) )
leftString.setParseAction( lambda s,l,toks: "%%-%ds" % len(toks[0]) )
rightString.setParseAction( lambda s,l,toks: "%%%ds" % len(toks[0]) )

# collect all formatters into a single "grammar"
# - note reals are checked before ints
formatters = rightString | leftString | realFormat | intFormat

# set up our test string, and use transform string to invoke parse actions
# on any matched tokens
testString = """
This is a string with
ints: #### # ###############
floats: #####.# ###.###### #.#
left-justified strings: <<<<<<<< << <
right-justified strings: >>>>>>>>>> >> >
int at end of sentence: ####.
"""
print formatters.transformString( testString )

-------------------
Prints:

This is a string with
ints: %4d %1d %15d
floats: %7.1f %10.6f %3.1f
left-justified strings: %-8s %-2s %-1s
right-justified strings: %10s %2s %1s
int at end of sentence: %4d.

Jim Segrave · Jun 30, 2006

Paul McGuire said:
Not an re solution, but pyparsing makes for an easy-to-follow program.
TransformString only needs to scan through the string once - the
"reals-before-ints" testing is factored into the definition of the
formatters variable.

Pyparsing's project wiki is at http://pyparsing.wikispaces.com.

If fails for floats specified as ###. or .###, it outputs an integer
format and the decimal point separately. It also ignores \# which
should prevent the '#' from being included in a format.

Paul McGuire · Jun 30, 2006

Jim Segrave said:
If fails for floats specified as ###. or .###, it outputs an integer
format and the decimal point separately. It also ignores \# which
should prevent the '#' from being included in a format.

True. What is the spec for these formatting strings, anyway? I Googled a
while, and it does not appear that this is really a Perl string formatting
technique, despite the OP's comments to the contrary. And I'm afraid my
limited Regex knowledge leaves the OP's example impenetrable to me. I got
lost among the '\'s and parens.

I actually thought that "###." was *not* intended to be floating point, but
instead represented an integer before a sentence-ending period. You do have
to be careful of making *both* leading and trailing digits optional, or else
simple sentence punctuating periods will get converted to "%1f"!

As for *ignoring* "\#", it would seem to me we would rather convert this to
"#", since "#" shouldn't be escaped in normal string interpolation.

The following modified version adds handling for "\#", "\<" and "\>", and
real numbers with no integer part. The resulting program isn't radically
different from the first version. (I've highlighted the changes with "<==="
marks.)

-- Paul

------------------
from pyparsing import Combine,Word,Optional,Regex

"""
read Perl-style formatting placeholders and replace with
proper %x string interp formatters

###### -> %6d
##.### -> %6.3f

"""

# set up patterns to be matched
# (note use of results name in realFormat, for easy access to
# decimal places substring)
intFormat = Word("#")
realFormat = Combine(Optional(Word("#"))+"."+ # <===
Word("#").setResultsName("decPlaces"))
leftString = Word("<")
rightString = Word(">")
escapedChar = Regex(r"\\[#<>]") # <===

# define parse actions for each - the matched tokens are the third
# arg to parse actions; parse actions will replace the incoming tokens with
# value returned from the parse action
intFormat.setParseAction( lambda s,l,toks: "%%%dd" % len(toks[0]) )
realFormat.setParseAction( lambda s,l,toks: "%%%d.%df" %
(len(toks[0]),len(toks.decPlaces)) )
leftString.setParseAction( lambda s,l,toks: "%%-%ds" % len(toks[0]) )
rightString.setParseAction( lambda s,l,toks: "%%%ds" % len(toks[0]) )
escapedChar.setParseAction( lambda s,l,toks: toks[0][1] ) #
<===

# collect all formatters into a single "grammar"
# - note reals are checked before ints
formatters = rightString | leftString | realFormat | intFormat | escapedChar
# <===

# set up our test string, and use transform string to invoke parse actions
# on any matched tokens
testString = r"""
This is a string with
ints: #### # ###############
floats: #####.# ###.###### #.# .###
left-justified strings: <<<<<<<< << <
right-justified strings: >>>>>>>>>> >> >
int at end of sentence: ####.
I want \##, please.
"""

print testString
print formatters.transformString( testString )

------------------
Prints:

This is a string with
ints: #### # ###############
floats: #####.# ###.###### #.# .###
left-justified strings: <<<<<<<< << <
right-justified strings: >>>>>>>>>> >> >
int at end of sentence: ####.
I want \##, please.

This is a string with
ints: %4d %1d %15d
floats: %7.1f %10.6f %3.1f %4.3f
left-justified strings: %-8s %-2s %-1s
right-justified strings: %10s %2s %1s
int at end of sentence: %4d.
I want #%1d, please.

Paul McGuire · Jun 30, 2006

Jim Segrave said:
If fails for floats specified as ###. or .###, it outputs an integer
format and the decimal point separately. It also ignores \# which
should prevent the '#' from being included in a format.

Ah! This may be making some sense to me now. Here are the OP's original
re's for matching.

exponentPattern = regex.compile('\(^\|[^\\#]\)\(#+\.#+\*\*\*\*\)')
floatPattern = regex.compile('\(^\|[^\\#]\)\(#+\.#+\)')
integerPattern = regex.compile('\(^\|[^\\#]\)\(##+\)')
leftJustifiedStringPattern = regex.compile('\(^\|[^\\<]\)\(<<+\)')
rightJustifiedStringPattern = regex.compile('\(^\|[^\\>]\)\(>>+\)')

Each re seems to have two parts to it. The leading parts appear to be
guards against escaped #, <, or > characters, yes? The second part of each
re shows the actual pattern to be matched. If so:

It seems that we *don't* want "###." or ".###" to be recognized as floats,
floatPattern requires at least one "#" character on either side of the ".".
Also note that single #, <, and > characters don't seem to be desired, but
at least two or more are required for matching. Pyparsing's Word class
accepts an optional min=2 constructor argument if this really is the case.
And it also seems that the pattern is supposed to be enclosed in ()'s. This
seems especially odd to me, since one of the main points of this funky
format seems to be to set up formatting that preserves column alignment of
text, as if creating a tabular output - enclosing ()'s just junks this up.

My example also omitted the exponent pattern. This can be handled with
another expression like realFormat, but with the trailing "****" characters.
Be sure to insert this expression before realFormat in the list of
formatters.

I may be completely off in my re interpretation. Perhaps one of the re
experts here can explain better what the OP's re's are all about. Can
anybody locate/cite the actual spec for this formatting, um, format?

-- Paul

Paul McGuire · Jun 30, 2006

Jim Segrave said:
If fails for floats specified as ###. or .###, it outputs an integer
format and the decimal point separately. It also ignores \# which
should prevent the '#' from being included in a format.

Here's a little more study on this (all tests are using Python 2.4.1):

If floats are specified as "###.", should we generate "%4.0f" as the result?
In fact, to get 3 leading places and a trailing decimal point, when 0
decimal places are desired, should be formatted with "%3.0f." - we have to
explicitly put in the trailing '.' character.

10.<

But as we see below, if the precision field is not zero, the initial width
consumes one character for the decimal point. If the precision field *is*
zero, then the entire width is used for the integer part of the value, with
no trailing decimal point.

".###" almost makes no sense. There is no floating point format that
suppresses the leading '0' before the decimal point.

0.00<

Using the %f with a nonzero precision field, will always output at least the
number of decimal places, plus the decimal point and leading '0' if number
is less than 1.

This whole discussion so far has also ignore negative values, again, we
should really look more into the spec for this formatting scheme, rather
than try to read the OP's mind.

-- Paul

Jim Segrave · Jun 30, 2006

Jim Segrave said:
Jim Segrave said:

If fails for floats specified as ###. or .###, it outputs an integer
format and the decimal point separately. It also ignores \# which
should prevent the '#' from being included in a format.

Click to expand...

Ah! This may be making some sense to me now. Here are the OP's original
re's for matching.

exponentPattern = regex.compile('\(^\|[^\\#]\)\(#+\.#+\*\*\*\*\)')
floatPattern = regex.compile('\(^\|[^\\#]\)\(#+\.#+\)')
integerPattern = regex.compile('\(^\|[^\\#]\)\(##+\)')
leftJustifiedStringPattern = regex.compile('\(^\|[^\\<]\)\(<<+\)')
rightJustifiedStringPattern = regex.compile('\(^\|[^\\>]\)\(>>+\)')

Each re seems to have two parts to it. The leading parts appear to be
guards against escaped #, <, or > characters, yes? The second part of each
re shows the actual pattern to be matched. If so:

It seems that we *don't* want "###." or ".###" to be recognized as floats,
floatPattern requires at least one "#" character on either side of the ".".
Also note that single #, <, and > characters don't seem to be desired, but
at least two or more are required for matching. Pyparsing's Word class
accepts an optional min=2 constructor argument if this really is the case.
And it also seems that the pattern is supposed to be enclosed in ()'s. This
seems especially odd to me, since one of the main points of this funky
format seems to be to set up formatting that preserves column alignment of
text, as if creating a tabular output - enclosing ()'s just junks this up.

The poster was excluding escaped (with a '\' character, but I've just
looked up the Perl format statement and in fact fields always begin
with a '@', and yes having no digits on one side of the decimal point
is legal. Strings can be left or right justified '@<<<<', '@>>>>', or
centred '@||||', numerics begin with an @, contain '#' and may contain
a decimal point. Fields beginning with '^' instead of '@' are omitted
if the format is a numeric ('#' with/without decimal). I assumed from
the poster's original patterns that one has to worry about '@', but
that's incorrect, they need to be present to be a format as opposed to
ordinary text and there's appears to be no way to embed a '@' in an
format. It's worth noting that PERL does implicit float to int
coercion, so it treats @### the same for ints and floats (no decimal
printed).

For the grisly details:

http://perl.com/doc/manual/html/pod/perlform.html

Paul McGuire · Jun 30, 2006

The poster was excluding escaped (with a '\' character, but I've just
looked up the Perl format statement and in fact fields always begin
with a '@', and yes having no digits on one side of the decimal point
is legal. Strings can be left or right justified '@<<<<', '@>>>>', or
centred '@||||', numerics begin with an @, contain '#' and may contain
a decimal point. Fields beginning with '^' instead of '@' are omitted
if the format is a numeric ('#' with/without decimal). I assumed from
the poster's original patterns that one has to worry about '@', but
that's incorrect, they need to be present to be a format as opposed to
ordinary text and there's appears to be no way to embed a '@' in an
format. It's worth noting that PERL does implicit float to int
coercion, so it treats @### the same for ints and floats (no decimal
printed).

For the grisly details:

http://perl.com/doc/manual/html/pod/perlform.html

Ah, wunderbar! Some further thoughts...

I can see that the OP omitted the concept of "@|||" centering, since the
Python string interpolation forms only support right or left justified
fields, and it seems he is trying to do some form of format->string interp
automation. Adding centering would require not only composing a suitable
string interp format, but also some sort of pad() operation in the arg
passed to the string interp operation. I suspect this also rules out simple
handling of the '^' operator as mentioned in the spec, and likewise for the
trailing ellipsis if a field is not long enough for the formatted value.

The '@' itself seems to be part of the field, so "@<<<<" would be a 5
column, left-justified string. A bare '@' seems to be a single string
placeholder (meaningless to ask right or left justified

), since this is
used in the doc's hack for including a "@" in the output. (That is, as you
said, the original spec provides no mechanism for escaping in a '@'
character, it has to get hacked in as a value dropped into a single
character field.)

The Perl docs say that fields that are too long are truncated. This does
not happen in Python string interps for numeric values, but it can be done
with strings (using the precision field).ABCDEFGHIJ

So if we were to focus on support for "@", "@>>>", "@<<<", "@###" and
"@###.##" (with and without leading or trailing digits about the decimal)
style format fields, this shouldn't be overly difficult, and may even meet
the OP's requirements. (The OP seemed to also want some support for
something like "@##.###****" for scientific notation, again, not a
dealbreaker.)

-- Paul

Jim Segrave · Jun 30, 2006

I can see that the OP omitted the concept of "@|||" centering, since the
Python string interpolation forms only support right or left justified
fields, and it seems he is trying to do some form of format->string interp
automation. Adding centering would require not only composing a suitable
string interp format, but also some sort of pad() operation in the arg
passed to the string interp operation. I suspect this also rules out simple
handling of the '^' operator as mentioned in the spec, and likewise for the
trailing ellipsis if a field is not long enough for the formatted value.

The '@' itself seems to be part of the field, so "@<<<<" would be a 5
column, left-justified string. A bare '@' seems to be a single string
placeholder (meaningless to ask right or left justified ), since this is
used in the doc's hack for including a "@" in the output. (That is, as you
said, the original spec provides no mechanism for escaping in a '@'
character, it has to get hacked in as a value dropped into a single
character field.)

The Perl docs say that fields that are too long are truncated. This does
not happen in Python string interps for numeric values, but it can be done
with strings (using the precision field).
ABCDEFGHIJ

So if we were to focus on support for "@", "@>>>", "@<<<", "@###" and
"@###.##" (with and without leading or trailing digits about the decimal)
style format fields, this shouldn't be overly difficult, and may even meet
the OP's requirements. (The OP seemed to also want some support for
something like "@##.###****" for scientific notation, again, not a
dealbreaker.)

One would need a much clearer spec on what the OP really wants to do - note
that` Perl formats have the variable names embeeded as part of the
format string, so writing a simple Perl->Python converter isn't going
to work,

I've given him a good start for an re based solution, you've given one
for a pyparsing based one, at this point I'd hope the OP can take it
from there or can come back with more specific questions on how to
deal with some of the awfulness of the formats he's working with.

Steve · Jun 30, 2006

Hi All!

Thanks for your suggestions and comments! I was able to use some of
your code and suggestions and have come up with this new version of
Report.py.

Here's the updated code :

-----------------------------------------------------------------

#!/usr/bin/env python
"""Provides two classes to create formatted reports.

The ReportTemplate class reads a template file or string containing a
fixed format with field tokens and substitutes member values from an
arbitrary python object.

The ColumnReportTemplate class takes a string argument to define a
header and line format for multiple calls with sequence data.

6/30/2006
Steve Reiss ([email protected]) - Converted to re module methods

"""

__author__ = "Robin Friedrich (e-mail address removed)"
__version__ = "1.0.0"

import string
import sys
import re

from types import StringType, ListType, TupleType, InstanceType,
FileType

#these regex pattern objects are used in the _make_printf function

exponentPattern = re.compile('\(^\|[^\\#]\)|#+\.#+\*\*\*\*')
floatPattern = re.compile('\(^\|[^\\#]\)|#+\.#+')
integerPattern = re.compile("\(^\|[^\\#]\)|\##+")
leftJustifiedStringPattern = re.compile('\(^\|[^\\<]\)|\<<+')
rightJustifiedStringPattern = re.compile('\(^\|[^\\>]\)|\>>+')

###################################################################
# _make_printf #
###################################################################

def _make_printf(s):
"""Convert perl style format symbology to printf tokens.

Take a string and substitute computed printf tokens for perl style
format symbology.

For example:

###.## yields %6.2f
######## yields %8d
<<<<< yields %-5s
"""
# print("Original String = %s\n\n") % (s)

while 1: # process all sci notation fields
if exponentPattern.search(s) < 0: break
i1 , i2 = exponentPattern.search(s).span()
width_total = i2 - i1
field = s[i1:i2-4]
width_mantissa = len( field[string.find(field,'.')+1:] )
f = '%'+`width_total`+'.'+`width_mantissa`+'e'
s = exponentPattern.sub(f, s, 1)

while 1: # process all floating pt fields
if floatPattern.search(s) < 0: break
i1 , i2 = floatPattern.search(s).span()
width_total = i2 - i1
field = s[i1:i2]
width_mantissa = len( field[string.find(field,'.')+1:] )
f = '%'+`width_total`+'.'+`width_mantissa`+'f'
s = floatPattern.sub(f, s, 1)

while 1: # process all integer fields
if integerPattern.search(s) < 0: break
i1 , i2 = integerPattern.search(s).span()
width_total = i2 - i1
f = '%'+`width_total`+'d'
s = integerPattern.sub(f, s, 1)

while 1: # process all left justified string
fields
if leftJustifiedStringPattern.search(s) < 0: break
i1 , i2 = leftJustifiedStringPattern.search(s).span()
width_total = i2 - i1
f = '%-'+`width_total`+'s'
s = leftJustifiedStringPattern.sub(f, s, 1)

while 1: # process all right justified
string fields
if rightJustifiedStringPattern.search(s) < 0: break
i1 , i2 = rightJustifiedStringPattern.search(s).span()
width_total = i2 - i1
f = '%'+`width_total`+'s'
s = rightJustifiedStringPattern.sub(f, s, 1)

s = re.sub('\\\\', ' ', s)
# print
# print("printf format = %s") % (s)
return s

###################################################################
# ReportTemplate #
###################################################################

class ReportTemplate:
"""Provide a print formatting object.

Defines an object which holds a formatted output template and can
print values substituted from a data object. The data members from
another Python object are used to substitute values into the
template. This template object is initialized from a template
file or string which employs the formatting technique below. The
intent is to provide a specification template which preserves
spacing so that fields can be lined up easily.

Special symbols are used to identify fields into which values
are substituted.

These symbols are:

##### for right justified integer

#.### for fixed point values rounded mantissa

#.###**** for scientific notation (four asterisks
required)

<<<<< for left justified string

%% is needed in the template to signify a real
percentage
symbol

\# A backslash is used to escape the above ##, <<, >>
symbols
if you need to use them outside a field spec.
The backslash will be removed upon output.

The total width of the symbol and it's decimal point position is
used to compute the appropriate printf token; see 'make_printf'
method. The symbol must have at least two adjacent characters for
it to be recognized as a field specifier.

To the right of each line of template body, following a '@@'
delimiter, is a comma separated list for corresponding variable
names. Sequence objects are supported. If you place a name of a
5-tuple for example, there should be five fields specified on the
left prepared to take those values. Also, individual element or
slices can be used. The values from these variable names will be
substituted into their corresponding fields in sequence.

For example:

a line of template might look like:
TGO1 = ####.# VGO = ##.####**** Vehicle: <<<<<<<<<< @@
t_go,v_go, vname
and would print like:
TGO1 = 22.4 VGO = -1.1255e+03 Vehicle: Atlantis
"""
delimiter = '@@'

def __init__( self, template = ''):
self.body = []
self.vars = []

#read in and parse a format template
try:
tpl = open(template, 'r')
lines = string.split(tpl.read(), '\n')[:-1]
tpl.close()
except IOError:
lines = string.split(template, '\n')

self.nrows = len(lines)

for i in range(self.nrows):
self.body.append([])
self.vars.append([])

for i in range(self.nrows):
splits = string.split(lines, self.delimiter)
body = splits[0] # I don't use tuple unpacking here because
# I don't know if there was indeed a @@ on the line

if len(splits) > 1 :
vars = splits[1]
else:
vars = ''

#if body[-1] == '\n':
#self.body = body[:-1]
#else:
self.body = body
varstrlist = string.split(vars, ',')
#print i, varstrlist

for item in varstrlist:
self.vars.append(string.strip(item))

#print self.vars
if len(self.vars) > 0:
self.body = _make_printf( self.body )
else:
print 'Template formatting error, line', i+1

def __repr__(self):
return string.join(self.body, '\n')

def __call__(self, *dataobjs):
return self._format(dataobjs[0])

def _format( self, dataobj ):
"""Return the values of the given data object substituted into
the template format stored in this object.
"""
# value[] is a list of lists of values from the dataobj
# body[] is the list of strings with %tokens to print
# if value == None just print the string without the %
argument
s = ''
value = []

for i in range(self.nrows):
value.append([])

for i in range(self.nrows):
for vname in self.vars:
try:
if string.find(vname, '[') < 0:
# this is the nominal case and a simple get
will be faster
value.append(getattr(dataobj, vname))
else:
# I use eval so that I can support sequence
values
# although it's slow.
value.append(eval('dataobj.'+vname))
except AttributeError, SyntaxError:
value.append('')

if value[0] != '':
try:
temp_vals = []
for item in value:
# items on the list of values for this line
# can be either literals or lists
if type(item) == ListType:
# take each element of the list and tack it
# onto the printing list
for element in item:
temp_vals.append(element)
else:
temp_vals.append(item)
# self.body is the current output line with %
tokens
# temp_vals contains the values to be inserted into
them.
s = s + (self.body % tuple(temp_vals)) + '\n'
except TypeError:
print 'Error on this line. The data value(s) could
not be formatted as numbers.'
print 'Check that you are not placing a string
value into a number field.'
else:
s = s + self.body + '\n'
return s

def writefile(self, file, dataobj):
"""takes either a pathname or an open file object and a data
object.
Instantiates the template with values from the data object
sending output to the open file.
"""
if type(file) == StringType:
fileobj = open(file,'w')
elif type(file) == FileType:
fileobj = file
else:
raise TypeError, '1st argument must be a pathname or an
open file object.'
fileobj.write(self._format(dataobj))
if type(file) == StringType: fileobj.close()

###################################################################
# isReportTemplate #
###################################################################

def isReportTemplate(obj):
"""Return 1 if obj is an instance of class ReportTemplate.
"""
if type(obj) == InstanceType and \
string.find(`obj.__class__` , ' ReportTemplate ') > -1:
return 1
else:
return 0

###################################################################
# ColumnReportTemplate #
###################################################################

class ColumnReportTemplate:
"""This class allows one to specify column oriented output formats.

The first argument to the constructor is a format string containing
the header text and a line of field specifier tokens. A line
containing nothing but dashes, underbars, spaces or tabs is
detected
as the separator between these two sections. For example, a format
string might look like this:

'''Page &P Date: &M/D/Y Time: &h:m:s
Time Event Factor A2 Factor B2
-------- ------------------- ----------- -------------
###.#### <<<<<<<<<<<<<<<<<<< ##.###**** ##.######****'''

The last line will be treated as the format for output data
contained
in a four-sequence. This line would (for example) be translated to
'%8.4f %-19s %10.3e %13.6e' for value substitution.
In the header text portion one may use special variable tokens
indicating that runtime values should be substituted into the
header block. These tokens start with a & character and are
immediately followed by either a P or a time/date format string.
In the above example the header contains references to page number,
current date in month/day/year order, and the current time.
Today it produced 'Page 2 Date: 10/04/96 Time:
15:13:28'
See doc string for now() function for further details.

An optional second argument is an output file handle to send
written output to (default is stdout). Keyword arguments may be
used to tailor the instance. At this time the 'page_length'
parameter is the only useful one.

Instances of this class are then used to print out any number of
records with the write method. The write method argument must be a
sequence of elements matching the number and data type implied by
the field specification tokens.

At the end of a page, a formfeed is output as well as new copy
of the header text.
"""
page_length = 50
lineno = 1
pageno = 1
first_write = 1

def __init__(self, format = '', output = sys.stdout, **kw):
# print("Original format = ", format)
self.output = output

self.header_separator = re.compile('\n[-_\s\t]+\n')
self.header_token = re.compile('&([^ \n\t]+)')

for item, value in kw.items():
setattr(self, item, value)

try: #
use try block in case there is NOT a header at all
result = self.header_separator.search(format).start() #
NEW separation of header and body from format

# print("result = ", result)
HeaderLine = self.header_separator.search(format).group() #
get the header lines that were matched

if result > -1: # separate
the header text from the format

# print("split = ", self.header_separator.split(format) )
HeaderPieces = self.header_separator.split(format)
# print("HeaderPiece[0] = ", HeaderPieces[0])
# print("HeaderPiece[1] = ", HeaderPieces[1])

self.header = HeaderPieces[0] + HeaderLine # header text
PLUS the matched HeaderLine
self.body = _make_printf(HeaderPieces[1]) # convert the
format chars to printf expressions

except :
self.header = '' # fail block of
TRY - no headings found - set to blank
self.body = _make_printf(format) # need to
process the format

# print("header = ", self.header)
# print("body = ", self.body)

self.header = self.prep_header(self.header) # parse the
special chars (&Page &M/D/Y &h:m:s) in header
self.header_len = len(string.split(self.header,'\n'))
self.max_body_len = self.page_length - self.header_len

def prep_header(self, header):
"""Substitute the header tokens with a named string printf
token. """
start = 0
new_header = ''
self.header_values = {}

# print("original header = %s") % (header)
HeaderPieces = self.header_token.split(header) # split
up the header w/ the regular expression

HeadCount = 0

for CurrentHeadPiece in HeaderPieces :

if HeadCount % 2 == 1: # matching
tokens to the pattern will be in the ODD indexes of Heads[]
# print("Heads %s = %s") % (HeadCount,CurrentHeadPiece)
new_header = new_header + '%(' + CurrentHeadPiece +')s'
self.header_values[CurrentHeadPiece] = 1
else:
new_header = new_header + CurrentHeadPiece

HeadCount = HeadCount + 1

# print("new header = %s") % (new_header)

return new_header

def write(self, seq):
"""Write the given sequence as a record in field format.
Length of sequence must match the number and data type
of the field tokens.
"""
seq = tuple(seq)

if self.lineno > self.max_body_len or self.first_write:
self.new_page()
self.first_write = 0

self.output.write( self.body % seq + '\n' )
self.lineno = self.lineno + 1

def new_page(self):
"""Issue formfeed, substitute current values for header
variables, then print header text.
"""
for key in self.header_values.keys():
if key == 'P':
self.header_values[key] = self.pageno
else:
self.header_values[key] = now(key)

header = self.header % self.header_values
self.output.write('\f'+ header +'\n')
self.lineno = 1
self.pageno = self.pageno + 1

def isColumnReportTemplate(obj):
"""Return 1 if obj is an instance of class ColumnReportTemplate.
"""
if type(obj) == InstanceType and \
string.find(`obj.__class__` , ' ColumnReportTemplate ') > -1:
return 1
else:
return 0

###################################################################
# now - return date and/or time value #
###################################################################

def now(code='M/D/Y'):
"""Function returning a formatted string representing the current
date and/or time. Input arg is a string using code letters to
represent date/time components.

Code Letter Expands to
D Day of month
M Month (two digit)
Y Year (two digit)
h hour (two digit 24-hour clock)
m minutes
s seconds

Other characters such as '/' ':' '_' '-' and ' ' are carried
through
as is and can be used as separators.
"""
import time
T = {}
T['year'], T['month'], T['dom'], T['hour'], T['min'], T['sec'], \
T['dow'], T['day'], T['dst'] =
time.localtime(time.time())
T['yr'] = repr(T['year'])[-2:]
formatstring = ''

tokens = {'D':'%(dom)02d', 'M':'%(month)02d', 'Y':'%(yr)02s',
'h':'%(hour)02d', 'm':'%(min)02d', 's':'%(sec)02d',
'/':'/', ':':':', '-':'-', ' ':' ' , '_':'_', ';':';',
'^':'^'}

for char in code:
formatstring = formatstring + tokens[char]

return formatstring % T

###################################################################
# test_Rt - Test Report Template #
###################################################################

def test_RT():

template_string = """
--------------------------------------------------
Date <<<<<<<<<<<<<<<<<<<<<<<<<<< Time >>>>>>> @@ date,
time

Input File : <<<<<<<<<<<<<<<<<<<<< @@ file[0]
Output File : <<<<<<<<<<<<<<<<<<<<< @@ file[1]
Corr. Coeff : ##.########**** StdDev : ##.### @@ coeff,
deviation
Fraction Breakdown : ###.# %% Run :\# ### @@ brkdwn,
runno
Passed In Value : ### @@ invalue
--------------------------------------------------
"""
class Data:

def __init__(self, InValue):
# self.date = "September 12, 1998"
self.date = now()
# self.time = "18:22:00"
self.time = now('h:m:s') #datetime.time()

self.file = ['TX2667-AE0.dat', 'TX2667-DL0.dat']
self.coeff = -3.4655102872e-05
self.deviation = 0.4018
self.runno = 56 + InValue
self.brkdwn = 43.11
self.invalue = InValue

Report = ReportTemplate(template_string)

for i in range(2):
D = Data(i)
print Report(D)

###################################################################
# test_Rt_file - Test Report Template from file #
###################################################################

def test_RT_file():

template_string ='ReportFormat1.txt' # filename of report format

class Data:

def __init__(self, InValue):
self.date = now()
self.time = now('h:m:s') #datetime.time()
self.file = ['TX2667-AE0.dat', 'TX2667-DL0.dat']
self.coeff = -3.4655102872e-05
self.deviation = 0.4018
self.runno = 56 + InValue
self.brkdwn = 43.11
self.invalue = InValue

Report = ReportTemplate(template_string)

for i in range(2):
D = Data(i)
print Report(D)

###################################################################
# test_CRT - Test Column Report Template #
###################################################################

def test_CRT():

print
print
print "test_CRT()"
print

format='''
Page &P Date: &M/D/Y Time: &h:m:s
Test Column Report 1

Time Event Factor A2 Factor B2
-------- ------------------- ----------- -------------
####.### <<<<<<<<<<<<<<<<<<< ##.###**** ##.######****'''

data = [12.225, 'Aftershock', 0.5419, 144.8]
report = ColumnReportTemplate( format, page_length=15 )

for i in range(0,200,10):
if i > 0 :
data = [data[0]+i, data[1], data[2]/i*10., data[3]*i/20.]
report.write( data )

###################################################################
# test_CRT2 - Test Column Report Template #
###################################################################

def test_CRT2():

print
print
print "test_CRT2()"
print

format='''
Page &P Date: &M/D/Y Time: &h:m:s
Test Column Report 2

I ID City Factor A2 Factor B2
--- ------ ------------------- ----------- -------------
data = [0, 5, 'Mt. View', 541, 144.2]
report = ColumnReportTemplate( format, page_length=15 )

for i in range(0,201,10):
data = [i, data[1]+i, data[2], data[3] + (i*10), data[4] + (i *
20)]
report.write( data )

###################################################################
# test_CRT3 - Test Column Report Template - no header chars #
###################################################################

def test_CRT3():

print
print
print "test_CRT3()"
print

format='''
Test Column Report 3

I ID City Factor A2 Factor B2
--- ------ ------------------- ----------- -------------
#--- ------ ------------------- ----------- -------------

data = [0, 5, 'Santa Cruz', 541, 144.2]
report = ColumnReportTemplate( format, page_length=15 )

for i in range(0,201,10):
data = [i, data[1]+i, data[2], data[3] + (i*10), data[4] + (i *
20)]
report.write( data )

###################################################################
# test_CRT4 - Test Column Report Template - no header at all #
###################################################################

def test_CRT4():

print
print
print "test_CRT4()"
print

format='''>>> #### <<<<<<<<<<<<<<<<<<< #####.##
#####.##'''

data = [0, 5, 'Santa Cruz', 541, 144.2]
report = ColumnReportTemplate( format, page_length=50 )

for i in range(0,201,10):
data = [i, data[1]+i, data[2], data[3] + (i*10), data[4] + (i *
20)]
report.write( data )

###################################################################
############# M A I N ###########################
###################################################################

def Main():

print "\n\nTesting this module.\n\n"

TheHeading = '''
simple heading \#
r-just int fixed point sci-notation left-just string
right-just string
##### #.### #.###**** <<<<< >>>>>'''

print
print " Make printf Test : "
print _make_printf(TheHeading)
print
print

test_RT()
test_CRT()

print
test_RT_file()
print

test_CRT2()
test_CRT3()
test_CRT4()

print
print "Current Date & time = ", now('M-D-Y h:m:s')

if __name__ == "__main__":
Main()

Jim Segrave · Jul 1, 2006

Hi All!

Thanks for your suggestions and comments! I was able to use some of
your code and suggestions and have come up with this new version of
Report.py.

Here's the updated code :

exponentPattern = re.compile('\(^\|[^\\#]\)|#+\.#+\*\*\*\*')
floatPattern = re.compile('\(^\|[^\\#]\)|#+\.#+')
integerPattern = re.compile("\(^\|[^\\#]\)|\##+")
leftJustifiedStringPattern = re.compile('\(^\|[^\\<]\)|\<<+')
rightJustifiedStringPattern = re.compile('\(^\|[^\\>]\)|\>>+')

Some comments and suggestions

If you want to include backslashes in a string, either
use raw strings, or use double backslashes (raw strings are much
easier to read). Otherwise, you have an accident waiting to happen -
'\(' _does_ make a two character string as you desired, but '\v' makes
a one character string, which is unlikely to be what you wanted.

That's a stylistic point. But more serious is the leading part of all
your regexes - '\(^\|[^\\#]\)|'. I'm not sure what you're trying to
accomplish - presumably to skip over escaped format characters, but
that's not what they do.

\( says to look for an open parenthesis (you've escaped it, so it's
not a grouping character. ^ says look at the start of the line. This
means the first character can never match an open parens, so this
entire term, up to the alternate expression (after the non-escaped
pipe symbol) never matches anything. If you want to ignore escaped
formating characters before a format, then you should use a negative
lookbehind assertation (see the library reference, 4.2.1, Regular
Expression Syntax:

'(?<!\\)'

This says that the match to the format can't start
immediately after a backslash. You need to make a final
pass over your format to remove the extra backslashes, otherwise they
will appear in the output, which you do, but you replace them with
spaces, which may not be the right thing to do - how could you output
a line of like '################' as part of a template?

Other odds and ends comments interleaved here

###################################################################
# _make_printf #
###################################################################

def _make_printf(s):
"""Convert perl style format symbology to printf tokens.

Take a string and substitute computed printf tokens for perl style
format symbology.

For example:

###.## yields %6.2f
######## yields %8d
<<<<< yields %-5s
"""
# print("Original String = %s\n\n") % (s)

while 1: # process all sci notation fields
if exponentPattern.search(s) < 0: break
i1 , i2 = exponentPattern.search(s).span()
width_total = i2 - i1
field = s[i1:i2-4]
width_mantissa = len( field[string.find(field,'.')+1:] )
f = '%'+`width_total`+'.'+`width_mantissa`+'e'
s = exponentPattern.sub(f, s, 1)

There are better ways to examine a match than with span() - consider
using grouping in your regex to get the mantissa width and the total
width, use a regex with grouping like this:

If:

exponentPattern = re.compile(r'(?<!\\)(#+\.(#+)\*\*\*\*'))

then you could do this:
m = re.match(exponentPattern, s)
if m:
s = exponentPattern.sub("%%%d.%de" % (len(m.groups()[0],
len(m.groups()[1]), s, 1)

m.groups()[0] will be the entire '#+\.#\*\*\'*\*') match, in other
words the field width to be printed

m.groups()[1] will be the string after the decimal point, not
inclouding the '*'s

In my opinion, building the string by using the sprintf like '%'
formatting operator, rather than adding together a bunch of substrings
is easier to read and maintain.

Similar use of grouping can be done for the other format string types.

s = re.sub('\\\\', ' ', s)
return s

As I noted, should backslashes be converted to spaces? And again, it's
easier to type and read if it uses raw strings:

s = re.sub(r'\\', ' ', s)

###################################################################
# ReportTemplate #
###################################################################

class ReportTemplate:
"""Provide a print formatting object. [Snip]
The total width of the symbol and it's decimal point position is
used to compute the appropriate printf token; see 'make_printf'
method. The symbol must have at least two adjacent characters for

A minor nit - make_printf is better described as a function, not a
method (it's not part of the ReportTemplate class or any other cleass)

[SNIP]

def __init__( self, template = ''):
self.body = []
self.vars = []

#read in and parse a format template
try:
tpl = open(template, 'r')
lines = string.split(tpl.read(), '\n')[:-1]

You'd be better off to use something like:

lines = []
for l in open(template, 'r').readlines():
lines.append(l.rstrip)

The use of rstrip discards any trailing whitespace and deals with
reading Windows generated files on a Unix box, where lines will end in
CRLF and you'd strip only the LF

except IOError:
lines = string.split(template, '\n')

I have my doubts about the advisability of assuming that you are
either passed the name of a file containing a template or a
template itself. A misspelled file name won't raise
an error, it will simply be processed as a fixed output. I would have
passed a flag to say if the template argument was file name or a
template and terminated with an error if it was a non-existant file.

[SNIP]

def _format( self, dataobj ):
"""Return the values of the given data object substituted into
the template format stored in this object.
"""
# value[] is a list of lists of values from the dataobj
# body[] is the list of strings with %tokens to print
# if value == None just print the string without the %
argument
s = ''
value = []

for i in range(self.nrows):
value.append([])

for i in range(self.nrows):
for vname in self.vars:
try:
if string.find(vname, '[') < 0:
# this is the nominal case and a simple get
will be faster
value.append(getattr(dataobj, vname))
else:
# I use eval so that I can support sequence
values
# although it's slow.
value.append(eval('dataobj.'+vname))
except AttributeError, SyntaxError:
value.append('')

There's another way to do this - use getattr to retrieve the sequence,
then use __getitem__ to index it

Something like this would work, again using a regex to separate out
the index (you might want the regex compiled once at __init__
time). The regex looks for a run of characters valid in a python
variable name followed by an optional integer (with or without sign)
index in square brackets. m.groups()[0] is the variable name portion,
m.grousp()[1] is the index if there's a subscripting term.

try:
m = re.match(r'([a-zA-Z_][a-zA-Z0-9._]*)\s*(?:\[\s*([+-]?\d+)\s*\])?\s*',
value)
if not m: raise SyntaxError
if m.groups()[1] is None:
value.append(getattr(dataobj, vname))
else:
value.append(getattr(m.groups()[0]).\
__getitem__(int(m.groups()[1])))
except AttributeError, SyntaxError, IndexError:

value.append('')

Click to expand...

This is a bit ugly, but avoids eval with all the dangers it carries -
a deliberately hacked template file can be used to do a lot of damage,
a badly written one could be hard to debug

if value[0] != '':
try:
temp_vals = []
for item in value:
# items on the list of values for this line
# can be either literals or lists
if type(item) == ListType:

Click to expand...

Might you be better off asking if item has a __getitem__? It would
then work with tuples and Extending to dictionaries would then be easier

def isReportTemplate(obj):
"""Return 1 if obj is an instance of class ReportTemplate.
"""
if type(obj) == InstanceType and \
string.find(`obj.__class__` , ' ReportTemplate ') > -1:
return 1
else:
return 0

Click to expand...

Why not just use isinstance(obj, classname)?

\w in regular expression	2	Feb 28, 2004
Regular expression match objects - compact syntax?	1	Feb 3, 2005
Bad Code (that works) help me re-write!	3	Oct 11, 2006
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004
re module substitution confusion	1	Jul 7, 2003
RE Engine error with sub()	6	Apr 15, 2005
Importing WSDL for Perl Module	0	Jan 11, 2005
Question regarding lists and regex	2	Nov 9, 2006

Regular Expression - old regex module vs. re module

Steve

Jim Segrave

Paul McGuire

Jim Segrave

Paul McGuire

Paul McGuire

Paul McGuire

Jim Segrave

Paul McGuire

Jim Segrave

Steve

Jim Segrave

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads