regular expression

aaron · Mar 26, 2005

dear readers,

given a string, suppose i wanted to do the following:
- replace all periods with colons, except for periods with a digit to
the right and left of it.

for example, given:
'375 mi. south of U.C.B. is 3.4 degrees warmer'

would be changed to:
"375 mi: south of U:C:B: is 3.4 degrees warmer'

i was thinking that a regular expression might do the trick. here's what
i tried:
!----------------------------------------------------------------------!
Python 2.4.1c1
[GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-49)] on linux2

>>> import re
>>> pattern = re.compile(r'(?!\d)[.](?!\d)')
>>> pattern.sub(':', '375 mi. south of U.C.B is 3.4 degrees warmer.')

Click to expand...

Click to expand...

'375 mi: south of U:C:B is 3.4 degrees warmer:'
!----------------------------------------------------------------------!

so this works, but not in the following case:
!----------------------------------------------------------------------!'.3'
!----------------------------------------------------------------------!

but going the other direction works:
!----------------------------------------------------------------------!'3:'
!----------------------------------------------------------------------!

any thoughts?

thanks,
aaron

Bengt Richter · Mar 26, 2005

dear readers,

given a string, suppose i wanted to do the following:
- replace all periods with colons, except for periods with a digit to
the right and left of it.

for example, given:
'375 mi. south of U.C.B. is 3.4 degrees warmer'

would be changed to:
"375 mi: south of U:C:B: is 3.4 degrees warmer'

i was thinking that a regular expression might do the trick. here's what
i tried:
!----------------------------------------------------------------------!
Python 2.4.1c1
[GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-49)] on linux2

import re
pattern = re.compile(r'(?!\d)[.](?!\d)')
pattern.sub(':', '375 mi. south of U.C.B is 3.4 degrees warmer.')

Click to expand...

Click to expand...

'375 mi: south of U:C:B is 3.4 degrees warmer:'
!----------------------------------------------------------------------!

so this works, but not in the following case:
!----------------------------------------------------------------------!'.3'
!----------------------------------------------------------------------!

but going the other direction works:
!----------------------------------------------------------------------!'3:'
!----------------------------------------------------------------------!

any thoughts?

Brute force the exceptional case that happens at the start of the line?

>>> import re
>>> pattern = re.compile(r'^[.]|(?!\d)[.](?!\d)')
>>> pattern.sub(':', '375 mi. south of U.C.B is 3.4 degrees warmer.') '375 mi: south of U:C:B is 3.4 degrees warmer:'
>>> pattern.sub(':', '.3') ':3'
>>> pattern.sub(':', '3.')

Click to expand...

Click to expand...

'3:'

Seems like an asymmetry in re's handling of (?!\d) after the last char vs before first though.

Regards,
Bengt Richter

Peter Hansen · Mar 26, 2005

Bengt said:
'375 mi: south of U:C:B is 3.4 degrees warmer:'

so this works, but not in the following case:

Click to expand...

Brute force the exceptional case that happens at the start of the line?

import re
pattern = re.compile(r'^[.]|(?!\d)[.](?!\d)')
pattern.sub(':', '375 mi. south of U.C.B is 3.4 degrees warmer.') '375 mi: south of U:C:B is 3.4 degrees warmer:'
pattern.sub(':', '.3') ':3'
pattern.sub(':', '3.')

Click to expand...

Click to expand...

'3:'

Be careful... the OP has assumed something that isn't true,
and Bengt's fix isn't sufficient:

>>> import re
>>> s = 'x.3'
>>> pattern = re.compile(r'^[.]|(?!\d)[.](?!\d)')
>>> pattern.sub(':', '.3') ':3'
>>> pattern.sub(':', s)

Click to expand...

Click to expand...

'x.3'

So the OP's "this works" comment was wrong.

Suggestion: whip up a variety of automated test cases and
make sure you run them all whenever you make changes to
this code...

(No, I don't have a solution to the continuing problem,
other than to wonder whether the input data really requires
all these edge cases to be handled properly.)

-Peter

Bengt Richter · Mar 26, 2005

Bengt said:
Bengt said:

pattern.sub(':', '375 mi. south of U.C.B is 3.4 degrees warmer.')
'375 mi: south of U:C:B is 3.4 degrees warmer:'

so this works, but not in the following case:
pattern.sub(':', '.3')

Click to expand...

Brute force the exceptional case that happens at the start of the line?

import re
pattern = re.compile(r'^[.]|(?!\d)[.](?!\d)')
pattern.sub(':', '375 mi. south of U.C.B is 3.4 degrees warmer.')

Click to expand...

'375 mi: south of U:C:B is 3.4 degrees warmer:'

pattern.sub(':', '.3') ':3'
pattern.sub(':', '3.')

Click to expand...

'3:'

Click to expand...

Be careful... the OP has assumed something that isn't true,
and Bengt's fix isn't sufficient:

import re
s = 'x.3'
pattern = re.compile(r'^[.]|(?!\d)[.](?!\d)')
pattern.sub(':', '.3') ':3'
pattern.sub(':', s)

Click to expand...

Click to expand...

'x.3'

So the OP's "this works" comment was wrong.

Suggestion: whip up a variety of automated test cases and
make sure you run them all whenever you make changes to
this code...

(No, I don't have a solution to the continuing problem,
other than to wonder whether the input data really requires
all these edge cases to be handled properly.)

Goes to show you ;-/ Do we need more tests than these?

>>> import re
>>> pattern = re.compile(r'[.](?!\d)|(?<!\d)[.]')
>>> print pattern.sub(':', '375 mi. south of U.C.B is 3.4 degrees warmer.') 375 mi: south of U:C:B is 3.4 degrees warmer:
>>> for s,ss in ((s,pattern.sub(':', s)) for s in ('%s%s.%s%s'%(sp1,c1,c2,sp2)

Click to expand...

Click to expand...

... for sp1 in ('', ' ')
... for c1 in ('', 'x', '3')
... for c2 in ('', 'x', '3')
... for sp2 in ('', ' '))):
... print '%10r => %r' %(s,ss)
...
'.' => ':'
'. ' => ': '
'.x' => ':x'
'.x ' => ':x '
'.3' => ':3'
'.3 ' => ':3 '
'x.' => 'x:'
'x. ' => 'x: '
'x.x' => 'x:x'
'x.x ' => 'x:x '
'x.3' => 'x:3'
'x.3 ' => 'x:3 '
'3.' => '3:'
'3. ' => '3: '
'3.x' => '3:x'
'3.x ' => '3:x '
'3.3' => '3.3'
'3.3 ' => '3.3 '
' .' => ' :'
' . ' => ' : '
' .x' => ' :x'
' .x ' => ' :x '
' .3' => ' :3'
' .3 ' => ' :3 '
' x.' => ' x:'
' x. ' => ' x: '
' x.x' => ' x:x'
' x.x ' => ' x:x '
' x.3' => ' x:3'
' x.3 ' => ' x:3 '
' 3.' => ' 3:'
' 3. ' => ' 3: '
' 3.x' => ' 3:x'
' 3.x ' => ' 3:x '
' 3.3' => ' 3.3'
' 3.3 ' => ' 3.3 '

Regards,
Bengt Richter

Peter Hansen · Mar 26, 2005

Bengt said:
Goes to show you ;-/ Do we need more tests than these?

[snip loads of tests]

Hmm... if I were doing this for real, not only would the
tests actually *tell* me when there was a failure, but
I would also throw in a few more cases involving larger
strings that more closely represent the expected real
inputs (i.e. using some numbers like 3.1415 and using
some strings that have the periods as punctuation such
as in the OP's original first "test case"). That way
if, during maintenance, somebody changes the algorithm
significantly, I'll be confident that it still covers
the broader set of cases, as well as the (exhaustive?)
set you've defined, which appear at first glance to
cover all the possible combinations of x and 3 and .
that might happen...

I'd also probably be generating most of the existing
test cases automatically, just to be sure I've got
100% coverage. Are you sure you didn't leave out one?
And what about, say, ".." or ".x3."?

-Peter

Paul McGuire · Mar 27, 2005

Aaron -

Here's a pyparsing approach (requires latest 1.3 pyparsing version).
It may not be as terse or fast as your regexp, but it may be easier to
maintain.

By defining floatNum ahead of DOT in the scanner definition, you
specify the dot-containing expressions that you do *not* want to have
dots converted to colons.

-- Paul

===================
from pyparsing import Word,Literal,replaceWith, Combine, nums

DOT = Literal(".").setParseAction( replaceWith(":") )
floatNum = Combine( Word(nums) + "." + Word(nums) )

scanner = floatNum | DOT

testdata = "'375 mi. south of U.C.B is 3.4 degrees warmer."

print scanner.transformString( testdata )
===================
prints out:
'375 mi: south of U:C:B is 3.4 degrees warmer:

Question: Optional Regular Expression Grouping	4	Oct 10, 2011
Help with regular expression in python	1	Aug 18, 2011
grimace: a fluent regular expression generator in Python	0	Jul 15, 2013
Pathological regular expression	18	Apr 9, 2009
Regular Expression for Finding and Deleting comments	1	Jan 4, 2011
Need help in Python regular expression	6	Jun 12, 2009
Regular expression to structure HTML	11	Oct 2, 2009
import subprocess in python	3	Nov 16, 2009

regular expression

aaron

Bengt Richter

Peter Hansen

Bengt Richter

Peter Hansen

Paul McGuire

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads