regular expression

A

aaron

dear readers,

given a string, suppose i wanted to do the following:
- replace all periods with colons, except for periods with a digit to
the right and left of it.

for example, given:
'375 mi. south of U.C.B. is 3.4 degrees warmer'

would be changed to:
"375 mi: south of U:C:B: is 3.4 degrees warmer'

i was thinking that a regular expression might do the trick. here's what
i tried:
!----------------------------------------------------------------------!
Python 2.4.1c1
[GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-49)] on linux2
>>> import re
>>> pattern = re.compile(r'(?!\d)[.](?!\d)')
>>> pattern.sub(':', '375 mi. south of U.C.B is 3.4 degrees warmer.')
'375 mi: south of U:C:B is 3.4 degrees warmer:'
!----------------------------------------------------------------------!

so this works, but not in the following case:
!----------------------------------------------------------------------!'.3'
!----------------------------------------------------------------------!

but going the other direction works:
!----------------------------------------------------------------------!'3:'
!----------------------------------------------------------------------!

any thoughts?

thanks,
aaron
 
B

Bengt Richter

dear readers,

given a string, suppose i wanted to do the following:
- replace all periods with colons, except for periods with a digit to
the right and left of it.

for example, given:
'375 mi. south of U.C.B. is 3.4 degrees warmer'

would be changed to:
"375 mi: south of U:C:B: is 3.4 degrees warmer'

i was thinking that a regular expression might do the trick. here's what
i tried:
!----------------------------------------------------------------------!
Python 2.4.1c1
[GCC 3.2.3 20030502 (Red Hat Linux 3.2.3-49)] on linux2
import re
pattern = re.compile(r'(?!\d)[.](?!\d)')
pattern.sub(':', '375 mi. south of U.C.B is 3.4 degrees warmer.')
'375 mi: south of U:C:B is 3.4 degrees warmer:'
!----------------------------------------------------------------------!

so this works, but not in the following case:
!----------------------------------------------------------------------!'.3'
!----------------------------------------------------------------------!

but going the other direction works:
!----------------------------------------------------------------------!'3:'
!----------------------------------------------------------------------!

any thoughts?
Brute force the exceptional case that happens at the start of the line?
>>> import re
>>> pattern = re.compile(r'^[.]|(?!\d)[.](?!\d)')
>>> pattern.sub(':', '375 mi. south of U.C.B is 3.4 degrees warmer.') '375 mi: south of U:C:B is 3.4 degrees warmer:'
>>> pattern.sub(':', '.3') ':3'
>>> pattern.sub(':', '3.')
'3:'

Seems like an asymmetry in re's handling of (?!\d) after the last char vs before first though.

Regards,
Bengt Richter
 
P

Peter Hansen

Bengt said:
'375 mi: south of U:C:B is 3.4 degrees warmer:'

so this works, but not in the following case:
Brute force the exceptional case that happens at the start of the line?
import re
pattern = re.compile(r'^[.]|(?!\d)[.](?!\d)')
pattern.sub(':', '375 mi. south of U.C.B is 3.4 degrees warmer.') '375 mi: south of U:C:B is 3.4 degrees warmer:'
pattern.sub(':', '.3') ':3'
pattern.sub(':', '3.')
'3:'

Be careful... the OP has assumed something that isn't true,
and Bengt's fix isn't sufficient:
>>> import re
>>> s = 'x.3'
>>> pattern = re.compile(r'^[.]|(?!\d)[.](?!\d)')
>>> pattern.sub(':', '.3') ':3'
>>> pattern.sub(':', s)
'x.3'

So the OP's "this works" comment was wrong.

Suggestion: whip up a variety of automated test cases and
make sure you run them all whenever you make changes to
this code...

(No, I don't have a solution to the continuing problem,
other than to wonder whether the input data really requires
all these edge cases to be handled properly.)

-Peter
 
B

Bengt Richter

Bengt said:
pattern.sub(':', '375 mi. south of U.C.B is 3.4 degrees warmer.')
'375 mi: south of U:C:B is 3.4 degrees warmer:'

so this works, but not in the following case:
pattern.sub(':', '.3')
Brute force the exceptional case that happens at the start of the line?
import re
pattern = re.compile(r'^[.]|(?!\d)[.](?!\d)')
pattern.sub(':', '375 mi. south of U.C.B is 3.4 degrees warmer.')
'375 mi: south of U:C:B is 3.4 degrees warmer:'
pattern.sub(':', '.3') ':3'
pattern.sub(':', '3.')
'3:'

Be careful... the OP has assumed something that isn't true,
and Bengt's fix isn't sufficient:
import re
s = 'x.3'
pattern = re.compile(r'^[.]|(?!\d)[.](?!\d)')
pattern.sub(':', '.3') ':3'
pattern.sub(':', s)
'x.3'

So the OP's "this works" comment was wrong.

Suggestion: whip up a variety of automated test cases and
make sure you run them all whenever you make changes to
this code...

(No, I don't have a solution to the continuing problem,
other than to wonder whether the input data really requires
all these edge cases to be handled properly.)
Goes to show you ;-/ Do we need more tests than these?
>>> import re
>>> pattern = re.compile(r'[.](?!\d)|(?<!\d)[.]')
>>> print pattern.sub(':', '375 mi. south of U.C.B is 3.4 degrees warmer.') 375 mi: south of U:C:B is 3.4 degrees warmer:
>>> for s,ss in ((s,pattern.sub(':', s)) for s in ('%s%s.%s%s'%(sp1,c1,c2,sp2)
... for sp1 in ('', ' ')
... for c1 in ('', 'x', '3')
... for c2 in ('', 'x', '3')
... for sp2 in ('', ' '))):
... print '%10r => %r' %(s,ss)
...
'.' => ':'
'. ' => ': '
'.x' => ':x'
'.x ' => ':x '
'.3' => ':3'
'.3 ' => ':3 '
'x.' => 'x:'
'x. ' => 'x: '
'x.x' => 'x:x'
'x.x ' => 'x:x '
'x.3' => 'x:3'
'x.3 ' => 'x:3 '
'3.' => '3:'
'3. ' => '3: '
'3.x' => '3:x'
'3.x ' => '3:x '
'3.3' => '3.3'
'3.3 ' => '3.3 '
' .' => ' :'
' . ' => ' : '
' .x' => ' :x'
' .x ' => ' :x '
' .3' => ' :3'
' .3 ' => ' :3 '
' x.' => ' x:'
' x. ' => ' x: '
' x.x' => ' x:x'
' x.x ' => ' x:x '
' x.3' => ' x:3'
' x.3 ' => ' x:3 '
' 3.' => ' 3:'
' 3. ' => ' 3: '
' 3.x' => ' 3:x'
' 3.x ' => ' 3:x '
' 3.3' => ' 3.3'
' 3.3 ' => ' 3.3 '

Regards,
Bengt Richter
 
P

Peter Hansen

Bengt said:
Goes to show you ;-/ Do we need more tests than these?
[snip loads of tests]

Hmm... if I were doing this for real, not only would the
tests actually *tell* me when there was a failure, but
I would also throw in a few more cases involving larger
strings that more closely represent the expected real
inputs (i.e. using some numbers like 3.1415 and using
some strings that have the periods as punctuation such
as in the OP's original first "test case"). That way
if, during maintenance, somebody changes the algorithm
significantly, I'll be confident that it still covers
the broader set of cases, as well as the (exhaustive?)
set you've defined, which appear at first glance to
cover all the possible combinations of x and 3 and .
that might happen...

I'd also probably be generating most of the existing
test cases automatically, just to be sure I've got
100% coverage. Are you sure you didn't leave out one?
And what about, say, ".." or ".x3."? :)

-Peter
 
P

Paul McGuire

Aaron -

Here's a pyparsing approach (requires latest 1.3 pyparsing version).
It may not be as terse or fast as your regexp, but it may be easier to
maintain.

By defining floatNum ahead of DOT in the scanner definition, you
specify the dot-containing expressions that you do *not* want to have
dots converted to colons.

-- Paul

===================
from pyparsing import Word,Literal,replaceWith, Combine, nums

DOT = Literal(".").setParseAction( replaceWith(":") )
floatNum = Combine( Word(nums) + "." + Word(nums) )

scanner = floatNum | DOT

testdata = "'375 mi. south of U.C.B is 3.4 degrees warmer."

print scanner.transformString( testdata )
===================
prints out:
'375 mi: south of U:C:B is 3.4 degrees warmer:
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top