Regular Expression Help

J

Jean-Claude Neveu

Hello,

I was wondering if someone could tell me where
I'm going wrong with my regular expression. I'm
trying to write a regexp that identifies whether
a string contains a correctly-formatted currency
amount. I want to support dollars, UK pounds and
Euros, but the example below deliberately omits
Euros in case the Euro symbol get mangled
anywhere in email or listserver processing. I
also want people to be able to omit the currency symbol if they wish.

My regexp that I'm matching against is: "^\$\£?\d{0,10}(\.\d{2})?$"

Here's how I think it should work (but clearly
I'm wrong, because it does not actually work):

^\$\£? Require zero or one instance of $ or £ at the start of the string.
d{0,10} Next, require between zero and ten alpha characters.
(\.\d{2})? Optionally, two characters can
follow. They must be preceded by a decimal point.

Examples of acceptable input should be:

$12.42
$12
£12.42
$12,482.96 (now I think about it, I have not catered for this in my regexp)

And unacceptable input would be:

$12b.42
blah
$blah
etc


Here is my Python script:

#
import re

def is_currency(str):
rex = "^\$\£?\d{0,10}(\.\d{2})?$"
if re.match(rex, str):
return 1
else:
return 0

def test_match(str):
if is_currency (str):
print str + " is a match"
else:
print str + " is not a match"

# All should match except the last two
test_match("$12.47")
test_match("12.47")
test_match("£12.47")
test_match("£12")
test_match("$12")
test_match("$12588.47")
test_match("$12,588.47")
test_match("£12588.47")
test_match("12588.47")
test_match("£12588")
test_match("$12588")
test_match("blah")
test_match("$12b.56")


AND HERE IS THE OUTPUT FROM THE ABOVE SCRIPT:
$12.47 is a match
12.47 is not a match
£12.47 is not a match
£12 is not a match
$12 is a match
$12588.47 is a match
$12,588.47 is not a match
£12588.47 is not a match
12588.47 is not a match
£12588 is not a match
$12588 is a match
blah is not a match
$12b.56 is not a match

Many thanks in advance. Regular expressions are not my strong suit :)

J-C
 
R

rurpy

My regexp that I'm matching against is: "^\$\£?\d{0,10}(\.\d{2})?$"

Here's how I think it should work (but clearly
I'm wrong, because it does not actually work):

^\$\£? Require zero or one instance of $ or £ at the start of the string.

The "or" in "$ or £" above is a vertical bar. You
want ^(\$|£)? here.
 
J

John Machin

The "or" in "$ or £" above is a vertical bar.  You
want ^(\$|£)? here.

Best not to use a capturing group (blah) when you don't need to
capture ... use (?:blah) instead.

When the alternatives are all single characters, for greater typing
efficiency and computing efficiency use a character class:

^[\$£]?
 
G

Graham Breed

Jean-Claude Neveu said:
Hello,

I was wondering if someone could tell me where I'm going wrong with my
regular expression. I'm trying to write a regexp that identifies whether
a string contains a correctly-formatted currency amount. I want to
support dollars, UK pounds and Euros, but the example below deliberately
omits Euros in case the Euro symbol get mangled anywhere in email or
listserver processing. I also want people to be able to omit the
currency symbol if they wish.

If Euro symbols can get mangled, so can Pound signs.
They're both outside ASCII.
My regexp that I'm matching against is: "^\$\£?\d{0,10}(\.\d{2})?$"

Here's how I think it should work (but clearly I'm wrong, because it
does not actually work):

^\$\£? Require zero or one instance of $ or £ at the start of the
string.

^[$£]? is correct. And, as you're using re.match, the ^ is
superfluous. (A previous message suggested ^[\$£]? which
will also work. You generally need to escape a Dollar sign
but not here.)

You should also think about the encoding. In my terminal,
"£" is identical to '\xc2\xa3'. That is, two bytes for a
UTF-8 code point. If you assume this encoding, it's best to
make it explicit. And if you don't assume a specific
encoding it's best to convert to unicode to do the
comparisons, so for 2.x (or portability) your string should
start u"
d{0,10} Next, require between zero and ten alpha characters.

There's a backslash missing, but not from your original
expression. Digits are not "alpha characters".
(\.\d{2})? Optionally, two characters can follow. They must be preceded
by a decimal point.

That works. Of course, \d{2} is longer than the simpler \d\d

Note that you can comment the original expression like this:

rex = u"""(?x)
^[$£]? # Zero or one instance of $ or £
# at the start of the string.
\d{0,10} # Between zero and ten digits
(\.\d{2})? # Optionally, two digits.
# They must be preceded by a decimal point.
$ # End of line
"""

Then anybody (including you) who comes to read this in the
future will have some idea what you were trying to do.

\> Examples of acceptable input should be:
$12.42
$12
£12.42
$12,482.96 (now I think about it, I have not catered for this in my
regexp)

Yes, you need to think about that.


Graham
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top