Regular expressions, help?

Sania · Apr 19, 2012

Hi,
So I am trying to get the number of casualties in a text. After 'death
toll' in the text the number I need is presented as you can see from
the variable called text. Here is my code
I'm pretty sure my regex is correct, I think it's the group part
that's the problem.
I am using nltk by python. Group grabs the string in parenthesis and
stores it in deadnum and I make deadnum into a list.

text="accounts put the death toll at 637 and those missing at
653 , but the total number is likely to be much bigger"
dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
deadnum=dead.group(1)
deaths.append(deadnum)
print deaths

Any help would be appreciated,
Thank you,
Sania

Jussi Piitulainen · Apr 19, 2012

Sania said:
So I am trying to get the number of casualties in a text. After 'death
toll' in the text the number I need is presented as you can see from
the variable called text. Here is my code
I'm pretty sure my regex is correct, I think it's the group part
that's the problem.
I am using nltk by python. Group grabs the string in parenthesis and
stores it in deadnum and I make deadnum into a list.

text="accounts put the death toll at 637 and those missing at
653 , but the total number is likely to be much bigger"
dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
deadnum=dead.group(1)
deaths.append(deadnum)
print deaths

It's the regexp. The .* after "death toll" each the input as far as it
can without making the whole match fail. The group matches only the
last digit in the text.

You could allow only non-digits before the number. Or you could look
up the variant of * that only matches as much as it must.

Sania · Apr 19, 2012

Sania said:
Sania said:

So I am trying to get the number of casualties in a text. After 'death
toll' in the text the number I need is presented as you can see from
the variable called text. Here is my code
I'm pretty sure my regex is correct, I think it's the group part
that's the problem.
I am using nltk by python. Group grabs the string in parenthesis and
stores it in deadnum and I make deadnum into a list.

Click to expand...

text="accounts put the death toll at 637 and those missing at
653 , but the total number is likely to be much bigger"
dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
deadnum=dead.group(1)
deaths.append(deadnum)
print deaths

Click to expand...

It's the regexp. The .* after "death toll" each the input as far as it
can without making the whole match fail. The group matches only the
last digit in the text.

You could allow only non-digits before the number. Or you could look
up the variant of * that only matches as much as it must.

Hey Thanks,
So now my regex is

dead=re.match(r".*death toll.{0,20}(\d[,\d\.]*)", text)

But I only find 7 not 657. How is it that the group is only matching
the last digit? The whole thing is parenthesis not just the last
part. ?

azrazer · Apr 19, 2012

Le 19/04/2012 14:02, Sania a écrit :

On Apr 19, 2:48 am, Jussi Piitulainen<[email protected]> [...]

text="accounts put the death toll at 637 and those missing at
653 , but the total number is likely to be much bigger"
dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
deadnum=dead.group(1)
deaths.append(deadnum)
print deaths

Click to expand...

It's the regexp. The .* after "death toll" each the input as far as it
can without making the whole match fail. The group matches only the
last digit in the text.

You could allow only non-digits before the number. Or you could look
up the variant of * that only matches as much as it must.

Click to expand...

Hey Thanks,
So now my regex is

dead=re.match(r".*death toll.{0,20}(\d[,\d\.]*)", text)

Hi,
But there, your regex matches :
<something>death toll<anything which length is <=20> followed by what
you capture (which is made up of a digit, at least)
there are at least two issues here :
- the number of characters between death toll and the figure may be > 20
- your {0,20} is greedy => .{0,20} matches as many as "." as it can
AND one digit is matched by (\d[,\d\.]*), since your group captures a
digit followed(OR NOT) by a digit, a comma, a dot
=====> so " at 63" is sucked by .{0,20} and (\d[,\d\.]*) matches
the remaining digit "7"

a solution would be to follow what Jussi suggested...
=> dead=re.match(r".*death toll\D*(\d*)", text)

But I only find 7 not 657. How is it that the group is only matching
the last digit? => .{,20} greed
The whole thing is parenthesis not just the last part. ?

yeah but only one digit remains when your group matches...

Good luck understanding regexes, it's a powerful tool !

best,
azra.

Jussi Piitulainen · Apr 19, 2012

Sania said:
Sania said:

So I am trying to get the number of casualties in a text. After 'death
toll' in the text the number I need is presented as you can see from
the variable called text. Here is my code
I'm pretty sure my regex is correct, I think it's the group part
that's the problem.
I am using nltk by python. Group grabs the string in parenthesis and
stores it in deadnum and I make deadnum into a list.

Click to expand...

Â text="accounts put the death toll at 637 and those missing at
653 , but the total number is likely to be much bigger"
Â Â Â dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
Â Â Â deadnum=dead.group(1)
Â Â Â deaths.append(deadnum)
Â Â Â print deaths

Click to expand...

It's the regexp. The .* after "death toll" each the input as far as it
can without making the whole match fail. The group matches only the
last digit in the text.

You could allow only non-digits before the number. Or you could look
up the variant of * that only matches as much as it must.

Click to expand...

Hey Thanks,
So now my regex is

dead=re.match(r".*death toll.{0,20}(\d[,\d\.]*)", text)

But I only find 7 not 657. How is it that the group is only matching
the last digit? The whole thing is parenthesis not just the last
part. ?

It's still consuming the digits among the text that comes _before_ the
parenthesised group: the .{0,20} matches as _much_ as it _can_ without
making the whole regex fail, and the . in it matches also digits.

Try \D{0,20} to limit its matching ability to non-digits.

Try \.{0,20}? to limit to it to matching as _little_ as it can.

(The variant of * I referred to is *?; {} and {}? are similar.)

The simplicity of regexen is deceptive. Be careful. Be surprised.
<http://docs.python.org/library/re.html>. Keep them simple. Consider
also other means instead or in addition.

Jon Clements · Apr 19, 2012

Hi,
So I am trying to get the number of casualties in a text. After 'death
toll' in the text the number I need is presented as you can see from
the variable called text. Here is my code
I'm pretty sure my regex is correct, I think it's the group part
that's the problem.
I am using nltk by python. Group grabs the string in parenthesis and
stores it in deadnum and I make deadnum into a list.

text="accounts put the death toll at 637 and those missing at
653 , but the total number is likely to be much bigger"
dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
deadnum=dead.group(1)
deaths.append(deadnum)
print deaths

Any help would be appreciated,
Thank you,
Sania

Or just don't fully rely on a regex. I would, for time, and the little sanity I believe I have left, would just do something like:

death_toll = re.search(r'death toll.*\d+', text).group().rsplit(' ', 1)[1]

hth,

Jon.

Sania · Apr 19, 2012

Hi,
So I am trying to get the number of casualties in a text. After 'death
toll' in the text the number I need is presented as you can see from
the variable called text. Here is my code
I'm pretty sure my regex is correct, I think it's the group part
that's the problem.
I am using nltk by python. Group grabs the string in parenthesis and
stores it in deadnum and I make deadnum into a list.

Click to expand...

text="accounts put the death toll at 637 and those missing at
653 , but the total number is likely to be much bigger"
dead=re.match(r".*death toll.*(\d[,\d\.]*)", text)
deadnum=dead.group(1)
deaths.append(deadnum)
print deaths

Click to expand...

Any help would be appreciated,
Thank you,
Sania

Click to expand...

Or just don't fully rely on a regex. I would, for time, and the little sanity I believe I have left, would just do something like:

death_toll = re.search(r'death toll.*\d+', text).group().rsplit(' ', 1)[1]

hth,

Jon.

Thank you all so much!

I ended up using Jussi's advice..... \D{0,20}
Azrazer what you suggested works but I need to make sure that it
catches numbers like 6,370 as well as 637. And I tried tweaking the
regex around from the one you said in your reply but It didn't work
(probably would have if I was more adept). But thanks!

Jon- I kind of see what you are doing. In the regex you say that after
death toll there can be 0 or more characters followed by 1 or more
digits (although I would need to add a comma within digit so it
catches 6,370). I can also see that you are splitting each string but
I don't understand the 1 in rsplit(' ', 1)[1]. I am not really
familiar with the syntax I guess.

Thanks again!

Andy · Apr 19, 2012

If you plan on doing more work with regular expressions in the future and you have access to a Windows machine you may want to consider picking up a copy of RegxBuddy. I don't have any affiliation with the makers but I have been using the software for a few years and it has saved me a lot of frustration.

Thanks,
-Andy-

Trouble with regular expressions	6	Feb 7, 2009
help with perl2python regular expressions	0	May 15, 2006
I dont get this. Please help me!!	2	Jan 24, 2023
using regular expressions...	1	Nov 11, 2008
regular expressions, stack and nesting	2	Mar 22, 2009
Python battle game help	2	Feb 23, 2023
FAQ 6.12 Can I use Perl regular expressions to match balanced text?	0	Jan 9, 2011
Using Regular Expressions to Parse SQL	4	Feb 5, 2008

Regular expressions, help?

Sania

Jussi Piitulainen

Sania

azrazer

Jussi Piitulainen

Jon Clements

Sania

Andy

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads