Regular expression fun. Repeated matching of a group Q

matteosartori · Feb 24, 2006

Hi all,

I've spent all morning trying to work this one out:

I've got the following string:

<td>04/01/2006</td><td>Wednesday</td><td> </td><td>09:14</td><td>12:44</td><td>12:50</td><td>17:58</td><td> </td><td> </td><td> </td><td> </td><td>08:14</td>

from which I'm attempting to extract the date, and the five times from
into a list. Only the very last time is guaranteed to be there so it
should also work for a line like:

<td>03/01/2006</td><td>Tuesday</td><td>Annual_Holiday</td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td>08:00</td>

My Python regular expression to match that is currently:

digs = re.compile(
r'<td>(\d{2}\/\d{2}\/\d{4})</td>.*?(?:<td>(\d+\:\d+)</td>).*$' )

which first extracts the date into group 1
then matches the tags between the date and the first instance of a time
into group 2
then matches the first instance of a time into group 3
but then group 4 grabs all the remaining string.

I've tried changing the time pattern into

(?:<td>(\d+\:\d+)</td>)+

but that doesn't seem to mean "grab one or more cases of the previous
regexp."

Any Python regexp gurus with a hint would be greatly appreciated.

M@

johnzenger · Feb 24, 2006

There's more to re than just sub. How about:

sanesplit = re.split(r"</td><td>|<td>|</td>", text)
date = sanesplit[1]
times = times = [time for time in sanesplit if re.match("\d\d:\d\d",
time)]

.... then "date" contains the date at the beginning of the line and
"times" contains all your times.

matteosartori · Feb 24, 2006

Thanks,

The date = sanesplit[1] line complains about the "list index being out
of range", which is probably due to the fact that not all lines have
the <td> in them, something i didn't explain in the previous post.

I'd need some way of ensuring, as with the pattern I'd concocted, that
a valid line actually starts with a <td> containing a / separated date
tag.

As an aside, is it not actually possible to do what I was trying with a
single pattern or is it just not practical?

M@

johnzenger · Feb 24, 2006

You can check len(sanesplit) to see how big your list is. If it is <
2, then there were no <td>'s, so move on to the next line.

It is probably possible to do the whole thing with a regular
expression. It is probably not wise to do so. Regular expressions are
difficult to read, and, as you discovered, difficult to program and
debug. In many cases, Python code that relies on regular expressions
for lots of program logic runs slower than code that uses normal
Python.

Suppose "words" contains all the words in English. Compare these two
lines:

foobarwords1 = [x for x in words if re.search("foo|bar", x) ]
foobarwords2 = [x for x in words if "foo" in x or "bar" in x ]

I haven't tested this with 2.4, but as of a few years ago it was a safe
bet that foobarwords2 will be calculated much, much faster. Also, I
think you will agree, foobarwords2 is a lot easier to read.

matteosartori · Feb 24, 2006

Yes, it's easier to read without a doubt. I just wondered if i was
failing to do what i was trying to do because it couldn't be done or
because i hadn't properly understood what i was doing. Alas, it was
probably the latter.

Thanks for your help,

M@

Paul McGuire · Feb 24, 2006

Here's a (surprise!) pyparsing solution. -- Paul
(Get pyparsing at http://pyparsing.sourceforge.net.)

data = [
"""<td>04/01/2006</td><td>Wednesday</td><td> </td><td>09:14</td><td>12:44</td><td>12:50</td><td>17:58</td><td> </td><td> </td><td> </td><td> </td><td>08:14</td>""",
"""<td>03/01/2006</td><td>Tuesday</td><td>Annual_Holiday</td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td>08:00</td>"""
]

from pyparsing import *

startTD,endTD = makeHTMLTags("TD")
startTD = startTD.suppress()
endTD = endTD.suppress()
dayOfWeek = oneOf("Sunday Monday Tuesday Wednesday Thursday Friday
Saturday")
nbsp = Literal(" ")
time = Combine(Word(nums,exact=2) + ":" + Word(nums,exact=2))
date = Combine(Word(nums,exact=2) + "/" + Word(nums,exact=2) + "/" +
Word(nums,exact=4))

entry = ( startTD + date.setResultsName("date") + endTD +
startTD + dayOfWeek.setResultsName("dayOfWeek") + endTD +
startTD + ( Suppress(nbsp) |
Word(alphanums+"_").setResultsName("name") ) + endTD +
OneOrMore(startTD + (Suppress(nbsp) | time) + endTD
).setResultsName("dates")
)

for d in data:
res = entry.parseString(d)
print res.date
print res.dayOfWeek
print res.name
print res.dates
print

Returns:

04/01/2006
Wednesday

['09:14', '12:44', '12:50', '17:58', '08:14']

03/01/2006
Tuesday
Annual_Holiday
['08:00']

plahey · Feb 24, 2006

Doesn't this do what you want?

import re

DATE_TIME_RE =
re.compile(r'<td>((\d{2}\/\d{2}\/\d{4})|(\d{2}:\d{2}))<\/td>')

test = '<td>04/01/2006</td>' \
'<td>Wednesday</td>' \
'<td> </td>' \
'<td>09:14</td>' \
'<td>12:44</td>' \
'<td>12:50</td>' \
'<td>17:58</td>' \
'<td> </td>' \
'<td> </td>' \
'<td> </td>' \
'<td> </td>' \
'<td>08:14</td>'

out = [m[0] for m in DATE_TIME_RE.findall(test)]

for m in out:
print m

Larry Bates · Feb 24, 2006

Hi all,

I've spent all morning trying to work this one out:

I've got the following string:

<td>04/01/2006</td><td>Wednesday</td><td> </td><td>09:14</td><td>12:44</td><td>12:50</td><td>17:58</td><td> </td><td> </td><td> </td><td> </td><td>08:14</td>

from which I'm attempting to extract the date, and the five times from
into a list. Only the very last time is guaranteed to be there so it
should also work for a line like:

<td>03/01/2006</td><td>Tuesday</td><td>Annual_Holiday</td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td> </td><td>08:00</td>

My Python regular expression to match that is currently:

digs = re.compile(
r'<td>(\d{2}\/\d{2}\/\d{4})</td>.*?(?:<td>(\d+\:\d+)</td>).*$' )

which first extracts the date into group 1
then matches the tags between the date and the first instance of a time
into group 2
then matches the first instance of a time into group 3
but then group 4 grabs all the remaining string.

I've tried changing the time pattern into

(?:<td>(\d+\:\d+)</td>)+

but that doesn't seem to mean "grab one or more cases of the previous
regexp."

Any Python regexp gurus with a hint would be greatly appreciated.

M@

This works:

import BeautifulSoup

test = '<td>04/01/2006</td>' \
'<td>Wednesday</td>' \
'<td> </td>' \
'<td>09:14</td>' \
'<td>12:44</td>' \
'<td>12:50</td>' \
'<td>17:58</td>' \
'<td> </td>' \
'<td> </td>' \
'<td> </td>' \
'<td> </td>' \
'<td>08:14</td>'

c=BeautifulSoup.BeautifulSoup(test)
times=[]
for i in c.childGenerator():
if i.contents[0] == " ": continue
times.append(i.contents[0])

date=times.pop(0)
day=times.pop(0)

print "date=", date
print "day=", day
print "times=", times

-Larry Bates

Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
How can I calculate the last payment of the year to be the sum of all previous payments for that year and subtracting it from Research Costs value?	7	Aug 22, 2023
A number everyday of the month "and" a different number depending on the day of the month´s day time	2	Mar 16, 2021
Regular expression to structure HTML	11	Oct 2, 2009
Newbie regular expression and whitespace question	6	Sep 22, 2005
Help with PHP/Javascript calculating custom estimate	0	Jun 27, 2008
Help with my responsive home page	2	Dec 14, 2022
PERL/HTML: extract repetitive information	2	Jul 11, 2007

Regular expression fun. Repeated matching of a group Q

matteosartori

johnzenger

matteosartori

johnzenger

matteosartori

Paul McGuire

plahey

Larry Bates

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads