A better RE?

M

Magnus Lycka

I want an re that matches strings like "21MAR06 31APR06 1236",
where the last part is day numbers (1-7), i.e it can contain
the numbers 1-7, in order, only one of each, and at least one
digit. I want it as three groups. I was thinking of

r"(\d\d[A-Z]\d\d) (\d\d[A-Z]\d\d) (1?2?3?4?5?6?7?)"

but that will match even if the third group is empty,
right? Does anyone have good and not overly complex RE for
this?

P.S. I know the "now you have two problems reply..."
 
F

Fredrik Lundh

Magnus said:
I want an re that matches strings like "21MAR06 31APR06 1236",
where the last part is day numbers (1-7), i.e it can contain
the numbers 1-7, in order, only one of each, and at least one
digit. I want it as three groups. I was thinking of

r"(\d\d[A-Z]\d\d) (\d\d[A-Z]\d\d) (1?2?3?4?5?6?7?)"

but that will match even if the third group is empty,
right? Does anyone have good and not overly complex RE for
this?

how about (untested)

r"(\d\d[A-Z]{3}\d\d) (\d\d[A-Z]{3}\d\d) (?=[1234567])(1?2?3?4?5?6?7?)"

where {3} means require three copies of the previous RE part, and
(?=[1234567]) means require at least one of 1-7, but don't move
forward if it matches.

</F>
 
?

=?ISO-8859-1?Q?Sch=FCle_Daniel?=

Magnus said:
I want an re that matches strings like "21MAR06 31APR06 1236",
where the last part is day numbers (1-7), i.e it can contain
the numbers 1-7, in order, only one of each, and at least one
digit. I want it as three groups. I was thinking of

r"(\d\d[A-Z]\d\d) (\d\d[A-Z]\d\d) (1?2?3?4?5?6?7?)"

but that will match even if the third group is empty,
right? Does anyone have good and not overly complex RE for
this?

P.S. I know the "now you have two problems reply..."
# non capturing group :)?)
(?=[1234567])(1?2?3?4?5?6?7?)" % (m,m))
'21MAR06'
'31APR06'
1236
 
B

bruno at modulix

Magnus said:
I want an re that matches strings like "21MAR06 31APR06 1236",
where the last part is day numbers (1-7), i.e it can contain
the numbers 1-7, in order, only one of each, and at least one
digit. I want it as three groups. I was thinking of

r"(\d\d[A-Z]\d\d) (\d\d[A-Z]\d\d) (1?2?3?4?5?6?7?)"

but that will match even if the third group is empty,
right? Does anyone have good and not overly complex RE for
this?
Simplest:
exp = r"(\d{2}[A-Z]{3}\d{2}) (\d{2}[A-Z]{3}\d{2}) (\d+)"
re.match(exp, s).groups()
('21MAR06', '31APR06', '1236')

but this could give you false positive, depending on the real data.

If you want to be as strict as possible, this becomes a little bit hairy.
P.S. I know the "now you have two problems reply..."

!-)
 
E

Eddie Corns

Magnus Lycka said:
I want an re that matches strings like "21MAR06 31APR06 1236",
where the last part is day numbers (1-7), i.e it can contain
the numbers 1-7, in order, only one of each, and at least one
digit. I want it as three groups. I was thinking of

Just a small point - what does "in order" mean here? if it means that eg 1362
is not valid then you're stuck because it's context sensitive and hence not
regular.

I can't see how any of the fancy extensions could help here but maybe I'm just
lacking insight.

Now if "[\1-7]" worked you'd be home and dry.

Eddie
 
F

Fredrik Lundh

Eddie Corns wrote:

Just a small point - what does "in order" mean here? if it means that eg 1362
is not valid then you're stuck because it's context sensitive and hence not
regular.

I can't see how any of the fancy extensions could help here but maybe I'm
just lacking insight.

import re

p = re.compile("(?=[1234567])(1?2?3?4?5?6?7?)$")

def test(s):
m = p.match(s)
print repr(s), "=>", m and m.groups() or "none"

test("")
test("1236")
test("1362")
test("12345678")

prints

'' => none
'1236' => ('1236',)
'1362' => none
'12345678' => none

</F>
 
J

Jim

Eddie said:
Just a small point - what does "in order" mean here? if it means that eg 1362
is not valid then you're stuck because it's context sensitive and hence not
regular.
I'm not seeing that. Any finite language is regular -- as a last
resort you could list all ascending sequences of 7 or fewer digits (but
perhaps I misunderstood the original poster's requirements).

Jim
 
E

Eddie Corns

I'm not seeing that. Any finite language is regular -- as a last
resort you could list all ascending sequences of 7 or fewer digits (but
perhaps I misunderstood the original poster's requirements).

No, that's what I did. Just carelessnes on my part, time I had a holiday!

Eddie
 
P

Paul McGuire

Magnus Lycka said:
I want an re that matches strings like "21MAR06 31APR06 1236",
where the last part is day numbers (1-7), i.e it can contain
the numbers 1-7, in order, only one of each, and at least one
digit. I want it as three groups. I was thinking of

r"(\d\d[A-Z]\d\d) (\d\d[A-Z]\d\d) (1?2?3?4?5?6?7?)"

but that will match even if the third group is empty,
right? Does anyone have good and not overly complex RE for
this?

P.S. I know the "now you have two problems reply..."

For the pyparsing-inclined, here are two versions, along with several
examples on how to extract the fields from the returned ParseResults object.
The second version is more rigorous in enforcing the days-of-week rules on
the 3rd field.

Note that the month field is already limited to valid month abbreviations,
and the same technique used to validate the days-of-week field could be used
to ensure that the date fields are valid dates (no 31st of FEB, etc.), that
the second date is after the first, etc.

-- Paul
Download pyparsing at http://pyparsing.sourceforge.net.


data = "21MAR06 31APR06 1236"
data2 = "21MAR06 31APR06 1362"

from pyparsing import *

# define format of an entry
month = oneOf("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC")
date = Combine( Word(nums,exact=2) + month + Word(nums,exact=2) )
daysOfWeek = Word("1234567")
entry = date.setResultsName("startDate") + \
date.setResultsName("endDate") + \
daysOfWeek.setResultsName("weekDays") + \
lineEnd

# extract entry data
e = entry.parseString(data)

# various ways to access the results
print e.startDate, e.endDate, e.weekDays
print "%(startDate)s : %(endDate)s : %(weekDays)s" % e
print e.asList()
print e
print

# get more rigorous in testing for valid days of week field
def rigorousDayOfWeekTest(s,l,toks):
# remove duplicates from toks[0], sort, then compare to original
tmp = "".join(sorted(dict([(ll,0) for ll in toks[0]]).keys()))
if tmp != toks[0]:
raise ParseException(s,l,"Invalid days of week field")

daysOfWeek.setParseAction(rigorousDayOfWeekTest)
entry = date.setResultsName("startDate") + \
date.setResultsName("endDate") + \
daysOfWeek.setResultsName("weekDays") + \
lineEnd

print entry.parseString(data)
print entry.parseString(data2) # <-- raises ParseException
 
M

Magnus Lycka

Fredrik said:
Magnus Lycka wrote:
r"(\d\d[A-Z]{3}\d\d) (\d\d[A-Z]{3}\d\d) (?=[1234567])(1?2?3?4?5?6?7?)"

Thanks a lot. (I knew about {3} of course, I was in a hurry
when I posted since I was close to missing my train...)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,201
Latest member
KourtneyBe

Latest Threads

Top