Regular expression to capture model numbers

krishnapostings · Apr 22, 2009

My quick attempt is below:
obj = re.compile(r'\b[0-9|a-zA-Z]+[\w-]+')['TestThis', '1234', 'Test123AB-x']

This is not working.

Requirements:
The text must contain a combination of numbers, alphabets and hyphen
with at least two of the three elements present. I can use it to set
min length using ) {}

Thanks,
Krishna.

John Machin · Apr 22, 2009

My quick attempt is below:
obj = re.compile(r'\b[0-9|a-zA-Z]+[\w-]+')

1. Provided the remainder of the pattern is greedy and it will be used
only for findall, the \b seems pointless.

2. What is the "|" for? Inside a character class, | has no special
meaning, and will match a literal "|" character (which isn't part of
your stated requirement).

3. \w will match underscore "_" ... not in your requirement.

4. Re [\w-] : manual says "If you want to include a ']' or a '-'
inside a set, precede it with a backslash, or place it as the first
character" which IIRC is the usual advice given with just about any
regex package -- actually, placing it at the end works but relying on
undocumented behaviour when there are alternatives that are as easy to
use and are documented is not a good habit to get into

5. You have used "+" twice; does this mean a minimum length of 2 is
part of your requirement?

['TestThis', '1234', 'Test123AB-x']

This is not working.

Requirements:
The text must contain a combination of numbers, alphabets and hyphen
with at least two of the three elements present.

Unfortunately(?), regular expressions can't express complicated
conditions like that.

I can use it to set
min length using ) {}

I presume that you mean enforcing a minimum length of (say) 4 by using
{4,} in the pattern ...

You are already faced with the necessity of filtering out unwanted
matches programmatically. You might as well leave the length check
until then.

So: first let's establish what the pattern should be, ignoring the "2
or more out of 3 classes" rule and the length rule.

First character: Digits? Maybe not. Hyphen? Probably not.
Last character: Hyphen? Probably not.
Other characters: Any of (ASCII) letters, digits, hyphen.

So based on my guesses for answers to the above questions, the pattern
should be r"[A-Za-z][-A-Za-z0-9]*[A-Za-z0-9]"

Note: this assumes that your data is impeccably clean, and there isn't
any such data outside textbooks. You may wish to make the pattern less
restrictive, so that you can pick up probable mistakes like "A123-
456" instead of "A123-456".

Checking a candidate returned by findall could be done something like
this:

# initial setup:
import string
alpha_set = set(string.ascii_letters)
digit_set = set('1234567890')
min_len = 4 # for example

# each candidate:
cand_set = set(cand)
ok = len(cand) >= min_len and (
bool(cand_set & alpha_set)
+ bool(cand_set & digit set)
+ bool('-' in cand_set)
) >= 2

HTH,
John

Aahz · Apr 22, 2009

My quick attempt is below:
obj = re.compile(r'\b[0-9|a-zA-Z]+[\w-]+')['TestThis', '1234', 'Test123AB-x']

This is not working.

What isn't working? Why not just split() on ";"? You need to define
your problem more precisely if you want us to help.

Piet van Oostrum · Apr 23, 2009

John Machin said:
JM> On Apr 23, 8:01 am, (e-mail address removed) wrote:

JM> Unfortunately(?), regular expressions can't express complicated
JM> conditions like that.

Yes, they can but it is not pretty.

The pattern must start with a letter, a digit or a hyphen.

If it starts with a letter, for example, there must be at least a hyphen
or a digit somewhere. So let us concentrate on the first one of these
that occurs in the string. Then the preceding things are only letters
and after it can be any combination of letters, digits and hyphens. So
the pattern for this is (when we write L for letters, and d for digits):

L+[-d][-Ld]*.

Similarly for strings starting with a digit and with a hyphen. Now
replacing L with [A-Za-z] and d with [0-9] or \d and factoring out the
[-Ld]* which is common to all 3 cases you get:

(?:[A-Za-z]+[-0-9]|[0-9]+[-A-Za-z]|-+[0-9A-Za-z])[-0-9A-Za-z]*

obj = re.compile(r'(?:[A-Za-z]+[-0-9]|[0-9]+[-A-Za-z]|-+[0-9A-Za-z])[-0-9A-Za-z]*')
re.findall(obj, 'TestThis;1234;Test123AB-x')

Click to expand...

Click to expand...

['Test123AB-x']

Or you can use re.I and mention only one case of letters:

obj = re.compile(r'(?:[a-z]+[-0-9]|[0-9]+[-a-z]|-+[0-9a-z])[-0-9a-z]*', re.I)

John Machin · Apr 23, 2009

JM> Unfortunately(?), regular expressions can't express complicated
JM> conditions like that.

Click to expand...

Yes, they can but it is not pretty.

The pattern must start with a letter, a digit or a hyphen.

If it starts with a letter, for example, there must be at least a hyphen
or a digit somewhere. So let us concentrate on the first one of these
that occurs in the string. Then the preceding things are only letters
and after it can be any combination of letters, digits and hyphens. So
the pattern for this is (when we write L for letters, and d for digits):

L+[-d][-Ld]*.

Similarly for strings starting with a digit and with a hyphen. Now
replacing L with [A-Za-z] and d with [0-9] or \d and factoring out the
[-Ld]* which is common to all 3 cases you get:

(?:[A-Za-z]+[-0-9]|[0-9]+[-A-Za-z]|-+[0-9A-Za-z])[-0-9A-Za-z]*

obj = re.compile(r'(?:[A-Za-z]+[-0-9]|[0-9]+[-A-Za-z]|-+[0-9A-Za-z])[-0-9A-Za-z]*')
re.findall(obj, 'TestThis;1234;Test123AB-x')

Click to expand...

Click to expand...

['Test123AB-x']

Or you can use re.I and mention only one case of letters:

obj = re.compile(r'(?:[a-z]+[-0-9]|[0-9]+[-a-z]|-+[0-9a-z])[-0-9a-z]*', re.I)

Understandable and maintainable, I don't think. Suppose that instead
the first character is limited to being alphabetic. You have to go
through the whole process of elaborating the possibilites again, and I
don't consider that process qualifies as "express[ing] complicated
conditions like that".

Piet van Oostrum · Apr 24, 2009

John Machin said:
obj = re.compile(r'(?:[a-z]+[-0-9]|[0-9]+[-a-z]|-+[0-9a-z])[-0-9a-z]*', re.I)

Click to expand...

Click to expand...

JM> Understandable and maintainable, I don't think. Suppose that instead
JM> the first character is limited to being alphabetic. You have to go
JM> through the whole process of elaborating the possibilites again, and I
JM> don't consider that process qualifies as "express[ing] complicated
JM> conditions like that".

No, I don't think regular expressions are the best tool for these kind
of tests. I just wanted to show that it *could* be done. By the way,
your additional hypothetical requirement that the first character should
be alphabetic just makes it easier: only the first alternative remains.
But on the other hand, suppose you would have the requirement that the
pattern should not end in a hyphen then it becomes even uglier. Or when
there should never be two hyphens in a row, I wouldn't even think of
using a re, although theoretically it would be possible.

Translating these requirements into re's is not `composable'.

Regular expression to structure HTML	11	Oct 2, 2009
about condensed regular expression syntax	7	Jun 27, 2007
problem with regular expression?	0	May 14, 2004
Regular expression to test and limit number of characters	12	Feb 27, 2008
regular expression AND required validator together?	9	Feb 21, 2008
Regular Expression Validator	3	Jul 1, 2004
Please help with regular expression (string must be letter plus 5 numbers)	2	Apr 21, 2006
Using Regular Expression Validation	2	Apr 16, 2004

Regular expression to capture model numbers

krishnapostings

John Machin

Aahz

Piet van Oostrum

John Machin

Piet van Oostrum

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads