Regular expression to capture model numbers

K

krishnapostings

My quick attempt is below:
obj = re.compile(r'\b[0-9|a-zA-Z]+[\w-]+')['TestThis', '1234', 'Test123AB-x']

This is not working.

Requirements:
The text must contain a combination of numbers, alphabets and hyphen
with at least two of the three elements present. I can use it to set
min length using ) {}

Thanks,
Krishna.
 
J

John Machin

My quick attempt is below:
obj = re.compile(r'\b[0-9|a-zA-Z]+[\w-]+')

1. Provided the remainder of the pattern is greedy and it will be used
only for findall, the \b seems pointless.

2. What is the "|" for? Inside a character class, | has no special
meaning, and will match a literal "|" character (which isn't part of
your stated requirement).

3. \w will match underscore "_" ... not in your requirement.

4. Re [\w-] : manual says "If you want to include a ']' or a '-'
inside a set, precede it with a backslash, or place it as the first
character" which IIRC is the usual advice given with just about any
regex package -- actually, placing it at the end works but relying on
undocumented behaviour when there are alternatives that are as easy to
use and are documented is not a good habit to get into :)

5. You have used "+" twice; does this mean a minimum length of 2 is
part of your requirement?
['TestThis', '1234', 'Test123AB-x']

This is not working.

Requirements:
  The text must contain a combination of numbers, alphabets and hyphen
with at least two of the three elements present.

Unfortunately(?), regular expressions can't express complicated
conditions like that.
I can use it to set
min length using ) {}

I presume that you mean enforcing a minimum length of (say) 4 by using
{4,} in the pattern ...

You are already faced with the necessity of filtering out unwanted
matches programmatically. You might as well leave the length check
until then.

So: first let's establish what the pattern should be, ignoring the "2
or more out of 3 classes" rule and the length rule.

First character: Digits? Maybe not. Hyphen? Probably not.
Last character: Hyphen? Probably not.
Other characters: Any of (ASCII) letters, digits, hyphen.

So based on my guesses for answers to the above questions, the pattern
should be r"[A-Za-z][-A-Za-z0-9]*[A-Za-z0-9]"

Note: this assumes that your data is impeccably clean, and there isn't
any such data outside textbooks. You may wish to make the pattern less
restrictive, so that you can pick up probable mistakes like "A123-
456" instead of "A123-456".

Checking a candidate returned by findall could be done something like
this:

# initial setup:
import string
alpha_set = set(string.ascii_letters)
digit_set = set('1234567890')
min_len = 4 # for example

# each candidate:
cand_set = set(cand)
ok = len(cand) >= min_len and (
bool(cand_set & alpha_set)
+ bool(cand_set & digit set)
+ bool('-' in cand_set)
) >= 2

HTH,
John
 
A

Aahz

My quick attempt is below:
obj = re.compile(r'\b[0-9|a-zA-Z]+[\w-]+')['TestThis', '1234', 'Test123AB-x']

This is not working.

What isn't working? Why not just split() on ";"? You need to define
your problem more precisely if you want us to help.
 
P

Piet van Oostrum

John Machin said:
JM> On Apr 23, 8:01 am, (e-mail address removed) wrote:
JM> Unfortunately(?), regular expressions can't express complicated
JM> conditions like that.

Yes, they can but it is not pretty.

The pattern must start with a letter, a digit or a hyphen.

If it starts with a letter, for example, there must be at least a hyphen
or a digit somewhere. So let us concentrate on the first one of these
that occurs in the string. Then the preceding things are only letters
and after it can be any combination of letters, digits and hyphens. So
the pattern for this is (when we write L for letters, and d for digits):

L+[-d][-Ld]*.

Similarly for strings starting with a digit and with a hyphen. Now
replacing L with [A-Za-z] and d with [0-9] or \d and factoring out the
[-Ld]* which is common to all 3 cases you get:

(?:[A-Za-z]+[-0-9]|[0-9]+[-A-Za-z]|-+[0-9A-Za-z])[-0-9A-Za-z]*
obj = re.compile(r'(?:[A-Za-z]+[-0-9]|[0-9]+[-A-Za-z]|-+[0-9A-Za-z])[-0-9A-Za-z]*')
re.findall(obj, 'TestThis;1234;Test123AB-x')
['Test123AB-x']

Or you can use re.I and mention only one case of letters:

obj = re.compile(r'(?:[a-z]+[-0-9]|[0-9]+[-a-z]|-+[0-9a-z])[-0-9a-z]*', re.I)
 
J

John Machin

JM> Unfortunately(?), regular expressions can't express complicated
JM> conditions like that.

Yes, they can but it is not pretty.

The pattern must start with a letter, a digit or a hyphen.

If it starts with a letter, for example, there must be at least a hyphen
or a digit somewhere. So let us concentrate on the first one of these
that occurs in the string. Then the preceding things are only letters
and after it can be any combination of letters, digits and hyphens. So
the pattern for this is (when we write L for letters, and d for digits):

L+[-d][-Ld]*.

Similarly for strings starting with a digit and with a hyphen. Now
replacing L with [A-Za-z] and d with [0-9] or \d and factoring out the
[-Ld]* which is common to all 3 cases you get:

(?:[A-Za-z]+[-0-9]|[0-9]+[-A-Za-z]|-+[0-9A-Za-z])[-0-9A-Za-z]*
obj = re.compile(r'(?:[A-Za-z]+[-0-9]|[0-9]+[-A-Za-z]|-+[0-9A-Za-z])[-0-9A-Za-z]*')
re.findall(obj, 'TestThis;1234;Test123AB-x')

['Test123AB-x']

Or you can use re.I and mention only one case of letters:

obj = re.compile(r'(?:[a-z]+[-0-9]|[0-9]+[-a-z]|-+[0-9a-z])[-0-9a-z]*', re.I)

Understandable and maintainable, I don't think. Suppose that instead
the first character is limited to being alphabetic. You have to go
through the whole process of elaborating the possibilites again, and I
don't consider that process qualifies as "express[ing] complicated
conditions like that".
 
P

Piet van Oostrum

John Machin said:
obj = re.compile(r'(?:[a-z]+[-0-9]|[0-9]+[-a-z]|-+[0-9a-z])[-0-9a-z]*', re.I)
JM> Understandable and maintainable, I don't think. Suppose that instead
JM> the first character is limited to being alphabetic. You have to go
JM> through the whole process of elaborating the possibilites again, and I
JM> don't consider that process qualifies as "express[ing] complicated
JM> conditions like that".

No, I don't think regular expressions are the best tool for these kind
of tests. I just wanted to show that it *could* be done. By the way,
your additional hypothetical requirement that the first character should
be alphabetic just makes it easier: only the first alternative remains.
But on the other hand, suppose you would have the requirement that the
pattern should not end in a hyphen then it becomes even uglier. Or when
there should never be two hyphens in a row, I wouldn't even think of
using a re, although theoretically it would be possible.

Translating these requirements into re's is not `composable'.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top