re.compile for names

B

brad

I am developing a list of 3 character strings like this:

and
bra
cam
dom
emi
mar
smi
....

The goal of the list is to have enough strings to identify files that
may contain the names of people. Missing a name in a file is unacceptable.

For example, the string 'mar' would get marc, mark, mary, maria... 'smi'
would get smith, smiley, smit, etc. False positives are OK (getting
common words instead of people's names is OK).

I may end up with a thousand or so of these 3 character strings. Is that
too much for an re.compile to handle? Also, is this a bad way to
approach this problem? Any ideas for improvement are welcome!

I can provide more info off-list for those who would like.

Thank you for your time,
Brad
 
M

Marc 'BlackJack' Rintsch

I am developing a list of 3 character strings like this:

and
bra
cam
dom
emi
mar
smi
...

The goal of the list is to have enough strings to identify files that
may contain the names of people. Missing a name in a file is unacceptable.

Then simply return `True` for any file that contains at least two or three
ASCII letters in a row. Easily written as a short re. ;-)
I may end up with a thousand or so of these 3 character strings. Is that
too much for an re.compile to handle? Also, is this a bad way to
approach this problem? Any ideas for improvement are welcome!

Unless you can come up with some restrictions to the names, just follow
the advice above or give up. I saw a documentation about someone with the
name "Scary Guy" in his ID papers recently. What about names with letters
not in the ASCII range?

Ciao,
Marc 'BlackJack' Rintsch
 
P

Paul McGuire

I am developing a list of 3 character strings like this:

and
bra
cam
dom
emi
mar
smi
...

The goal of the list is to have enough strings to identify files that
may contain the names of people. Missing a name in a file is unacceptable.

For example, the string 'mar' would get marc, mark, mary, maria... 'smi'
would get smith, smiley, smit, etc. False positives are OK (getting
common words instead of people's names is OK).

I may end up with a thousand or so of these 3 character strings. Is that
too much for an re.compile to handle? Also, is this a bad way to
approach this problem? Any ideas for improvement are welcome!

I can provide more info off-list for those who would like.

Thank you for your time,
Brad

There are only 17,576 possible 3-letter strings, so you must keep your
percentage of this number small for this filter to be of any use.
With a list of a dozen or so strings, this may work okay for you. But
the more of these strings that you add, the more the number of false
positives will frustrate your attempts at making any sense of the
results. I suspect that using a thousand or so of these strings will
end up matching 95+% of all files.

You will also get better results if you constrain the location of the
match, for instance, looking for file names that *start* with
someone's name, instead of just containing them somewhere.

-- Paul
 
B

brad

Marc said:
What about names with letters not in the ASCII range?

Like Asian names? The names we encounter are spelled out in English...
like Xu, Zu, Li-Cheng, Matsumoto, Wantanabee, etc. So the ASCII approach
would still work. I guess.

My first thought was to spell out names entirely, but that quickly
seemed a bad idea. Doing an re on smith with whitespace boundaries is
more accurate than smi w/o, but the volume of names just makes it
impossible. And the volume of false positives using only smi makes it
somewhat worthless too.

It's tough when a problem needs an accurate yet broad solution. Too
broad and the results are irrelevant as they'll include so many false
positives, too accurate and the results will be missing a few names.
It's a no-win :(

Thanks for the advice.

Brad
 
J

John Machin

I am developing a list of 3 character strings like this:

and
bra
cam
dom
emi
mar
smi
...

The goal of the list is to have enough strings to identify files that
may contain the names of people. Missing a name in a file is unacceptable.

The constraint that you have been given (no false negatives) is utterly
unrealistic. Given that constraint, forget the 3-letter substring
approach. There are many two-letter names. I have seen a genuine
instance of a one-letter surname ("O"). In jurisdictions which don't
disallow it, people can change their name to a string of digits. These
days you can't even rely on names starting with a capital letter ("i
think paris hilton is said:
For example, the string 'mar' would get marc, mark, mary, maria... 'smi'
would get smith, smiley, smit, etc. False positives are OK (getting
common words instead of people's names is OK).

I may end up with a thousand or so of these 3 character strings.

If you get a large file of names and take every possible 3-letter
substring that you find, you would expect to get well over a thousand.
Is that
too much for an re.compile to handle?

Suck it and see. I'd guess that re.compile("mar|smi|jon|bro|wil....) is
*NOT* the way to go.
Also, is this a bad way to
approach this problem?

Yes. At the very least I'd suggest that you need to break up your file
into "words" and then consider whether each word is part of a "name".
Much depends on context, if you want to cut down on false positives --
"we went 2 paris n staid at the hilton", "the bill from the smith was
too high".
Any ideas for improvement are welcome!

1. Get the PHB to come up with a more realistic constraint.
2. http://en.wikipedia.org/wiki/Named_entity_recognition

HTH,
John
 
J

John Machin

Like Asian names? The names we encounter are spelled out in English...
like Xu, Zu, Li-Cheng, Matsumoto, Wantanabee, etc.

"spelled out in English"? "English" has nothing to do with it.

The first 3 are Chinese, spelled using the Pinyin system, which happens
to use "Roman" letters [including non-ASCII ü]. They may appear adorned
with tone marks [not ASCII] or tone digits. The 4th and 5th [which is
presumably intended to be "Watanabe"] are Japanese, using the Romaji
system, which ... you guess the rest :)

Cheers,
John
 
J

John Machin

Seems to me the OP is looking for people-names inside file-contents, not
inside file-names.

[snip]
You will also get better results if you constrain the location of the
match, for instance, looking for file names that *start* with
someone's name, instead of just containing them somewhere.

YMevidentlyV :)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,534
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top