re.compile for names

brad · May 21, 2007

I am developing a list of 3 character strings like this:

and
bra
cam
dom
emi
mar
smi
....

The goal of the list is to have enough strings to identify files that
may contain the names of people. Missing a name in a file is unacceptable.

For example, the string 'mar' would get marc, mark, mary, maria... 'smi'
would get smith, smiley, smit, etc. False positives are OK (getting
common words instead of people's names is OK).

I may end up with a thousand or so of these 3 character strings. Is that
too much for an re.compile to handle? Also, is this a bad way to
approach this problem? Any ideas for improvement are welcome!

I can provide more info off-list for those who would like.

Thank you for your time,
Brad

Marc 'BlackJack' Rintsch · May 21, 2007

I am developing a list of 3 character strings like this:

and
bra
cam
dom
emi
mar
smi
...

The goal of the list is to have enough strings to identify files that
may contain the names of people. Missing a name in a file is unacceptable.

Then simply return `True` for any file that contains at least two or three
ASCII letters in a row. Easily written as a short re. ;-)

I may end up with a thousand or so of these 3 character strings. Is that
too much for an re.compile to handle? Also, is this a bad way to
approach this problem? Any ideas for improvement are welcome!

Unless you can come up with some restrictions to the names, just follow
the advice above or give up. I saw a documentation about someone with the
name "Scary Guy" in his ID papers recently. What about names with letters
not in the ASCII range?

Ciao,
Marc 'BlackJack' Rintsch

Paul McGuire · May 21, 2007

I am developing a list of 3 character strings like this:

and
bra
cam
dom
emi
mar
smi
...

The goal of the list is to have enough strings to identify files that
may contain the names of people. Missing a name in a file is unacceptable.

For example, the string 'mar' would get marc, mark, mary, maria... 'smi'
would get smith, smiley, smit, etc. False positives are OK (getting
common words instead of people's names is OK).

I may end up with a thousand or so of these 3 character strings. Is that
too much for an re.compile to handle? Also, is this a bad way to
approach this problem? Any ideas for improvement are welcome!

I can provide more info off-list for those who would like.

Thank you for your time,
Brad

There are only 17,576 possible 3-letter strings, so you must keep your
percentage of this number small for this filter to be of any use.
With a list of a dozen or so strings, this may work okay for you. But
the more of these strings that you add, the more the number of false
positives will frustrate your attempts at making any sense of the
results. I suspect that using a thousand or so of these strings will
end up matching 95+% of all files.

You will also get better results if you constrain the location of the
match, for instance, looking for file names that *start* with
someone's name, instead of just containing them somewhere.

-- Paul

brad · May 21, 2007

Marc said:
What about names with letters not in the ASCII range?

Like Asian names? The names we encounter are spelled out in English...
like Xu, Zu, Li-Cheng, Matsumoto, Wantanabee, etc. So the ASCII approach
would still work. I guess.

My first thought was to spell out names entirely, but that quickly
seemed a bad idea. Doing an re on smith with whitespace boundaries is
more accurate than smi w/o, but the volume of names just makes it
impossible. And the volume of false positives using only smi makes it
somewhat worthless too.

It's tough when a problem needs an accurate yet broad solution. Too
broad and the results are irrelevant as they'll include so many false
positives, too accurate and the results will be missing a few names.
It's a no-win

Thanks for the advice.

Brad

John Machin · May 21, 2007

I am developing a list of 3 character strings like this:

and
bra
cam
dom
emi
mar
smi
...

The goal of the list is to have enough strings to identify files that
may contain the names of people. Missing a name in a file is unacceptable.

The constraint that you have been given (no false negatives) is utterly
unrealistic. Given that constraint, forget the 3-letter substring
approach. There are many two-letter names. I have seen a genuine
instance of a one-letter surname ("O"). In jurisdictions which don't
disallow it, people can change their name to a string of digits. These
days you can't even rely on names starting with a capital letter ("i

think paris hilton is said:
For example, the string 'mar' would get marc, mark, mary, maria... 'smi'
would get smith, smiley, smit, etc. False positives are OK (getting
common words instead of people's names is OK).

I may end up with a thousand or so of these 3 character strings.

If you get a large file of names and take every possible 3-letter
substring that you find, you would expect to get well over a thousand.

Is that
too much for an re.compile to handle?

Suck it and see. I'd guess that re.compile("mar|smi|jon|bro|wil....) is
*NOT* the way to go.

Also, is this a bad way to
approach this problem?

Yes. At the very least I'd suggest that you need to break up your file
into "words" and then consider whether each word is part of a "name".
Much depends on context, if you want to cut down on false positives --
"we went 2 paris n staid at the hilton", "the bill from the smith was
too high".

Any ideas for improvement are welcome!

1. Get the PHB to come up with a more realistic constraint.
2. http://en.wikipedia.org/wiki/Named_entity_recognition

HTH,
John

John Machin · May 21, 2007

Like Asian names? The names we encounter are spelled out in English...
like Xu, Zu, Li-Cheng, Matsumoto, Wantanabee, etc.

"spelled out in English"? "English" has nothing to do with it.

The first 3 are Chinese, spelled using the Pinyin system, which happens
to use "Roman" letters [including non-ASCII Ã¼]. They may appear adorned
with tone marks [not ASCII] or tone digits. The 4th and 5th [which is
presumably intended to be "Watanabe"] are Japanese, using the Romaji
system, which ... you guess the rest

Cheers,
John

John Machin · May 21, 2007

Seems to me the OP is looking for people-names inside file-contents, not
inside file-names.

[snip]

You will also get better results if you constrain the location of the
match, for instance, looking for file names that *start* with
someone's name, instead of just containing them somewhere.

YMevidentlyV

Help with Regex for domain names	7	Jul 30, 2009
A plea for better variable names	1	Oct 10, 2009
Rename multiple files using names in a text file	2	Sep 14, 2007
Standard names for common keyboard events	0	Jun 22, 2006
print header for output	0	Jun 19, 2011
Unicode: matching a word and unaccenting characters	2	Nov 14, 2007
Methods for understanding complex, real world, C++ code?	50	Apr 10, 2012
Unicode: matching a	0	Nov 14, 2007

re.compile for names

brad

Marc 'BlackJack' Rintsch

Paul McGuire

brad

John Machin

John Machin

John Machin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads