Finding Peoples' Names in Files

B

brad

Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.
 
C

cokofreedom

Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.

Can't you just use the string function .findall() ?
 
T

Tim Williams

Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.


Text = open(fname).read()

def a_function():
for Name in ['Guido', Robert',Susan']:
if Name in Text:
return 1

if a_function():
print "A name was found"

:)
 
D

Diez B. Roggisch

brad said:
I mean *any* possible person's name... I don't *know* the names
beforehand :)

Erm, now what's a persons name then? Maybe there is a Viagra Cialis living
in ... Caracas. Or so. Or a Woodshed Ribcage in Oregon... who knows?

And what about variable names - peters_temp a name? It's amazing what lack
of creativity in variable names can result in...

So - unless you come up with positive list of what you consider a name,
you're pretty much out of luck.

diez
 
C

cokofreedom

I mean *any* possible person's name... I don't *know* the names
beforehand :)

Well for something like that you could either do a search
for .istitle() through the file with a for loop, but that will catch
those after a full stop, or other random entries.

Otherwise, you get a HUGE list of all possible names? In dictionary
format for best option, and check the file against in using "in"...

However...how can you know it is a name...
 
F

Francesco Guerrieri

I mean *any* possible person's name... I don't *know* the names
beforehand :)


"I cannot combine some characters

dhcmrlchtdj

which the divine Library has not foreseen and which in one of
its secret tongues do not contain a terrible meaning. No one can
articulate a syllable which is not filled with tenderness and fear,
which is not, in one of these languages, the powerful name of a god."

Jorge Luis Borges, The Library of Babel
 
D

Dan Stromberg

Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert' and
or 'Susan', then we should return True, otherwise return False.

It'll be hard to handle the Dweezil's and Moon Unit's of the world (I
believe these are Frank Zappa's kids?), but you could compile a list of
reasonably common names by gaining access to a usenet news spool, and
pulling the names from the headers.

But then this is starting to sound dangerously like a spam campaign - in
which case, "Please don't!".
 
B

brad

However...how can you know it is a name...

OK, I admitted in my first post that it was a crazy question, but if one
could find an answer, one would be onto something. Maybe it's not a 100%
answerable question, but I would guess that it is an 80% answerable
question... I just don't know how... yet :)

Besides admitting that it's a crazy question, I should stop and explain
how it would be useful to me at least. Is a credit card number itself
valuable? I would think not. One can easily re and luhn check for credit
card numbers located in files with a great degree of accuracy, but a
number without a name is not very useful to me. So, if one could
associate names to luhn checked numbers automatically, then one would be
onto something. Or at least say, "hey, this file has luhn validated CCs
*AND* it seems to have people's names in it as well." Now then, I'd have
less to review or perhaps as much as I have now, but I could push the
files with numbers and names to the top of the list so that they would
be reviewed first.

Brad
 
M

Matimus

OK, I admitted in my first post that it was a crazy question, but if one
could find an answer, one would be onto something. Maybe it's not a 100%
answerable question, but I would guess that it is an 80% answerable
question... I just don't know how... yet :)

Besides admitting that it's a crazy question, I should stop and explain
how it would be useful to me at least. Is a credit card number itself
valuable? I would think not. One can easily re and luhn check for credit
card numbers located in files with a great degree of accuracy, but a
number without a name is not very useful to me. So, if one could
associate names to luhn checked numbers automatically, then one would be
onto something. Or at least say, "hey, this file has luhn validated CCs
*AND* it seems to have people's names in it as well." Now then, I'd have
less to review or perhaps as much as I have now, but I could push the
files with numbers and names to the top of the list so that they would
be reviewed first.

Brad

What the hell are you doing? Your post sounds to me like you have a
huge amount of stolen, or at the very least misapprehended, data. Now
you want to search it for credit card numbers and names so that you
can use them.

I am not cool with this! This is a public forum about a programming
language. What makes you think that anybody in this forum will be cool
with that. Perhaps you aren't doing anything illegal, but it sure is
coming off that way. If you are doing something illegal I hope you get
caught.

At the very least, you might want to clarify why you are looking for
such capability so that you don't get effectively black-listed (well,
by me at least).

Matt
 
B

byte8bits

What the hell are you doing? Your post sounds to me like you have a
huge amount of stolen, or at the very least misapprehended, data. Now
you want to search it for credit card numbers and names so that you
can use them.

I am not cool with this! This is a public forum about a programming
language. What makes you think that anybody in this forum will be cool
with that. Perhaps you aren't doing anything illegal, but it sure is
coming off that way. If you are doing something illegal I hope you get
caught.

At the very least, you might want to clarify why you are looking for
such capability so that you don't get effectively black-listed (well,
by me at least).

Matt

Go have a beer and calm down a bit :) It's a legitimate purpose,
although it could (and probably is being used by bad guys right now).
My intent, as you can see from the links below, is to catch it before
the bad guys do.

http://filebox.vt.edu/users/rtilley/public/find_ccns/
http://filebox.vt.edu/users/rtilley/public/find_ssns/

Brad
 
M

Matimus

Go have a beer and calm down a bit :) It's a legitimate purpose,
although it could (and probably is being used by bad guys right now).
My intent, as you can see from the links below, is to catch it before
the bad guys do.

http://filebox.vt.edu/users/rtilley...ilebox.vt.edu/users/rtilley/public/find_ssns/

Brad

Its just past 10:00 am where I am... I know customs vary, but
generally beer before lunch is frowned upon :). I know the tone of
posts does not carry well over the web, but I was really just trying
to point out that your previous post sounded very shady, and at the
very least some clarification was in order. I wasn't standing on my
desk frothing at the mouth or anything.

On to my suggestion. I think you are going to have to use statistical
analysis. That is, you won't get something that reliably returns a
boolean, but maybe something that says there is a 75% chance that
there are names in a given file. You can't know that a given string is
or isn't a name, you can only know that it is probably a name based
upon how often it is used in that context. Either way this isn't a
simple problem to solve, and it probably involves creating a database
of words that shows what percentage of the time they are used as
names. How such a database is created... that is the hard part. There
may be tools out there for such analasys, but that isn't an area I
have any experience in.

Matt
 
J

John J. Lee

brad said:
Crazy question, but has anyone attempted this or seen Python code that
does? For example, if a text file contained 'Guido' and or 'Robert'
and or 'Susan', then we should return True, otherwise return False.

A few ideas:

1. If you don't have a list of names, find a list of words that
doesn't contain proper nouns (there are a few word lists out there,
not sure if any exclude people's names, though). Look for short runs
of two or three "words" (punctuation-separated tokens) in the email
that aren't in the dictionary. Some of them will be people's names.

2. Send the text through Google translate and look for runs of words
that are unchanged. Some of them will be people's names.

3. Search the literature and look for fancy algorithms. Here are some
papers (the last mentions some commercial software to do this):

http://citeseer.ist.psu.edu/bikel99algorithm.html

http://citeseer.ist.psu.edu/618945.html

http://arxiv.org/html/cmp-lg/9706017


John
 
C

Chris Mellon

Go have a beer and calm down a bit :) It's a legitimate purpose,
although it could (and probably is being used by bad guys right now).
My intent, as you can see from the links below, is to catch it before
the bad guys do.

http://filebox.vt.edu/users/rtilley/public/find_ccns/
http://filebox.vt.edu/users/rtilley/public/find_ssns/

Brad

In case you're doing this for PCI validation, be aware that just the
CC number is considered sensitive and you'd get some false negatives
if you filter on anything except that.

Random strings that match CC checksums are really quite rare and false
positives from that alone are unlikely to be a problem. Unless I
deployed this and there was a significant false positive rate I
wouldn't risk the false negatives, personally.
 
B

brad

Chris said:
In case you're doing this for PCI validation, be aware that just the
CC number is considered sensitive and you'd get some false negatives
if you filter on anything except that.

Random strings that match CC checksums are really quite rare and false
positives from that alone are unlikely to be a problem. Unless I
deployed this and there was a significant false positive rate I
wouldn't risk the false negatives, personally.

Yes, it is for PCI. Our rate of false positives is low, very low. I
wasn't aware that a number alone was a PCI violation. Thank you! On
another note, we're a university (Virginia Tech) and we're subject to
FERPA, HIPPA, GLBA, etc... in addition to PCI. So we do these checks for
U.S. Social Security Numbers too in an effort to prevent or lessen the
chance of ID theft. Unfortunately, there is no luhn check for SSNs. We
follow the Social Security Administration verification guideline
religiously... here's an web front-end to my logic:

http://black.cirt.vt.edu/public/valid_ssn/index.html

but still have many false positives on SSNs, so being able to id *names
and numbers* in files would still be a be benefit to us.

Brad
 
C

Chris Mellon

Yes, it is for PCI. Our rate of false positives is low, very low. I
wasn't aware that a number alone was a PCI violation. Thank you! On
another note, we're a university (Virginia Tech) and we're subject to
FERPA, HIPPA, GLBA, etc... in addition to PCI. So we do these checks for
U.S. Social Security Numbers too in an effort to prevent or lessen the
chance of ID theft. Unfortunately, there is no luhn check for SSNs. We
follow the Social Security Administration verification guideline
religiously... here's an web front-end to my logic:

http://black.cirt.vt.edu/public/valid_ssn/index.html

but still have many false positives on SSNs, so being able to id *names
and numbers* in files would still be a be benefit to us.

Brad

Defining the problem as "given a word, figure out if that word is
likely to be a name", it seems the simplest solution is to get a
corpus of names and then flag them based on edit distance from words
in the name list. Maybe soundex? You're going to need a *massive*
corpus though, and that might be a problem if you distribute this for
people to run instead of doing it centrally.

As a totally off the wall speculation, you might be able to train a
neural net against a large enough corpus (Say, your student and
faculty member databases) and end up with something that can match a
name algorithmically without needing the table. This is a really hard
problem - maybe you can get your CompSci department to make it part of
someones thesis ;)

Once you've got a way to tell if a word might be a name, and a way to
tell if another word is likely to be a SSN, you just need to match up
hits within the same document, use some sort of distance filter, and
then you'll be "done".

I assume this is intended primary to catch files that people are
storing accidentally, rather that catching intentional identity theft
in action. It'd be trivial to hide from these sort of scans if you
were actively malicious.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top