Is Python Suitable for Large Find & Replace Operations?

R

rbt

Here's the scenario:

You have many hundred gigabytes of data... possible even a terabyte or
two. Within this data, you have private, sensitive information (US
social security numbers) about your company's clients. Your company has
generated its own unique ID numbers to replace the social security numbers.

Now, management would like the IT guys to go thru the old data and
replace as many SSNs with the new ID numbers as possible. You have a tab
delimited txt file that maps the SSNs to the new ID numbers. There are
500,000 of these number pairs. What is the most efficient way to
approach this? I have done small-scale find and replace programs before,
but the scale of this is larger than what I'm accustomed to.

Any suggestions on how to approach this are much appreciated.
 
R

Robert Kern

rbt said:
Here's the scenario:

You have many hundred gigabytes of data... possible even a terabyte or
two. Within this data, you have private, sensitive information (US
social security numbers) about your company's clients. Your company has
generated its own unique ID numbers to replace the social security numbers.

Now, management would like the IT guys to go thru the old data and
replace as many SSNs with the new ID numbers as possible. You have a tab
delimited txt file that maps the SSNs to the new ID numbers. There are
500,000 of these number pairs. What is the most efficient way to
approach this? I have done small-scale find and replace programs before,
but the scale of this is larger than what I'm accustomed to.

Any suggestions on how to approach this are much appreciated.

I'm just tossing this out. I don't really have much experience with the
algorithm, but when searching for multiple targets, an Aho-Corasick
automaton might be appropriate.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

Good luck!

--
Robert Kern
(e-mail address removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
 
R

Roy Smith

rbt said:
You have many hundred gigabytes of data... possible even a terabyte or
two. Within this data, you have private, sensitive information (US
social security numbers) about your company's clients. Your company has
generated its own unique ID numbers to replace the social security numbers.

Now, management would like the IT guys to go thru the old data and
replace as many SSNs with the new ID numbers as possible.

The first thing I would do is a quick sanity check. Write a simple
Python script which copies its input to its output with minimal
processing. Something along the lines of:

for line in sys.stdin.readline():
words = line.split()
print words

Now, run this on a gig of your data on your system and time it.
Multiply the timing by 1000 (assuming a terabyte of real data), and
now you've got a rough lower bound on how long the real job will take.
If you find this will take weeks of processing time, you might want to
re-think the plan. Better to discover that now than a few weeks into
the project :)
You have a tab
delimited txt file that maps the SSNs to the new ID numbers. There are
500,000 of these number pairs. What is the most efficient way to
approach this? I have done small-scale find and replace programs before,
but the scale of this is larger than what I'm accustomed to.

The obvious first thing is to read the SSN/ID map file and build a
dictionary where the keys are the SSN's and the values are the ID's.

The next step depends on your data. Is it in text files? An SQL
database? Some other format. One way or another you'll have to read
it in, do something along the lines of "newSSN = ssnMap[oldSSN]", and
write it back out.
 
P

Paul Rubin

rbt said:
Now, management would like the IT guys to go thru the old data and
replace as many SSNs with the new ID numbers as possible. You have a
tab delimited txt file that maps the SSNs to the new ID numbers. There
are 500,000 of these number pairs. What is the most efficient way to
approach this? I have done small-scale find and replace programs
before, but the scale of this is larger than what I'm accustomed to.

Just use an ordinary python dict for the map, on a system with enough
ram (maybe 100 bytes or so for each pair, so the map would be 50 MB).
Then it's simply a matter of scanning through the input files to find
SSN's and look them up in the dict.
 
R

Robert Kern

Robert said:
I'm just tossing this out. I don't really have much experience with the
algorithm, but when searching for multiple targets, an Aho-Corasick
automaton might be appropriate.

http://hkn.eecs.berkeley.edu/~dyoo/python/ahocorasick/

More puttering around the web suggests that pytst may be a faster/more
scalable implementation.

http://www.lehuen.com/nicolas/index.php/Pytst

Try them both. If you meet success with one of them, come back and tell
us about it.

--
Robert Kern
(e-mail address removed)

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter
 
B

Bruno Desthuilliers

rbt a écrit :
Here's the scenario:

You have many hundred gigabytes of data... possible even a terabyte or
two. Within this data, you have private, sensitive information (US
social security numbers) about your company's clients. Your company has
generated its own unique ID numbers to replace the social security numbers.

Now, management would like the IT guys to go thru the old data and
replace as many SSNs with the new ID numbers as possible. You have a tab
delimited txt file that maps the SSNs to the new ID numbers. There are
500,000 of these number pairs. What is the most efficient way to
approach this? I have done small-scale find and replace programs before,
but the scale of this is larger than what I'm accustomed to.

Any suggestions on how to approach this are much appreciated.

I may say something stupid (that would not be the first time), but why
don't you just put it all in a good RDBMS and go with a few SQL queries?
 
J

John Machin

rbt said:
Here's the scenario:

You have many hundred gigabytes of data... possible even a terabyte or
two. Within this data, you have private, sensitive information (US
social security numbers) about your company's clients. Your company has
generated its own unique ID numbers to replace the social security numbers.

Now, management would like the IT guys to go thru the old data and
replace as many SSNs with the new ID numbers as possible.

This question is grossly OT; it's nothing at all to do with Python.
However ....

(0) Is this homework?

(1) What to do with an SSN that's not in the map?

(2) How will a user of the system tell the difference between "new ID
numbers" and SSNs?

(3) Has the company really been using the SSN as a customer ID instead
of an account number, or have they been merely recording the SSN as a
data item? Will the "new ID numbers" be used in communication with the
clients? Will they be advised of the new numbers? How will you handle
the inevitable cases where the advice doesn't get through?

(4) Under what circumstances will it not be possible to replace *ALL*
the SSNs?

(5) For how long can the data be off-line while it's being transformed?

You have a tab
delimited txt file that maps the SSNs to the new ID numbers. There are
500,000 of these number pairs.

And what is the source of the SSNs in this file??? Have they been
extracted from the data? How?
What is the most efficient way to
approach this? I have done small-scale find and replace programs before,
but the scale of this is larger than what I'm accustomed to.

Any suggestions on how to approach this are much appreciated.

A sensible answer will depend on how the data is structured:

1. If it's in a database with tables some of which have a column for
SSN, then there's a solution involving SQL.

2. If it's in semi-free-text files where the SSNs are marked somehow:

---client header---
surname: Doe first: John initial: Q SSN:123456789 blah blah
or
<ssn>123456789</ssn>

then there's another solution which involves finding the markers ...

3. If it's really free text, like
"""
File note: Today John Q. Doe telephoned to advise that his Social
Security # is 123456789 not 987654321 (which is his wife's) and the soc
sec numbers of his kids Bob & Carol are ....
"""
then you might be in some difficulty ... google("TREC")


AND however you do it, you need to be very aware of the possibility
(especially with really free text) of changing some string of digits
that's NOT an SSN.

HTH,
John
 
B

Brian

Hi Rbt,

To give an example of processing a lot of data, I used Python to read
and process every word in a single text file that contained the entire
King James Bible version. It processed it within about one second --
split the words, etc. Worked quite well.

Hope this helps,

Brian
 
S

Steve Holden

Brian said:
Hi Rbt,

To give an example of processing a lot of data, I used Python to read
and process every word in a single text file that contained the entire
King James Bible version. It processed it within about one second --
split the words, etc. Worked quite well.

Hope this helps,
Hardly "big" by the OP's standards at only 4.5 MB (and roughly 790,000
words, if anyone's interested).

The problem with processing terabytes is frequently the need to
accomodate the data with techniques that allow the virtual memory to
avoid becoming swamped. The last thing you need to do under such
circumstances is create an in-memory copy of hte entire data set, since
on most real computers this will inflict swapping behavior.

regards
Steve
 
G

Gilles Lenfant

rbt a écrit :
Here's the scenario:

You have many hundred gigabytes of data... possible even a terabyte or
two. Within this data, you have private, sensitive information (US
social security numbers) about your company's clients. Your company has
generated its own unique ID numbers to replace the social security numbers.

Now, management would like the IT guys to go thru the old data and
replace as many SSNs with the new ID numbers as possible. You have a tab
delimited txt file that maps the SSNs to the new ID numbers. There are
500,000 of these number pairs. What is the most efficient way to
approach this? I have done small-scale find and replace programs before,
but the scale of this is larger than what I'm accustomed to.

Any suggestions on how to approach this are much appreciated.

Are this huge amount of data to rearch/replace stored in an RDBMS or in
flat file(s) with markup (XML, CSV, ...) ?
 
R

rbt

rbt a écrit :

Are this huge amount of data to rearch/replace stored in an RDBMS or in
flat file(s) with markup (XML, CSV, ...) ?

I apologize that it has taken me so long to respond. I had a hdd crash
which I am in the process of recovering from. If emails to
(e-mail address removed) bounced, that is the reason why.


The data is in files. Mostly Word documents and excel spreadsheets. The
SSN map I have is a plain text file that has a format like this:

ssn-xx-xxxx new-id-xxxx
ssn-xx-xxxx new-id-xxxx
etc.

There are a bit more than 500K of these pairs.

Thank you,
rbt
 
R

rbt

rbt a écrit :

Are this huge amount of data to rearch/replace stored in an RDBMS or in
flat file(s) with markup (XML, CSV, ...) ?

I apologize that it has taken me so long to respond. I had a hdd crash
which I am in the process of recovering from. If emails to
(e-mail address removed) bounced, that is the reason why.


The data is in files. Mostly Word documents and excel spreadsheets. The
SSN map I have is a plain text file that has a format like this:

ssn-xx-xxxx new-id-xxxx
ssn-xx-xxxx new-id-xxxx
etc.

There are a bit more than 500K of these pairs.

Thank you,
rbt
 
R

rbt

This question is grossly OT; it's nothing at all to do with Python.
However ....

(0) Is this homework?

No, it is not.
(1) What to do with an SSN that's not in the map?

Leave it be.
(2) How will a user of the system tell the difference between "new ID
numbers" and SSNs?

We have documentation. The two sets of numbers (SSNs and new_ids) are
exclusive of each other.
(3) Has the company really been using the SSN as a customer ID instead
of an account number, or have they been merely recording the SSN as a
data item? Will the "new ID numbers" be used in communication with the
clients? Will they be advised of the new numbers? How will you handle
the inevitable cases where the advice doesn't get through?

My task is purely technical.
(4) Under what circumstances will it not be possible to replace *ALL*
the SSNs?

I do not understand this question.
(5) For how long can the data be off-line while it's being transformed?

The data is on file servers that are unused on weekends and nights.
And what is the source of the SSNs in this file??? Have they been
extracted from the data? How?

That is irrelevant.
A sensible answer will depend on how the data is structured:

1. If it's in a database with tables some of which have a column for
SSN, then there's a solution involving SQL.

2. If it's in semi-free-text files where the SSNs are marked somehow:

---client header---
surname: Doe first: John initial: Q SSN:123456789 blah blah
or
<ssn>123456789</ssn>

then there's another solution which involves finding the markers ...

3. If it's really free text, like
"""
File note: Today John Q. Doe telephoned to advise that his Social
Security # is 123456789 not 987654321 (which is his wife's) and the soc
sec numbers of his kids Bob & Carol are ....
"""
then you might be in some difficulty ... google("TREC")


AND however you do it, you need to be very aware of the possibility
(especially with really free text) of changing some string of digits
that's NOT an SSN.

That's possible, but I think not probably.
 
R

rbt

rbt a écrit :

Are this huge amount of data to rearch/replace stored in an RDBMS or in
flat file(s) with markup (XML, CSV, ...) ?

I apologize that it has taken me so long to respond. I had a hdd crash
which I am in the process of recovering from. If emails to
(e-mail address removed) bounced, that is the reason why.


The data is in files. Mostly Word documents and excel spreadsheets. The
SSN map I have is a plain text file that has a format like this:

ssn-xx-xxxx new-id-xxxx
ssn-xx-xxxx new-id-xxxx
etc.

There are a bit more than 500K of these pairs.

Thank you,
rbt
 
R

rbt

rbt a écrit :

Are this huge amount of data to rearch/replace stored in an RDBMS or in
flat file(s) with markup (XML, CSV, ...) ?

The data is in files. Mostly Word documents and excel spreadsheets. The
SSN map I have is a plain text file that has a format like this:

ssn-xx-xxxx new-id-xxxx
ssn-xx-xxxx new-id-xxxx
etc.

There are a bit more than 500K of these pairs.

Thank you,
rbt
 
J

John Machin

rbt said:
No, it is not.




Leave it be.




We have documentation. The two sets of numbers (SSNs and new_ids) are
exclusive of each other.




My task is purely technical.

OK then, let's ignore the fact that the data is in a collection of Word
& Excel files, and let's ignore the scale for the moment. Let's assume
there are only 100 very plain text files to process, and only 1000 SSNs
in your map, so it doesn't have to be very efficient.

Can you please write a few lines of Python that would define your task
-- assume you have a line of text from an input file, show how you would
determine that it needed to be changed, and how you would change it.
I do not understand this question.

Can you determine from the data, without reference to the map, that a
particular string of characters is an SSN?

If so, and it is not in the map, why can it not be *added* to the map
with a new generated ID?
The data is on file servers that are unused on weekends and nights.




That is irrelevant.

Quite the contrary. If they had been extracted from the data, it would
mean that someone actually might have a vague clue about how those SSNs
were represented in the data and might be able to tell you and you might
be able to tell us so that we could help you.
That's possible, but I think not probably.

You "think not probably" -- based on what? pious hope? what the customer
has told you? inspection of the data?
 
R

rbt

OK then, let's ignore the fact that the data is in a collection of Word
& Excel files, and let's ignore the scale for the moment. Let's assume
there are only 100 very plain text files to process, and only 1000 SSNs
in your map, so it doesn't have to be very efficient.

Can you please write a few lines of Python that would define your task
-- assume you have a line of text from an input file, show how you would
determine that it needed to be changed, and how you would change it.

The script is too long to post in its entirety. In short, I open the
files, do a binary read (in 1MB chunks for ease of memory usage) on them
before placing that read into a variable and that in turn into a list
that I then apply the following re to

ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b')

like this:

for chunk in whole_file:
search = ss.findall(chunk)
if search:
validate(search)

The validate function makes sure the string found is indeed in the range
of a legitimate SSN. You may read about this range here:

http://www.ssa.gov/history/ssn/geocard.html

That is as far as I have gotten. And I hope you can tell that I have
placed some small amount of thought into the matter. I've tested the
find a lot and it is rather accurate in finding SSNs in files. I have
not yet started replacing anything. I've only posted here for advice
before beginning.
Can you determine from the data, without reference to the map, that a
particular string of characters is an SSN?

See above.
If so, and it is not in the map, why can it not be *added* to the map
with a new generated ID?

It is not my responsibility to do this. I do not have that authority
within the organization. Have you never worked for a real-world business
and dealt with office politics and territory ;)
Quite the contrary. If they had been extracted from the data,

They have not. They are generated by a higher authority and then used by
lower authorities such as me.

Again though, I think this is irrelevant to the task at hand... I have a
map, I have access to the data and that is all I need to have, no? I do
appreciate your input though. If you would like to have a further
exchange of ideas, perhaps we should do so off list?
 
P

Peter Hansen

rbt said:
The script is too long to post in its entirety. In short, I open the
files, do a binary read (in 1MB chunks for ease of memory usage) on them
before placing that read into a variable and that in turn into a list
that I then apply the following re to

ss = re.compile(r'\b\d{3}-\d{2}-\d{4}\b')

like this:

for chunk in whole_file:
search = ss.findall(chunk)
if search:
validate(search)

This seems so obvious that I hesitate to ask, but is the above really a
simplification of the real code, which actually handles the case of SSNs
that lie over the boundary between chunks? In other words, what happens
if the first chunk has only the first four digits of the SSN, and the
rest lies in the second chunk?

-Peter
 
R

rbt

This seems so obvious that I hesitate to ask, but is the above really a
simplification of the real code, which actually handles the case of SSNs
that lie over the boundary between chunks? In other words, what happens
if the first chunk has only the first four digits of the SSN, and the
rest lies in the second chunk?

-Peter

No, that's a good question. As of now, there is nothing to handle the
scenario that you bring up. I have considered this possibility (rare but
possible). I have not written a solution for it. It's a very good point
though.

Is that not why proper software is engineered? Anyone can build a go
cart (write a program), but it takes a team of engineers and much
testing to build a car, no? Which woulu you rather be riding in during a
crash? I wish upper mgt had a better understanding of this ;)
 
J

John Machin

*ONE* MB? What are the higher-ups constraining you to run this on? My
reference points: (1) An off-the-shelf office desktop PC from
HP/Dell/whoever with 512MB of memory is standard fare. Motherboards with
4 memory slots (supporting up to 4GB) are common. (2) Reading files in
1MB chunks seemed appropriate when I was running an 8Mb 486 using Win 3.11.

Are you able to run your script on the network server itself? And what
is the server setup? Windows/Novell/Samba/??? What version of Python are
you running? Have you considered mmap?

In the following code, which is "variable" (chunk?) and which is the
"list" to which you apply the re? The only list I see is the confusingly
named "search" which is the *result* of applying re.findall.

Users never use spaces or other punctuation (or no punctuation) instead
of dashes?

False positives? E.g. """
Dear X, Pls rush us qty 20 heart pacemaker alarm trigger circuits, part
number 123-456-78-9012, as discussed.
"""

You will need the offset to update, so better get used to using
something like finditer now. You may well find that "validate" may want
to check some context on either side, even if it's just one character --
\b may turn out to be inappropriate.
No, that's a good question. As of now, there is nothing to handle the
scenario that you bring up. I have considered this possibility (rare but
possible). I have not written a solution for it. It's a very good point
though.

How rare? What is the average size of file? What is the average
frequency of SSNs per file? What is the maximum file size (you did
mention mostly MS Word and Excel, so can't be very big)?

N.B. Using mmap just blows this problem away. Using bigger memory chunks
diminishes it.

And what percentage of files contain no SSNs at all?

Are you updating in-situ, or must your script copy the files, changing
SSNs as it goes? Do you have a choice between the two methods?

Are you allowed to constrain the list of input files so that you don't
update binaries (e.g.) exe files?
Is that not why proper software is engineered? Anyone can build a go
cart (write a program), but it takes a team of engineers and much
testing to build a car, no? Which woulu you rather be riding in during a
crash? I wish upper mgt had a better understanding of this ;)

Don't we all so wish? Meanwhile, back in the real world, ...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,525
Members
44,997
Latest member
mileyka

Latest Threads

Top