Best way to replace a set of strings in large files?

R

Ryan Chan

Hello,

Consider the case:

You have 200 lines of mapping to replace, in a csv format, e.g.

apple,orange
boy,girl
....

You have a 500MB file, you want to replace all 200 lines of mapping,
what would be the most efficient way to do it?

Thanks.
 
C

cvhLE

Hello,

Consider the case:

You have 200 lines of mapping to replace, in a csv format, e.g.

apple,orange
boy,girl
...

You have a 500MB file, you want to replace all 200 lines of mapping,
what would be the most efficient way to do it?

Thanks.

If you want to replace the whole line or know the column where you
need to replace it and the line has clear separators you may be be a
lot faster if you do it using awk:

cat csv|awk -F"," "$2~/apple/ {$2="orange"; print $1,$2} " ...

otherwise I don't see a reason not to use the most obvious way:
starting from line 1 and running until the end ... especially if dont
know *where* the 200 lines are ...

#! /usr/bin/perl -w
%replace=('apple'=>'orange','boy'=>'girl');
$r="(".join ("|", keys %replace ).")";$r=qr($r);
while (<>) {
s/$r/$replace{$1}/g;
print;
}




[08:07:43] cvh@lenny:~$ echo "a boy named sue sings a song for apple
jack" | perl repl.pl
a girl named sue sings a song for orange jack
[08:07:45] cvh@lenny:~$ echo "a boy named sue sings a song for apple
jack" > test.txt
[08:07:59] cvh@lenny:~$ perl repl.pl test.txt
a girl named sue sings a song for orange jack
[08:08:11] cvh@lenny:~$ perl repl.pl test.txt >test_replace.txt
[08:08:24] cvh@lenny:~$ cat test_replace.txt
a girl named sue sings a song for orange jack
[08:08:40] cvh@lenny:~$
 
S

sln

If you want to replace the whole line or know the column where you
need to replace it and the line has clear separators you may be be a
lot faster if you do it using awk:

cat csv|awk -F"," "$2~/apple/ {$2="orange"; print $1,$2} " ...

otherwise I don't see a reason not to use the most obvious way:
starting from line 1 and running until the end ... especially if dont
know *where* the 200 lines are ...

#! /usr/bin/perl -w
%replace=('apple'=>'orange','boy'=>'girl');
$r="(".join ("|", keys %replace ).")";$r=qr($r);
while (<>) {
s/$r/$replace{$1}/g;
print;
}

I would asume this would take a long
time to do this process.

At a minimum, it would take

500,000,000
x
200
-----------------
100,000,000,000

100 billion character comparisons
if nothing ever matched.
Still not matching word, but the first character
matched before backtracking

100,000,000,000
x
2
----------------
200,000,000,000

brings the total up to 200 billion character
comparisons.

Since this is all a conservative estimate
I would average (conservatively) 4 comparison
characters per map per byte in the file and say

500,000,000
x
800
-----------------
400,000,000,000

400 billion comparisons.
Add to that the menutia of backtracking, loading
buffers, writing to disk, and the underpining layers
Perl has to do to execute C code, and I would go out
for coffee or take a nap.

-sln
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,901
Latest member
Noble71S45

Latest Threads

Top