Best way to replace a set of strings in large files?

Ryan Chan · Dec 10, 2009

Hello,

Consider the case:

You have 200 lines of mapping to replace, in a csv format, e.g.

apple,orange
boy,girl
....

You have a 500MB file, you want to replace all 200 lines of mapping,
what would be the most efficient way to do it?

Thanks.

cvhLE · Dec 11, 2009

Hello,

Consider the case:

You have 200 lines of mapping to replace, in a csv format, e.g.

apple,orange
boy,girl
...

You have a 500MB file, you want to replace all 200 lines of mapping,
what would be the most efficient way to do it?

Thanks.

If you want to replace the whole line or know the column where you
need to replace it and the line has clear separators you may be be a
lot faster if you do it using awk:

cat csv|awk -F"," "$2~/apple/ {$2="orange"; print $1,$2} " ...

otherwise I don't see a reason not to use the most obvious way:
starting from line 1 and running until the end ... especially if dont
know *where* the 200 lines are ...

#! /usr/bin/perl -w
%replace=('apple'=>'orange','boy'=>'girl');
$r="(".join ("|", keys %replace ).")";$r=qr($r);
while (<>) {
s/$r/$replace{$1}/g;
print;
}

[08:07:43] cvh@lenny:~$ echo "a boy named sue sings a song for apple
jack" | perl repl.pl
a girl named sue sings a song for orange jack
[08:07:45] cvh@lenny:~$ echo "a boy named sue sings a song for apple
jack" > test.txt
[08:07:59] cvh@lenny:~$ perl repl.pl test.txt
a girl named sue sings a song for orange jack
[08:08:11] cvh@lenny:~$ perl repl.pl test.txt >test_replace.txt
[08:08:24] cvh@lenny:~$ cat test_replace.txt
a girl named sue sings a song for orange jack
[08:08:40] cvh@lenny:~$

sln · Dec 11, 2009

If you want to replace the whole line or know the column where you
need to replace it and the line has clear separators you may be be a
lot faster if you do it using awk:

cat csv|awk -F"," "$2~/apple/ {$2="orange"; print $1,$2} " ...

otherwise I don't see a reason not to use the most obvious way:
starting from line 1 and running until the end ... especially if dont
know *where* the 200 lines are ...

#! /usr/bin/perl -w
%replace=('apple'=>'orange','boy'=>'girl');
$r="(".join ("|", keys %replace ).")";$r=qr($r);
while (<>) {
s/$r/$replace{$1}/g;
print;
}

I would asume this would take a long
time to do this process.

At a minimum, it would take

500,000,000
x
200
-----------------
100,000,000,000

100 billion character comparisons
if nothing ever matched.
Still not matching word, but the first character
matched before backtracking

100,000,000,000
x
2
----------------
200,000,000,000

brings the total up to 200 billion character
comparisons.

Since this is all a conservative estimate
I would average (conservatively) 4 comparison
characters per map per byte in the file and say

500,000,000
x
800
-----------------
400,000,000,000

400 billion comparisons.
Add to that the menutia of backtracking, loading
buffers, writing to disk, and the underpining layers
Perl has to do to execute C code, and I would go out
for coffee or take a nap.

-sln

The Best Way to Combine Multiple PST Files in Outlook	4	Jan 25, 2025
What is the best way to import MBOX files into Office 365?	4	Mar 19, 2026
Whats the simplest way to convert PST to EML files quickly?	3	Mar 3, 2026
How to Make CSV Contact Files Work Seamlessly Across All Smartphones?	0	Sep 17, 2025
Whats the best approach for converting OST to PST files?	5	Feb 10, 2025
Find and count strings of text from multiple files	17	Dec 16, 2021
Best Method to Import PST Files to Gmail Without Outlook	5	Dec 26, 2024
What is the best method to migrate OST data to Apple Mail?	1	Apr 6, 2026

Best way to replace a set of strings in large files?

Ryan Chan

cvhLE

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads