Best way to replace a set of strings in large files?

Discussion in 'Perl Misc' started by Ryan Chan, Dec 10, 2009.

  1. Ryan Chan

    Ryan Chan Guest

    Hello,

    Consider the case:

    You have 200 lines of mapping to replace, in a csv format, e.g.

    apple,orange
    boy,girl
    ....

    You have a 500MB file, you want to replace all 200 lines of mapping,
    what would be the most efficient way to do it?

    Thanks.
    Ryan Chan, Dec 10, 2009
    #1
    1. Advertising

  2. Ryan Chan

    cvhLE Guest

    On Dec 10, 3:21 pm, Ryan Chan <> wrote:
    > Hello,
    >
    > Consider the case:
    >
    > You have 200 lines of mapping to replace, in a csv format, e.g.
    >
    > apple,orange
    > boy,girl
    > ...
    >
    > You have a 500MB file, you want to replace all 200 lines of mapping,
    > what would be the most efficient way to do it?
    >
    > Thanks.


    If you want to replace the whole line or know the column where you
    need to replace it and the line has clear separators you may be be a
    lot faster if you do it using awk:

    cat csv|awk -F"," "$2~/apple/ {$2="orange"; print $1,$2} " ...

    otherwise I don't see a reason not to use the most obvious way:
    starting from line 1 and running until the end ... especially if dont
    know *where* the 200 lines are ...

    #! /usr/bin/perl -w
    %replace=('apple'=>'orange','boy'=>'girl');
    $r="(".join ("|", keys %replace ).")";$r=qr($r);
    while (<>) {
    s/$r/$replace{$1}/g;
    print;
    }




    [08:07:43] cvh@lenny:~$ echo "a boy named sue sings a song for apple
    jack" | perl repl.pl
    a girl named sue sings a song for orange jack
    [08:07:45] cvh@lenny:~$ echo "a boy named sue sings a song for apple
    jack" > test.txt
    [08:07:59] cvh@lenny:~$ perl repl.pl test.txt
    a girl named sue sings a song for orange jack
    [08:08:11] cvh@lenny:~$ perl repl.pl test.txt >test_replace.txt
    [08:08:24] cvh@lenny:~$ cat test_replace.txt
    a girl named sue sings a song for orange jack
    [08:08:40] cvh@lenny:~$
    cvhLE, Dec 11, 2009
    #2
    1. Advertising

  3. Ryan Chan

    Guest

    On Thu, 10 Dec 2009 23:09:28 -0800 (PST), cvhLE <> wrote:

    >On Dec 10, 3:21 pm, Ryan Chan <> wrote:
    >> Hello,
    >>
    >> Consider the case:
    >>
    >> You have 200 lines of mapping to replace, in a csv format, e.g.
    >>
    >> apple,orange
    >> boy,girl
    >> ...
    >>
    >> You have a 500MB file, you want to replace all 200 lines of mapping,
    >> what would be the most efficient way to do it?
    >>
    >> Thanks.

    >
    >If you want to replace the whole line or know the column where you
    >need to replace it and the line has clear separators you may be be a
    >lot faster if you do it using awk:
    >
    >cat csv|awk -F"," "$2~/apple/ {$2="orange"; print $1,$2} " ...
    >
    >otherwise I don't see a reason not to use the most obvious way:
    >starting from line 1 and running until the end ... especially if dont
    >know *where* the 200 lines are ...
    >
    >#! /usr/bin/perl -w
    >%replace=('apple'=>'orange','boy'=>'girl');
    >$r="(".join ("|", keys %replace ).")";$r=qr($r);
    >while (<>) {
    >s/$r/$replace{$1}/g;
    >print;
    >}
    >


    I would asume this would take a long
    time to do this process.

    At a minimum, it would take

    500,000,000
    x
    200
    -----------------
    100,000,000,000

    100 billion character comparisons
    if nothing ever matched.
    Still not matching word, but the first character
    matched before backtracking

    100,000,000,000
    x
    2
    ----------------
    200,000,000,000

    brings the total up to 200 billion character
    comparisons.

    Since this is all a conservative estimate
    I would average (conservatively) 4 comparison
    characters per map per byte in the file and say

    500,000,000
    x
    800
    -----------------
    400,000,000,000

    400 billion comparisons.
    Add to that the menutia of backtracking, loading
    buffers, writing to disk, and the underpining layers
    Perl has to do to execute C code, and I would go out
    for coffee or take a nap.

    -sln
    , Dec 11, 2009
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    13
    Views:
    10,846
  2. Diego Martins
    Replies:
    5
    Views:
    5,235
    Diego Martins
    Jun 19, 2007
  3. EJP
    Replies:
    3
    Views:
    539
  4. Karin Lagesen

    matching strings in a large set of strings

    Karin Lagesen, Apr 29, 2010, in forum: Python
    Replies:
    13
    Views:
    438
    Bryan
    May 3, 2010
  5. Helmut Jarausch
    Replies:
    3
    Views:
    307
    Dave Angel
    Apr 30, 2010
Loading...

Share This Page