Processing Multiple Large Files

Discussion in 'Perl Misc' started by friend.05@gmail.com, Dec 11, 2008.

  1. Guest

    Hi,

    I analyzing some netwokr log files. There are around 200-300 files and
    each file has more than 2 million entries in it.

    Currently my script is reading each file line by line. So it will take
    lot of time to process all the files.

    Is there any efficient way to do it?

    May be Multiprocessing, Multitasking ?


    Thanks.
     
    , Dec 11, 2008
    #1
    1. Advertising

  2. Tim Greer Guest

    wrote:

    > Hi,
    >
    > I analyzing some netwokr log files. There are around 200-300 files and
    > each file has more than 2 million entries in it.
    >
    > Currently my script is reading each file line by line. So it will take
    > lot of time to process all the files.


    When dealing with a lot of data, you usually want to read line by line,
    if you can help it. That's the most efficient way when dealing with
    large text files. If you have a ton of memory to play with, you can
    try other solutions, but even reading line by line, there might be ways
    to speed that up, too, depending on a few variables and your needs.

    No matter how you go about it, if you have to look at every line in the
    file (to use, process, skip, whatever), you're still going to have to
    do that and it will have the smaller memory footprint. Maybe it's how
    you're going about the task that can be improved? Do you have any
    relevant code snippets?
    --
    Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
    Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
    and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
    Industry's most experienced staff! -- Web Hosting With Muscle!
     
    Tim Greer, Dec 11, 2008
    #2
    1. Advertising

  3. Guest

    On Dec 11, 3:32 pm, Tim Greer <> wrote:
    > wrote:
    > > Hi,

    >
    > > I analyzing some netwokr log files. There are around 200-300 files and
    > > each file has more than 2 million entries in it.

    >
    > > Currently my script is reading each file line by line. So it will take
    > > lot of time to process all the files.

    >
    > When dealing with a lot of data, you usually want to read line by line,
    > if you can help it.  That's the most efficient way when dealing with
    > large text files.  If you have a ton of memory to play with, you can
    > try other solutions, but even reading line by line, there might be ways
    > to speed that up, too, depending on a few variables and your needs.
    >
    > No matter how you go about it, if you have to look at every line in the
    > file (to use, process, skip, whatever), you're still going to have to
    > do that and it will have the smaller memory footprint.  Maybe it's how
    > you're going about the task that can be improved?  Do you have any
    > relevant code snippets?
    > --
    > Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
    > Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
    > and Custom Hosting.  24/7 support, 30 day guarantee, secure servers.
    > Industry's most experienced staff! -- Web Hosting With Muscle!


    Yes I am reading each file line by line.

    But there are more than 200 files. So is there a way if I can process
    some files parallely.

    Or any other solution to speed up my task.
     
    , Dec 11, 2008
    #3
  4. Guest

    "" <> wrote:
    > Hi,
    >
    > I analyzing some netwokr log files. There are around 200-300 files and
    > each file has more than 2 million entries in it.
    >
    > Currently my script is reading each file line by line.


    Perl makes it look like you are reading the files line by line.
    But really it is using internal buffering to read the files
    in larger chunks (well, if the lines are short. If the lines
    are long, the chunks may actually be shorter than the lines)

    > So it will take
    > lot of time to process all the files.
    >
    > Is there any efficient way to do it?


    Figure out which parts are inefficient, and improve them.

    > May be Multiprocessing, Multitasking ?


    Do you have several CPUs? Can you I/O system keep up with them?

    There are kinds of way to do parallel processing in Perl.
    In this case, maybe Parallel::ForkManager would be best.
    Each process can be assigned a specific one of the 300
    files to work on.

    See the docs for Parallel::ForkManager.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Dec 11, 2008
    #4
  5. "" <> writes:

    > I analyzing some netwokr log files. There are around 200-300 files and
    > each file has more than 2 million entries in it.
    >
    > Currently my script is reading each file line by line. So it will take
    > lot of time to process all the files.


    It depends on what kind of processing you're doing.

    If you don't need to process lines in order you might get a speedup by
    starting a couple of processes each processing their own files. Again
    depending on the kind of processing the optimal number of processes
    may vary from the number om cpu's to a copule times the number of
    cpu's.

    If part of you processing consists of doing a DNS lookup you might be
    able to get a speedup by reading a few lines a time a use asyncronous
    dns requests (Net::DNS::Async seems to do it) instead of block on each
    and every request.

    Other optimizations might be possible, but almost everything depends
    on the kind of processing you have to do and if you have to process
    lines in some predetermined order.

    //Makholm
     
    Peter Makholm, Dec 11, 2008
    #5
  6. wrote:
    > Hi,
    >
    > I analyzing some netwokr log files. There are around 200-300 files and
    > each file has more than 2 million entries in it.
    >
    > Currently my script is reading each file line by line. So it will take
    > lot of time to process all the files.
    >
    > Is there any efficient way to do it?
    >
    > May be Multiprocessing, Multitasking ?
    >


    If the 200-300 files are on the same disk, are not especially fragmented
    and your program is already IO-bound, parallel processing might
    conceivably slow things down by increasing the number of head-seeks needed.

    Just a thought.

    --
    RGB
     
    RedGrittyBrick, Dec 11, 2008
    #6
  7. "" <> wrote in news:5f1e2237-
    :

    > I analyzing some netwokr log files. There are around 200-300 files and
    > each file has more than 2 million entries in it.
    >
    > Currently my script is reading each file line by line. So it will take
    > lot of time to process all the files.
    >
    > Is there any efficient way to do it?
    >
    > May be Multiprocessing, Multitasking ?


    Here is one way to do it using Parallel::Forkmanager.

    If your system is somewhat typical, you'll probably run into an IO
    bottleneck before you run into a CPU bottleneck.

    For example:

    C:\DOCUME~1\asu1\LOCALS~1\Temp\large> cat create.pl
    #!/usr/bin/perl

    use strict;
    use warnings;

    my $line = join("\t", qw( 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 )) .
    "\n";

    my $fn_tmpl = 'data_%2.2d.txt';
    my $fn = sprintf $fn_tmpl, 0;

    open my $out, '>', $fn
    or die "Cannot open '$fn': $!";

    for (1 .. 100_000) {
    print $out $line
    or die "Cannot write to '$fn': $!";
    }

    close $out
    or die "Cannot close: '$fn': $!";

    for (1 .. 19) {
    system copy => $fn, sprintf($fn_tmpl, $_);
    }

    C:\DOCUME~1\asu1\LOCALS~1\Temp\large> timethis create
    ....
    TimeThis : Command Line : create
    TimeThis : Start Time : Thu Dec 11 18:14:12 2008
    TimeThis : End Time : Thu Dec 11 18:14:16 2008
    TimeThis : Elapsed Time : 00:00:03.468

    Now, you have 20 input files with 100_000 lines each:

    C:\DOCUME~1\asu1\LOCALS~1\Temp\large> dir
    ....
    2008/12/11 06:14 PM 4,100,000 data_00.txt
    2008/12/11 06:14 PM 4,100,000 data_01.txt
    2008/12/11 06:14 PM 4,100,000 data_02.txt
    2008/12/11 06:14 PM 4,100,000 data_03.txt
    2008/12/11 06:14 PM 4,100,000 data_04.txt
    2008/12/11 06:14 PM 4,100,000 data_05.txt
    2008/12/11 06:14 PM 4,100,000 data_06.txt
    2008/12/11 06:14 PM 4,100,000 data_07.txt
    2008/12/11 06:14 PM 4,100,000 data_08.txt
    2008/12/11 06:14 PM 4,100,000 data_09.txt
    2008/12/11 06:14 PM 4,100,000 data_10.txt
    2008/12/11 06:14 PM 4,100,000 data_11.txt
    2008/12/11 06:14 PM 4,100,000 data_12.txt
    2008/12/11 06:14 PM 4,100,000 data_13.txt
    2008/12/11 06:14 PM 4,100,000 data_14.txt
    2008/12/11 06:14 PM 4,100,000 data_15.txt
    2008/12/11 06:14 PM 4,100,000 data_16.txt
    2008/12/11 06:14 PM 4,100,000 data_17.txt
    2008/12/11 06:14 PM 4,100,000 data_18.txt
    2008/12/11 06:14 PM 4,100,000 data_19.txt

    Here is a simple program to process the data:

    C:\DOCUME~1\asu1\LOCALS~1\Temp\large> cat process.pl
    #!/usr/bin/perl

    use strict;
    use warnings;

    use Parallel::ForkManager;

    my ($instances) = @ARGV;

    my $fn_tmpl = 'data_%2.2d.txt';

    my $pm = Parallel::ForkManager->new($instances);

    for my $i (0 .. 19 ) {
    $pm->start and next;

    my $input = sprintf $fn_tmpl, $i;

    eval {
    open my $in, '<', $input
    or die "Cannot open '$input': $!";

    while ( my $line = <$in> ) {
    my @data = split /\t/, $line;

    # replace with your own processing code
    # don't try to keep all your data in memory
    }
    close $in
    or die "Cannot close '$input': $!";
    };

    warn $@ if $@;

    $pm->finish;
    }

    $pm->wait_all_children;

    __END__

    First, try without forking to establish a baseline:

    C:\DOCUME~1\asu1\LOCALS~1\Temp\large> timethis process 0

    TimeThis : Command Line : process 0
    TimeThis : Start Time : Thu Dec 11 18:31:50 2008
    TimeThis : End Time : Thu Dec 11 18:32:41 2008
    TimeThis : Elapsed Time : 00:00:51.156

    Let's try a few more:

    TimeThis : Command Line : process 2
    TimeThis : Start Time : Thu Dec 11 18:35:15 2008
    TimeThis : End Time : Thu Dec 11 18:35:58 2008
    TimeThis : Elapsed Time : 00:00:43.578

    TimeThis : Command Line : process 4
    TimeThis : Start Time : Thu Dec 11 18:36:17 2008
    TimeThis : End Time : Thu Dec 11 18:36:59 2008
    TimeThis : Elapsed Time : 00:00:41.921

    TimeThis : Command Line : process 8
    TimeThis : Start Time : Thu Dec 11 18:37:18 2008
    TimeThis : End Time : Thu Dec 11 18:38:00 2008
    TimeThis : Elapsed Time : 00:00:41.328

    TimeThis : Command Line : process 16
    TimeThis : Start Time : Thu Dec 11 18:38:18 2008
    TimeThis : End Time : Thu Dec 11 18:38:58 2008
    TimeThis : Elapsed Time : 00:00:40.734

    TimeThis : Command Line : process 20
    TimeThis : Start Time : Thu Dec 11 18:39:17 2008
    TimeThis : End Time : Thu Dec 11 18:39:58 2008
    TimeThis : Elapsed Time : 00:00:40.578

    Not very impressive. Between no forking vs max 20 instances, time
    required to process was reduced by 20% with most of the gains coming
    from running 2. That probably has more to do with the implementation of
    fork on Windows than anything else.

    In fact, I should probably have used threads on Windows. Anyway, I'll
    boot into Linux and see if the returns there are greater.

    Try this simple experiment on your system. See how many instances gives
    you the best bang-per-buck.

    Sinan

    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://www.rehabitation.com/clpmisc/
     
    A. Sinan Unur, Dec 11, 2008
    #7
  8. "A. Sinan Unur" <> wrote in
    news:Xns9B71C0CDC9E12asu1cornelledu@127.0.0.1:

    > "" <> wrote in news:5f1e2237-
    > :
    >
    >> I analyzing some netwokr log files. There are around 200-300 files
    >> and each file has more than 2 million entries in it.
    >>
    >> Currently my script is reading each file line by line. So it will
    >> take lot of time to process all the files.
    >>
    >> Is there any efficient way to do it?
    >>
    >> May be Multiprocessing, Multitasking ?

    >
    > Here is one way to do it using Parallel::Forkmanager.
    >

    ....

    > Not very impressive. Between no forking vs max 20 instances, time
    > required to process was reduced by 20% with most of the gains coming
    > from running 2. That probably has more to do with the implementation
    > of fork on Windows than anything else.
    >
    > In fact, I should probably have used threads on Windows. Anyway, I'll
    > boot into Linux and see if the returns there are greater.


    Hmmm ... I tried it on ArchLinux using perl from the repository on the
    exact same hardware as the Windows tests:

    [sinan@archardy large]$ time perl process.pl 0

    real 0m29.983s
    user 0m29.848s
    sys 0m0.073s

    [sinan@archardy large]$ time perl process.pl 2

    real 0m15.281s
    user 0m29.865s
    sys 0m0.077s

    with no changes going to 4, 8, 16 or 20 max instances. Exact same
    program and data on the same hardware, yet the no fork version was 40%
    faster. Running it in a shell window in xfce4 versus at boot-up on the
    console and running it in an ntfs filesystem versus ext3 file system did
    not make any meaningful difference.

    The wireless connection was up but inactive in all scenarios.

    -- Sinan

    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://www.rehabitation.com/clpmisc/
     
    A. Sinan Unur, Dec 12, 2008
    #8
  9. cartercc Guest

    On Dec 11, 3:27 pm, "" <>
    wrote:
    > I analyzing some netwokr log files. There are around 200-300 files and
    > each file has more than 2 million entries in it.
    > Currently my script is reading each file line by line. So it will take
    > lot of time to process all the files.


    Your question is really about data. The fact that your data is
    contained in files which have rows and columns is totally irrelevant.
    You would have the same problem if all the data were contained in just
    one file. If you have 200,000,000 items of data, you have that much
    data, and there's absolutely nothing you can do about it.

    > Is there any efficient way to do it?


    This is a good question, and the answer is, 'Maybe.' If you want to
    generate reports from the data, you might want to look into putting in
    into a database and writing queries against the database. That's what
    companies like Wal-mart, Amazon.com, and eBay do. Write a script that
    runs as a cron job at 2:00 am and reads all the data into a database.
    Then write another script that queries the database at 4:00 am and
    spits out the reports you want.

    >
    > May be Multiprocessing, Multitasking ?


    If you are using an Intel-like processor, it multi processes, anyway.
    There are only two ways to increase speed: increase the clocks of the
    processor or increase the number of processors. With respect to the
    latter, take a look at Erlang. I'd bet a lot of money that you could
    write an Erlang script that would increase the speed by several orders
    of magnitude. (On my machine, Erlang generates about 60,000 threads in
    sevaral milliseconds, and I have an old, slow machine.)

    CC
     
    cartercc, Dec 12, 2008
    #9
  10. On 2008-12-12 13:09, A. Sinan Unur <> wrote:
    > "A. Sinan Unur" <> wrote in
    > news:Xns9B71C0CDC9E12asu1cornelledu@127.0.0.1:
    >> "" <> wrote in news:5f1e2237-
    >> :
    >>
    >>> I analyzing some netwokr log files. There are around 200-300 files
    >>> and each file has more than 2 million entries in it.

    [...]
    >>> Is there any efficient way to do it?
    >>>
    >>> May be Multiprocessing, Multitasking ?

    >>
    >> Here is one way to do it using Parallel::Forkmanager.
    >>

    > ...
    >
    >> Not very impressive. Between no forking vs max 20 instances, time
    >> required to process was reduced by 20% with most of the gains coming
    >> from running 2. That probably has more to do with the implementation
    >> of fork on Windows than anything else.
    >>
    >> In fact, I should probably have used threads on Windows. Anyway, I'll
    >> boot into Linux and see if the returns there are greater.

    >
    > Hmmm ... I tried it on ArchLinux using perl from the repository on the
    > exact same hardware as the Windows tests:
    >
    > [sinan@archardy large]$ time perl process.pl 0
    >
    > real 0m29.983s
    > user 0m29.848s
    > sys 0m0.073s
    >
    > [sinan@archardy large]$ time perl process.pl 2
    >
    > real 0m15.281s
    > user 0m29.865s
    > sys 0m0.077s
    >
    > with no changes going to 4, 8, 16 or 20 max instances. Exact same
    > program and data on the same hardware, yet the no fork version was 40%
    > faster.


    Where do you get this 40% figure from? As far as I can see the forking
    version is almost exactly 100% faster (0m15.281s instead of 0m29.983s)
    than the non-forking version.

    This is to be expected. Your small test files fit completely into memory
    even on rather small systems and if you ran process.pl directly after
    create.pl, they almost certainly were. So the task is completely
    CPU-bound, and if you have at least two cores (most current computers
    have) two processes should be twice as fast as one.

    Here is what I get for

    for i in `seq 0 25`
    do
    echo -n "$i "
    time ./process $i
    done

    on a dual-core system:

    0 ./process $i 20.85s user 0.10s system 99% cpu 21.024 total
    1 ./process $i 22.03s user 0.06s system 99% cpu 22.146 total
    2 ./process $i 21.86s user 0.04s system 197% cpu 11.093 total
    3 ./process $i 22.63s user 0.09s system 197% cpu 11.505 total
    [...]
    23 ./process $i 21.67s user 0.15s system 199% cpu 10.956 total
    24 ./process $i 22.91s user 0.10s system 199% cpu 11.553 total
    25 ./process $i 22.05s user 0.08s system 199% cpu 11.124 total


    Two processes are twice as fast as one, but adding more processes
    doesn't help (but doesn't hurt either).

    And here's the output for an 8-core system:

    0 ./process $i 10.22s user 0.05s system 99% cpu 10.275 total
    1 ./process $i 10.13s user 0.07s system 100% cpu 10.196 total
    2 ./process $i 10.19s user 0.06s system 199% cpu 5.138 total
    3 ./process $i 10.19s user 0.06s system 284% cpu 3.606 total
    4 ./process $i 10.19s user 0.06s system 395% cpu 2.589 total
    5 ./process $i 10.18s user 0.06s system 472% cpu 2.167 total
    6 ./process $i 10.20s user 0.05s system 495% cpu 2.069 total
    7 ./process $i 10.20s user 0.07s system 650% cpu 1.580 total
    8 ./process $i 10.18s user 0.06s system 652% cpu 1.571 total
    9 ./process $i 10.19s user 0.05s system 659% cpu 1.553 total
    10 ./process $i 10.20s user 0.06s system 667% cpu 1.538 total
    11 ./process $i 10.19s user 0.06s system 666% cpu 1.538 total
    12 ./process $i 10.19s user 0.06s system 706% cpu 1.451 total
    13 ./process $i 10.19s user 0.05s system 662% cpu 1.545 total
    14 ./process $i 10.19s user 0.06s system 689% cpu 1.486 total
    15 ./process $i 10.19s user 0.05s system 708% cpu 1.446 total
    16 ./process $i 10.20s user 0.06s system 755% cpu 1.357 total
    17 ./process $i 10.22s user 0.06s system 756% cpu 1.360 total
    18 ./process $i 10.20s user 0.05s system 741% cpu 1.383 total
    19 ./process $i 10.21s user 0.06s system 729% cpu 1.407 total
    20 ./process $i 10.23s user 0.05s system 726% cpu 1.415 total
    21 ./process $i 10.20s user 0.06s system 749% cpu 1.368 total
    22 ./process $i 10.21s user 0.05s system 726% cpu 1.411 total
    23 ./process $i 10.23s user 0.06s system 739% cpu 1.392 total
    24 ./process $i 10.21s user 0.04s system 712% cpu 1.440 total
    25 ./process $i 10.20s user 0.05s system 739% cpu 1.386 total

    Speed rises almost linearly until 7 processes (which manage to use 6.5
    cores). Then it gets still a bit faster until 16 to 17 processes (using
    7.5 cores) and after that it levels off. Not quite what I expected but
    close enough.

    For the OP's problem, this test is most likely not representative: He
    has a lot more files and each is larger. So they may not fit into the
    cache, and even if they do, they probably aren't in the cache when his
    script runs (depends on how long ago they were last read/written and how
    busy the system is).

    hp
     
    Peter J. Holzer, Dec 13, 2008
    #10
  11. On 2008-12-12 15:02, cartercc <> wrote:
    > On Dec 11, 3:27 pm, "" <>
    > wrote:
    >> I analyzing some netwokr log files. There are around 200-300 files and
    >> each file has more than 2 million entries in it.
    >> Currently my script is reading each file line by line. So it will take
    >> lot of time to process all the files.

    [...]
    >>
    >> May be Multiprocessing, Multitasking ?

    >
    > If you are using an Intel-like processor, it multi processes, anyway.


    No. At least not on a level you notice. From a perl programmer's view
    (or a Java or C programmer's), each core is a separate CPU. A
    single-threaded program will not become faster just because you have two
    or more cores. You have to program those threads (or processes)
    explicitely to get any speedup. See my results for Sinan's test program
    for a dual- and eight core machine.

    > There are only two ways to increase speed: increase the clocks of the
    > processor or increase the number of processors. With respect to the
    > latter, take a look at Erlang. I'd bet a lot of money that you could
    > write an Erlang script that would increase the speed by several orders
    > of magnitude.


    I doubt that very much. Erlang is inherently multithreaded so you don't
    have to do anything special to use those 2 or 8 cores you have, but it
    doesn't magically make a processor "several orders of magnitude" faster,
    unless you have hundreds or thousands of processors.

    I think Erlang is usually compiled to native code (like C), so it may
    well be a few orders of magnitude faster than perl because of that. But
    that depends very much on the problem and extracting stuff from text
    files is something at which perl is relatively fast.

    > (On my machine, Erlang generates about 60,000 threads in
    > sevaral milliseconds, and I have an old, slow machine.)


    Which says nothing about how long it takes to parse a line in a log
    file.

    hp
     
    Peter J. Holzer, Dec 13, 2008
    #11
  12. Guest

    "Peter J. Holzer" <> wrote:
    > On 2008-12-12 13:09, A. Sinan Unur <> wrote:
    > >
    > >> Not very impressive. Between no forking vs max 20 instances, time
    > >> required to process was reduced by 20% with most of the gains coming
    > >> from running 2. That probably has more to do with the implementation
    > >> of fork on Windows than anything else.
    > >>
    > >> In fact, I should probably have used threads on Windows. Anyway, I'll
    > >> boot into Linux and see if the returns there are greater.

    > >
    > > Hmmm ... I tried it on ArchLinux using perl from the repository on the
    > > exact same hardware as the Windows tests:
    > >
    > > [sinan@archardy large]$ time perl process.pl 0
    > >
    > > real 0m29.983s
    > > user 0m29.848s
    > > sys 0m0.073s
    > >
    > > [sinan@archardy large]$ time perl process.pl 2
    > >
    > > real 0m15.281s
    > > user 0m29.865s
    > > sys 0m0.077s
    > >
    > > with no changes going to 4, 8, 16 or 20 max instances. Exact same
    > > program and data on the same hardware, yet the no fork version was 40%
    > > faster.

    >
    > Where do you get this 40% figure from? As far as I can see the forking
    > version is almost exactly 100% faster (0m15.281s instead of 0m29.983s)
    > than the non-forking version.



    I assumed he was comparing Linux to Windows, not within linux.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Dec 14, 2008
    #12
  13. Guest

    On Thu, 11 Dec 2008 12:27:15 -0800 (PST), "" <> wrote:

    >Hi,
    >
    >I analyzing some netwokr log files. There are around 200-300 files and
    >each file has more than 2 million entries in it.
    >
    >Currently my script is reading each file line by line. So it will take
    >lot of time to process all the files.
    >
    >Is there any efficient way to do it?
    >
    >May be Multiprocessing, Multitasking ?
    >
    >
    >Thanks.


    I'm estimating 100 characters perl line, 2 million lines per file,
    at 300 files, will be about 60 Gigabytes of data to be read.

    If the files are are to be read across a real 1 Gigabit network, just
    reading the data will take about 10 minutes (I think, or about 600 seconds).
    A Gigabit ethernet, theoretically can transmit 100 MB/second, if its cache
    is big enough. But that includes packetizing data and protocol ack/nak's.
    So, in reality, its about 50 MB/second.

    Some drives can't read/write that fast. Its the upper limit of
    some small raid systems. So the drives may not actually be able to keep up
    with network read requests if thats all your doing.
    The cpu will be mostly idle.

    In reality though, your not reading huge blocks of data like on a disk to
    disk transfer, your processing line by line.

    In this case, the I/O requests are incrementally idle between cpu, line by
    line file processing. AND the cpu sits Idle waiting for the incremental I/O
    request.

    So where is the time being lost? Well, its being lost in both places, on the
    cpu and the I/O.

    The cpu can process an equivalent menutia, of about 25 GB's/second, ram
    at about 2-4 GB's/second, the hard drive (being slower than Gigabit ethernet)
    at about 25 MB's/second

    For the cpu:
    It would be better to keep the cpu working all the time and not waiting for
    I/O completion on single requests. The way to do this is to have many requests
    submitted at one time. I would say 25 threads, on 25 different files at a time.
    Your running the same function, just on a different thread. You just have to
    know which file is next.
    And multiple threads beats multiple processes hands down simply because process
    switching takes much more overhead than thread switching. Another reason is that
    you have 25 buffers waiting for the I/O data instead of just 1.

    For the I/O:
    When the I/O always has a request pending (cached) it doesen't have to wait for the
    cpu. Most memory transfer will be via dma controller, not the cpu.

    If you don't do multiple threads at all, there is still ways to speed it up.
    Even if you create a cache of 10 lines at a time before you process, it would be
    better than no threads at all.

    Good luck.

    sln
     
    , Dec 14, 2008
    #13
  14. Guest

    On Sun, 14 Dec 2008 01:03:42 GMT, wrote:

    >On Thu, 11 Dec 2008 12:27:15 -0800 (PST), "" <> wrote:
    >
    >

    [...]
    >If you don't do multiple threads at all, there is still ways to speed it up.
    >Even if you create a cache of 10 lines at a time before you process, it would be
    >better than no threads at all.
    >

    Of course its a block read, not line by line.

    sln
     
    , Dec 14, 2008
    #14
  15. wrote in news:20081213192853.428$:

    > "Peter J. Holzer" <> wrote:
    >> On 2008-12-12 13:09, A. Sinan Unur <> wrote:
    >> >

    ....
    >> >> In fact, I should probably have used threads on Windows. Anyway,
    >> >> I'll boot into Linux and see if the returns there are greater.
    >> >
    >> > Hmmm ... I tried it on ArchLinux using perl from the repository on
    >> > the exact same hardware as the Windows tests:
    >> >
    >> > [sinan@archardy large]$ time perl process.pl 0
    >> >
    >> > real 0m29.983s
    >> > user 0m29.848s
    >> > sys 0m0.073s
    >> >
    >> > [sinan@archardy large]$ time perl process.pl 2
    >> >
    >> > real 0m15.281s
    >> > user 0m29.865s
    >> > sys 0m0.077s
    >> >
    >> > with no changes going to 4, 8, 16 or 20 max instances. Exact same
    >> > program and data on the same hardware, yet the no fork version was
    >> > 40% faster.

    >>
    >> Where do you get this 40% figure from? As far as I can see the
    >> forking version is almost exactly 100% faster (0m15.281s instead of
    >> 0m29.983s) than the non-forking version.

    >
    >
    > I assumed he was comparing Linux to Windows, not within linux.


    A very astute observation ;-)

    My purpose was to show to the OP how to test if forking etc could
    provide performance gains. I did not think so (I did say "you are going
    to run into IO bottlenecks before you run into CPU bottlenecks").

    I was still astonished by the fact that the exact same Perl program,
    with the exact same data, on the exact same hardware, being run under
    the latest available perl binary for each platform, was faster in
    ArchLinux than in Windows XP.

    Sinan

    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://www.rehabitation.com/clpmisc/
     
    A. Sinan Unur, Dec 14, 2008
    #15
  16. Tim Greer Guest

    wrote:

    > If the files are are to be read across a real 1 Gigabit network, just
    > reading the data will take about 10 minutes (I think, or about 600
    > seconds). A Gigabit ethernet, theoretically can transmit 100
    > MB/second, if its cache is big enough. But that includes packetizing
    > data and protocol ack/nak's. So, in reality, its about 50 MB/second.


    By all accounts, this aspect is probably irrelevant, as there was no
    mention of needing to transfer the data across a network. If this is
    the case, I'm hoping the OP mentions it and any other aspect that could
    play a potential role. Still, I'm pretty certain they mean they'll
    process the data on the system the data resides on, or else they'll
    transfer it to another system and then process the data on the system
    the data is then (now) on. Otherwise, there are certainly other
    aspects to consider, to be sure.
    --
    Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
    Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
    and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
    Industry's most experienced staff! -- Web Hosting With Muscle!
     
    Tim Greer, Dec 14, 2008
    #16
  17. Guest

    On Sat, 13 Dec 2008 20:29:46 -0800, Tim Greer <> wrote:

    > wrote:
    >
    >> If the files are are to be read across a real 1 Gigabit network, just
    >> reading the data will take about 10 minutes (I think, or about 600
    >> seconds). A Gigabit ethernet, theoretically can transmit 100
    >> MB/second, if its cache is big enough. But that includes packetizing
    >> data and protocol ack/nak's. So, in reality, its about 50 MB/second.

    >
    >By all accounts, this aspect is probably irrelevant, as there was no
    >mention of needing to transfer the data across a network. If this is
    >the case, I'm hoping the OP mentions it and any other aspect that could
    >play a potential role. Still, I'm pretty certain they mean they'll
    >process the data on the system the data resides on, or else they'll
    >transfer it to another system and then process the data on the system
    >the data is then (now) on. Otherwise, there are certainly other
    >aspects to consider, to be sure.

    OP:
    "Hi,

    I analyzing some netwokr log files. There are around ...
    "

    sln
     
    , Dec 14, 2008
    #17
  18. Guest

    On Sun, 14 Dec 2008 06:34:27 GMT, wrote:

    >On Sat, 13 Dec 2008 20:29:46 -0800, Tim Greer <> wrote:
    >
    >> wrote:
    >>
    >>> If the files are are to be read across a real 1 Gigabit network, just
    >>> reading the data will take about 10 minutes (I think, or about 600
    >>> seconds). A Gigabit ethernet, theoretically can transmit 100
    >>> MB/second, if its cache is big enough. But that includes packetizing
    >>> data and protocol ack/nak's. So, in reality, its about 50 MB/second.

    >>
    >>By all accounts, this aspect is probably irrelevant, as there was no
    >>mention of needing to transfer the data across a network. If this is
    >>the case, I'm hoping the OP mentions it and any other aspect that could
    >>play a potential role. Still, I'm pretty certain they mean they'll
    >>process the data on the system the data resides on, or else they'll
    >>transfer it to another system and then process the data on the system
    >>the data is then (now) on. Otherwise, there are certainly other
    >>aspects to consider, to be sure.

    > OP:
    >"Hi,
    >
    >I analyzing some netwokr log files. There are around ...
    >"
    >
    >sln


    In Chinese, this translates to "I got your US job files,
    no need to keep your workers, fire thos bastards and join
    the Communist revolution"

    sln
     
    , Dec 14, 2008
    #18
  19. Tim Greer Guest

    wrote:

    > "Hi,
    >
    > I analyzing some netwokr log files. There are around ...
    > "
    >


    I didn't get the impression that meant the large preexisting logs needed
    to be transfered or read over the network as they were processed, but
    people have done stranger things, I suppose. :)
    --
    Tim Greer, CEO/Founder/CTO, BurlyHost.com, Inc.
    Shared Hosting, Reseller Hosting, Dedicated & Semi-Dedicated servers
    and Custom Hosting. 24/7 support, 30 day guarantee, secure servers.
    Industry's most experienced staff! -- Web Hosting With Muscle!
     
    Tim Greer, Dec 14, 2008
    #19
  20. cartercc Guest

    On Dec 13, 5:26 pm, "Peter J. Holzer" <> wrote:
    > No. At least not on a level you notice. From a perl programmer's view
    > (or a Java or C programmer's), each core is a separate CPU. A
    > single-threaded program will not become faster just because you have two
    > or more cores. You have to program those threads (or processes)
    > explicitely to get any speedup. See my results for Sinan's test program
    > for a dual- and eight core machine.


    I've never written a multithreaded Perl program, but I have written
    multithreaded programs in C and Java, and the big problem IMO is that
    all those threads typically want a piece of your shared object, so you
    have to build a gate keeper to allow all those threads into your
    shared object one at a time. I expect that Perl is the same.

    > I doubt that very much. Erlang is inherently multithreaded so you don't
    > have to do anything special to use those 2 or 8 cores you have, but it
    > doesn't magically make a processor "several orders of magnitude" faster,
    > unless you have hundreds or thousands of processors.


    That's true, but that's not really the point. The point is that Erlang
    is an asynchronous message passing language, so you don't have to keep
    locking and unlocking shared objects. From my POV as a software guy
    (not a hardware guy) not having to explicitly programs mutexes and
    semaphores is great. But yeah, theoretically Erlang program runtime
    speed is proportional to the number of processors you have.

    > I think Erlang is usually compiled to native code (like C), so it may
    > well be a few orders of magnitude faster than perl because of that. But
    > that depends very much on the problem and extracting stuff from text
    > files is something at which perl is relatively fast.


    Yes. You can't really make a priori statements about execution speed,
    and Perl is optimized for data manipulation while Erlang isn't, but
    still, Erlang is optimized for multithreading which might be the
    answer that the OP was looking for.

    > Which says nothing about how long it takes to parse a line in a log
    > file.


    Which was my point to begin with. If you have 2M records, you have 2M
    records and you've got to deal with it.

    BTW, I've never written an Erlang program to do what I use Perl for,
    I'm not sure if it can be done, and I don't know if it would have any
    benefit. However, I've seen what Erlang can do with multithreaded
    apps, and I certainly think that Erlang is a strong competitor for
    those kinds of applications.

    CC
     
    cartercc, Dec 18, 2008
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Maxim
    Replies:
    0
    Views:
    417
    Maxim
    Jul 7, 2003
  2. Replies:
    4
    Views:
    984
    M.E.Farmer
    Feb 13, 2005
  3. Jon Spivey
    Replies:
    3
    Views:
    2,130
    Gregory A. Beamer
    Dec 1, 2009
  4. thufir
    Replies:
    3
    Views:
    230
    Thufir
    Apr 12, 2008
  5. Clement Ow
    Replies:
    0
    Views:
    128
    Clement Ow
    Apr 4, 2008
Loading...

Share This Page