out of memory

Discussion in 'Perl Misc' started by friend.05@gmail.com, Oct 31, 2008.

  1. Guest

    Hi,

    I want to parse large log file (in GBs)

    and I am readin 2-3 such files in hash array.

    But since it will very big hash array it is going out of memory.

    what are the other approach I can take.


    Example code:

    open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
    while (<$INFO>)
    {
    (undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
    undef) = split('\|');
    push @{$time_table{"$cli_ip|$id"}}, $time;
    }
    close $INFO;


    In above code $file is very big in size(in Gbs); so I am getting out
    of memory !
     
    , Oct 31, 2008
    #1
    1. Advertising

  2. "" <> wrote:
    >I want to parse large log file (in GBs)
    >
    >and I am readin 2-3 such files in hash array.
    >
    >But since it will very big hash array it is going out of memory.
    >
    >what are the other approach I can take.


    "Doctor, it hurts when I do this."
    "Well, then don't do it."

    Simple: don't read them into RAM but process them line by line.

    >Example code:
    >
    >open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
    >while (<$INFO>)


    Oh, you are processing them line by line,

    >{
    > (undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
    >undef) = split('\|');
    > push @{$time_table{"$cli_ip|$id"}}, $time;
    >}
    >close $INFO;


    If for whatever reason your requirement (sic!!!) is to create an array
    with all this data, then you need better hardware and probably a 64bit
    OS and Perl.

    Of course a much better approach would probably be to trade time for
    space and find a different algorithm to solve your original problem
    (which you didn't tell us about) by using less RAM in the first place. I
    personally don't see any need to store more than one data set in RAM for
    "parsing log files", but of course I don't know what kind of log files
    you are talking about and what information you want to compute from
    those log files.

    Another common solution is to use a database to handle large sets of
    data.

    jue
     
    Jürgen Exner, Oct 31, 2008
    #2
    1. Advertising

  3. Juha Laiho Guest

    "" <> said:
    >I want to parse large log file (in GBs)
    >
    >and I am readin 2-3 such files in hash array.
    >
    >But since it will very big hash array it is going out of memory.


    Do you really need to have the whole file available in order to
    extract the data you're interested in?

    >Example code:
    >
    >open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
    >while (<$INFO>)
    >{
    > (undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
    >undef) = split('\|');
    > push @{$time_table{"$cli_ip|$id"}}, $time;
    >}
    >close $INFO;
    >
    >In above code $file is very big in size(in Gbs); so I am getting out
    >of memory !


    So, you're storing times based on client ip and id, if I read correctly.

    How about not keeping that data in memory, but writing it out as you
    gather it?
    - to a text file, to be processed further in a next stage of the script
    - to a database format file (via DB_File module, or one of its sister
    modules), so that you can do fast indexed searches on the data
    - to a "real" database in a proper relational structure, to allow
    you to do any kind of relational reporting rather easily

    Also, where $time above apparently is a string containing some kind of
    a timestamp, you could convert that timestamp into something else
    (number of seconds from epoch comes to mind) that takes a lot less
    memory than a string representation such as "2008-10-31 18:33:24".
    --
    Wolf a.k.a. Juha Laiho Espoo, Finland
    (GC 3.0) GIT d- s+: a C++ ULSH++++$ P++@ L+++ E- W+$@ N++ !K w !O !M V
    PS(+) PE Y+ PGP(+) t- 5 !X R !tv b+ !DI D G e+ h---- r+++ y++++
    "...cancel my subscription to the resurrection!" (Jim Morrison)
     
    Juha Laiho, Oct 31, 2008
    #3
  4. Guest

    On Oct 31, 12:37 pm, Juha Laiho <> wrote:
    > "" <> said:
    >
    > >I want to parse large log file (in GBs)

    >
    > >and I am readin 2-3 such files in hash array.

    >
    > >But since it will very big hash array it is going out of memory.

    >
    > Do you really need to have the whole file available in order to
    > extract the data you're interested in?
    >
    > >Example code:

    >
    > >open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
    > >while (<$INFO>)
    > >{
    > >        (undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
    > >undef) = split('\|');
    > >            push @{$time_table{"$cli_ip|$id"}}, $time;
    > >}
    > >close $INFO;

    >
    > >In above code $file is very big in size(in Gbs); so I am getting out
    > >of memory !

    >
    > So, you're storing times based on client ip and id, if I read correctly.
    >
    > How about not keeping that data in memory, but writing it out as you
    > gather it?
    > - to a text file, to be processed further in a next stage of the script
    > - to a database format file (via DB_File module, or one of its sister
    >   modules), so that you can do fast indexed searches on the data
    > - to a "real" database in a proper relational structure, to allow
    >   you to do any kind of relational reporting rather easily
    >
    > Also, where $time above apparently is a string containing some kind of
    > a timestamp, you could convert that timestamp into something else
    > (number of seconds from epoch comes to mind) that takes a lot less
    > memory than a string representation such as "2008-10-31 18:33:24".
    > --
    > Wolf  a.k.a.  Juha Laiho     Espoo, Finland
    > (GC 3.0) GIT d- s+: a C++ ULSH++++$ P++@ L+++ E- W+$@ N++ !K w !O !M V
    >          PS(+) PE Y+ PGP(+) t- 5 !X R !tv b+ !DI D G e+ h---- r+++ y++++
    > "...cancel my subscription to the resurrection!" (Jim Morrison)


    Thanks.

    if I output as text file and read it again later on will be able to
    search based on key. (I mean when read it again I will be able to use
    it as hash or not )
     
    , Oct 31, 2008
    #4
  5. Guest

    "" <> wrote:
    > Hi,
    >
    > I want to parse large log file (in GBs)
    >
    > and I am readin 2-3 such files in hash array.
    >
    > But since it will very big hash array it is going out of memory.
    >
    > what are the other approach I can take.


    The other approaches you can take depend on what you are trying to
    do.

    >
    > Example code:
    >
    > open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
    > while (<$INFO>)
    > {
    > (undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
    > undef) = split('\|');
    > push @{$time_table{"$cli_ip|$id"}}, $time;
    > }
    > close $INFO;


    You could get some improvement by having just a hash rather than a hash of
    arrays. Replace the push with, for example:

    $time_table{"$cli_ip|$id"} .= "$time|";

    Then you would have to split the hash values into a list/array one at a
    time as they are needed.



    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Oct 31, 2008
    #5
  6. "" <> wrote:
    >if I output as text file and read it again later on will be able to
    >search based on key. (I mean when read it again I will be able to use
    >it as hash or not )


    That depends upon what you do with the data when reading it in again. Of
    course you can construct hash, but then you wouldn't have gained
    anything. Why would this hash be any smaller than the one you were
    trying to construct the first time?

    Your current approach (put everything into a hash) and your current
    hardware are incompatible.

    Either get larger hardware (expensive) or rethink your basic approach,
    e.g. use a database system or compute your desired results on the fly
    while parsing through the file or write intermediate results to a file
    in a format that later can be processed line by line or by any other of
    the gazillions ways of preversing RAM. Don't you learn those techniques
    in basic computer science classes any more?

    jue
     
    Jürgen Exner, Oct 31, 2008
    #6
  7. Guest

    On Oct 31, 1:22 pm, Jürgen Exner <> wrote:
    > "" <> wrote:
    > >if I output as text file and read it again later on will be able to
    > >search based on key. (I mean when read it again I will be able to use
    > >it as hash or not )

    >
    > That depends upon what you do with the data when reading it in again. Of
    > course you can construct hash, but then you wouldn't have gained
    > anything. Why would this hash be any smaller than the one you were
    > trying to construct the first time?
    >
    > Your current approach (put everything into a hash) and your current
    > hardware are incompatible.
    >
    > Either get larger hardware (expensive) or rethink your basic approach,
    > e.g. use a database system or compute your desired results on the fly
    > while parsing through the file or write intermediate results to a file
    > in a format that later can be processed line by line or by any other of
    > the gazillions ways of preversing RAM. Don't you learn those techniques
    > in basic computer science classes any more?
    >
    > jue


    output to a file and using it again will take lot of time. It will be
    very slow.

    will be helpful in speed if I use DB_FILE module
     
    , Oct 31, 2008
    #7
  8. Guest

    On Oct 31, 1:41 pm, "" <>
    wrote:
    > On Oct 31, 1:22 pm, Jürgen Exner <> wrote:
    >
    >
    >
    >
    >
    > > "" <> wrote:
    > > >if I output as text file and read it again later on will be able to
    > > >search based on key. (I mean when read it again I will be able to use
    > > >it as hash or not )

    >
    > > That depends upon what you do with the data when reading it in again. Of
    > > course you can construct hash, but then you wouldn't have gained
    > > anything. Why would this hash be any smaller than the one you were
    > > trying to construct the first time?

    >
    > > Your current approach (put everything into a hash) and your current
    > > hardware are incompatible.

    >
    > > Either get larger hardware (expensive) or rethink your basic approach,
    > > e.g. use a database system or compute your desired results on the fly
    > > while parsing through the file or write intermediate results to a file
    > > in a format that later can be processed line by line or by any other of
    > > the gazillions ways of preversing RAM. Don't you learn those techniques
    > > in basic computer science classes any more?

    >
    > > jue

    >
    > output to a file and using it again will take lot of time. It will be
    > very slow.
    >
    > will be helpful in speed if I use DB_FILE module- Hide quoted text -
    >
    > - Show quoted text -


    here is what I am trying to do.

    I have two large files. I will read one file and see if that is also
    present in second file. I also need count how many time it is appear
    in both the file. And according I do other processing.

    so if I process line by line both the file then it will be like (eg.
    file1 has 10 line and file2 has 10 line. for each line file1 it will
    loop 10 times. so total 100 loops.) I am dealing millions of lines so
    this approach will be very slow.


    this is my current code. It runs fine with small file.



    open ($INFO, '<', $file) or die "Cannot open $file :$!\n";
    while (<$INFO>)
    {
    (undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
    undef) = split('\|');
    push @{$time_table{"$cli_ip|$dns_id"}}, $time;
    }


    open ($INFO_PRI, '<', $pri_file) or die "Cannot open $pri_file :$!
    \n";
    while (<$INFO_PRI>)
    {
    (undef, undef, undef, $pri_time, $pri_cli_ip, undef, undef,
    $pri_id, undef, $query, undef) = split('\|');
    $pri_ip_id_table{"$pri_cli_ip|$pri_id"}++;
    push @{$pri_time_table{"$pri_cli_ip|$pri_id"}}, $pri_time;
    }

    @pri_ip_id_table_ = keys(%pri_ip_id_table);

    for($i = 0; $i < @pri_ip_id_table_; $i++) #file 2
    {
    if($time_table{"$pri_ip_dns_table_[$i]"}) #chk if it
    is there in file 1
    {
    #do some processing.
    }

    }



    so for above example which I approach will be best ?


    Thanks for your help.
     
    , Oct 31, 2008
    #8
  9. >>>>> "JE" == Jürgen Exner <> writes:

    JE> Don't you learn those techniques in basic computer science
    JE> classes any more?

    The assumption that someone who is getting paid to program has had -- or
    even has had any interest in -- computer science classes gets less
    tenable with each passing day.

    Charlton


    --
    Charlton Wilbur
     
    Charlton Wilbur, Oct 31, 2008
    #9
  10. J. Gleixner Guest

    wrote:
    > On Oct 31, 1:41 pm, "" <>
    > wrote:
    >> On Oct 31, 1:22 pm, Jürgen Exner <> wrote:
    >>
    >>
    >>
    >>
    >>
    >>> "" <> wrote:
    >>>> if I output as text file and read it again later on will be able to
    >>>> search based on key. (I mean when read it again I will be able to use
    >>>> it as hash or not )
    >>> That depends upon what you do with the data when reading it in again. Of
    >>> course you can construct hash, but then you wouldn't have gained
    >>> anything. Why would this hash be any smaller than the one you were
    >>> trying to construct the first time?
    >>> Your current approach (put everything into a hash) and your current
    >>> hardware are incompatible.
    >>> Either get larger hardware (expensive) or rethink your basic approach,
    >>> e.g. use a database system or compute your desired results on the fly
    >>> while parsing through the file or write intermediate results to a file
    >>> in a format that later can be processed line by line or by any other of
    >>> the gazillions ways of preversing RAM. Don't you learn those techniques
    >>> in basic computer science classes any more?
    >>> jue

    >> output to a file and using it again will take lot of time. It will be
    >> very slow.
    >>
    >> will be helpful in speed if I use DB_FILE module- Hide quoted text -
    >>
    >> - Show quoted text -

    >
    > here is what I am trying to do.
    >
    > I have two large files. I will read one file and see if that is also
    > present in second file. I also need count how many time it is appear
    > in both the file. And according I do other processing.
    >
    > so if I process line by line both the file then it will be like (eg.
    > file1 has 10 line and file2 has 10 line. for each line file1 it will
    > loop 10 times. so total 100 loops.) I am dealing millions of lines so
    > this approach will be very slow.


    Maybe you shouldn't do your own math. It'd be 10 reads, for each file,
    so 20.
    >
    >
    > this is my current code. It runs fine with small file.
    >

    use strict;
    use warnings;

    >
    >
    > open ($INFO, '<', $file) or die "Cannot open $file :$!\n";

    open( my $INFO, ...

    > while (<$INFO>)
    > {
    > (undef, undef, undef, $time, $cli_ip, $ser_ip, undef, $id,
    > undef) = split('\|');


    my( $time, $cli_ip, $ser_ip, $id ) = (split( /\|/ ))[3,4,5,7];

    > push @{$time_table{"$cli_ip|$dns_id"}}, $time;
    > }

    close( $INFO );
    >
    >
    > open ($INFO_PRI, '<', $pri_file) or die "Cannot open $pri_file :$!
    > \n";


    open( my $INFO_PRI, ...

    > while (<$INFO_PRI>)
    > {
    > (undef, undef, undef, $pri_time, $pri_cli_ip, undef, undef,
    > $pri_id, undef, $query, undef) = split('\|');


    my( $pri_time, $pri_cli_ip, $pri_id, $query ) = (split( /\|/ ))[3,4,7,9];

    > $pri_ip_id_table{"$pri_cli_ip|$pri_id"}++;
    > push @{$pri_time_table{"$pri_cli_ip|$pri_id"}}, $pri_time;
    > }


    Read one file into memory/hash, if possible. As you're processing
    the second one, store/push some data to process later, or process
    it at that time, if it matches your criteria. There's no need to
    store both in memory.

    >
    > @pri_ip_id_table_ = keys(%pri_ip_id_table);
    >
    > for($i = 0; $i < @pri_ip_id_table_; $i++) #file 2


    Ugg.. the keys for %pri_ip_id_table are 'something|somethingelse'
    how that works with that for loop is probably not what one
    would expect.

    > {
    > if($time_table{"$pri_ip_dns_table_[$i]"}) #chk if it
    > is there in file 1


    Really? Where is pri_ip_dns_table_ defined?

    > so for above example which I approach will be best ?
     
    J. Gleixner, Oct 31, 2008
    #10
  11. smallpond Guest

    On Oct 31, 1:59 pm, "" <>
    wrote:

    >
    > here is what I am trying to do.
    >
    > I have two large files. I will read one file and see if that is also
    > present in second file. I also need count how many time it is appear
    > in both the file. And according I do other processing.
    >
    > so if I process line by line both the file then it will be like (eg.
    > file1 has 10 line and file2 has 10 line. for each line file1 it will
    > loop 10 times. so total 100 loops.) I am dealing millions of lines so
    > this approach will be very slow.
    >



    This problem was solved 50 years ago. You sort the two files and then
    take
    one pass through both comparing records. Why are you reinventing the
    wheel?

    --S
     
    smallpond, Oct 31, 2008
    #11
  12. Guest

    "" <> wrote:

    > > > Either get larger hardware (expensive) or rethink your basic
    > > > approach, e.g. use a database system or compute your desired results
    > > > on the fly while parsing through the file or write intermediate
    > > > results to a file in a format that later can be processed line by
    > > > line or by any other of the gazillions ways of preversing RAM. Don't
    > > > you learn those techniques in basic computer science classes any
    > > > more?

    > >
    > > > jue

    > >
    > > output to a file and using it again will take lot of time. It will be
    > > very slow.


    That depends on how you do it.

    > >
    > > will be helpful in speed if I use DB_FILE module


    That depends on what you are comparing it to. Compared to an in memory
    hash, DB_File makes things slower, not faster. Except in the sense that
    something which runs out of memory and dies before completing the job is
    infinitely slow, so preventing that is, in a sense, faster. One exception
    I know of would be if one of the files is constant, so it only needs to be
    turned into a DB_File once, and if only a small fraction of the keys are
    ever probed by the process driven by other file. Then it could be faster.

    Also, DB_File doesn't take nested structures, so you would have to flatten
    your HoA. Once you flatten it, it might fit in memory anyway.

    >
    > here is what I am trying to do.
    >
    > I have two large files. I will read one file and see if that is also
    > present in second file. I also need count how many time it is appear
    > in both the file. And according I do other processing.


    If you *only* need to count, then you don't need the HoA in the first
    place.

    > so if I process line by line both the file then it will be like (eg.
    > file1 has 10 line and file2 has 10 line. for each line file1 it will
    > loop 10 times. so total 100 loops.) I am dealing millions of lines so
    > this approach will be very slow.


    I don't think anyone was recommending that you do a Cartesian join on the
    files. You could break the data up into files by hashing on IP address and
    making a separate file for each hash value. For each hash bucket you would
    have two files, one from each starting file, and they could be processed
    together with your existing script. Or you could reformat the two files
    and then sort them jointly, which would group all the like keys together
    for you for later processing.

    >
    > @pri_ip_id_table_ = keys(%pri_ip_id_table);


    For very large hashes when you have memory issues, you should iterate
    over it with "each" rather than building a list of keys.

    >
    > for($i =3D 0; $i < @pri_ip_id_table_; $i++) #file 2
    > {
    > if($time_table{"$pri_ip_dns_table_[$i]"})
    > {
    > #do some processing.


    Could you "do some processing" incrementally, as each line from file 2 is
    encountered, rather than having to load all keys of file2 into memory
    at once?

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Oct 31, 2008
    #12
  13. "" <> wrote:
    >I have two large files. I will read one file and see if that is also
    >present in second file.


    The way you wrote this means you are checking if file A is a subset of
    file B. However I have a strong feeling, you are talking about the
    records in each file, not the files themself.

    >I also need count how many time it is appear
    >in both the file. And according I do other processing.


    >so if I process line by line both the file then it will be like (eg.
    >file1 has 10 line and file2 has 10 line. for each line file1 it will
    >loop 10 times. so total 100 loops.) I am dealing millions of lines so
    >this approach will be very slow.


    So you need to pre-process your data.

    One possibility: read only the smaller file into a hash. Then you can
    compare the larger file line by line against this hash. This is a linear
    algorithm. Of course this only works if at least the relevant data from
    the smaller file will fit into RAM.

    Another approach: sort both input files. There are many sorting
    algorithms around, including those that sort completely on disk and
    require very minimum RAM. They were very popular back when 32kB was a
    lot of memory. Then you can walk through both files line by line in
    parallel, requiring only a tiny little bit of RAM.
    Depending upon the sorting algorithm this would be O(n)log(n) or
    somewhat worse.

    Yet another option: put your relevant data into a database and use
    database operators to extract the information you want, in your case a
    simple intersection: all records, that are in A and in B. Database
    systems are optimized to handle large sets of data efficiently.

    >this is my current code. It runs fine with small file.


    Well, that is great. But it seems you still don't believe me when I'm
    saying that your problem cannot be fixed by a little tweak in your
    existing code. Any gain you may get by storing a smaller data item or
    similar will very soon be eaten up by larger data sets.
    THIS IS NOT GOING TO WORK. YOU HAVE TO RETHINK YOUR APPROACH AND CHOOSE
    A DIFFERENT STRATEGIE/ALGORITHM!

    jue
     
    Jürgen Exner, Oct 31, 2008
    #13
  14. Jürgen Exner <> wrote:
    >"" <> wrote:
    >>I have two large files. I will read one file and see if that is also
    >>present in second file.

    >
    >The way you wrote this means you are checking if file A is a subset of
    >file B. However I have a strong feeling, you are talking about the
    >records in each file, not the files themself.
    >
    >>I also need count how many time it is appear
    >>in both the file. And according I do other processing.

    >
    >>so if I process line by line both the file then it will be like (eg.
    >>file1 has 10 line and file2 has 10 line. for each line file1 it will
    >>loop 10 times. so total 100 loops.) I am dealing millions of lines so
    >>this approach will be very slow.

    >
    >So you need to pre-process your data.
    >
    >One possibility: read only the smaller file into a hash. Then you can
    >compare the larger file line by line against this hash. This is a linear
    >algorithm. Of course this only works if at least the relevant data from
    >the smaller file will fit into RAM.
    >
    >Another approach: sort both input files. There are many sorting
    >algorithms around, including those that sort completely on disk and
    >require very minimum RAM. They were very popular back when 32kB was a
    >lot of memory. Then you can walk through both files line by line in
    >parallel, requiring only a tiny little bit of RAM.
    >Depending upon the sorting algorithm this would be O(n)log(n) or
    >somewhat worse.
    >
    >Yet another option: put your relevant data into a database and use
    >database operators to extract the information you want, in your case a
    >simple intersection: all records, that are in A and in B. Database
    >systems are optimized to handle large sets of data efficiently.


    Forgot one other common approach: bucketize your data.
    Create buckets of IPs or IDs or whatever criteria works for your case.
    Then sort the data into 20 or 50 or 100 individual buckets (aka files)
    for each of your input files. And then compare bucket x from file A with
    bucket x from file B.

    jue
     
    Jürgen Exner, Oct 31, 2008
    #14
  15. Guest

    On Fri, 31 Oct 2008 14:32:05 -0400, Charlton Wilbur <> wrote:

    >>>>>> "JE" == Jürgen Exner <> writes:

    >
    > JE> Don't you learn those techniques in basic computer science
    > JE> classes any more?
    >
    >The assumption that someone who is getting paid to program has had -- or
    >even has had any interest in -- computer science classes gets less
    >tenable with each passing day.
    >
    >Charlton


    Well said.. that should be its own thread.

    sln
     
    , Oct 31, 2008
    #15
  16. Guest

    On Fri, 31 Oct 2008 13:09:23 -0700, Jürgen Exner <> wrote:

    >"" <> wrote:
    >>I have two large files. I will read one file and see if that is also
    >>present in second file.

    >
    >The way you wrote this means you are checking if file A is a subset of
    >file B. However I have a strong feeling, you are talking about the
    >records in each file, not the files themself.
    >
    >>I also need count how many time it is appear
    >>in both the file. And according I do other processing.

    >
    >>so if I process line by line both the file then it will be like (eg.
    >>file1 has 10 line and file2 has 10 line. for each line file1 it will
    >>loop 10 times. so total 100 loops.) I am dealing millions of lines so
    >>this approach will be very slow.

    >
    >So you need to pre-process your data.
    >
    >One possibility: read only the smaller file into a hash. Then you can
    >compare the larger file line by line against this hash. This is a linear
    >algorithm. Of course this only works if at least the relevant data from
    >the smaller file will fit into RAM.
    >
    >Another approach: sort both input files. There are many sorting
    >algorithms around, including those that sort completely on disk and
    >require very minimum RAM. They were very popular back when 32kB was a
    >lot of memory. Then you can walk through both files line by line in
    >parallel, requiring only a tiny little bit of RAM.
    >Depending upon the sorting algorithm this would be O(n)log(n) or
    >somewhat worse.
    >
    >Yet another option: put your relevant data into a database and use
    >database operators to extract the information you want, in your case a
    >simple intersection: all records, that are in A and in B. Database
    >systems are optimized to handle large sets of data efficiently.
    >
    >>this is my current code. It runs fine with small file.

    >
    >Well, that is great. But it seems you still don't believe me when I'm
    >saying that your problem cannot be fixed by a little tweak in your
    >existing code. Any gain you may get by storing a smaller data item or
    >similar will very soon be eaten up by larger data sets.
    >THIS IS NOT GOING TO WORK. YOU HAVE TO RETHINK YOUR APPROACH AND CHOOSE
    >A DIFFERENT STRATEGIE/ALGORITHM!
    >
    >jue


    He cannot get past the idea of 'millions' of lines in a file, even
    though he states items of interrest. He won't think of items, just
    the millions of lines.

    In todays large data mining, there are billions of lines to consider.
    Of course the least common denominator reduces that down to billions
    of items.

    Like a hash, it can be separated into alphabetical sequence files,
    matched with available memory, usually 16 gigabytes, then reduced
    exponentially until the desired form is achieved.

    But his outlook is panicy and without resolve. The world is coming
    to an end for him and he would like to share it with the world.

    sln
     
    , Oct 31, 2008
    #16
  17. smallpond wrote:
    > On Oct 31, 1:59 pm, "" <>
    > wrote:
    >
    >> here is what I am trying to do.
    >>
    >> I have two large files. I will read one file and see if that is also
    >> present in second file. I also need count how many time it is appear
    >> in both the file. And according I do other processing.
    >>
    >> so if I process line by line both the file then it will be like (eg.
    >> file1 has 10 line and file2 has 10 line. for each line file1 it will
    >> loop 10 times. so total 100 loops.) I am dealing millions of lines so
    >> this approach will be very slow.
    >>

    >
    >
    > This problem was solved 50 years ago.


    At least 80; the IBM 077 punched-card collator came out in 1937.
    --
    John W. Kennedy
    "Only an idiot fights a war on two fronts. Only the heir to the
    throne of the kingdom of idiots would fight a war on twelve fronts"
    -- J. Michael Straczynski. "Babylon 5", "Ceremonies of Light and Dark"
     
    John W Kennedy, Nov 1, 2008
    #17
  18. John W Kennedy wrote:
    > smallpond wrote:
    >> On Oct 31, 1:59 pm, "" <>
    >> wrote:
    >>
    >>> here is what I am trying to do.
    >>>
    >>> I have two large files. I will read one file and see if that is also
    >>> present in second file. I also need count how many time it is appear
    >>> in both the file. And according I do other processing.
    >>>
    >>> so if I process line by line both the file then it will be like (eg.
    >>> file1 has 10 line and file2 has 10 line. for each line file1 it will
    >>> loop 10 times. so total 100 loops.) I am dealing millions of lines so
    >>> this approach will be very slow.
    >>>

    >>
    >>
    >> This problem was solved 50 years ago.

    >
    > At least 80; the IBM 077 punched-card collator came out in 1937.


    Arrggghhhh! Make that 70 (or 71), of course.

    --
    John W. Kennedy
    "Only an idiot fights a war on two fronts. Only the heir to the
    throne of the kingdom of idiots would fight a war on twelve fronts"
    -- J. Michael Straczynski. "Babylon 5", "Ceremonies of Light and Dark"
     
    John W Kennedy, Nov 1, 2008
    #18
  19. David Combs Guest

    In article <>,
    Jürgen Exner <> wrote:

    >
    >Another approach: sort both input files. There are many sorting
    >algorithms around,


    Question: why not simply use the standard unix (linux) "sort" program?

    Does that not do all the right things? qsort, uses file-merge-etc
    if it needs to, etc?

    (And hopefully has that within-the-last-10-years *massive*
    speedup on (a) already-sorted files and (b) sorting ASCII files
    discovered by that algorithm-book-writing prof at Princeton.)


    >including those that sort completely on disk and
    >require very minimum RAM. They were very popular back when 32kB was a
    >lot of memory. Then you can walk through both files line by line in
    >parallel, requiring only a tiny little bit of RAM.
    >Depending upon the sorting algorithm this would be O(n)log(n) or
    >somewhat worse.
    >



    Thanks,

    David
     
    David Combs, Dec 1, 2008
    #19
  20. David Combs Guest

    In article <>,
    <> wrote:
    >On Fri, 31 Oct 2008 14:32:05 -0400, Charlton Wilbur <> wrote:
    >
    >>>>>>> "JE" == Jürgen Exner <> writes:

    >>
    >> JE> Don't you learn those techniques in basic computer science
    >> JE> classes any more?
    >>
    >>The assumption that someone who is getting paid to program has had -- or
    >>even has had any interest in -- computer science classes gets less
    >>tenable with each passing day.
    >>
    >>Charlton

    >
    >Well said.. that should be its own thread.
    >
    >sln


    Like hiring surgeons who've never had biology.

    "Look, I can cut, can't I? What ELSE could I possibly need to know?"


    David
     
    David Combs, Dec 1, 2008
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.

Share This Page