Frequency in large datasets

Discussion in 'Perl Misc' started by Cosmic Cruizer, May 1, 2008.

  1. I've been able to reduce my dataset by 75%, but it still leaves me with a
    file of 47 gigs. I'm trying to find the frequency of each line using:

    open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
    $!";
    foreach (<TEMP>) {
    $seen{$_}++;
    }
    close(TEMP) || die "cannot close file
    $tempfile: $!";

    My program keeps aborting after a few minutes because the computer runs out
    of memory. I have four gigs of ram and the total paging files is 10 megs,
    but Perl does not appear to be using it.

    How can I find the frequency of each line using such a large dataset? I
    tried to have two output files where I kept moving the databack and forth
    each time I grabbed the next line from TEMP instead of using $seen{$_}++,
    but I did not have much success.
    Cosmic Cruizer, May 1, 2008
    #1
    1. Advertising

  2. Cosmic Cruizer wrote:
    > I've been able to reduce my dataset by 75%, but it still leaves me with a
    > file of 47 gigs. I'm trying to find the frequency of each line using:
    >
    > open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
    > $!";
    > foreach (<TEMP>) {
    > $seen{$_}++;
    > }
    > close(TEMP) || die "cannot close file
    > $tempfile: $!";
    >
    > My program keeps aborting after a few minutes because the computer runs out
    > of memory.


    This line:

    > foreach (<TEMP>) {


    reads the whole file into memory. You should read the file line by line
    instead by replacing it with:

    while (<TEMP>) {

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, May 1, 2008
    #2
    1. Advertising

  3. Cosmic Cruizer <> wrote in
    news:Xns9A90C3D86EFCEccruizermydejacom@207.115.17.102:

    > I've been able to reduce my dataset by 75%, but it still leaves me
    > with a file of 47 gigs. I'm trying to find the frequency of each line
    > using:
    >
    > open(TEMP, "< $tempfile") || die "cannot open file
    > $tempfile:
    > $!";
    > foreach (<TEMP>) {


    Well, that is simply silly. You have a huge file yet you try to read all
    of it into memory. Ain't gonna work.

    How long is each line and how many unique lines do you expect?

    If the number of unique lines is small relative to the number of total
    lines, I do not see any difficulty if you get rid of the boneheaded for
    loop.

    > $seen{$_}++;
    > }
    > close(TEMP) || die "cannot close file
    > $tempfile: $!";



    my %seen;

    open my $TEMP, '<', $tempfile
    or die "Cannot open '$tempfile': $!";

    ++ $seen{ $_ } while <$TEMP>;

    close $TEMP
    or die "Cannot close '$tempfile': $!";

    > My program keeps aborting after a few minutes because the computer
    > runs out of memory. I have four gigs of ram and the total paging files
    > is 10 megs, but Perl does not appear to be using it.


    I don't see much point to having a 10 MB swap file. To make the best use
    of 4 GB physical memory, AFAIK, you need to be running a 64 bit OS.

    > How can I find the frequency of each line using such a large dataset?
    > I tried to have two output files where I kept moving the databack and
    > forth each time I grabbed the next line from TEMP instead of using
    > $seen{$_}++, but I did not have much success.


    If the number of unique lines is large, I would periodically store the
    current counts, clear the hash, keep processing the original file. Then,
    when you reach the end of the original data file, go back to the stored
    counts (which will have multiple entries for each unique line) and
    aggregate the information there.

    Sinan

    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://www.rehabitation.com/clpmisc/
    A. Sinan Unur, May 1, 2008
    #3
  4. Cosmic Cruizer

    Guest

    Cosmic Cruizer <> wrote:
    > I've been able to reduce my dataset by 75%, but it still leaves me with a
    > file of 47 gigs. I'm trying to find the frequency of each line using:
    >
    > open(TEMP, "< $tempfile") || die "cannot open file
    > $tempfile: $!";
    > foreach (<TEMP>) {
    > $seen{$_}++;
    > }
    > close(TEMP) || die "cannot close file
    > $tempfile: $!";


    If each line shows up a million times on average, that shouldn't
    be a problem. If each line shows up twice on average, then it won't
    work so well with 4G of RAM. We don't which of those is closer to your
    case.

    > My program keeps aborting after a few minutes because the computer runs
    > out of memory. I have four gigs of ram and the total paging files is 10
    > megs, but Perl does not appear to be using it.


    If the program is killed due to running out of memory, then I would
    say that the program *does* appear to be using the available memory. What
    makes you think it isn't using it?


    > How can I find the frequency of each line using such a large dataset?


    I probably wouldn't use Perl, but rather the OS's utilities. For example
    on linux:

    sort big_file | uniq -c


    > I
    > tried to have two output files where I kept moving the databack and forth
    > each time I grabbed the next line from TEMP instead of using $seen{$_}++,
    > but I did not have much success.


    But in line 42.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
    , May 1, 2008
    #4
  5. Cosmic Cruizer

    Guest

    Gunnar Hjalmarsson <> wrote:
    > Cosmic Cruizer wrote:
    > > I've been able to reduce my dataset by 75%, but it still leaves me with
    > > a file of 47 gigs. I'm trying to find the frequency of each line using:
    > >
    > > open(TEMP, "< $tempfile") || die "cannot open file
    > > $tempfile: $!";
    > > foreach (<TEMP>) {
    > > $seen{$_}++;
    > > }
    > > close(TEMP) || die "cannot close file
    > > $tempfile: $!";
    > >
    > > My program keeps aborting after a few minutes because the computer runs
    > > out of memory.

    >
    > This line:
    >
    > > foreach (<TEMP>) {

    >
    > reads the whole file into memory. You should read the file line by line
    > instead by replacing it with:
    >
    > while (<TEMP>) {


    Duh, I completely overlooked that.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
    , May 1, 2008
    #5
  6. Gunnar Hjalmarsson <> wrote in
    news::

    > Cosmic Cruizer wrote:
    >> I've been able to reduce my dataset by 75%, but it still leaves me
    >> with a file of 47 gigs. I'm trying to find the frequency of each line
    >> using:
    >>
    >> open(TEMP, "< $tempfile") || die "cannot open file
    >> $tempfile:
    >> $!";
    >> foreach (<TEMP>) {
    >> $seen{$_}++;
    >> }
    >> close(TEMP) || die "cannot close file
    >> $tempfile: $!";
    >>
    >> My program keeps aborting after a few minutes because the computer
    >> runs out of memory.

    >
    > This line:
    >
    >> foreach (<TEMP>) {

    >
    > reads the whole file into memory. You should read the file line by
    > line instead by replacing it with:
    >
    > while (<TEMP>) {
    >


    <sigh> As both you and Sinan pointed out... I'm using foreach. Everywhere
    else I used the while statement to get me to this point. This solves the
    problem.

    Thank you.
    Cosmic Cruizer, May 1, 2008
    #6
  7. Cosmic Cruizer <> wrote:
    >I've been able to reduce my dataset by 75%, but it still leaves me with a
    >file of 47 gigs. I'm trying to find the frequency of each line using:
    >
    > open(TEMP, "< $tempfile") || die "cannot open file $tempfile:
    >$!";
    > foreach (<TEMP>) {


    This slurps the whole file (yes, all 47GB) inot a list and then iterates
    over that list. Read the file line-by-line instead:

    while (<TEMP>){

    This should work unless you have a lot of different data points.

    jue
    Jürgen Exner, May 1, 2008
    #7
  8. Cosmic Cruizer

    Ben Bullock Guest

    A. Sinan Unur <> wrote:
    > Cosmic Cruizer <> wrote in
    > news:Xns9A90C3D86EFCEccruizermydejacom@207.115.17.102:
    >
    >> I've been able to reduce my dataset by 75%, but it still leaves me
    >> with a file of 47 gigs. I'm trying to find the frequency of each line
    >> using:
    >>
    >> open(TEMP, "< $tempfile") || die "cannot open file
    >> $tempfile:
    >> $!";
    >> foreach (<TEMP>) {

    >
    > Well, that is simply silly. You have a huge file yet you try to read all
    > of it into memory. Ain't gonna work.


    I'm not sure why it's silly as such - perhaps he didn't know that
    "foreach" would read all the file into memory.


    > If the number of unique lines is small relative to the number of total
    > lines, I do not see any difficulty if you get rid of the boneheaded for
    > loop.


    Again, why is it "boneheaded"? The fact that foreach reads the entire
    file into memory isn't something I'd expect people to know
    automatically.
    Ben Bullock, May 1, 2008
    #8
  9. (Ben Bullock) wrote in
    news:fvbj3s$l7u$:

    > A. Sinan Unur <> wrote:
    >> Cosmic Cruizer <> wrote in
    >> news:Xns9A90C3D86EFCEccruizermydejacom@207.115.17.102:
    >>


    ....

    >>> foreach (<TEMP>) {

    >>
    >> Well, that is simply silly. You have a huge file yet you try to read
    >> all of it into memory. Ain't gonna work.

    >
    > I'm not sure why it's silly as such - perhaps he didn't know that
    > "foreach" would read all the file into memory.


    Well, I assumed he didn't. But this is one of those things, had I found
    myself doing it, after spending hours and hours trying to work out a way
    of processing the file, I would have slapped my forehead and said, "now
    that was just a silly thing to do". Coupled with the "ain't" I assumed
    my meaning was clear. I wasn't calling the OP names, but trying to get a
    message across very strongly.

    >> If the number of unique lines is small relative to the number of
    >> total lines, I do not see any difficulty if you get rid of the
    >> boneheaded for loop.

    >
    > Again, why is it "boneheaded"?


    Because there is no hope of anything working so long as that for loop is
    there.

    > The fact that foreach reads the entire file into memory isn't
    > something I'd expect people to know automatically.


    Maybe this helps:

    From perlfaq3.pod:

    <blockquote>
    * How can I make my Perl program take less memory?

    ....

    Of course, the best way to save memory is to not do anything to waste it
    in the first place. Good programming practices can go a long way toward
    this:

    * Don't slurp!

    Don't read an entire file into memory if you can process it line by
    line. Or more concretely, use a loop like this:
    </blockquote>

    Maybe you would like to read the rest.

    So, calling the for loop boneheaded is a little stronger than "Bad
    Idea", but then what is simply a bad idea with a 200 MB file (things
    will still work but less efficiently) is boneheaded with a 47 GB file
    (there is no chance of the program working).

    There is a reason "Don't slurp!" appears with an exclamation mark and as
    the first recommendation in the FAQ list answer.

    Hope this helps you become more comfortable with the notion that reading
    a 47 GB file is a boneheaded move. It is boneheaded if I do it, if Larry
    Wall does it, if Superman does it ... you get the picture I hope.

    Sinan

    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://www.rehabitation.com/clpmisc/
    A. Sinan Unur, May 1, 2008
    #9
  10. On May 1, 7:26 am, "A. Sinan Unur" <> wrote:
    > (Ben Bullock) wrote innews:fvbj3s$l7u$:
    >
    > > A. Sinan Unur <> wrote:

    >
    > Hope this helps you become more comfortable with the notion that reading
    > a 47 GB file is a boneheaded move. It is boneheaded if I do it, if Larry
    > Wall does it, if Superman does it ... you get the picture I hope.
    >


    I don't think it would be boneheaded if Superman did it...I mean, he's
    SUPERMAN.
    nolo contendere, May 1, 2008
    #10
  11. nolo contendere <> wrote in
    news::

    > On May 1, 7:26 am, "A. Sinan Unur" <> wrote:
    >> (Ben Bullock) wrote
    >> innews:fvbj3s$l7u$

    > net.ne.jp:
    >>
    >> > A. Sinan Unur <> wrote:

    >>
    >> Hope this helps you become more comfortable with the notion that
    >> reading a 47 GB file is a boneheaded move. It is boneheaded if I do
    >> it, if Larry Wall does it, if Superman does it ... you get the
    >> picture I hope.
    >>

    >
    > I don't think it would be boneheaded if Superman did it...I mean, he's
    > SUPERMAN.


    But attempting to slurp a 47 GB files is the equivalent of having a
    cryptonite slurpee in the morning.

    Not good.

    ;-)

    Sinan

    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://www.rehabitation.com/clpmisc/
    A. Sinan Unur, May 1, 2008
    #11
  12. Cosmic Cruizer

    Uri Guttman Guest

    >>>>> "ASU" == A Sinan Unur <> writes:

    ASU> nolo contendere <> wrote in
    ASU> news::

    >> On May 1, 7:26 am, "A. Sinan Unur" <> wrote:
    >>> (Ben Bullock) wrote
    >>> innews:fvbj3s$l7u$

    >> net.ne.jp:
    >>>
    >>> > A. Sinan Unur <> wrote:
    >>>
    >>> Hope this helps you become more comfortable with the notion that
    >>> reading a 47 GB file is a boneheaded move. It is boneheaded if I do
    >>> it, if Larry Wall does it, if Superman does it ... you get the
    >>> picture I hope.
    >>>

    >>
    >> I don't think it would be boneheaded if Superman did it...I mean, he's
    >> SUPERMAN.


    ASU> But attempting to slurp a 47 GB files is the equivalent of having a
    ASU> cryptonite slurpee in the morning.

    ASU> Not good.

    ASU> ;-)

    and i wouldn't even recommend file::slurp for that job!! :)

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Free Perl Training --- http://perlhunter.com/college.html ---------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
    Uri Guttman, May 1, 2008
    #12
  13. A. Sinan Unur wrote:
    > nolo contendere <> wrote in
    > news::
    >
    >> On May 1, 7:26 am, "A. Sinan Unur" <> wrote:
    >>> (Ben Bullock) wrote
    >>> innews:fvbj3s$l7u$

    >> net.ne.jp:
    >>>> A. Sinan Unur <> wrote:
    >>> Hope this helps you become more comfortable with the notion that
    >>> reading a 47 GB file is a boneheaded move. It is boneheaded if I do
    >>> it, if Larry Wall does it, if Superman does it ... you get the
    >>> picture I hope.
    >>>

    >> I don't think it would be boneheaded if Superman did it...I mean, he's
    >> SUPERMAN.

    >
    > But attempting to slurp a 47 GB files is the equivalent of having a
    > cryptonite slurpee in the morning.


    s/cryptonite/kryptonite/;


    John
    --
    Perl isn't a toolbox, but a small machine shop where you
    can special-order certain sorts of tools at low cost and
    in short order. -- Larry Wall
    John W. Krahn, May 1, 2008
    #13
  14. Cosmic Cruizer

    Uri Guttman Guest

    >>>>> "JWK" == John W Krahn <> writes:

    JWK> A. Sinan Unur wrote:
    >>> I don't think it would be boneheaded if Superman did it...I mean, he's
    >>> SUPERMAN.

    >> But attempting to slurp a 47 GB files is the equivalent of having a
    >> cryptonite slurpee in the morning.


    JWK> s/cryptonite/kryptonite/;

    what if the 47 GB was enkrypted? (sic :)

    uri

    --
    Uri Guttman ------ -------- http://www.sysarch.com --
    ----- Perl Code Review , Architecture, Development, Training, Support ------
    --------- Free Perl Training --- http://perlhunter.com/college.html ---------
    --------- Gourmet Hot Cocoa Mix ---- http://bestfriendscocoa.com ---------
    Uri Guttman, May 1, 2008
    #14
  15. Cosmic Cruizer <> wrote in
    news:Xns9A90D0E1FD16ccruizermydejacom@207.115.17.102:

    > Gunnar Hjalmarsson <> wrote in
    > news::
    >
    >> Cosmic Cruizer wrote:
    >>> I've been able to reduce my dataset by 75%, but it still leaves me
    >>> with a file of 47 gigs. I'm trying to find the frequency of each line
    >>> using:
    >>>
    >>> open(TEMP, "< $tempfile") || die "cannot open file
    >>> $tempfile:
    >>> $!";
    >>> foreach (<TEMP>) {
    >>> $seen{$_}++;
    >>> }
    >>> close(TEMP) || die "cannot close file
    >>> $tempfile: $!";
    >>>
    >>> My program keeps aborting after a few minutes because the computer
    >>> runs out of memory.

    >>
    >> This line:
    >>
    >>> foreach (<TEMP>) {

    >>
    >> reads the whole file into memory. You should read the file line by
    >> line instead by replacing it with:
    >>
    >> while (<TEMP>) {
    >>

    >
    ><sigh> As both you and Sinan pointed out... I'm using foreach.

    Everywhere
    > else I used the while statement to get me to this point. This solves

    the
    > problem.
    >
    > Thank you.


    Well... that did not make any difference at all. I still get up to about
    90% of the physical ram and the job aborts within about the same
    timeframe. From what I can tell, using while did not make any difference
    than using foreach. I tried using the two swapfiles idea, but that is not
    a viable solution. I guess the only thing to do is to break the files
    down into smaller chunks of about 5 gigs each. That will give me about 3
    to 4 days worth of data at a time. After that, I can look at what I have
    and decide how I can optimize the data for the next run.
    Cosmic Cruizer, May 2, 2008
    #15
  16. On May 1, 8:26 pm, Cosmic Cruizer <> wrote:
    > Cosmic Cruizer <> wrote innews:Xns9A90D0E1FD16ccruizermydejacom@207.115.17.102:
    >
    >
    >
    >
    >
    > > Gunnar Hjalmarsson <> wrote in
    > >news::

    >
    > >> Cosmic Cruizer wrote:
    > >>> I've been able to reduce my dataset by 75%, but it still leaves me
    > >>> with a file of 47 gigs. I'm trying to find the frequency of each line
    > >>> using:

    >
    > >>> open(TEMP, "< $tempfile") || die "cannot open file
    > >>> $tempfile:
    > >>> $!";
    > >>> foreach (<TEMP>) {
    > >>> $seen{$_}++;
    > >>> }
    > >>> close(TEMP) || die "cannot close file
    > >>> $tempfile: $!";

    >
    > >>> My program keeps aborting after a few minutes because the computer
    > >>> runs out of memory.

    >
    > >> This line:

    >
    > >>> foreach (<TEMP>) {

    >
    > >> reads the whole file into memory. You should read the file line by
    > >> line instead by replacing it with:

    >
    > >> while (<TEMP>) {

    >
    > ><sigh> As both you and Sinan pointed out... I'm using foreach.

    > Everywhere
    > > else I used the while statement to get me to this point. This solves

    > the
    > > problem.

    >
    > > Thank you.

    >
    > Well... that did not make any difference at all. I still get up to about
    > 90% of the physical ram and the job aborts within about the same
    > timeframe. From what I can tell, using while did not make any difference
    > than using foreach. I tried using the two swapfiles idea, but that is not
    > a viable solution. I guess the only thing to do is to break the files
    > down into smaller chunks of about 5 gigs each. That will give me about 3
    > to 4 days worth of data at a time. After that, I can look at what I have
    > and decide how I can optimize the data for the next run.


    While slower, you could use a DBM if %seen is
    overgrowing memory, eg,

    my $db = tie %seen, 'DB_File', [$filename, $flags, $mode, $DB_HASH]
    or die ...

    --
    Charles DeRykus
    comp.llang.perl.moderated, May 2, 2008
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. news.microsoft.com
    Replies:
    0
    Views:
    415
    news.microsoft.com
    Apr 12, 2006
  2. Francisco Garcia
    Replies:
    2
    Views:
    455
    Fran Garcia
    Apr 13, 2006
  3. x1
    Replies:
    9
    Views:
    300
    Rick DeNatale
    Oct 12, 2006
  4. PerlFAQ Server
    Replies:
    0
    Views:
    197
    PerlFAQ Server
    Feb 1, 2011
  5. PerlFAQ Server
    Replies:
    0
    Views:
    182
    PerlFAQ Server
    Mar 26, 2011
Loading...

Share This Page