Perl storing huge data(300MB) in a scalar

Discussion in 'Perl Misc' started by kalpanashtty@gmail.com, Dec 5, 2006.

  1. Guest

    Hello,
    This is regarding issues we face in while storing large data in a
    scalar variable. The problem is explained as below:

    We have a log file which has 10lines each line has appx 300MB
    long(continuous). Using perl we read each line and store the read line
    in a scalar variable. This works fine. But each time when it read these
    huge line we see after sometime "Out of memory" and even memory
    consumption increases.

    Do any one faced this problem and know how to handle this kind of
    scenario.

    Kalpana
     
    , Dec 5, 2006
    #1
    1. Advertising

  2. J.D. Baldwin Guest

    In the previous article, <> wrote:
    > Do any one faced this problem and know how to handle this kind of
    > scenario.


    I had a similar problem a few months back with huge log data that
    wasn't broken by newlines. perldoc -f getc has what you probably
    need. Something along the lines of:

    my $chunk = '';
    for ( 1..$howmanycharsdoyouwantatonce )
    {
    $chunk .= getc FHANDLE;
    }
    --
    _+_ From the catapult of |If anyone disagrees with any statement I make, I
    _|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
    \ / |to deny under oath that I ever made it. -T. Lehrer
    ***~~~~-----------------------------------------------------------------------
     
    J.D. Baldwin, Dec 5, 2006
    #2
    1. Advertising

  3. J.D. Baldwin wrote:
    > In the previous article, <> wrote:
    >>Do any one faced this problem and know how to handle this kind of
    >>scenario.

    >
    > I had a similar problem a few months back with huge log data that
    > wasn't broken by newlines. perldoc -f getc has what you probably
    > need. Something along the lines of:
    >
    > my $chunk = '';
    > for ( 1..$howmanycharsdoyouwantatonce )
    > {
    > $chunk .= getc FHANDLE;
    > }


    Read one character at a time? Ick!

    read FHANDLE, my $chunk, $howmanycharsdoyouwantatonce;

    Or:

    local $/ = \$howmanycharsdoyouwantatonce;
    my $chunk = <FHANDLE>;



    John
    --
    Perl isn't a toolbox, but a small machine shop where you can special-order
    certain sorts of tools at low cost and in short order. -- Larry Wall
     
    John W. Krahn, Dec 5, 2006
    #3
  4. Guest

    wrote:
    > Hello,
    > This is regarding issues we face in while storing large data in a
    > scalar variable. The problem is explained as below:
    >
    > We have a log file which has 10lines each line has appx 300MB
    > long(continuous). Using perl we read each line and store the read line
    > in a scalar variable. This works fine. But each time when it read these
    > huge line we see after sometime "Out of memory" and even memory
    > consumption increases.
    >
    > Do any one faced this problem and know how to handle this kind of
    > scenario.


    I write code that doesn't have this problem. Since you haven't shown
    us any of our code, I can't tell you which part of your code is the
    problem.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Dec 5, 2006
    #4
  5. J.D. Baldwin Guest

    In the previous article, John W. Krahn <> wrote:
    > Read one character at a time? Ick!


    There seemed to be a good reason at the time. Anyway, performance
    wasn't an issue.

    > local $/ = \$howmanycharsdoyouwantatonce;
    > my $chunk = <FHANDLE>;


    That's a cool trick, thanks.
    --
    _+_ From the catapult of |If anyone disagrees with any statement I make, I
    _|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
    \ / |to deny under oath that I ever made it. -T. Lehrer
    ***~~~~-----------------------------------------------------------------------
     
    J.D. Baldwin, Dec 5, 2006
    #5
  6. Guest

    J.D. Baldwin wrote:
    > In the previous article, John W. Krahn <> wrote:
    > > Read one character at a time? Ick!

    >
    > There seemed to be a good reason at the time. Anyway, performance
    > wasn't an issue.
    >
    > > local $/ = \$howmanycharsdoyouwantatonce;
    > > my $chunk = <FHANDLE>;

    >
    > That's a cool trick, thanks.


    You might want to check further into the $/ Perl variable...

    http://perldoc.perl.org/perlvar.html#$RS

    If there are literal strings in your file that you can use as a
    pseudo-eol, then set $/ to that string and read the file as normal.
    You'll have the advantage of not needing to see if you read too little
    or too much and having to reconstruct your lines.

    Perl does well with file I/O, but will grunt when having to allocate
    big chunks of memory to read the lines. If you read about slurp you'll
    see it's almost never a good idea, and from doing some benchmarking I
    found I was ahead just reading line by line, or in reasonable sized
    fixed blocks, so I'd go about finding some way of determining the real
    end-of-record marker.
     
    , Dec 5, 2006
    #6
  7. Ala Qumsieh Guest

    wrote:

    > Do any one faced this problem and know how to handle this kind of
    > scenario.


    use a 64-bit compiled version of Perl?

    --Ala
     
    Ala Qumsieh, Dec 6, 2006
    #7
  8. J.D. Baldwin Guest

    In the previous article, <> wrote:
    > If you read about slurp you'll see it's almost never a good idea
    > [...]


    So, a question then:

    I have a very short script that reads the output of wget $URL like so:

    my $wget_out = `/path/to/wget $URL`;

    I am absolutely assured that the output from this URL will be around
    10-15K every time. Furthermore, I need to search for a short string
    that always appears near the end of the output (so there is no
    advantage to cutting off the input after some shorter number of
    characters).

    So now that you have educated me a little, I am doing this:

    $/ = \32000; # much bigger than ever needed, small enough
    # to avoid potential memory problems in the
    # unlikely event of runaway output from wget

    my $wget_out = `/path/to/wget $URL`;

    if ( $wget_out /$string_to_match/ )
    {
    # do "OK" thing
    }
    else
    {
    # do "not so OK" thing
    }

    Performance is important, but not extremely so; this script runs many
    times per hour to validate the output of certain web servers. So if
    there is overhead to the "obvious" line-by-line read-and-match method
    of doing the same thing (which will always have to read about 200
    lines before matching), then doing it that way is wasteful.

    In your opinion, is this an exception to the "almost never a good
    idea," or is this a case for slurping?

    Also, if I can determine the absolute earliest $string_to_match could
    possibly appear, I suppose I can get a big efficiency out of

    my $earliest_char = 8_000; # string of interest appears after
    # AT LEAST 8,000 characters

    if ( substr($wget_out, $earliest_char) =~ /$string_to_match/ )
    {
    ...

    Yes?
    --
    _+_ From the catapult of |If anyone disagrees with any statement I make, I
    _|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
    \ / |to deny under oath that I ever made it. -T. Lehrer
    ***~~~~-----------------------------------------------------------------------
     
    J.D. Baldwin, Dec 7, 2006
    #8
  9. Guest

    wrote:
    > In the previous article, <> wrote:
    > > If you read about slurp you'll see it's almost never a good idea
    > > [...]


    I would disagree. Slurp is quite often a good idea. Slurping data that
    is, or has the potential to be, very large when doing so is utterly
    unnecessary is rarely a good idea, though.

    >
    > So, a question then:
    >
    > I have a very short script that reads the output of wget $URL like so:
    >
    > my $wget_out = `/path/to/wget $URL`;
    >
    > I am absolutely assured that the output from this URL will be around
    > 10-15K every time.


    So how does this get turned into 300MB?

    > Furthermore, I need to search for a short string
    > that always appears near the end of the output (so there is no
    > advantage to cutting off the input after some shorter number of
    > characters).
    >
    > So now that you have educated me a little, I am doing this:
    >
    > $/ = \32000; # much bigger than ever needed, small enough
    > # to avoid potential memory problems in the
    > # unlikely event of runaway output from wget
    >
    > my $wget_out = `/path/to/wget $URL`;


    Backticks in a scalar context is not line oriented, and so $/ is irrelevant
    to it. Even in a list context, backticks seem to slurp the whole thing,
    and only apply $/ to it after slurping.

    If you are really worried about runaway wget, you should either open a pipe
    and read from it yourself:

    open my $fh, "/path/to/get $URL |" or die $!;
    $/=\32000;
    my $wget_out=<$fh>;

    or just use system tools to do it and forget about $/ altogether:

    my $wget_out = `/path/to/wget $URL|head -c 32000`;

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Dec 7, 2006
    #9
  10. J.D. Baldwin <> wrote:

    > my $wget_out = `/path/to/wget $URL`;



    You can make it more portable by doing it in native Perl
    rather than shelling out:

    use LWP::Simple;
    my $wget_out = get $URL;


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Dec 7, 2006
    #10
  11. J.D. Baldwin Guest

    In the previous article, <> wrote, quoting me:
    > > So, a question then:
    > >
    > > I have a very short script that reads the output of wget $URL like so:
    > >
    > > my $wget_out = `/path/to/wget $URL`;
    > >
    > > I am absolutely assured that the output from this URL will be around
    > > 10-15K every time.

    >
    > So how does this get turned into 300MB?


    That 300MB thing was the other guy; I just piggybacked my question
    onto his.

    > > $/ = \32000; # much bigger than ever needed, small enough
    > > # to avoid potential memory problems in the
    > > # unlikely event of runaway output from wget
    > >
    > > my $wget_out = `/path/to/wget $URL`;

    >
    > Backticks in a scalar context is not line oriented, and so $/ is
    > irrelevant to it. Even in a list context, backticks seem to slurp
    > the whole thing, and only apply $/ to it after slurping.


    Ah, I knew the IFS didn't matter, but I didn't extrapolate that into a
    realization that $/ wouldn't matter at all.

    > If you are really worried about runaway wget, you should either open
    > a pipe and read from it yourself:
    >
    > open my $fh, "/path/to/get $URL |" or die $!;
    > $/=\32000;
    > my $wget_out=<$fh>;


    I was trying to avoid doing an open -- not that it's a big deal -- and
    I'm not 100% sure that pipe trick will work ...

    > or just use system tools to do it and forget about $/ altogether:
    >
    > my $wget_out = `/path/to/wget $URL|head -c 32000`;


    .... because, sadly, I am doing this for a Windows platform, where I
    have no head (ha ha).n

    I'll probably just drop back to the open method described above, thanks.
    --
    _+_ From the catapult of |If anyone disagrees with any statement I make, I
    _|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
    \ / |to deny under oath that I ever made it. -T. Lehrer
    ***~~~~-----------------------------------------------------------------------
     
    J.D. Baldwin, Dec 7, 2006
    #11
  12. J.D. Baldwin Guest

    In the previous article, Tad McClellan <> wrote:
    > > my $wget_out = `/path/to/wget $URL`;

    >
    >
    > You can make it more portable by doing it in native Perl
    > rather than shelling out:
    >
    > use LWP::Simple;
    > my $wget_out = get $URL;


    That's kind of a Phase II plan ... getting new Perl modules installed
    on these monitoring systems is non-trivial, but for political reasons
    rather than technical ones.
    --
    _+_ From the catapult of |If anyone disagrees with any statement I make, I
    _|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
    \ / |to deny under oath that I ever made it. -T. Lehrer
    ***~~~~-----------------------------------------------------------------------
     
    J.D. Baldwin, Dec 7, 2006
    #12
  13. Uri Guttman Guest

    >>>>> "JDB" == J D Baldwin <> writes:

    JDB> In the previous article, <> wrote, quoting me:
    >> > So, a question then:
    >> >
    >> > I have a very short script that reads the output of wget $URL like so:
    >> >
    >> > my $wget_out = `/path/to/wget $URL`;
    >> >
    >> > I am absolutely assured that the output from this URL will be around
    >> > 10-15K every time.

    >>
    >> So how does this get turned into 300MB?


    JDB> That 300MB thing was the other guy; I just piggybacked my question
    JDB> onto his.

    then start a new thread with a new subject.

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Dec 7, 2006
    #13
  14. J.D. Baldwin Guest

    In the previous article, Tad McClellan <> wrote:
    > > my $wget_out = `/path/to/wget $URL`;

    >
    >
    > You can make it more portable by doing it in native Perl
    > rather than shelling out:
    >
    > use LWP::Simple;
    > my $wget_out = get $URL;


    A little additional research shows that a) I was wrong about LWP not
    being part of ActivePerl, because it is, and b) LWP::UserAgent allows
    me to specify a max content size (taking care of that problem) and a
    specific proxy server (a part of the problem domain I didn't mention).
    Thanks.
    --
    _+_ From the catapult of |If anyone disagrees with any statement I make, I
    _|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
    \ / |to deny under oath that I ever made it. -T. Lehrer
    ***~~~~-----------------------------------------------------------------------
     
    J.D. Baldwin, Dec 7, 2006
    #14
  15. John Bokma Guest

    Tad McClellan <> wrote:

    > J.D. Baldwin <> wrote:
    >
    >> my $wget_out = `/path/to/wget $URL`;

    >
    >
    > You can make it more portable by doing it in native Perl
    > rather than shelling out:
    >
    > use LWP::Simple;
    > my $wget_out = get $URL;


    Or if you insist on wget, it has been ported to Windows (I use wget in
    some Perl programs similar to what J.D. mentioned)

    --
    John Experienced Perl programmer: http://castleamber.com/

    Perl help, tutorials, and examples: http://johnbokma.com/perl/
     
    John Bokma, Dec 7, 2006
    #15
  16. gf Guest

    J.D. Baldwin wrote:

    > I have a very short script that reads the output of wget $URL like so:
    >
    > my $wget_out = `/path/to/wget $URL`;
    >
    > I am absolutely assured that the output from this URL will be around
    > 10-15K every time. Furthermore, I need to search for a short string
    > that always appears near the end of the output (so there is no
    > advantage to cutting off the input after some shorter number of
    > characters).


    If you are truly "absolutely assured" and you know your machine will
    ALWAYS have enough RAM available to handle the data being read, then
    slurping is fine, except when at the dinner table, unless you're in a
    society that approves of such behavior... which reminds me of testing
    for leap year, only now I digress.....

    When you are always getting the same file, or files, then slurp is
    safer^H^H^H^H^Hbenign. For small config files and small data sets it's
    cool and I use it for those. If you are trying to slurp a file using a
    name that came about dynamically, or as part of user interaction or
    input, then slurp would be a really bad design choice in my opinion. If
    you feel that the app going run-away, crashing, or taking the host to
    its knees is acceptable... well then, slurp away, just use it with the
    knowledge that it is a very sharp pokey kind of tool and shouldn't be
    waved about wildly in a crowd or carried while running. Again, it'd be
    worth reading the slurp docs and/or Conway's comments in the PBP book.

    Now, regarding using `wget...`, why not just use LWP::Simple instead?
    It works very nicely in a very similar fashion, and skips having to
    shell out just to run. Having written a bunch of iterations of spiders
    for our internal use, using LWP::Simple, LWP::UserAgent, plus some
    stuff needing curl or wget, I still reach for the simple LWP first.

    Just my $0.02.
     
    gf, Dec 7, 2006
    #16
  17. J.D. Baldwin Guest

    In the previous article, gf <> wrote:
    > Now, regarding using `wget...`, why not just use LWP::Simple
    > instead?


    Because perl -MLPW::Simple -e 'print "OK\n";' failed, and (as
    mentioned elsethread) installing new modules is not going to happen
    anytime soon for these hosts.

    Then I tried it again without misspelling "LWP" and it worked. So I
    have already rewritten the whole thing (all twenty-odd lines of it,
    oooh) to use LWP::UserAgent (which was also present). Much more
    robust, and I still avoid writing and then reading a file.
    --
    _+_ From the catapult of |If anyone disagrees with any statement I make, I
    _|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
    \ / |to deny under oath that I ever made it. -T. Lehrer
    ***~~~~-----------------------------------------------------------------------
     
    J.D. Baldwin, Dec 8, 2006
    #17
  18. J.D. Baldwin <> wrote:

    > I just piggybacked my question
    > onto his.



    Why?


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
     
    Tad McClellan, Dec 8, 2006
    #18
  19. J.D. Baldwin Guest

    In the previous article, Tad McClellan <> wrote:
    > J.D. Baldwin <> wrote:
    >
    > > I just piggybacked my question
    > > onto his.

    >
    >
    > Why?


    Because the comments about slurp and the use of $/ led naturally to a
    closely related topic I've been thinking about.
    --
    _+_ From the catapult of |If anyone disagrees with any statement I make, I
    _|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
    \ / |to deny under oath that I ever made it. -T. Lehrer
    ***~~~~-----------------------------------------------------------------------
     
    J.D. Baldwin, Dec 8, 2006
    #19
  20. Uri Guttman Guest

    >>>>> "JDB" == J D Baldwin <> writes:

    JDB> In the previous article, Tad McClellan <> wrote:
    >> J.D. Baldwin <> wrote:
    >>
    >> > I just piggybacked my question
    >> > onto his.

    >>
    >>
    >> Why?


    JDB> Because the comments about slurp and the use of $/ led naturally to a
    JDB> closely related topic I've been thinking about.

    i will ask again, why didn't you start a new thread and subject then?

    uri

    --
    Uri Guttman ------ -------- http://www.stemsystems.com
    --Perl Consulting, Stem Development, Systems Architecture, Design and Coding-
    Search or Offer Perl Jobs ---------------------------- http://jobs.perl.org
     
    Uri Guttman, Dec 8, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. toton
    Replies:
    11
    Views:
    716
    toton
    Oct 13, 2006
  2. Jonathan Wood
    Replies:
    1
    Views:
    513
    Jonathan Wood
    Jun 2, 2008
  3. Replies:
    3
    Views:
    513
  4. Clint Olsen
    Replies:
    6
    Views:
    366
    Jeff 'japhy' Pinyan
    Nov 13, 2003
  5. Mark

    Replace scalar in another scalar

    Mark, Jan 27, 2005, in forum: Perl Misc
    Replies:
    4
    Views:
    168
    Arndt Jonasson
    Jan 27, 2005
Loading...

Share This Page