Perl storing huge data(300MB) in a scalar

K

kalpanashtty

Hello,
This is regarding issues we face in while storing large data in a
scalar variable. The problem is explained as below:

We have a log file which has 10lines each line has appx 300MB
long(continuous). Using perl we read each line and store the read line
in a scalar variable. This works fine. But each time when it read these
huge line we see after sometime "Out of memory" and even memory
consumption increases.

Do any one faced this problem and know how to handle this kind of
scenario.

Kalpana
 
J

J.D. Baldwin

In the previous article said:
Do any one faced this problem and know how to handle this kind of
scenario.

I had a similar problem a few months back with huge log data that
wasn't broken by newlines. perldoc -f getc has what you probably
need. Something along the lines of:

my $chunk = '';
for ( 1..$howmanycharsdoyouwantatonce )
{
$chunk .= getc FHANDLE;
}
 
J

John W. Krahn

J.D. Baldwin said:
I had a similar problem a few months back with huge log data that
wasn't broken by newlines. perldoc -f getc has what you probably
need. Something along the lines of:

my $chunk = '';
for ( 1..$howmanycharsdoyouwantatonce )
{
$chunk .= getc FHANDLE;
}

Read one character at a time? Ick!

read FHANDLE, my $chunk, $howmanycharsdoyouwantatonce;

Or:

local $/ = \$howmanycharsdoyouwantatonce;
my $chunk = <FHANDLE>;



John
 
X

xhoster

Hello,
This is regarding issues we face in while storing large data in a
scalar variable. The problem is explained as below:

We have a log file which has 10lines each line has appx 300MB
long(continuous). Using perl we read each line and store the read line
in a scalar variable. This works fine. But each time when it read these
huge line we see after sometime "Out of memory" and even memory
consumption increases.

Do any one faced this problem and know how to handle this kind of
scenario.

I write code that doesn't have this problem. Since you haven't shown
us any of our code, I can't tell you which part of your code is the
problem.

Xho
 
J

J.D. Baldwin

In the previous article said:
Read one character at a time? Ick!

There seemed to be a good reason at the time. Anyway, performance
wasn't an issue.
local $/ = \$howmanycharsdoyouwantatonce;
my $chunk = <FHANDLE>;

That's a cool trick, thanks.
 
G

greg.ferguson

J.D. Baldwin said:
There seemed to be a good reason at the time. Anyway, performance
wasn't an issue.


That's a cool trick, thanks.

You might want to check further into the $/ Perl variable...

http://perldoc.perl.org/perlvar.html#$RS

If there are literal strings in your file that you can use as a
pseudo-eol, then set $/ to that string and read the file as normal.
You'll have the advantage of not needing to see if you read too little
or too much and having to reconstruct your lines.

Perl does well with file I/O, but will grunt when having to allocate
big chunks of memory to read the lines. If you read about slurp you'll
see it's almost never a good idea, and from doing some benchmarking I
found I was ahead just reading line by line, or in reasonable sized
fixed blocks, so I'd go about finding some way of determining the real
end-of-record marker.
 
J

J.D. Baldwin

In the previous article said:
If you read about slurp you'll see it's almost never a good idea
[...]

So, a question then:

I have a very short script that reads the output of wget $URL like so:

my $wget_out = `/path/to/wget $URL`;

I am absolutely assured that the output from this URL will be around
10-15K every time. Furthermore, I need to search for a short string
that always appears near the end of the output (so there is no
advantage to cutting off the input after some shorter number of
characters).

So now that you have educated me a little, I am doing this:

$/ = \32000; # much bigger than ever needed, small enough
# to avoid potential memory problems in the
# unlikely event of runaway output from wget

my $wget_out = `/path/to/wget $URL`;

if ( $wget_out /$string_to_match/ )
{
# do "OK" thing
}
else
{
# do "not so OK" thing
}

Performance is important, but not extremely so; this script runs many
times per hour to validate the output of certain web servers. So if
there is overhead to the "obvious" line-by-line read-and-match method
of doing the same thing (which will always have to read about 200
lines before matching), then doing it that way is wasteful.

In your opinion, is this an exception to the "almost never a good
idea," or is this a case for slurping?

Also, if I can determine the absolute earliest $string_to_match could
possibly appear, I suppose I can get a big efficiency out of

my $earliest_char = 8_000; # string of interest appears after
# AT LEAST 8,000 characters

if ( substr($wget_out, $earliest_char) =~ /$string_to_match/ )
{
...

Yes?
 
X

xhoster

In the previous article said:
If you read about slurp you'll see it's almost never a good idea
[...]

I would disagree. Slurp is quite often a good idea. Slurping data that
is, or has the potential to be, very large when doing so is utterly
unnecessary is rarely a good idea, though.
So, a question then:

I have a very short script that reads the output of wget $URL like so:

my $wget_out = `/path/to/wget $URL`;

I am absolutely assured that the output from this URL will be around
10-15K every time.

So how does this get turned into 300MB?
Furthermore, I need to search for a short string
that always appears near the end of the output (so there is no
advantage to cutting off the input after some shorter number of
characters).

So now that you have educated me a little, I am doing this:

$/ = \32000; # much bigger than ever needed, small enough
# to avoid potential memory problems in the
# unlikely event of runaway output from wget

my $wget_out = `/path/to/wget $URL`;

Backticks in a scalar context is not line oriented, and so $/ is irrelevant
to it. Even in a list context, backticks seem to slurp the whole thing,
and only apply $/ to it after slurping.

If you are really worried about runaway wget, you should either open a pipe
and read from it yourself:

open my $fh, "/path/to/get $URL |" or die $!;
$/=\32000;
my $wget_out=<$fh>;

or just use system tools to do it and forget about $/ altogether:

my $wget_out = `/path/to/wget $URL|head -c 32000`;

Xho
 
T

Tad McClellan

J.D. Baldwin said:
my $wget_out = `/path/to/wget $URL`;


You can make it more portable by doing it in native Perl
rather than shelling out:

use LWP::Simple;
my $wget_out = get $URL;
 
J

J.D. Baldwin

In the previous article said:
So how does this get turned into 300MB?

That 300MB thing was the other guy; I just piggybacked my question
onto his.
Backticks in a scalar context is not line oriented, and so $/ is
irrelevant to it. Even in a list context, backticks seem to slurp
the whole thing, and only apply $/ to it after slurping.

Ah, I knew the IFS didn't matter, but I didn't extrapolate that into a
realization that $/ wouldn't matter at all.
If you are really worried about runaway wget, you should either open
a pipe and read from it yourself:

open my $fh, "/path/to/get $URL |" or die $!;
$/=\32000;
my $wget_out=<$fh>;

I was trying to avoid doing an open -- not that it's a big deal -- and
I'm not 100% sure that pipe trick will work ...
or just use system tools to do it and forget about $/ altogether:

my $wget_out = `/path/to/wget $URL|head -c 32000`;

.... because, sadly, I am doing this for a Windows platform, where I
have no head (ha ha).n

I'll probably just drop back to the open method described above, thanks.
 
J

J.D. Baldwin

In the previous article said:
You can make it more portable by doing it in native Perl
rather than shelling out:

use LWP::Simple;
my $wget_out = get $URL;

That's kind of a Phase II plan ... getting new Perl modules installed
on these monitoring systems is non-trivial, but for political reasons
rather than technical ones.
 
U

Uri Guttman

JDB> That 300MB thing was the other guy; I just piggybacked my question
JDB> onto his.

then start a new thread with a new subject.

uri
 
J

J.D. Baldwin

In the previous article said:
You can make it more portable by doing it in native Perl
rather than shelling out:

use LWP::Simple;
my $wget_out = get $URL;

A little additional research shows that a) I was wrong about LWP not
being part of ActivePerl, because it is, and b) LWP::UserAgent allows
me to specify a max content size (taking care of that problem) and a
specific proxy server (a part of the problem domain I didn't mention).
Thanks.
 
J

John Bokma

Tad McClellan said:
You can make it more portable by doing it in native Perl
rather than shelling out:

use LWP::Simple;
my $wget_out = get $URL;

Or if you insist on wget, it has been ported to Windows (I use wget in
some Perl programs similar to what J.D. mentioned)
 
G

gf

J.D. Baldwin said:
I have a very short script that reads the output of wget $URL like so:

my $wget_out = `/path/to/wget $URL`;

I am absolutely assured that the output from this URL will be around
10-15K every time. Furthermore, I need to search for a short string
that always appears near the end of the output (so there is no
advantage to cutting off the input after some shorter number of
characters).

If you are truly "absolutely assured" and you know your machine will
ALWAYS have enough RAM available to handle the data being read, then
slurping is fine, except when at the dinner table, unless you're in a
society that approves of such behavior... which reminds me of testing
for leap year, only now I digress.....

When you are always getting the same file, or files, then slurp is
safer^H^H^H^H^Hbenign. For small config files and small data sets it's
cool and I use it for those. If you are trying to slurp a file using a
name that came about dynamically, or as part of user interaction or
input, then slurp would be a really bad design choice in my opinion. If
you feel that the app going run-away, crashing, or taking the host to
its knees is acceptable... well then, slurp away, just use it with the
knowledge that it is a very sharp pokey kind of tool and shouldn't be
waved about wildly in a crowd or carried while running. Again, it'd be
worth reading the slurp docs and/or Conway's comments in the PBP book.

Now, regarding using `wget...`, why not just use LWP::Simple instead?
It works very nicely in a very similar fashion, and skips having to
shell out just to run. Having written a bunch of iterations of spiders
for our internal use, using LWP::Simple, LWP::UserAgent, plus some
stuff needing curl or wget, I still reach for the simple LWP first.

Just my $0.02.
 
J

J.D. Baldwin

In the previous article said:
Now, regarding using `wget...`, why not just use LWP::Simple
instead?

Because perl -MLPW::Simple -e 'print "OK\n";' failed, and (as
mentioned elsethread) installing new modules is not going to happen
anytime soon for these hosts.

Then I tried it again without misspelling "LWP" and it worked. So I
have already rewritten the whole thing (all twenty-odd lines of it,
oooh) to use LWP::UserAgent (which was also present). Much more
robust, and I still avoid writing and then reading a file.
 
J

J.D. Baldwin

In the previous article said:

Because the comments about slurp and the use of $/ led naturally to a
closely related topic I've been thinking about.
 
U

Uri Guttman

JDB> Because the comments about slurp and the use of $/ led naturally to a
JDB> closely related topic I've been thinking about.

i will ask again, why didn't you start a new thread and subject then?

uri
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top