String Processing Basic Stuff

V

Vishal G

Hi Guys,

Very basic question....

Please dont suggest to use other programing language or other data
structure cause I can't...

I read data from file and yes I have to slurp the whole thing to
memory cause I can use upto 4GB...

data in file is in this format

30 56 78 34 2 39 87 (50 values per line, total of 120 million
entries)

reading file in paragraph mode

Now I have to remove multiple spaces without using much memory

This is what I have wrote (might be very low standard code for Gurus
out there)

It works but takes 5 mins consuming 600-700 MB, if I use substitution
to achieve this it takes 4-5 GB and around 2-3 mins...

Could you pls suggest way to process it faster using less memory
possible...

# Process the string $_ to remove leading whitespaces,
multiple whitespaces
# and to padd each value to same size
my $chr = '';
my $str = '';
my $value = '';
my $unitlength = $Alignment::BASEQUALITY_BYTES;
while (length($_) > 0) {
if (($chr = substr($_, 0, 1, "")) ne " ") {
$value = $value . $chr;
} else {
$str = $str . sprintf("%${unitlength}d",
$value) if ($value);
undef $value;
}
}

# BQ field
$ace->{'BQ'}->{$name} = $str;

undef $str;
undef $chr;

Thanks in advance

Vishal
 
S

sln

Hi Guys,

Very basic question....

Please dont suggest to use other programing language or other data
structure cause I can't...

I read data from file and yes I have to slurp the whole thing to
memory cause I can use upto 4GB...

data in file is in this format

30 56 78 34 2 39 87 (50 values per line, total of 120 million
entries)

reading file in paragraph mode

Now I have to remove multiple spaces without using much memory

This is what I have wrote (might be very low standard code for Gurus
out there)

It works but takes 5 mins consuming 600-700 MB, if I use substitution
to achieve this it takes 4-5 GB and around 2-3 mins...

Could you pls suggest way to process it faster using less memory
possible...

# Process the string $_ to remove leading whitespaces,
multiple whitespaces
# and to padd each value to same size
my $chr = '';
my $str = '';
my $value = '';
my $unitlength = $Alignment::BASEQUALITY_BYTES;
while (length($_) > 0) {
if (($chr = substr($_, 0, 1, "")) ne " ") {
$value = $value . $chr;
} else {
$str = $str . sprintf("%${unitlength}d",
$value) if ($value);
undef $value;
}
}

# BQ field
$ace->{'BQ'}->{$name} = $str;

undef $str;
undef $chr;

Thanks in advance

Vishal

Not really clear on what you mean by 50 values per
line, or if you have slurped an 800 MB string in $_
Looks like your trying to shrink one string and grow
another.
The way you are doing it seems very granular.

Here are a couple approaches you could try if not
tried already.

sln

##############
# ???.pl
##############

use strict;
use warnings;

my $unitlength = 5; #$Alignment::BASEQUALITY_BYTES;
my $line = '30 56 78 34 2 39 87 ';
my $str = $line;


# If its 50 values per line
# do substitution
# ------------------------------
$str =~ s/\s*(\d+)/sprintf "%${unitlength}d", $1/ge;
$str =~ s/\s+$//;
print "'$str'\n";


# If its all on one huge line
# shrink one string, grow another
# (not sure this will save memory)
# ------------------------------------
my $newstr = '';
my $RxNumber = qr/\s*(\d+)/;

while ($str =~ s/$RxNumber//)
{
$newstr .= (sprintf "%${unitlength}d", $1);
}
print "'$newstr'\n";

__END__

output:

' 30 56 78 34 2 39 87'
' 30 56 78 34 2 39 87'
 
X

xhoster

Vishal G said:
Hi Guys,

Very basic question....

Please dont suggest to use other programing language or other data
structure cause I can't...

If you can't use a different structure, at least for intermediates,
then you can't program.

I read data from file and yes I have to slurp the whole thing to
memory cause I can use upto 4GB...

Because you can do it that means you have to? We can't you read line by
line, processing each line and appending the result to $str before moving
to the next?
data in file is in this format

30 56 78 34 2 39 87 (50 values per line, total of 120 million
entries)

So then, would this work to make an example file?
perl -le 'foreach (1..2.4e6) {print join " ", map int(rand()*99), 1..50}'

reading file in paragraph mode

Why reading in paragraph mode? From your format description, the data
is not formatted in paragraphs.
Now I have to remove multiple spaces without using much memory

This is what I have wrote (might be very low standard code for Gurus
out there)

It works but takes 5 mins consuming 600-700 MB,

When I try it, I get many many warnings which suggests that it is not
actually working correctly.

if I use substitution
to achieve this it takes 4-5 GB and around 2-3 mins...

How did you use substitution?


Starting your code indented half way across the screen isn't very helpful.
It just leads to messy line wrap problems. I fixed that.
my $chr = '';
my $str = '';
my $value = '';
my $unitlength = $Alignment::BASEQUALITY_BYTES;
while (length($_) > 0) {
if (($chr = substr($_, 0, 1, "")) ne " ") {
$value = $value . $chr;
} else {
$str = $str . sprintf("%${unitlength}d", $value) if ($value);

I get:
Argument "67\n33" isn't numeric in sprintf....
undef $value;
}
}

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top