String Processing Basic Stuff

Discussion in 'Perl Misc' started by Vishal G, Oct 21, 2008.

  1. Vishal G

    Vishal G Guest

    Hi Guys,

    Very basic question....

    Please dont suggest to use other programing language or other data
    structure cause I can't...

    I read data from file and yes I have to slurp the whole thing to
    memory cause I can use upto 4GB...

    data in file is in this format

    30 56 78 34 2 39 87 (50 values per line, total of 120 million
    entries)

    reading file in paragraph mode

    Now I have to remove multiple spaces without using much memory

    This is what I have wrote (might be very low standard code for Gurus
    out there)

    It works but takes 5 mins consuming 600-700 MB, if I use substitution
    to achieve this it takes 4-5 GB and around 2-3 mins...

    Could you pls suggest way to process it faster using less memory
    possible...

    # Process the string $_ to remove leading whitespaces,
    multiple whitespaces
    # and to padd each value to same size
    my $chr = '';
    my $str = '';
    my $value = '';
    my $unitlength = $Alignment::BASEQUALITY_BYTES;
    while (length($_) > 0) {
    if (($chr = substr($_, 0, 1, "")) ne " ") {
    $value = $value . $chr;
    } else {
    $str = $str . sprintf("%${unitlength}d",
    $value) if ($value);
    undef $value;
    }
    }

    # BQ field
    $ace->{'BQ'}->{$name} = $str;

    undef $str;
    undef $chr;

    Thanks in advance

    Vishal
     
    Vishal G, Oct 21, 2008
    #1
    1. Advertising

  2. Vishal G

    Vishal G Guest

    Unitlength is 3 in this case
     
    Vishal G, Oct 21, 2008
    #2
    1. Advertising

  3. Vishal G

    Guest

    On Mon, 20 Oct 2008 23:27:24 -0700 (PDT), Vishal G <> wrote:

    > Hi Guys,
    >
    >Very basic question....
    >
    >Please dont suggest to use other programing language or other data
    >structure cause I can't...
    >
    >I read data from file and yes I have to slurp the whole thing to
    >memory cause I can use upto 4GB...
    >
    >data in file is in this format
    >
    >30 56 78 34 2 39 87 (50 values per line, total of 120 million
    >entries)
    >
    >reading file in paragraph mode
    >
    >Now I have to remove multiple spaces without using much memory
    >
    >This is what I have wrote (might be very low standard code for Gurus
    >out there)
    >
    >It works but takes 5 mins consuming 600-700 MB, if I use substitution
    >to achieve this it takes 4-5 GB and around 2-3 mins...
    >
    >Could you pls suggest way to process it faster using less memory
    >possible...
    >
    > # Process the string $_ to remove leading whitespaces,
    >multiple whitespaces
    > # and to padd each value to same size
    > my $chr = '';
    > my $str = '';
    > my $value = '';
    > my $unitlength = $Alignment::BASEQUALITY_BYTES;
    > while (length($_) > 0) {
    > if (($chr = substr($_, 0, 1, "")) ne " ") {
    > $value = $value . $chr;
    > } else {
    > $str = $str . sprintf("%${unitlength}d",
    >$value) if ($value);
    > undef $value;
    > }
    > }
    >
    > # BQ field
    > $ace->{'BQ'}->{$name} = $str;
    >
    > undef $str;
    > undef $chr;
    >
    >Thanks in advance
    >
    >Vishal


    Not really clear on what you mean by 50 values per
    line, or if you have slurped an 800 MB string in $_
    Looks like your trying to shrink one string and grow
    another.
    The way you are doing it seems very granular.

    Here are a couple approaches you could try if not
    tried already.

    sln

    ##############
    # ???.pl
    ##############

    use strict;
    use warnings;

    my $unitlength = 5; #$Alignment::BASEQUALITY_BYTES;
    my $line = '30 56 78 34 2 39 87 ';
    my $str = $line;


    # If its 50 values per line
    # do substitution
    # ------------------------------
    $str =~ s/\s*(\d+)/sprintf "%${unitlength}d", $1/ge;
    $str =~ s/\s+$//;
    print "'$str'\n";


    # If its all on one huge line
    # shrink one string, grow another
    # (not sure this will save memory)
    # ------------------------------------
    my $newstr = '';
    my $RxNumber = qr/\s*(\d+)/;

    while ($str =~ s/$RxNumber//)
    {
    $newstr .= (sprintf "%${unitlength}d", $1);
    }
    print "'$newstr'\n";

    __END__

    output:

    ' 30 56 78 34 2 39 87'
    ' 30 56 78 34 2 39 87'
     
    , Oct 21, 2008
    #3
  4. Vishal G

    Guest

    Vishal G <> wrote:
    > Hi Guys,
    >
    > Very basic question....
    >
    > Please dont suggest to use other programing language or other data
    > structure cause I can't...


    If you can't use a different structure, at least for intermediates,
    then you can't program.


    > I read data from file and yes I have to slurp the whole thing to
    > memory cause I can use upto 4GB...


    Because you can do it that means you have to? We can't you read line by
    line, processing each line and appending the result to $str before moving
    to the next?

    >
    > data in file is in this format
    >
    > 30 56 78 34 2 39 87 (50 values per line, total of 120 million
    > entries)


    So then, would this work to make an example file?
    perl -le 'foreach (1..2.4e6) {print join " ", map int(rand()*99), 1..50}'


    >
    > reading file in paragraph mode


    Why reading in paragraph mode? From your format description, the data
    is not formatted in paragraphs.

    >
    > Now I have to remove multiple spaces without using much memory
    >
    > This is what I have wrote (might be very low standard code for Gurus
    > out there)
    >
    > It works but takes 5 mins consuming 600-700 MB,


    When I try it, I get many many warnings which suggests that it is not
    actually working correctly.


    > if I use substitution
    > to achieve this it takes 4-5 GB and around 2-3 mins...


    How did you use substitution?


    Starting your code indented half way across the screen isn't very helpful.
    It just leads to messy line wrap problems. I fixed that.

    > my $chr = '';
    > my $str = '';
    > my $value = '';
    > my $unitlength = $Alignment::BASEQUALITY_BYTES;
    > while (length($_) > 0) {
    > if (($chr = substr($_, 0, 1, "")) ne " ") {
    > $value = $value . $chr;
    > } else {
    > $str = $str . sprintf("%${unitlength}d", $value) if ($value);


    I get:
    Argument "67\n33" isn't numeric in sprintf....

    > undef $value;
    > }
    > }


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Oct 21, 2008
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Sparko

    Basic stuff

    Sparko, Apr 19, 2005, in forum: ASP .Net
    Replies:
    3
    Views:
    655
    Hans Kesting
    Apr 20, 2005
  2. Tom
    Replies:
    3
    Views:
    299
    Rolf Magnus
    Dec 14, 2003
  3. wink.co.nr

    Basic Stuff

    wink.co.nr, May 16, 2007, in forum: C++
    Replies:
    4
    Views:
    305
  4. Donn Ingle

    basic if stuff- testing ranges

    Donn Ingle, Nov 25, 2007, in forum: Python
    Replies:
    12
    Views:
    425
    John Machin
    Nov 26, 2007
  5. richard

    Re: Basic array stuff

    richard, Oct 5, 2008, in forum: HTML
    Replies:
    1
    Views:
    353
    Dr J R Stockton
    Oct 5, 2008
Loading...

Share This Page