Extract Numeric values from string

Discussion in 'Perl Misc' started by Vishal G, Sep 11, 2008.

  1. Vishal G

    Vishal G Guest

    Hi there,

    I have searched the whole group looking for solution to my problem.

    Actually, I dont understand the perl regular expression properly...
    working on it...

    Here is the problem..

    I have string which contain numbers...

    $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
    there are 112 million values

    I would like to extract numveric values from specific position till
    some position using regular expression.. I dont want to use split caue
    it uses lot of memory..

    for example:

    offset = 3; length = 4;

    so the result string should be $str = "454 67 59 298928";

    Thanks in advance

    Vishal
     
    Vishal G, Sep 11, 2008
    #1
    1. Advertising

  2. Vishal G <> writes:

    > I have string which contain numbers...
    >
    > $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
    > there are 112 million values
    >
    > I would like to extract numveric values from specific position till
    > some position using regular expression.. I dont want to use split caue
    > it uses lot of memory..
    >
    > for example:
    >
    > offset = 3; length = 4;
    >
    > so the result string should be $str = "454 67 59 298928";


    Well, you could always do something like:

    my $regex =
    qr/
    ^
    (?:\d+\s*) {$offset}
    ((?:\d+\s*){$length})
    /x;

    my ($result) = $str =~ /$regex/;


    --
    T.
     
    Tomislav Novak, Sep 11, 2008
    #2
    1. Advertising

  3. Vishal G

    Ben Morrow Guest

    Quoth Tomislav Novak <>:
    > Vishal G <> writes:
    >
    > > I have string which contain numbers...
    > >
    > > $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
    > > there are 112 million values
    > >
    > > I would like to extract numveric values from specific position till
    > > some position using regular expression.. I dont want to use split caue
    > > it uses lot of memory..
    > >
    > > for example:
    > >
    > > offset = 3; length = 4;
    > >
    > > so the result string should be $str = "454 67 59 298928";

    >
    > Well, you could always do something like:
    >
    > my $regex =
    > qr/
    > ^
    > (?:\d+\s*) {$offset}
    > ((?:\d+\s*){$length})
    > /x;


    The string apparently contains 112M values. {} quantifiers in Perl cannot
    be larger than 32766.

    I would suggest running through the string using substr to check each
    character at a time. Count the number of spaces, and collect up the
    digits as needed. This will be slow, but will avoid copying the string.

    In general, perl has a policy of trading memory for speed. If you are
    short of memory, I would suggest using a different language with more
    appropriate tradeoffs.

    Ben

    --
    Raise your hand if you're invulnerable.
    []
     
    Ben Morrow, Sep 11, 2008
    #3
  4. Vishal G

    Dr.Ruud Guest

    Vishal G schreef:

    > I have string which contain numbers...
    >
    > $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
    > there are 112 million values
    >
    > I would like to extract numveric values from specific position till
    > some position using regular expression.. I dont want to use split caue
    > it uses lot of memory..
    >
    > for example:
    >
    > offset = 3; length = 4;
    >
    > so the result string should be $str = "454 67 59 298928";



    Maybe you are looking for something like this:

    $ perl -Mstrict -Mwarnings -le '
    print scalar localtime;
    my $s; $s .= "$_ " for 1..10_000_000;
    print scalar localtime;

    my $offset = 9_999_903;
    my $count = 4;

    while ($s =~ m/([0-9]+)/g) {
    $count or last;
    --$offset > 0 and next;
    $count-- and print $1;
    }
    print scalar localtime;
    '
    Thu Sep 11 14:42:40 2008
    Thu Sep 11 14:42:47 2008
    9999903
    9999904
    9999905
    9999906
    Thu Sep 11 14:42:53 2008


    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 11, 2008
    #4
  5. Vishal G

    cartercc Guest

    On Sep 11, 4:46 am, Vishal G <> wrote:
    > I have string which contain numbers...
    >
    > $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
    > there are 112 million values
    >
    > I would like to extract numveric values from specific position till
    > some position using regular expression.. I dont want to use split caue
    > it uses lot of memory..
    >
    > for example:
    >
    > offset = 3; length = 4;
    >
    > so the result string should be $str = "454 67 59 298928";


    Is your string already in memory, or does it come from storage? If the
    latter, you might consider replacing the spaces with new lines and
    then using a counter to iterate through the file with something like
    this:

    while (<INFILE>)
    { $counter++;
    if ($counter < $offset) { next; }
    elsif ($counter >= $offset and $counter < $length)
    { print OUTFILE; }
    elsif ($counter > ($length + $offset)) { last; }
    else { print "ERROR"; }
    }

    If your string is already in memory, I would use the C trick of getc()
    and test each character, again using a counter for the white space.
    Using inline C would probably be faster and you could discard all the
    characters you don't need.

    while ((char c = getc()) != EOF)
    { //test c, count whitespace, and save what you need
    }

    CC
     
    cartercc, Sep 11, 2008
    #5
  6. Vishal G

    Dr.Ruud Guest

    Dr.Ruud schreef:
    > Vishal G:


    >> I have string which contain numbers...
    >>
    >> $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
    >> there are 112 million values
    >>
    >> I would like to extract numveric values from specific position till
    >> some position using regular expression.. I dont want to use split
    >> caue it uses lot of memory..
    >>
    >> for example:
    >>
    >> offset = 3; length = 4;
    >>
    >> so the result string should be $str = "454 67 59 298928";

    >
    >
    > Maybe you are looking for something like this:
    >
    > $ perl -Mstrict -Mwarnings -le '
    > print scalar localtime;
    > my $s; $s .= "$_ " for 1..10_000_000;
    > print scalar localtime;
    >
    > my $offset = 9_999_903;
    > my $count = 4;
    >
    > while ($s =~ m/([0-9]+)/g) {
    > $count or last;
    > --$offset > 0 and next;
    > $count-- and print $1;
    > }
    > print scalar localtime;
    > '
    > Thu Sep 11 14:42:40 2008
    > Thu Sep 11 14:42:47 2008
    > 9999903
    > 9999904
    > 9999905
    > 9999906
    > Thu Sep 11 14:42:53 2008


    Which means that the while(regexp) skips about 2 million numbers per
    second.
    So with $offset = 100_000_000 it may take about a minute.

    --
    Affijn, Ruud

    "Gewoon is een tijger."
     
    Dr.Ruud, Sep 11, 2008
    #6
  7. Vishal G

    Ben Morrow Guest

    Quoth cartercc <>:
    > On Sep 11, 4:46 am, Vishal G <> wrote:
    > > I have string which contain numbers...
    > >
    > > $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
    > > there are 112 million values
    > >
    > > I would like to extract numveric values from specific position till
    > > some position using regular expression.. I dont want to use split caue
    > > it uses lot of memory..
    > >
    > > for example:
    > >
    > > offset = 3; length = 4;
    > >
    > > so the result string should be $str = "454 67 59 298928";

    >
    > Is your string already in memory, or does it come from storage? If the
    > latter, you might consider replacing the spaces with new lines and
    > then using a counter to iterate through the file with something like
    > this:
    >
    > while (<INFILE>)


    No need to replace the spaces. $/ = " " will work just fine.

    <snip>
    > If your string is already in memory, I would use the C trick of getc()


    getc reads from a file, not from memory.

    Ben

    --
    You poor take courage, you rich take care:
    The Earth was made a common treasury for everyone to share
    All things in common, all people one.
    'We come in peace'---the order came to cut them down. []
     
    Ben Morrow, Sep 11, 2008
    #7
  8. On Thu, 11 Sep 2008 01:46:08 -0700, Vishal G wrote:

    > Hi there,
    >
    > I have searched the whole group looking for solution to my problem.
    >
    > Actually, I dont understand the perl regular expression properly...
    > working on it...
    >
    > Here is the problem..
    >
    > I have string which contain numbers...
    >
    > $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string there
    > are 112 million values
    >
    > I would like to extract numveric values from specific position till some
    > position using regular expression.. I dont want to use split caue it
    > uses lot of memory..
    >
    > for example:
    >
    > offset = 3; length = 4;
    >
    > so the result string should be $str = "454 67 59 298928";
    >
    > Thanks in advance
    >
    > Vishal


    Why do you store that in a free-format string? I can think of a a number
    of better ways to store it. You could store it in a binary array (like in
    C) and then access it using vec(). Tie::Array::packed may also be an
    interesting approach. By storing your data smarter, you can make an O(N)
    algorithm O(1).

    Regards,

    Leon Timmermans
     
    Leon Timmermans, Sep 11, 2008
    #8
  9. Vishal G

    Ted Zlatanov Guest

    On Thu, 11 Sep 2008 01:46:08 -0700 (PDT) Vishal G <> wrote:

    VG> Here is the problem..

    VG> I have string which contain numbers...

    VG> $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
    VG> there are 112 million values

    VG> I would like to extract numveric values from specific position till
    VG> some position using regular expression.. I dont want to use split caue
    VG> it uses lot of memory..

    This works for me. To avoid dealing with edge cases, I surround the
    input with spaces. The assumption is that only digits and spaces are in
    your data; the algorithm uses that to find the next space or the next
    digit. Note also that slow_extract() is there as a reference to check
    the algorithm works OK. It's very possible it has bugs: I wrote it to
    show you the general idea of iterating through the string, and tests are
    what you see in __DATA__ which is minimal.

    You should consider keeping large data sets like this in a database,
    e.g. SQLite. Then operating on it from Perl or other languages is much
    easier, especially if you index your columns appropriately.

    Ted

    #!/usr/bin/perl

    use warnings;
    use strict;
    use Data::Dumper;
    use List::Util qw/min/;

    my $str = <DATA>; # we keep it global so it's not passed around
    chomp $str;
    $str = " $str ";


    while (<DATA>)
    {
    my ($pos, $offset) = m/(\d+)\D+(\d+)/;
    my $slow_result = slow_extract($pos, $offset);
    my $fast_result = fast_extract($pos, $offset);
    my $ok = $slow_result eq $fast_result;
    print "position $pos, offset $offset: $slow_result / $fast_result / OK=$ok\n";
    }

    sub slow_extract
    {
    my $logical_pos = shift @_;
    my $n = shift @_;

    my @numbers = split ' ', $str;
    return join ' ', grep { defined } @numbers[$logical_pos .. $logical_pos+$n-1];
    }

    sub fast_get_number
    {
    my $start_pos = shift @_;

    my @matches = grep { defined && $_ > 0 } map { index($str, $_, $start_pos) } 0..9;

    return unless scalar @matches;

    my $start = min(@matches);
    my $end = index($str, ' ', $start);
    return ($end, substr($str, $start, $end-$start));
    }

    sub fast_extract
    {
    my $logical_pos = shift @_;
    my $n = shift @_;

    my $at = 0;
    my $current_logical_pos = 0;

    my @numbers;
    while (1)
    {
    my @next = fast_get_number($at);
    print Dumper \@next;
    last unless scalar @next;
    last if $next[0] < 0;
    if ($current_logical_pos >= $logical_pos)
    {
    push @numbers, $next[1];
    }
    $at = $next[0];
    last if scalar @numbers == $n;
    $current_logical_pos++;
    }

    return join ' ', @numbers;
    }

    __DATA__
    93430 574 454 67 59 298928 74 4875 8 93430
    3 4
    5 6
    7 8
    10 2
     
    Ted Zlatanov, Sep 11, 2008
    #9
  10. Vishal G

    cartercc Guest

    This is why I read this group, always learning things at the (small)
    cost of exhibiting my own ignorance. It always amazes me the depth of
    knowledge that some people have, and a little bit depressing as to my
    own lack of knowledge.

    I have several friends who are medical doctors, and know several of
    their children who are in various stages of the medical education
    process, and I've always liked that approach: two years in the
    classroom and four (or more) in the field. In a job you get stuck in a
    rut where you might have the same experience thousands of times,
    unlike a forum like c.l.p.m. where you can broaden your knowledge by
    way of specific, limited example.

    All this as a rather wordy 'Thanks'.

    CC

    On Sep 11, 9:57 am, Ben Morrow <> wrote:
    > Quoth cartercc <>:
    >
    >
    >
    > > On Sep 11, 4:46 am, Vishal G <> wrote:
    > > > I have string which contain numbers...

    >
    > > > $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
    > > > there are 112 million values

    >
    > > > I would like to extract numveric values from specific position till
    > > > some position using regular expression.. I dont want to use split caue
    > > > it uses lot of memory..

    >
    > > > for example:

    >
    > > > offset = 3; length = 4;

    >
    > > > so the result string should be $str = "454 67 59 298928";

    >
    > > Is your string already in memory, or does it come from storage? If the
    > > latter, you might consider replacing the spaces with new lines and
    > > then using a counter to iterate through the file with something like
    > > this:

    >
    > > while (<INFILE>)

    >
    > No need to replace the spaces. $/ = " " will work just fine.
    >
    > <snip>
    >
    > > If your string is already in memory, I would use the C trick of getc()

    >
    > getc reads from a file, not from memory.
    >
    > Ben
    >
    > --
    > You poor take courage, you rich take care:
    > The Earth was made a common treasury for everyone to share
    > All things in common, all people one.
    > 'We come in peace'---the order came to cut them down.       []
     
    cartercc, Sep 11, 2008
    #10
  11. Vishal G <> wrote:
    >I have string which contain numbers...
    >
    >$str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
    >there are 112 million values


    Wow! A single string of maybe half a gigabyte length? That sounds like
    an awfully poor datastructure.

    >I would like to extract numveric values from specific position till
    >some position using regular expression.. I dont want to use split caue
    >it uses lot of memory..


    I cannot imagine that REs will be any more efficient than split(), which
    uses REs, BTW, too.

    >for example:
    >
    >offset = 3; length = 4;
    >
    >so the result string should be $str = "454 67 59 298928";


    I would put that data into a more suitable data structure.
    Maybe write the string to a file and then read it back into an array
    using the space character as the line separator?

    Or loop through the string character by character and note all positions
    of space characters in an array. Then you can use substr() to extract
    the desired substring directly.

    jue
     
    Jürgen Exner, Sep 11, 2008
    #11
  12. Vishal G

    Guest

    Jürgen Exner <> wrote:
    > Vishal G <> wrote:
    > >I have string which contain numbers...
    > >
    > >$str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
    > >there are 112 million values

    >
    > Wow! A single string of maybe half a gigabyte length? That sounds like
    > an awfully poor datastructure.


    Yes. But Perl is often used as a glue language. As such, it often has
    to deal with poor datastructures. If the other programs could be easily
    changed to do the right thing in the first place, we wouldn't need the
    glue.

    ....
    >
    > I would put that data into a more suitable data structure.
    > Maybe write the string to a file and then read it back into an array
    > using the space character as the line separator?


    That would use at least half as much memory as splitting, and so would
    probably be memory prohibitive.

    > Or loop through the string character by character and note all positions
    > of space characters in an array. Then you can use substr() to extract
    > the desired substring directly.


    If this only has to be done once per execution, then I would just leave it
    in the original structure and step though it with /(\d+)/g. If I was going
    to do several extractions, I would convert the string so that each element
    is fixed size (either by padding the numbers with 0 to the max length, or
    by using pack with the appropriate template) then use substr to get the
    desired chunk.

    while ($str=~/(\d+)/g) {$y.=pack "i", $1};

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    , Sep 11, 2008
    #12
  13. Vishal G

    Vishal G Guest

    On Sep 12, 3:44 am, wrote:
    > Jürgen Exner <> wrote:
    > >VishalG<> wrote:
    > > >I have string which contain numbers...

    >
    > > >$str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
    > > >there are 112 million values

    >
    > > Wow! A single string of maybe half a gigabyte length? That sounds like
    > > an awfully poor datastructure.

    >
    > Yes.  But Perl is often used as a glue language.  As such, it often has
    > to deal with poor datastructures.  If the other programs could be easily
    > changed to do the right thing in the first place, we wouldn't need the
    > glue.
    >
    > ...
    >
    >
    >
    > > I would put that data into a more suitable data structure.
    > > Maybe write the string to a file and then read it back into an array
    > > using the space character as the line separator?

    >
    > That would use at least half as much memory as splitting, and so would
    > probably be memory prohibitive.
    >
    > > Or loop through the string character by character and note all positions
    > > of space characters in an array. Then you can use substr() to extract
    > > the desired substring directly.

    >
    > If this only has to be done once per execution, then I would just leave it
    > in the original structure and step though it with /(\d+)/g.  If I was going
    > to do several extractions, I would convert the string so that each element
    > is fixed size (either by padding the numbers with 0 to the max length, or
    > by using pack with the appropriate template) then use substr to get the
    > desired chunk.
    >
    > while ($str=~/(\d+)/g) {$y.=pack "i", $1};
    >
    > Xho
    >
    > --
    > --------------------http://NewsReader.Com/--------------------
    > The costs of publication of this article were defrayed in part by the
    > payment of page charges. This article must therefore be hereby marked
    > advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    > this fact.


    Thanx a lot for all these insightful ideas.

    -Wow! A single string of maybe half a gigabyte length? That sounds
    like an awfully poor data structure.

    Actually, I am changing Perl scripts written by someone else and
    changing the data structure is not an option cause other modules
    depends on it.

    Its an ACE (assembly) file which contains DNA and quality value for
    each base. So, if there is 220 million bases long DNA then we end with
    one string containing 220 million numeric values which is cumbersome
    to manage when you have to add & extract information from this string.

    The information is in the file as I said earlier and read in to this
    data structure. I am trying to split the assembly into parts of
    variable length. That’s why I am trying to split the string but if I
    use split function to get the 1 million records, it uses 3.0 GB of
    memory which is ridicules
     
    Vishal G, Sep 12, 2008
    #13
  14. Vishal G <> wrote:
    >On Sep 12, 3:44 am, wrote:
    >> Jürgen Exner <> wrote:
    >> >VishalG<> wrote:
    >> > >I have string which contain numbers...

    >>
    >> > >$str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
    >> > >there are 112 million values

    >>
    >> > Wow! A single string of maybe half a gigabyte length? That sounds like
    >> > an awfully poor datastructure.


    >Actually, I am changing Perl scripts written by someone else and
    >changing the data structure is not an option cause other modules
    >depends on it


    Well, I guess sometimes you are stuck with whatever you are stuck with.

    >The information is in the file as I said earlier and read in to this
    >data structure. I am trying to split the assembly into parts of
    >variable length.


    You don't seem to be very familiar with Perl, so let me restate what has
    been said earlier:
    Perl has a very flexible concept of what constitutes a 'line' in a file.
    In particular _YOU_ as a programmer can define, which character is
    considered the end-of-line separator/terminator.

    Now, if you set the INPUT RECORD SEPARATOR $/ to the space character,
    then as far as Perl is concerned each number becomes its own line.

    Now you can read your file line by line (i.e. number by number) and Perl
    conveniently even keeps a record of which line you just read in the
    variable INPUT_LINE_NUMBER $. .

    To e.g. print $n numbers, starting with number $start becomes something
    like (sketch only, untested):

    $. = ' ';
    while ($. < $start) {
    $dummy = <IN>; #read line (=number) and throw it away
    }
    for (1..$n) {
    print scalar <IN>;
    }

    The largest piece of data in this code snippet is the list (1..$start)
    and even that can be replaced with a while loop, reducing the memory
    footprint to a few bytes for just one line (=number) at a time.

    jue
     
    Jürgen Exner, Sep 12, 2008
    #14
  15. Glenn Jackman <> wrote:
    >At 2008-09-12 12:43AM, "Jürgen Exner" wrote:
    >> Vishal G <> wrote:

    >[...]
    >> Now, if you set the INPUT RECORD SEPARATOR $/ to the space character,
    >> then as far as Perl is concerned each number becomes its own line.
    >>
    >> Now you can read your file line by line (i.e. number by number) and Perl
    >> conveniently even keeps a record of which line you just read in the
    >> variable INPUT_LINE_NUMBER $. .
    >>
    >> To e.g. print $n numbers, starting with number $start becomes something
    >> like (sketch only, untested):
    >>
    >> $. = ' ';

    >
    >Above should be: $/ = ' ';


    Thanks!

    jue
     
    Jürgen Exner, Sep 12, 2008
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    5
    Views:
    954
    X-Centric
    Jun 30, 2005
  2. darrel
    Replies:
    4
    Views:
    841
    darrel
    Jul 19, 2007
  3. jobs

    int to numeric numeric(18,2) ?

    jobs, Jul 21, 2007, in forum: ASP .Net
    Replies:
    2
    Views:
    991
    =?ISO-8859-1?Q?G=F6ran_Andersson?=
    Jul 22, 2007
  4. Sandhya Prabhakaran
    Replies:
    6
    Views:
    581
    alex23
    Aug 3, 2009
  5. bobmct
    Replies:
    4
    Views:
    127
    Peter J. Holzer
    May 8, 2009
Loading...

Share This Page