Extract Numeric values from string

Discussion in 'Perl Misc' started by Vishal G, Sep 11, 2008.

  1. Vishal G

    Vishal G Guest

    Hi there,

    I have searched the whole group looking for solution to my problem.

    Actually, I dont understand the perl regular expression properly...
    working on it...

    Here is the problem..

    I have string which contain numbers...

    $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
    there are 112 million values

    I would like to extract numveric values from specific position till
    some position using regular expression.. I dont want to use split caue
    it uses lot of memory..

    for example:

    offset = 3; length = 4;

    so the result string should be $str = "454 67 59 298928";

    Thanks in advance

    Vishal
     
    Vishal G, Sep 11, 2008
    #1
    1. Advertisements

  2. Well, you could always do something like:

    my $regex =
    qr/
    ^
    (?:\d+\s*) {$offset}
    ((?:\d+\s*){$length})
    /x;

    my ($result) = $str =~ /$regex/;
     
    Tomislav Novak, Sep 11, 2008
    #2
    1. Advertisements

  3. Vishal G

    Ben Morrow Guest

    The string apparently contains 112M values. {} quantifiers in Perl cannot
    be larger than 32766.

    I would suggest running through the string using substr to check each
    character at a time. Count the number of spaces, and collect up the
    digits as needed. This will be slow, but will avoid copying the string.

    In general, perl has a policy of trading memory for speed. If you are
    short of memory, I would suggest using a different language with more
    appropriate tradeoffs.

    Ben
     
    Ben Morrow, Sep 11, 2008
    #3
  4. Vishal G

    Dr.Ruud Guest

    Vishal G schreef:

    Maybe you are looking for something like this:

    $ perl -Mstrict -Mwarnings -le '
    print scalar localtime;
    my $s; $s .= "$_ " for 1..10_000_000;
    print scalar localtime;

    my $offset = 9_999_903;
    my $count = 4;

    while ($s =~ m/([0-9]+)/g) {
    $count or last;
    --$offset > 0 and next;
    $count-- and print $1;
    }
    print scalar localtime;
    '
    Thu Sep 11 14:42:40 2008
    Thu Sep 11 14:42:47 2008
    9999903
    9999904
    9999905
    9999906
    Thu Sep 11 14:42:53 2008
     
    Dr.Ruud, Sep 11, 2008
    #4
  5. Vishal G

    cartercc Guest

    Is your string already in memory, or does it come from storage? If the
    latter, you might consider replacing the spaces with new lines and
    then using a counter to iterate through the file with something like
    this:

    while (<INFILE>)
    { $counter++;
    if ($counter < $offset) { next; }
    elsif ($counter >= $offset and $counter < $length)
    { print OUTFILE; }
    elsif ($counter > ($length + $offset)) { last; }
    else { print "ERROR"; }
    }

    If your string is already in memory, I would use the C trick of getc()
    and test each character, again using a counter for the white space.
    Using inline C would probably be faster and you could discard all the
    characters you don't need.

    while ((char c = getc()) != EOF)
    { //test c, count whitespace, and save what you need
    }

    CC
     
    cartercc, Sep 11, 2008
    #5
  6. Vishal G

    Dr.Ruud Guest

    Dr.Ruud schreef:
    Which means that the while(regexp) skips about 2 million numbers per
    second.
    So with $offset = 100_000_000 it may take about a minute.
     
    Dr.Ruud, Sep 11, 2008
    #6
  7. Vishal G

    Ben Morrow Guest

    No need to replace the spaces. $/ = " " will work just fine.

    getc reads from a file, not from memory.

    Ben
     
    Ben Morrow, Sep 11, 2008
    #7
  8. Why do you store that in a free-format string? I can think of a a number
    of better ways to store it. You could store it in a binary array (like in
    C) and then access it using vec(). Tie::Array::packed may also be an
    interesting approach. By storing your data smarter, you can make an O(N)
    algorithm O(1).

    Regards,

    Leon Timmermans
     
    Leon Timmermans, Sep 11, 2008
    #8
  9. Vishal G

    Ted Zlatanov Guest

    VG> Here is the problem..

    VG> I have string which contain numbers...

    VG> $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
    VG> there are 112 million values

    VG> I would like to extract numveric values from specific position till
    VG> some position using regular expression.. I dont want to use split caue
    VG> it uses lot of memory..

    This works for me. To avoid dealing with edge cases, I surround the
    input with spaces. The assumption is that only digits and spaces are in
    your data; the algorithm uses that to find the next space or the next
    digit. Note also that slow_extract() is there as a reference to check
    the algorithm works OK. It's very possible it has bugs: I wrote it to
    show you the general idea of iterating through the string, and tests are
    what you see in __DATA__ which is minimal.

    You should consider keeping large data sets like this in a database,
    e.g. SQLite. Then operating on it from Perl or other languages is much
    easier, especially if you index your columns appropriately.

    Ted

    #!/usr/bin/perl

    use warnings;
    use strict;
    use Data::Dumper;
    use List::Util qw/min/;

    my $str = <DATA>; # we keep it global so it's not passed around
    chomp $str;
    $str = " $str ";


    while (<DATA>)
    {
    my ($pos, $offset) = m/(\d+)\D+(\d+)/;
    my $slow_result = slow_extract($pos, $offset);
    my $fast_result = fast_extract($pos, $offset);
    my $ok = $slow_result eq $fast_result;
    print "position $pos, offset $offset: $slow_result / $fast_result / OK=$ok\n";
    }

    sub slow_extract
    {
    my $logical_pos = shift @_;
    my $n = shift @_;

    my @numbers = split ' ', $str;
    return join ' ', grep { defined } @numbers[$logical_pos .. $logical_pos+$n-1];
    }

    sub fast_get_number
    {
    my $start_pos = shift @_;

    my @matches = grep { defined && $_ > 0 } map { index($str, $_, $start_pos) } 0..9;

    return unless scalar @matches;

    my $start = min(@matches);
    my $end = index($str, ' ', $start);
    return ($end, substr($str, $start, $end-$start));
    }

    sub fast_extract
    {
    my $logical_pos = shift @_;
    my $n = shift @_;

    my $at = 0;
    my $current_logical_pos = 0;

    my @numbers;
    while (1)
    {
    my @next = fast_get_number($at);
    print Dumper \@next;
    last unless scalar @next;
    last if $next[0] < 0;
    if ($current_logical_pos >= $logical_pos)
    {
    push @numbers, $next[1];
    }
    $at = $next[0];
    last if scalar @numbers == $n;
    $current_logical_pos++;
    }

    return join ' ', @numbers;
    }

    __DATA__
    93430 574 454 67 59 298928 74 4875 8 93430
    3 4
    5 6
    7 8
    10 2
     
    Ted Zlatanov, Sep 11, 2008
    #9
  10. Vishal G

    cartercc Guest

    This is why I read this group, always learning things at the (small)
    cost of exhibiting my own ignorance. It always amazes me the depth of
    knowledge that some people have, and a little bit depressing as to my
    own lack of knowledge.

    I have several friends who are medical doctors, and know several of
    their children who are in various stages of the medical education
    process, and I've always liked that approach: two years in the
    classroom and four (or more) in the field. In a job you get stuck in a
    rut where you might have the same experience thousands of times,
    unlike a forum like c.l.p.m. where you can broaden your knowledge by
    way of specific, limited example.

    All this as a rather wordy 'Thanks'.

    CC

     
    cartercc, Sep 11, 2008
    #10
  11. Wow! A single string of maybe half a gigabyte length? That sounds like
    an awfully poor datastructure.
    I cannot imagine that REs will be any more efficient than split(), which
    uses REs, BTW, too.
    I would put that data into a more suitable data structure.
    Maybe write the string to a file and then read it back into an array
    using the space character as the line separator?

    Or loop through the string character by character and note all positions
    of space characters in an array. Then you can use substr() to extract
    the desired substring directly.

    jue
     
    Jürgen Exner, Sep 11, 2008
    #11
  12. Vishal G

    xhoster Guest

    Yes. But Perl is often used as a glue language. As such, it often has
    to deal with poor datastructures. If the other programs could be easily
    changed to do the right thing in the first place, we wouldn't need the
    glue.

    ....
    That would use at least half as much memory as splitting, and so would
    probably be memory prohibitive.
    If this only has to be done once per execution, then I would just leave it
    in the original structure and step though it with /(\d+)/g. If I was going
    to do several extractions, I would convert the string so that each element
    is fixed size (either by padding the numbers with 0 to the max length, or
    by using pack with the appropriate template) then use substr to get the
    desired chunk.

    while ($str=~/(\d+)/g) {$y.=pack "i", $1};

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    The costs of publication of this article were defrayed in part by the
    payment of page charges. This article must therefore be hereby marked
    advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
    this fact.
     
    xhoster, Sep 11, 2008
    #12
  13. Vishal G

    Vishal G Guest

    Thanx a lot for all these insightful ideas.

    -Wow! A single string of maybe half a gigabyte length? That sounds
    like an awfully poor data structure.

    Actually, I am changing Perl scripts written by someone else and
    changing the data structure is not an option cause other modules
    depends on it.

    Its an ACE (assembly) file which contains DNA and quality value for
    each base. So, if there is 220 million bases long DNA then we end with
    one string containing 220 million numeric values which is cumbersome
    to manage when you have to add & extract information from this string.

    The information is in the file as I said earlier and read in to this
    data structure. I am trying to split the assembly into parts of
    variable length. That’s why I am trying to split the string but if I
    use split function to get the 1 million records, it uses 3.0 GB of
    memory which is ridicules
     
    Vishal G, Sep 12, 2008
    #13
  14. Well, I guess sometimes you are stuck with whatever you are stuck with.
    You don't seem to be very familiar with Perl, so let me restate what has
    been said earlier:
    Perl has a very flexible concept of what constitutes a 'line' in a file.
    In particular _YOU_ as a programmer can define, which character is
    considered the end-of-line separator/terminator.

    Now, if you set the INPUT RECORD SEPARATOR $/ to the space character,
    then as far as Perl is concerned each number becomes its own line.

    Now you can read your file line by line (i.e. number by number) and Perl
    conveniently even keeps a record of which line you just read in the
    variable INPUT_LINE_NUMBER $. .

    To e.g. print $n numbers, starting with number $start becomes something
    like (sketch only, untested):

    $. = ' ';
    while ($. < $start) {
    $dummy = <IN>; #read line (=number) and throw it away
    }
    for (1..$n) {
    print scalar <IN>;
    }

    The largest piece of data in this code snippet is the list (1..$start)
    and even that can be replaced with a while loop, reducing the memory
    footprint to a few bytes for just one line (=number) at a time.

    jue
     
    Jürgen Exner, Sep 12, 2008
    #14
  15. Thanks!

    jue
     
    Jürgen Exner, Sep 12, 2008
    #15
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.