Extract Numeric values from string


V

Vishal G

Hi there,

I have searched the whole group looking for solution to my problem.

Actually, I dont understand the perl regular expression properly...
working on it...

Here is the problem..

I have string which contain numbers...

$str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
there are 112 million values

I would like to extract numveric values from specific position till
some position using regular expression.. I dont want to use split caue
it uses lot of memory..

for example:

offset = 3; length = 4;

so the result string should be $str = "454 67 59 298928";

Thanks in advance

Vishal
 
Ad

Advertisements

T

Tomislav Novak

Vishal G said:
I have string which contain numbers...

$str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
there are 112 million values

I would like to extract numveric values from specific position till
some position using regular expression.. I dont want to use split caue
it uses lot of memory..

for example:

offset = 3; length = 4;

so the result string should be $str = "454 67 59 298928";

Well, you could always do something like:

my $regex =
qr/
^
(?:\d+\s*) {$offset}
((?:\d+\s*){$length})
/x;

my ($result) = $str =~ /$regex/;
 
B

Ben Morrow

Quoth Tomislav Novak said:
Well, you could always do something like:

my $regex =
qr/
^
(?:\d+\s*) {$offset}
((?:\d+\s*){$length})
/x;

The string apparently contains 112M values. {} quantifiers in Perl cannot
be larger than 32766.

I would suggest running through the string using substr to check each
character at a time. Count the number of spaces, and collect up the
digits as needed. This will be slow, but will avoid copying the string.

In general, perl has a policy of trading memory for speed. If you are
short of memory, I would suggest using a different language with more
appropriate tradeoffs.

Ben
 
D

Dr.Ruud

Vishal G schreef:
I have string which contain numbers...

$str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
there are 112 million values

I would like to extract numveric values from specific position till
some position using regular expression.. I dont want to use split caue
it uses lot of memory..

for example:

offset = 3; length = 4;

so the result string should be $str = "454 67 59 298928";


Maybe you are looking for something like this:

$ perl -Mstrict -Mwarnings -le '
print scalar localtime;
my $s; $s .= "$_ " for 1..10_000_000;
print scalar localtime;

my $offset = 9_999_903;
my $count = 4;

while ($s =~ m/([0-9]+)/g) {
$count or last;
--$offset > 0 and next;
$count-- and print $1;
}
print scalar localtime;
'
Thu Sep 11 14:42:40 2008
Thu Sep 11 14:42:47 2008
9999903
9999904
9999905
9999906
Thu Sep 11 14:42:53 2008
 
C

cartercc

I have string which contain numbers...

$str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
there are 112 million values

I would like to extract numveric values from specific position till
some position using regular expression.. I dont want to use split caue
it uses lot of memory..

for example:

offset = 3; length = 4;

so the result string should be $str = "454 67 59 298928";

Is your string already in memory, or does it come from storage? If the
latter, you might consider replacing the spaces with new lines and
then using a counter to iterate through the file with something like
this:

while (<INFILE>)
{ $counter++;
if ($counter < $offset) { next; }
elsif ($counter >= $offset and $counter < $length)
{ print OUTFILE; }
elsif ($counter > ($length + $offset)) { last; }
else { print "ERROR"; }
}

If your string is already in memory, I would use the C trick of getc()
and test each character, again using a counter for the white space.
Using inline C would probably be faster and you could discard all the
characters you don't need.

while ((char c = getc()) != EOF)
{ //test c, count whitespace, and save what you need
}

CC
 
D

Dr.Ruud

Dr.Ruud schreef:
Vishal G:
I have string which contain numbers...

$str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
there are 112 million values

I would like to extract numveric values from specific position till
some position using regular expression.. I dont want to use split
caue it uses lot of memory..

for example:

offset = 3; length = 4;

so the result string should be $str = "454 67 59 298928";


Maybe you are looking for something like this:

$ perl -Mstrict -Mwarnings -le '
print scalar localtime;
my $s; $s .= "$_ " for 1..10_000_000;
print scalar localtime;

my $offset = 9_999_903;
my $count = 4;

while ($s =~ m/([0-9]+)/g) {
$count or last;
--$offset > 0 and next;
$count-- and print $1;
}
print scalar localtime;
'
Thu Sep 11 14:42:40 2008
Thu Sep 11 14:42:47 2008
9999903
9999904
9999905
9999906
Thu Sep 11 14:42:53 2008

Which means that the while(regexp) skips about 2 million numbers per
second.
So with $offset = 100_000_000 it may take about a minute.
 
Ad

Advertisements

B

Ben Morrow

Quoth cartercc said:
Is your string already in memory, or does it come from storage? If the
latter, you might consider replacing the spaces with new lines and
then using a counter to iterate through the file with something like
this:

while (<INFILE>)

No need to replace the spaces. $/ = " " will work just fine.

If your string is already in memory, I would use the C trick of getc()

getc reads from a file, not from memory.

Ben
 
L

Leon Timmermans

Hi there,

I have searched the whole group looking for solution to my problem.

Actually, I dont understand the perl regular expression properly...
working on it...

Here is the problem..

I have string which contain numbers...

$str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string there
are 112 million values

I would like to extract numveric values from specific position till some
position using regular expression.. I dont want to use split caue it
uses lot of memory..

for example:

offset = 3; length = 4;

so the result string should be $str = "454 67 59 298928";

Thanks in advance

Vishal

Why do you store that in a free-format string? I can think of a a number
of better ways to store it. You could store it in a binary array (like in
C) and then access it using vec(). Tie::Array::packed may also be an
interesting approach. By storing your data smarter, you can make an O(N)
algorithm O(1).

Regards,

Leon Timmermans
 
T

Ted Zlatanov

VG> Here is the problem..

VG> I have string which contain numbers...

VG> $str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
VG> there are 112 million values

VG> I would like to extract numveric values from specific position till
VG> some position using regular expression.. I dont want to use split caue
VG> it uses lot of memory..

This works for me. To avoid dealing with edge cases, I surround the
input with spaces. The assumption is that only digits and spaces are in
your data; the algorithm uses that to find the next space or the next
digit. Note also that slow_extract() is there as a reference to check
the algorithm works OK. It's very possible it has bugs: I wrote it to
show you the general idea of iterating through the string, and tests are
what you see in __DATA__ which is minimal.

You should consider keeping large data sets like this in a database,
e.g. SQLite. Then operating on it from Perl or other languages is much
easier, especially if you index your columns appropriately.

Ted

#!/usr/bin/perl

use warnings;
use strict;
use Data::Dumper;
use List::Util qw/min/;

my $str = <DATA>; # we keep it global so it's not passed around
chomp $str;
$str = " $str ";


while (<DATA>)
{
my ($pos, $offset) = m/(\d+)\D+(\d+)/;
my $slow_result = slow_extract($pos, $offset);
my $fast_result = fast_extract($pos, $offset);
my $ok = $slow_result eq $fast_result;
print "position $pos, offset $offset: $slow_result / $fast_result / OK=$ok\n";
}

sub slow_extract
{
my $logical_pos = shift @_;
my $n = shift @_;

my @numbers = split ' ', $str;
return join ' ', grep { defined } @numbers[$logical_pos .. $logical_pos+$n-1];
}

sub fast_get_number
{
my $start_pos = shift @_;

my @matches = grep { defined && $_ > 0 } map { index($str, $_, $start_pos) } 0..9;

return unless scalar @matches;

my $start = min(@matches);
my $end = index($str, ' ', $start);
return ($end, substr($str, $start, $end-$start));
}

sub fast_extract
{
my $logical_pos = shift @_;
my $n = shift @_;

my $at = 0;
my $current_logical_pos = 0;

my @numbers;
while (1)
{
my @next = fast_get_number($at);
print Dumper \@next;
last unless scalar @next;
last if $next[0] < 0;
if ($current_logical_pos >= $logical_pos)
{
push @numbers, $next[1];
}
$at = $next[0];
last if scalar @numbers == $n;
$current_logical_pos++;
}

return join ' ', @numbers;
}

__DATA__
93430 574 454 67 59 298928 74 4875 8 93430
3 4
5 6
7 8
10 2
 
C

cartercc

This is why I read this group, always learning things at the (small)
cost of exhibiting my own ignorance. It always amazes me the depth of
knowledge that some people have, and a little bit depressing as to my
own lack of knowledge.

I have several friends who are medical doctors, and know several of
their children who are in various stages of the medical education
process, and I've always liked that approach: two years in the
classroom and four (or more) in the field. In a job you get stuck in a
rut where you might have the same experience thousands of times,
unlike a forum like c.l.p.m. where you can broaden your knowledge by
way of specific, limited example.

All this as a rather wordy 'Thanks'.

CC

Quoth cartercc <[email protected]>:


Is your string already in memory, or does it come from storage? If the
latter, you might consider replacing the spaces with new lines and
then using a counter to iterate through the file with something like
this:
while (<INFILE>)

No need to replace the spaces. $/ = " " will work just fine.

If your string is already in memory, I would use the C trick of getc()

getc reads from a file, not from memory.

Ben

--
You poor take courage, you rich take care:
The Earth was made a common treasury for everyone to share
All things in common, all people one.
'We come in peace'---the order came to cut them down.       [[email protected]]
 
J

Jürgen Exner

Vishal G said:
I have string which contain numbers...

$str = "30 574 454 67 59 298928 74 4875 8 934"; # in actual string
there are 112 million values

Wow! A single string of maybe half a gigabyte length? That sounds like
an awfully poor datastructure.
I would like to extract numveric values from specific position till
some position using regular expression.. I dont want to use split caue
it uses lot of memory..

I cannot imagine that REs will be any more efficient than split(), which
uses REs, BTW, too.
for example:

offset = 3; length = 4;

so the result string should be $str = "454 67 59 298928";

I would put that data into a more suitable data structure.
Maybe write the string to a file and then read it back into an array
using the space character as the line separator?

Or loop through the string character by character and note all positions
of space characters in an array. Then you can use substr() to extract
the desired substring directly.

jue
 
Ad

Advertisements

X

xhoster

Jürgen Exner said:
Wow! A single string of maybe half a gigabyte length? That sounds like
an awfully poor datastructure.

Yes. But Perl is often used as a glue language. As such, it often has
to deal with poor datastructures. If the other programs could be easily
changed to do the right thing in the first place, we wouldn't need the
glue.

....
I would put that data into a more suitable data structure.
Maybe write the string to a file and then read it back into an array
using the space character as the line separator?

That would use at least half as much memory as splitting, and so would
probably be memory prohibitive.
Or loop through the string character by character and note all positions
of space characters in an array. Then you can use substr() to extract
the desired substring directly.

If this only has to be done once per execution, then I would just leave it
in the original structure and step though it with /(\d+)/g. If I was going
to do several extractions, I would convert the string so that each element
is fixed size (either by padding the numbers with 0 to the max length, or
by using pack with the appropriate template) then use substr to get the
desired chunk.

while ($str=~/(\d+)/g) {$y.=pack "i", $1};

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
V

Vishal G

Yes.  But Perl is often used as a glue language.  As such, it often has
to deal with poor datastructures.  If the other programs could be easily
changed to do the right thing in the first place, we wouldn't need the
glue.

...




That would use at least half as much memory as splitting, and so would
probably be memory prohibitive.


If this only has to be done once per execution, then I would just leave it
in the original structure and step though it with /(\d+)/g.  If I was going
to do several extractions, I would convert the string so that each element
is fixed size (either by padding the numbers with 0 to the max length, or
by using pack with the appropriate template) then use substr to get the
desired chunk.

while ($str=~/(\d+)/g) {$y.=pack "i", $1};

Xho

--
--------------------http://NewsReader.Com/--------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Thanx a lot for all these insightful ideas.

-Wow! A single string of maybe half a gigabyte length? That sounds
like an awfully poor data structure.

Actually, I am changing Perl scripts written by someone else and
changing the data structure is not an option cause other modules
depends on it.

Its an ACE (assembly) file which contains DNA and quality value for
each base. So, if there is 220 million bases long DNA then we end with
one string containing 220 million numeric values which is cumbersome
to manage when you have to add & extract information from this string.

The information is in the file as I said earlier and read in to this
data structure. I am trying to split the assembly into parts of
variable length. That’s why I am trying to split the string but if I
use split function to get the 1 million records, it uses 3.0 GB of
memory which is ridicules
 
J

Jürgen Exner

Actually, I am changing Perl scripts written by someone else and
changing the data structure is not an option cause other modules
depends on it

Well, I guess sometimes you are stuck with whatever you are stuck with.
The information is in the file as I said earlier and read in to this
data structure. I am trying to split the assembly into parts of
variable length.

You don't seem to be very familiar with Perl, so let me restate what has
been said earlier:
Perl has a very flexible concept of what constitutes a 'line' in a file.
In particular _YOU_ as a programmer can define, which character is
considered the end-of-line separator/terminator.

Now, if you set the INPUT RECORD SEPARATOR $/ to the space character,
then as far as Perl is concerned each number becomes its own line.

Now you can read your file line by line (i.e. number by number) and Perl
conveniently even keeps a record of which line you just read in the
variable INPUT_LINE_NUMBER $. .

To e.g. print $n numbers, starting with number $start becomes something
like (sketch only, untested):

$. = ' ';
while ($. < $start) {
$dummy = <IN>; #read line (=number) and throw it away
}
for (1..$n) {
print scalar <IN>;
}

The largest piece of data in this code snippet is the list (1..$start)
and even that can be replaced with a while loop, reducing the memory
footprint to a few bytes for just one line (=number) at a time.

jue
 
Ad

Advertisements


Top