Scalable method for searching in relatively big files

K

KaZ

Hi,

I want to do this:

I have a file A (the file size can be up to around 40MB, I know it
isn't that big), and I have to read it one line after the other (I use
a while loop for this), and possibly print something into file B,
depending on what is in the line. In each iteration, I have to search
the file A for pattern and the file B to check what has been already
written in it.

Right now, I made the searches with arrays and grep. I put the whole
file A in an array (array @a), file B is an empty array (array @b) at
the beginning, and at the end I write the array @b into file B. I read
directly in the file A each line for the while loop, because I feared
it would interfer with the search if I used the array @a for the while
loop also.

The file A is for the moment only 4MB, and the script take 10 minutes
to complete. So it isn't a scalable solution in my opinion.

I was told to use database or possibly tied hashes, in order to get
more scalability.

Can anybody tell me a bit more about this two methods? I read what a
hash is, but fail to see how it could help me.

Greetings,
 
D

David Squire

KaZ said:
Hi,

I want to do this:

I have a file A (the file size can be up to around 40MB, I know it
isn't that big), and I have to read it one line after the other (I use
a while loop for this), and possibly print something into file B,
depending on what is in the line. In each iteration, I have to search
the file A for pattern and the file B to check what has been already
written in it.

Right now, I made the searches with arrays and grep. I put the whole
file A in an array (array @a), file B is an empty array (array @b) at
the beginning, and at the end I write the array @b into file B. I read
directly in the file A each line for the while loop, because I feared
it would interfer with the search if I used the array @a for the while
loop also.

The file A is for the moment only 4MB, and the script take 10 minutes
to complete. So it isn't a scalable solution in my opinion.

I was told to use database or possibly tied hashes, in order to get
more scalability.

Can anybody tell me a bit more about this two methods? I read what a
hash is, but fail to see how it could help me.

Show us your script, and someone here will be able to help. Without
knowing exactly what kind of search (exact match? regex? whole line?)
you are doing, it is impossible to recommend much.


DS
 
B

Brian Wakem

KaZ said:
Hi,

I want to do this:

I have a file A (the file size can be up to around 40MB, I know it
isn't that big), and I have to read it one line after the other (I use
a while loop for this), and possibly print something into file B,
depending on what is in the line. In each iteration, I have to search
the file A for pattern and the file B to check what has been already
written in it.

Right now, I made the searches with arrays and grep. I put the whole
file A in an array (array @a), file B is an empty array (array @b) at
the beginning, and at the end I write the array @b into file B. I read
directly in the file A each line for the while loop, because I feared
it would interfer with the search if I used the array @a for the while
loop also.

The file A is for the moment only 4MB, and the script take 10 minutes
to complete. So it isn't a scalable solution in my opinion.

I was told to use database or possibly tied hashes, in order to get
more scalability.

Can anybody tell me a bit more about this two methods? I read what a
hash is, but fail to see how it could help me.

Greetings,


As David has said, you need to show us some code.

Usually replacing regexs with index (where it can be used) will give a big
performance increase. I doubt it needs to take 10 minutes, or even 10
seconds.
 
K

KaZ

David said:
Show us your script, and someone here will be able to help. Without
knowing exactly what kind of search (exact match? regex? whole line?)
you are doing, it is impossible to recommend much.


DS

Hi,

it looks like this:

--------------------------------------------------------------------------------------------------------
open (FILEA, '<', "./filea.txt")

while (defined($line = <FILEA>)) {
@line = split '\t', $line;
$var0 = @line[0];
......
$var7 = @line[7];
$nb = @line[8];

if ( $var3 eq "blah") {
if ((not grep { /string=$nb;/ } @b) && (not grep { /\t$nb\t/ } @a))
{
push @b, "blah blah blah string=$nb\n";
}
}
elsif ( $var3 eq "other") {
# something similar to the preceding if....
}
elsif ($var6 eq "something") {
# something similar to preceding elsif ....
}

}

open (FILEB, '>', "./path-to-the-file.txt') or die "Can't open
path-to-the-file: $!";
print FILEB @b;
close FILEB;
--------------------------------------------------------------------------------------------------------

I hope it is enough. I can post the complete script, but I have to
"anonymize" it a bit before, I'm sorry.

Thanks for the answer,
kaz
 
K

KaZ

KaZ said:
David said:
Show us your script, and someone here will be able to help. Without
knowing exactly what kind of search (exact match? regex? whole line?)
you are doing, it is impossible to recommend much.


DS

Hi,

it looks like this:

--------------------------------------------------------------------------------------------------------
open (FILEA, '<', "./filea.txt")

while (defined($line = <FILEA>)) {
@line = split '\t', $line;
$var0 = @line[0];
.....
$var7 = @line[7];
$nb = @line[8];

if ( $var3 eq "blah") {
if ((not grep { /string=$nb;/ } @b) && (not grep { /\t$nb\t/ } @a))
{
push @b, "blah blah blah string=$nb\n";
}
}
elsif ( $var3 eq "other") {
# something similar to the preceding if....
}
elsif ($var6 eq "something") {
# something similar to preceding elsif ....
}

}

open (FILEB, '>', "./path-to-the-file.txt') or die "Can't open
path-to-the-file: $!";
print FILEB @b;
close FILEB;
--------------------------------------------------------------------------------------------------------

I hope it is enough. I can post the complete script, but I have to
"anonymize" it a bit before, I'm sorry.

Thanks for the answer,
kaz


Sorry, I made a mistake:
each "if ($var eq "some_string")" is to be replaced by a sub which
search in an excel list, of about 400 rows, using
Spreadsheet::parseExcel.

I already used this sub in other scripts, and it was slow but still
below 1 minute, so I thought, it was not the reason for the slowness
here. But if you think perl is able to process such a script much
faster normally, then I have to make a text version of it.
 
D

David Squire

KaZ said:
Hi,

it looks like this:

Thanks for trying, but if you really want help here, you need to follow
the posting guidelines, and post a small but *complete* script. This is
not a complete script. Where does @a get its values, for example?

Also, you are unlikely to get help until you post something the reports
no problems when both 'use strict' and 'use warnings' are in place.

Still, I am feeling generous this morning...
while (defined($line = <FILEA>)) {

while (my $line = said:
@line = split '\t', $line;
$var0 = @line[0];
.....
$var7 = @line[7];
$nb = @line[8];

my ($var0, ..., $var7, $nb) = split /\t/, $line; # what happens if the
number of fields is wrong?
if ( $var3 eq "blah") {
if ((not grep { /string=$nb;/ } @b) && (not grep { /\t$nb\t/ } @a))

index would be faster than regexes here, but I strongly suspect that
hashes would be even better... but we don't know what @a is...

I am thinking along the lines of:

my %OutLinesHash; # replacing your @b
....
# in loop

$OutLinesHash{$nb} = "blah blah blah string=$nb\n";

Then you can replace your grep with "if (exists $OutLinesHash{$nb})". I
can't tell about @a, as you haven't told us about it.

If you need to write out the contents of %OutLinesHash in the order you
read them, you can maintain a parallel list of keys, e.g.

my %OutLinesHash; # replacing your @b
my @KeysInOrder;

....

# in loop

if (! exists $OutLinesHash{$nb}) {
$OutLinesHash{$nb} = "blah blah blah string=$nb\n";
push @KeysInOrder, $nb;
}

# to write it out
foreach my $Key (@KeysInOrder) {
print SOMEWHERE $OutLinesHash{$nb};
}


HTH


DS
 
D

David Squire

KaZ said:
Thanks for the help

Thank who for what? Please quote context, retaining attributions, when
you reply.

Please read the posting guidelines, posted here several times a week,
and start following them if you wish to continue to receive help from
this group.


DS
 
D

David Squire

David Squire wrote:

[snip]
# to write it out
foreach my $Key (@KeysInOrder) {
print SOMEWHERE $OutLinesHash{$nb};
}

Whoops. That should be:

# to write it out
foreach my $Key (@KeysInOrder) {
print SOMEWHERE $OutLinesHash{$Key};
}
 
P

Peter J. Holzer

KaZ said:
KaZ said:
--------------------------------------------------------------------------------------------------------
open (FILEA, '<', "./filea.txt")

while (defined($line = <FILEA>)) { [...]
if ( $var3 eq "blah") { [...]
}
--------------------------------------------------------------------------------------------------------

Sorry, I made a mistake:
each "if ($var eq "some_string")" is to be replaced by a sub which
search in an excel list, of about 400 rows, using
Spreadsheet::parseExcel.

I already used this sub in other scripts, and it was slow but still
below 1 minute, so I thought, it was not the reason for the slowness
here.

If you parse an excel file several times for each line of a 4 MB file,
you are probably parsing it about a hundredthousand times. No wonder
this is slow. You should parse the excel file once at startup, extract
the information you need and store it in an appropriate perl data
structure (most likely a hash). Then you can replace parsing your excel
sheet with a simple hash lookup.
But if you think perl is able to process such a script much
faster normally, then I have to make a text version of it.

Just avoid doing the same thing over and over again if you already know
the result.

hp
 
T

Tad McClellan

KaZ said:
open (FILEA, '<', "./filea.txt")


You should always, yes *always*, check the return value from open():

open (FILEA, '<', './filea.txt') or die "could not open './filea.txt' $!";

@line = split '\t', $line;


A pattern match should *look like* a pattern match:

@line = split /\t/, $line;

$var0 = @line[0];


You should always enable warnings when developing Perl code!
 
I

it_says_BALLS_on_your forehead

A pattern match should *look like* a pattern match:

@line = split /\t/, $line;

I agree with you, Tad. However, From the Programming Perl 3rd ed. book
pg. 63...

PICK YOUR OWN QUOTES

"You can use whichever nonalphanumeric, nonwhitespace delimiter you
like in place of '/'."

An interesting thing about the single quote - it's not supposed to
interpolate, and if the pattern is a variable, it won't. But in the
case of at least tab characters '\t', it does. This behavior is not
consistent with how tabs behave between single quotes with the print
function.
 
T

Tad McClellan

[ the snipped OP's code was: @line = split '\t', $line; ]

I agree with you, Tad. However, From the Programming Perl 3rd ed. book
pg. 63...

PICK YOUR OWN QUOTES

"You can use whichever nonalphanumeric, nonwhitespace delimiter you
like in place of '/'."


Please don't cite a resource with limited distribution when there
is a widely available resource that says the same thing (perlop.pod).

If "/" is the delimiter then the initial C<m> is optional.
With the C<m> you can use any pair of non-alphanumeric,
non-whitespace characters as delimiters.


But that doesn't apply to the OP's code, because it does not have the C<m>:

@line = split m'\t', $line;

The OP is not supplying a pattern as split's first arg, he is
supplying a string instead (which will then be forced into a pattern
by the DWIMer).

IMHO, the DWIMer is being rather too helpful in the OP's case, which
is why I made my comment in the first place.

The OP's code acts like a pattern match but does not look like a pattern match.

An interesting thing about the single quote - it's not supposed to
interpolate, and if the pattern is a variable, it won't. But in the
case of at least tab characters '\t', it does. This behavior is not
consistent with how tabs behave between single quotes with the print
function.


Yet another reason to make the pattern *look like* a pattern then, yes?
 
I

it_says_BALLS_on_your forehead

Tad said:
[ the snipped OP's code was: @line = split '\t', $line; ]

I agree with you, Tad. However, From the Programming Perl 3rd ed. book
pg. 63...

PICK YOUR OWN QUOTES

"You can use whichever nonalphanumeric, nonwhitespace delimiter you
like in place of '/'."


Please don't cite a resource with limited distribution when there
is a widely available resource that says the same thing (perlop.pod).

If "/" is the delimiter then the initial C<m> is optional.
With the C<m> you can use any pair of non-alphanumeric,
non-whitespace characters as delimiters.

I see nothing wrong with citing one of the definitive Perl reference
books when I provide the quote.
But that doesn't apply to the OP's code, because it does not have the C<m>:

@line = split m'\t', $line;

The OP is not supplying a pattern as split's first arg, he is
supplying a string instead (which will then be forced into a pattern
by the DWIMer).

That I did not know; interesting.
IMHO, the DWIMer is being rather too helpful in the OP's case, which
is why I made my comment in the first place.

The OP's code acts like a pattern match but does not look like a pattern match.




Yet another reason to make the pattern *look like* a pattern then, yes?

As I stated before, I agree with you.
 
X

xhoster

it_says_BALLS_on_your forehead said:
I agree with you, Tad. However, From the Programming Perl 3rd ed. book
pg. 63...

PICK YOUR OWN QUOTES

"You can use whichever nonalphanumeric, nonwhitespace delimiter you
like in place of '/'."

Can doesn't mean should. And without a compelling reason, you shouldn't.
An interesting thing about the single quote - it's not supposed to
interpolate, and if the pattern is a variable, it won't. But in the
case of at least tab characters '\t', it does.

No, the single quotes send a literal '\t' into the regex. The regex engine
does the interpolation of \t into a tab.
This behavior is not
consistent with how tabs behave between single quotes with the print
function.

That is because print doesn't interpret strings, it prints them. The regex
engine interprets them.

Xho
 
T

Tad McClellan

I see nothing wrong with citing one of the definitive Perl reference
books when I provide the quote.


I see at least 2 reasons.

The primary reason is that everybody has the docs that come with perl
and not everybody has bought the Camel book, so more people can
participate freely, which seems desirable and open-sourcey.

The 2nd reason is that those who haven't paid O'Reilly cannot go
see the context of the quote. There is nothing in the quote that
indicates that is is talking about m// delimiters, it might be
talking about qq// delimiters for all we can tell.
 
I

it_says_BALLS_on_your_forehead

Tad said:
I see at least 2 reasons.

The primary reason is that everybody has the docs that come with perl
and not everybody has bought the Camel book, so more people can
participate freely, which seems desirable and open-sourcey.

The 2nd reason is that those who haven't paid O'Reilly cannot go
see the context of the quote. There is nothing in the quote that
indicates that is is talking about m// delimiters, it might be
talking about qq// delimiters for all we can tell.

You're certainly tenacious. Very well, I acquiesce :).
 
K

KaZ

Brian said:
As David has said, you need to show us some code.

Usually replacing regexs with index (where it can be used) will give a big
performance increase. I doubt it needs to take 10 minutes, or even 10
seconds.

Hello,

I changed grep for index, but it didn't speed up (actually it is taking
exactly the same amount of time). Maybe because it wasn't real regexes,
only strings like /\t$var\t/

Using a hash instead of reading the excel file everytime made a
noticeable speed increase, but it is still lasting more than 5 min.

Anyway, thanks a lot for this advice.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top