Scalable method for searching in relatively big files

KaZ · Jun 28, 2006

Hi,

I want to do this:

I have a file A (the file size can be up to around 40MB, I know it
isn't that big), and I have to read it one line after the other (I use
a while loop for this), and possibly print something into file B,
depending on what is in the line. In each iteration, I have to search
the file A for pattern and the file B to check what has been already
written in it.

Right now, I made the searches with arrays and grep. I put the whole
file A in an array (array @a), file B is an empty array (array @b) at
the beginning, and at the end I write the array @b into file B. I read
directly in the file A each line for the while loop, because I feared
it would interfer with the search if I used the array @a for the while
loop also.

The file A is for the moment only 4MB, and the script take 10 minutes
to complete. So it isn't a scalable solution in my opinion.

I was told to use database or possibly tied hashes, in order to get
more scalability.

Can anybody tell me a bit more about this two methods? I read what a
hash is, but fail to see how it could help me.

Greetings,

David Squire · Jun 28, 2006

KaZ said:
Hi,

I want to do this:

I have a file A (the file size can be up to around 40MB, I know it
isn't that big), and I have to read it one line after the other (I use
a while loop for this), and possibly print something into file B,
depending on what is in the line. In each iteration, I have to search
the file A for pattern and the file B to check what has been already
written in it.

Right now, I made the searches with arrays and grep. I put the whole
file A in an array (array @a), file B is an empty array (array @b) at
the beginning, and at the end I write the array @b into file B. I read
directly in the file A each line for the while loop, because I feared
it would interfer with the search if I used the array @a for the while
loop also.

The file A is for the moment only 4MB, and the script take 10 minutes
to complete. So it isn't a scalable solution in my opinion.

I was told to use database or possibly tied hashes, in order to get
more scalability.

Can anybody tell me a bit more about this two methods? I read what a
hash is, but fail to see how it could help me.

Show us your script, and someone here will be able to help. Without
knowing exactly what kind of search (exact match? regex? whole line?)
you are doing, it is impossible to recommend much.

DS

Brian Wakem · Jun 28, 2006

KaZ said:
Hi,

I want to do this:

I have a file A (the file size can be up to around 40MB, I know it
isn't that big), and I have to read it one line after the other (I use
a while loop for this), and possibly print something into file B,
depending on what is in the line. In each iteration, I have to search
the file A for pattern and the file B to check what has been already
written in it.

Right now, I made the searches with arrays and grep. I put the whole
file A in an array (array @a), file B is an empty array (array @b) at
the beginning, and at the end I write the array @b into file B. I read
directly in the file A each line for the while loop, because I feared
it would interfer with the search if I used the array @a for the while
loop also.

The file A is for the moment only 4MB, and the script take 10 minutes
to complete. So it isn't a scalable solution in my opinion.

I was told to use database or possibly tied hashes, in order to get
more scalability.

Can anybody tell me a bit more about this two methods? I read what a
hash is, but fail to see how it could help me.

Greetings,

As David has said, you need to show us some code.

Usually replacing regexs with index (where it can be used) will give a big
performance increase. I doubt it needs to take 10 minutes, or even 10
seconds.

KaZ · Jun 28, 2006

David said:
Show us your script, and someone here will be able to help. Without
knowing exactly what kind of search (exact match? regex? whole line?)
you are doing, it is impossible to recommend much.

DS

Hi,

it looks like this:

--------------------------------------------------------------------------------------------------------
open (FILEA, '<', "./filea.txt")

while (defined($line = <FILEA>)) {
@line = split '\t', $line;
$var0 = @line[0];
......
$var7 = @line[7];
$nb = @line[8];

if ( $var3 eq "blah") {
if ((not grep { /string=$nb;/ } @b) && (not grep { /\t$nb\t/ } @a))
{
push @b, "blah blah blah string=$nb\n";
}
}
elsif ( $var3 eq "other") {
# something similar to the preceding if....
}
elsif ($var6 eq "something") {
# something similar to preceding elsif ....
}

}

open (FILEB, '>', "./path-to-the-file.txt') or die "Can't open
path-to-the-file: $!";
print FILEB @b;
close FILEB;
--------------------------------------------------------------------------------------------------------

I hope it is enough. I can post the complete script, but I have to
"anonymize" it a bit before, I'm sorry.

Thanks for the answer,
kaz

KaZ · Jun 28, 2006

KaZ said:
David said:

Show us your script, and someone here will be able to help. Without
knowing exactly what kind of search (exact match? regex? whole line?)
you are doing, it is impossible to recommend much.

DS

Click to expand...

Hi,

it looks like this:

--------------------------------------------------------------------------------------------------------
open (FILEA, '<', "./filea.txt")

while (defined($line = <FILEA>)) {
@line = split '\t', $line;
$var0 = @line[0];
.....
$var7 = @line[7];
$nb = @line[8];

if ( $var3 eq "blah") {
if ((not grep { /string=$nb;/ } @b) && (not grep { /\t$nb\t/ } @a))
{
push @b, "blah blah blah string=$nb\n";
}
}
elsif ( $var3 eq "other") {
# something similar to the preceding if....
}
elsif ($var6 eq "something") {
# something similar to preceding elsif ....
}

}

open (FILEB, '>', "./path-to-the-file.txt') or die "Can't open
path-to-the-file: $!";
print FILEB @b;
close FILEB;
--------------------------------------------------------------------------------------------------------

I hope it is enough. I can post the complete script, but I have to
"anonymize" it a bit before, I'm sorry.

Thanks for the answer,
kaz

Sorry, I made a mistake:
each "if ($var eq "some_string")" is to be replaced by a sub which
search in an excel list, of about 400 rows, using
Spreadsheet:

arseExcel.

I already used this sub in other scripts, and it was slow but still
below 1 minute, so I thought, it was not the reason for the slowness
here. But if you think perl is able to process such a script much
faster normally, then I have to make a text version of it.

David Squire · Jun 28, 2006

KaZ said:
Hi,

it looks like this:

Thanks for trying, but if you really want help here, you need to follow
the posting guidelines, and post a small but *complete* script. This is
not a complete script. Where does @a get its values, for example?

Also, you are unlikely to get help until you post something the reports
no problems when both 'use strict' and 'use warnings' are in place.

Still, I am feeling generous this morning...

while (defined($line = <FILEA>)) {

while (my $line = said:
@line = split '\t', $line;
$var0 = @line[0];
.....
$var7 = @line[7];
$nb = @line[8];

my ($var0, ..., $var7, $nb) = split /\t/, $line; # what happens if the
number of fields is wrong?

if ( $var3 eq "blah") {
if ((not grep { /string=$nb;/ } @b) && (not grep { /\t$nb\t/ } @a))

index would be faster than regexes here, but I strongly suspect that
hashes would be even better... but we don't know what @a is...

I am thinking along the lines of:

my %OutLinesHash; # replacing your @b
....
# in loop

$OutLinesHash{$nb} = "blah blah blah string=$nb\n";

Then you can replace your grep with "if (exists $OutLinesHash{$nb})". I
can't tell about @a, as you haven't told us about it.

If you need to write out the contents of %OutLinesHash in the order you
read them, you can maintain a parallel list of keys, e.g.

my %OutLinesHash; # replacing your @b
my @KeysInOrder;

....

# in loop

if (! exists $OutLinesHash{$nb}) {
$OutLinesHash{$nb} = "blah blah blah string=$nb\n";
push @KeysInOrder, $nb;
}

# to write it out
foreach my $Key (@KeysInOrder) {
print SOMEWHERE $OutLinesHash{$nb};
}

HTH

DS

KaZ · Jun 28, 2006

Thanks for the help, I understood the way I can use the hash.

Regards,

KaZ · Jun 28, 2006

Thanks for the help, I understood the way I can use the hash.

Regards,

David Squire · Jun 28, 2006

KaZ said:
Thanks for the help

Thank who for what? Please quote context, retaining attributions, when
you reply.

Please read the posting guidelines, posted here several times a week,
and start following them if you wish to continue to receive help from
this group.

DS

David Squire · Jun 28, 2006

David Squire wrote:

[snip]

# to write it out
foreach my $Key (@KeysInOrder) {
print SOMEWHERE $OutLinesHash{$nb};
}

Whoops. That should be:

# to write it out
foreach my $Key (@KeysInOrder) {
print SOMEWHERE $OutLinesHash{$Key};
}

Peter J. Holzer · Jun 28, 2006

KaZ said:
KaZ said:

--------------------------------------------------------------------------------------------------------
open (FILEA, '<', "./filea.txt")

while (defined($line = <FILEA>)) { [...]
if ( $var3 eq "blah") { [...]
}
--------------------------------------------------------------------------------------------------------

Click to expand...

Sorry, I made a mistake:
each "if ($var eq "some_string")" is to be replaced by a sub which
search in an excel list, of about 400 rows, using
Spreadsheet:arseExcel.

I already used this sub in other scripts, and it was slow but still
below 1 minute, so I thought, it was not the reason for the slowness
here.

If you parse an excel file several times for each line of a 4 MB file,
you are probably parsing it about a hundredthousand times. No wonder
this is slow. You should parse the excel file once at startup, extract
the information you need and store it in an appropriate perl data
structure (most likely a hash). Then you can replace parsing your excel
sheet with a simple hash lookup.

But if you think perl is able to process such a script much
faster normally, then I have to make a text version of it.

Just avoid doing the same thing over and over again if you already know
the result.

hp

Tad McClellan · Jun 28, 2006

KaZ said:
open (FILEA, '<', "./filea.txt")

You should always, yes *always*, check the return value from open():

open (FILEA, '<', './filea.txt') or die "could not open './filea.txt' $!";

@line = split '\t', $line;

A pattern match should *look like* a pattern match:

@line = split /\t/, $line;

$var0 = @line[0];

You should always enable warnings when developing Perl code!

it_says_BALLS_on_your forehead · Jun 28, 2006

A pattern match should *look like* a pattern match:

@line = split /\t/, $line;

I agree with you, Tad. However, From the Programming Perl 3rd ed. book
pg. 63...

PICK YOUR OWN QUOTES

"You can use whichever nonalphanumeric, nonwhitespace delimiter you
like in place of '/'."

An interesting thing about the single quote - it's not supposed to
interpolate, and if the pattern is a variable, it won't. But in the
case of at least tab characters '\t', it does. This behavior is not
consistent with how tabs behave between single quotes with the print
function.

Tad McClellan · Jun 28, 2006

[ the snipped OP's code was: @line = split '\t', $line; ]

I agree with you, Tad. However, From the Programming Perl 3rd ed. book
pg. 63...

PICK YOUR OWN QUOTES

"You can use whichever nonalphanumeric, nonwhitespace delimiter you
like in place of '/'."

Please don't cite a resource with limited distribution when there
is a widely available resource that says the same thing (perlop.pod).

If "/" is the delimiter then the initial C<m> is optional.
With the C<m> you can use any pair of non-alphanumeric,
non-whitespace characters as delimiters.

But that doesn't apply to the OP's code, because it does not have the C<m>:

@line = split m'\t', $line;

The OP is not supplying a pattern as split's first arg, he is
supplying a string instead (which will then be forced into a pattern
by the DWIMer).

IMHO, the DWIMer is being rather too helpful in the OP's case, which
is why I made my comment in the first place.

The OP's code acts like a pattern match but does not look like a pattern match.

An interesting thing about the single quote - it's not supposed to
interpolate, and if the pattern is a variable, it won't. But in the
case of at least tab characters '\t', it does. This behavior is not
consistent with how tabs behave between single quotes with the print
function.

Yet another reason to make the pattern *look like* a pattern then, yes?

it_says_BALLS_on_your forehead · Jun 28, 2006

Tad said:
[ the snipped OP's code was: @line = split '\t', $line; ]

I agree with you, Tad. However, From the Programming Perl 3rd ed. book
pg. 63...

PICK YOUR OWN QUOTES

"You can use whichever nonalphanumeric, nonwhitespace delimiter you
like in place of '/'."

Click to expand...

Please don't cite a resource with limited distribution when there
is a widely available resource that says the same thing (perlop.pod).

If "/" is the delimiter then the initial C<m> is optional.
With the C<m> you can use any pair of non-alphanumeric,
non-whitespace characters as delimiters.

I see nothing wrong with citing one of the definitive Perl reference
books when I provide the quote.

But that doesn't apply to the OP's code, because it does not have the C<m>:

@line = split m'\t', $line;

The OP is not supplying a pattern as split's first arg, he is
supplying a string instead (which will then be forced into a pattern
by the DWIMer).

That I did not know; interesting.

IMHO, the DWIMer is being rather too helpful in the OP's case, which
is why I made my comment in the first place.

The OP's code acts like a pattern match but does not look like a pattern match.

Yet another reason to make the pattern *look like* a pattern then, yes?

As I stated before, I agree with you.

xhoster · Jun 28, 2006

it_says_BALLS_on_your forehead said:
I agree with you, Tad. However, From the Programming Perl 3rd ed. book
pg. 63...

PICK YOUR OWN QUOTES

"You can use whichever nonalphanumeric, nonwhitespace delimiter you
like in place of '/'."

Can doesn't mean should. And without a compelling reason, you shouldn't.

An interesting thing about the single quote - it's not supposed to
interpolate, and if the pattern is a variable, it won't. But in the
case of at least tab characters '\t', it does.

No, the single quotes send a literal '\t' into the regex. The regex engine
does the interpolation of \t into a tab.

This behavior is not
consistent with how tabs behave between single quotes with the print
function.

That is because print doesn't interpret strings, it prints them. The regex
engine interprets them.

Xho

Tad McClellan · Jun 28, 2006

I see nothing wrong with citing one of the definitive Perl reference
books when I provide the quote.

I see at least 2 reasons.

The primary reason is that everybody has the docs that come with perl
and not everybody has bought the Camel book, so more people can
participate freely, which seems desirable and open-sourcey.

The 2nd reason is that those who haven't paid O'Reilly cannot go
see the context of the quote. There is nothing in the quote that
indicates that is is talking about m// delimiters, it might be
talking about qq// delimiters for all we can tell.

it_says_BALLS_on_your_forehead · Jun 29, 2006

Tad said:
I see at least 2 reasons.

The primary reason is that everybody has the docs that come with perl
and not everybody has bought the Camel book, so more people can
participate freely, which seems desirable and open-sourcey.

The 2nd reason is that those who haven't paid O'Reilly cannot go
see the context of the quote. There is nothing in the quote that
indicates that is is talking about m// delimiters, it might be
talking about qq// delimiters for all we can tell.

You're certainly tenacious. Very well, I acquiesce

.

KaZ · Jun 29, 2006

Brian said:
As David has said, you need to show us some code.

Usually replacing regexs with index (where it can be used) will give a big
performance increase. I doubt it needs to take 10 minutes, or even 10
seconds.

Hello,

I changed grep for index, but it didn't speed up (actually it is taking
exactly the same amount of time). Maybe because it wasn't real regexes,
only strings like /\t$var\t/

Using a hash instead of reading the excel file everytime made a
noticeable speed increase, but it is still lasting more than 5 min.

Anyway, thanks a lot for this advice.

Folder Structure mixing front and backend for scalable Projekt	2	Mar 1, 2022
BootStrap Code was in one big line	2	Dec 3, 2022
Searching in array for numbers between two numbers	4	Oct 2, 2022
Searching the smaller picture in the larger picture	2	Jan 24, 2024
Logic Problem with BigInteger Method	2	Aug 26, 2023
Efficiently searching multiple files	10	May 20, 2010
Big problem I need to solve with some unix utils	1	Jun 19, 2022
How to not load an insanely big dataset in less than 50 hrs	1	Sep 2, 2023

Scalable method for searching in relatively big files

KaZ

David Squire

Brian Wakem

KaZ

KaZ

David Squire

KaZ

KaZ

David Squire

David Squire

Peter J. Holzer

Tad McClellan

it_says_BALLS_on_your forehead

Tad McClellan

it_says_BALLS_on_your forehead

xhoster

Tad McClellan

it_says_BALLS_on_your_forehead

KaZ

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads