Reading huge *.txt files?

Math55 · Oct 7, 2003

hi, is there a possibility to read large (>1mb) *.txt files in a fast
way? everytime i do that, my program freezes or takes very long to
finish. anyone a idea?

THANKS

Helgi Briem · Oct 7, 2003

hi, is there a possibility to read large (>1mb) *.txt files in a fast
way? everytime i do that, my program freezes or takes very long to
finish. anyone a idea?

1MB is large? What are you using, a BBC Commodore?

Anyway,

#!perl
use warnings;
use strict;
my $in = '/full/path/to/bigtextfile';
open IN, $in or die "Cannot open $in:$!\n";
while (<IN>)
{
#do stuff with contents
}
close IN;
__END__

Jesper · Oct 7, 2003

I have the same problem.
I have two txt files (txta and texb) each with 15-20000 lines. Each line
contains a username + info. Now what I want to do is take a line from texta,
retrieve the username from this line, search for the username in txtb and
return the info that's related to the username i txtb. My code looks like:

open (txta, "txta");
@txta= <txta>;
close(txta);

open (txtb, "txtb");
@txtb= <txtb>;
close(txtb);

foreach $txta_line (@txta)
{
#starts by getting the userid
get userid;

--------------------------------------
foreach $txtb_line (@txtb)
{
my (@array) = $txtb_line =~ m!(.*);(.*)!;
if (@array[0]=~m!^$userid$!) { $userinfo = @array[1]; }
}
--------------------------------------
$userinfo = '';

#uses the $userinfo
}

The code between the lines takes 'really' long time (15000 lines are a lot
of lines

) - is there anyway to optimize the code ?

Regards,

Jesper

BTW I'm using activestate perl 5.6.1

Tad McClellan · Oct 7, 2003

Math55 said:
hi, is there a possibility to read large (>1mb) *.txt files in a fast
way? everytime i do that, my program freezes or takes very long to
finish. anyone a idea?

You have an error on line 17.

Peter Hickman · Oct 7, 2003

Jesper said:
--------------------------------------
foreach $txtb_line (@txtb)
{
my (@array) = $txtb_line =~ m!(.*);(.*)!;
if (@array[0]=~m!^$userid$!) { $userinfo = @array[1]; }
}

You do realise that this line has just lost all the information you put in
$userinfo three lines previous?

#uses the $userinfo
}

The code between the lines takes 'really' long time (15000 lines are a lot
of lines ) - is there anyway to optimize the code ?

The brain dead fix is this...

foreach $txtb_line (@txtb)
{
my (@array) = $txtb_line =~ m!(.*);(.*)!;
if (@array[0]=~m!^$userid$!) {
$userinfo = @array[1];
last;
}
}

That will quit the inner loop once a match is found.

However this is still not the way, even if TIMTOWTDI.

Before you read txta read txtb into a hash

foreach $txtb_line (@txtb}
{
my (@array) = $txtb_line =~ m!(.*);(.*)!;
$lookup{$array[0]} = $array[1];
}

then where you had the inner loop

if($lookup{$userid}) {
$userinfo = $lookup{$userid};
} else {
$userinfo = '';
}

Don't go much faster.

Dominik Seelow · Oct 7, 2003

Jesper wrote:

Hello Jesper.

I have the same problem.
I have two txt files (txta and texb) each with 15-20000 lines. Each line
contains a username + info. Now what I want to do is take a line from texta,
retrieve the username from this line, search for the username in txtb and
return the info that's related to the username i txtb. My code looks like:

open (txta, "txta");
@txta= <txta>;
close(txta);

Reading the whole file at once is not a good idea, especially if the
files are large.

open (txtb, "txtb");
@txtb= <txtb>;
close(txtb);

foreach $txta_line (@txta)
{
#starts by getting the userid
get userid;

--------------------------------------
foreach $txtb_line (@txtb)
{
my (@array) = $txtb_line =~ m!(.*);(.*)!;
if (@array[0]=~m!^$userid$!) { $userinfo = @array[1]; }
}
--------------------------------------
$userinfo = '';

#uses the $userinfo
}

The code between the lines takes 'really' long time (15000 lines are a lot
of lines ) - is there anyway to optimize the code ?

Read the files line-wise. If you need to join them, read the first file
(apparently 'textb') line-by-line and put everything you need into a
hash. Read the second file (apparently 'texta') and use the hash to join
the information. Try to use other names for your files.

HTH,
Dominik

Tulan W. Hu · Oct 7, 2003

Math55 said:
hi, is there a possibility to read large (>1mb) *.txt files in a fast
way? everytime i do that, my program freezes or takes very long to
finish. anyone a idea?

THANKS

Upgrade your perl to 5.8.1 and use Tie::File.

Peter Hickman · Oct 7, 2003

Tulan said:
Upgrade your perl to 5.8.1 and use Tie::File.

To be honest this is not good advice. His code is grossly inefficient, most of
the improvements will come from a better design than using a module to
implement a bad design.

Jesper · Oct 7, 2003

However this is still not the way, even if TIMTOWTDI.

Before you read txta read txtb into a hash

foreach $txtb_line (@txtb}
{
my (@array) = $txtb_line =~ m!(.*);(.*)!;
$lookup{$array[0]} = $array[1];
}

then where you had the inner loop

if($lookup{$userid}) {
$userinfo = $lookup{$userid};
} else {
$userinfo = '';
}

Don't go much faster.

Thx, that works pretty darn nice - it takes about 6 secs. to do 240.000.000
comparisons

Regards

Jesper

Eric J. Roode · Oct 7, 2003

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have the same problem.
I have two txt files (txta and texb) each with 15-20000 lines. Each
line contains a username + info. Now what I want to do is take a line
from texta, retrieve the username from this line, search for the
username in txtb and return the info that's related to the username i
txtb. My code looks like:

open (txta, "txta");
@txta= <txta>;
close(txta);

open (txtb, "txtb");
@txtb= <txtb>;
close(txtb);

foreach $txta_line (@txta)
{

[Eric gets out the Holy Baseball Bat of Education]

Do NOT [whap!] read [whap!] an entire *file* [whap!] into an *array*
[whap!] just so you can [whap!] loop [whap!] over it!!! [whap-whap-whap!]

[Eric puts the baseball bat away]

Okay, now that we're in the proper frame of mind for some education:

One: You read an entire file (txta) into an array, then apparently do
nothing more than loop over that array. This is Bad. [don't make me get
the baseball bat out again.] It's a waste of memory, and it doesn't buy
you anything. It's foolish. It means that you copied a bad habit from
some bad book or bad instructor.

Two: You read another entire file (txtb) into memory, and repeatedly loop
over it in search of user ids. In this case, it's not entirely hopeless
that you slurped the entire file into memory, because you're referencing
it frequently, but there's a better way. Read through the txtb file,
finding userids, and put them into a hash, and reference them while going
through txta. That way you will only have to scan file txtb once.

Something like this:

my %uid_lookup;

while (<txtb>)
{
chomp; # (?)
$uid_lookup{$1} = $2 if /(.*);(.*)/;
}

while (<txta>)
{
get userid;
$userinfo = $uid_lookup{$userid};
...
}

There. One pass per file. No huge arrays in memory (although, who knows
how large %uid_lookup can get?).

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP4LxsWPeouIeTNHoEQIGtQCfXsffnYbsXe1F5ph2Z21mUBU/HwQAnA8W
05GqIlnA5zusx6yRqyX7PyP4
=Wk1v
-----END PGP SIGNATURE-----

Chris Mattern · Oct 7, 2003

Math55 said:
hi, is there a possibility to read large (>1mb) *.txt files in a fast
way? everytime i do that, my program freezes or takes very long to
finish. anyone a idea?

Specifics. Specifics are good. Specifics want to be your friend.
How fast is "in a fast way"? How long is "very long"? And most of
all--what "that" are you doing?

Chris Mattern

Chris Mattern · Oct 7, 2003

Peter said:
To be honest this is not good advice. His code is grossly inefficient,
most of the improvements will come from a better design than using a
module to implement a bad design.

Actually, we don't know if it's good advice or not. The OP hasn't
given us any code to look at. The code we've seen in this thread
came from another newbie who chimed in about having the same problem,
except he actually provided people with code to fix.

Chris Mattern

Tintin · Oct 8, 2003

Jesper said:
I have the same problem.
I have two txt files (txta and texb) each with 15-20000 lines. Each line
contains a username + info. Now what I want to do is take a line from texta,
retrieve the username from this line, search for the username in txtb and
return the info that's related to the username i txtb. My code looks like:

open (txta, "txta");
@txta= <txta>;
close(txta);

open (txtb, "txtb");
@txtb= <txtb>;
close(txtb);

[snipped rest of code]

I'm don't want to critical of your code (Eric did a good, but justified job
of that).

What I really want to know (and there was a long thread about this not too
long ago), is what thought processes led you to think that slurping a large
file into an array was a good thing to do (for this particular case)?

Help to script a very easy program to manipulate timecodes (srt files)	0	Aug 13, 2022
How can I train a neural network by reading different csv files	0	Nov 24, 2022
Huge files manipulation	12	Nov 10, 2008
Reading Ports (uart) in windows	3	Jan 27, 2023
Find and count strings of text from multiple files	17	Dec 16, 2021
Reading Huge UnixMailbox Files	3	Apr 26, 2011
Recovering deleted files	4	Dec 26, 2013
Caching .txt files	16	Jul 20, 2009

Reading huge *.txt files?

Math55

Helgi Briem

Jesper

Tad McClellan

Peter Hickman

Dominik Seelow

Tulan W. Hu

Peter Hickman

Jesper

Eric J. Roode

Chris Mattern

Chris Mattern

Tintin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads