Reading huge *.txt files?

M

Math55

hi, is there a possibility to read large (>1mb) *.txt files in a fast
way? everytime i do that, my program freezes or takes very long to
finish. anyone a idea?

THANKS:)
 
H

Helgi Briem

hi, is there a possibility to read large (>1mb) *.txt files in a fast
way? everytime i do that, my program freezes or takes very long to
finish. anyone a idea?

1MB is large? What are you using, a BBC Commodore?

Anyway,

#!perl
use warnings;
use strict;
my $in = '/full/path/to/bigtextfile';
open IN, $in or die "Cannot open $in:$!\n";
while (<IN>)
{
#do stuff with contents
}
close IN;
__END__
 
J

Jesper

I have the same problem.
I have two txt files (txta and texb) each with 15-20000 lines. Each line
contains a username + info. Now what I want to do is take a line from texta,
retrieve the username from this line, search for the username in txtb and
return the info that's related to the username i txtb. My code looks like:

open (txta, "txta");
@txta= <txta>;
close(txta);

open (txtb, "txtb");
@txtb= <txtb>;
close(txtb);

foreach $txta_line (@txta)
{
#starts by getting the userid
get userid;

--------------------------------------
foreach $txtb_line (@txtb)
{
my (@array) = $txtb_line =~ m!(.*);(.*)!;
if (@array[0]=~m!^$userid$!) { $userinfo = @array[1]; }
}
--------------------------------------
$userinfo = '';

#uses the $userinfo
}


The code between the lines takes 'really' long time (15000 lines are a lot
of lines :)) - is there anyway to optimize the code ?

Regards,

Jesper

BTW I'm using activestate perl 5.6.1
 
T

Tad McClellan

Math55 said:
hi, is there a possibility to read large (>1mb) *.txt files in a fast
way? everytime i do that, my program freezes or takes very long to
finish. anyone a idea?


You have an error on line 17.
 
P

Peter Hickman

Jesper said:
--------------------------------------
foreach $txtb_line (@txtb)
{
my (@array) = $txtb_line =~ m!(.*);(.*)!;
if (@array[0]=~m!^$userid$!) { $userinfo = @array[1]; }
}

You do realise that this line has just lost all the information you put in
$userinfo three lines previous?
#uses the $userinfo
}


The code between the lines takes 'really' long time (15000 lines are a lot
of lines :)) - is there anyway to optimize the code ?

The brain dead fix is this...

foreach $txtb_line (@txtb)
{
my (@array) = $txtb_line =~ m!(.*);(.*)!;
if (@array[0]=~m!^$userid$!) {
$userinfo = @array[1];
last;
}
}

That will quit the inner loop once a match is found.

However this is still not the way, even if TIMTOWTDI.

Before you read txta read txtb into a hash

foreach $txtb_line (@txtb}
{
my (@array) = $txtb_line =~ m!(.*);(.*)!;
$lookup{$array[0]} = $array[1];
}

then where you had the inner loop

if($lookup{$userid}) {
$userinfo = $lookup{$userid};
} else {
$userinfo = '';
}

Don't go much faster.
 
D

Dominik Seelow

Jesper wrote:

Hello Jesper.
I have the same problem.
I have two txt files (txta and texb) each with 15-20000 lines. Each line
contains a username + info. Now what I want to do is take a line from texta,
retrieve the username from this line, search for the username in txtb and
return the info that's related to the username i txtb. My code looks like:

open (txta, "txta");
@txta= <txta>;
close(txta);
Reading the whole file at once is not a good idea, especially if the
files are large.
open (txtb, "txtb");
@txtb= <txtb>;
close(txtb);

foreach $txta_line (@txta)
{
#starts by getting the userid
get userid;

--------------------------------------
foreach $txtb_line (@txtb)
{
my (@array) = $txtb_line =~ m!(.*);(.*)!;
if (@array[0]=~m!^$userid$!) { $userinfo = @array[1]; }
}
--------------------------------------
$userinfo = '';

#uses the $userinfo
}


The code between the lines takes 'really' long time (15000 lines are a lot
of lines :)) - is there anyway to optimize the code ?
Read the files line-wise. If you need to join them, read the first file
(apparently 'textb') line-by-line and put everything you need into a
hash. Read the second file (apparently 'texta') and use the hash to join
the information. Try to use other names for your files.


HTH,
Dominik
 
T

Tulan W. Hu

Math55 said:
hi, is there a possibility to read large (>1mb) *.txt files in a fast
way? everytime i do that, my program freezes or takes very long to
finish. anyone a idea?

THANKS:)

Upgrade your perl to 5.8.1 and use Tie::File.
 
P

Peter Hickman

Tulan said:
Upgrade your perl to 5.8.1 and use Tie::File.

To be honest this is not good advice. His code is grossly inefficient, most of
the improvements will come from a better design than using a module to
implement a bad design.
 
J

Jesper

However this is still not the way, even if TIMTOWTDI.
Before you read txta read txtb into a hash

foreach $txtb_line (@txtb}
{
my (@array) = $txtb_line =~ m!(.*);(.*)!;
$lookup{$array[0]} = $array[1];
}

then where you had the inner loop

if($lookup{$userid}) {
$userinfo = $lookup{$userid};
} else {
$userinfo = '';
}

Don't go much faster.

Thx, that works pretty darn nice - it takes about 6 secs. to do 240.000.000
comparisons :)




Regards

Jesper
 
E

Eric J. Roode

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I have the same problem.
I have two txt files (txta and texb) each with 15-20000 lines. Each
line contains a username + info. Now what I want to do is take a line
from texta, retrieve the username from this line, search for the
username in txtb and return the info that's related to the username i
txtb. My code looks like:

open (txta, "txta");
@txta= <txta>;
close(txta);

open (txtb, "txtb");
@txtb= <txtb>;
close(txtb);

foreach $txta_line (@txta)
{

[Eric gets out the Holy Baseball Bat of Education]

Do NOT [whap!] read [whap!] an entire *file* [whap!] into an *array*
[whap!] just so you can [whap!] loop [whap!] over it!!! [whap-whap-whap!]

[Eric puts the baseball bat away]

Okay, now that we're in the proper frame of mind for some education:

One: You read an entire file (txta) into an array, then apparently do
nothing more than loop over that array. This is Bad. [don't make me get
the baseball bat out again.] It's a waste of memory, and it doesn't buy
you anything. It's foolish. It means that you copied a bad habit from
some bad book or bad instructor.

Two: You read another entire file (txtb) into memory, and repeatedly loop
over it in search of user ids. In this case, it's not entirely hopeless
that you slurped the entire file into memory, because you're referencing
it frequently, but there's a better way. Read through the txtb file,
finding userids, and put them into a hash, and reference them while going
through txta. That way you will only have to scan file txtb once.

Something like this:

my %uid_lookup;

while (<txtb>)
{
chomp; # (?)
$uid_lookup{$1} = $2 if /(.*);(.*)/;
}

while (<txta>)
{
get userid;
$userinfo = $uid_lookup{$userid};
...
}


There. One pass per file. No huge arrays in memory (although, who knows
how large %uid_lookup can get?).

- --
Eric
$_ = reverse sort $ /. r , qw p ekca lre uJ reh
ts p , map $ _. $ " , qw e p h tona e and print

-----BEGIN PGP SIGNATURE-----
Version: PGPfreeware 7.0.3 for non-commercial use <http://www.pgp.com>

iQA/AwUBP4LxsWPeouIeTNHoEQIGtQCfXsffnYbsXe1F5ph2Z21mUBU/HwQAnA8W
05GqIlnA5zusx6yRqyX7PyP4
=Wk1v
-----END PGP SIGNATURE-----
 
C

Chris Mattern

Math55 said:
hi, is there a possibility to read large (>1mb) *.txt files in a fast
way? everytime i do that, my program freezes or takes very long to
finish. anyone a idea?
Specifics. Specifics are good. Specifics want to be your friend.
How fast is "in a fast way"? How long is "very long"? And most of
all--what "that" are you doing?

Chris Mattern
 
C

Chris Mattern

Peter said:
To be honest this is not good advice. His code is grossly inefficient,
most of the improvements will come from a better design than using a
module to implement a bad design.
Actually, we don't know if it's good advice or not. The OP hasn't
given us any code to look at. The code we've seen in this thread
came from another newbie who chimed in about having the same problem,
except he actually provided people with code to fix.

Chris Mattern
 
T

Tintin

Jesper said:
I have the same problem.
I have two txt files (txta and texb) each with 15-20000 lines. Each line
contains a username + info. Now what I want to do is take a line from texta,
retrieve the username from this line, search for the username in txtb and
return the info that's related to the username i txtb. My code looks like:

open (txta, "txta");
@txta= <txta>;
close(txta);

open (txtb, "txtb");
@txtb= <txtb>;
close(txtb);

[snipped rest of code]

I'm don't want to critical of your code (Eric did a good, but justified job
of that).

What I really want to know (and there was a long thread about this not too
long ago), is what thought processes led you to think that slurping a large
file into an array was a good thing to do (for this particular case)?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,563
Members
45,039
Latest member
CasimiraVa

Latest Threads

Top