filehandle, read lines

roy.schultheiss · Jul 24, 2007

Hello all,

I am looking for a way to open a file in Perl and start reading e.g.
on line 1,000,000.
One possibility is:

open (FILE, "...");

# move handle to line 1,000,000
for ($i=0; $i<1000000;$i++)
{ $data = <FILE>; }

# something magic ...

close (FILE);

This is not very efficiently. So, is there another "better" way to
move the Position of the handle through the lines of a file?

Thank you very much,

roy

anno4000 · Jul 24, 2007

Hello all,

I am looking for a way to open a file in Perl and start reading e.g.
on line 1,000,000.
One possibility is:

open (FILE, "...");

# move handle to line 1,000,000
for ($i=0; $i<1000000;$i++)
{ $data = <FILE>; }

# something magic ...

close (FILE);

This is not very efficiently. So, is there another "better" way to
move the Position of the handle through the lines of a file?

No, there isn't. It's a file full of bytes, the n-th line starts after
the n-1-th occurence of a line feed. So you must inspect all bytes
before the line in question.

There are modules that hide the ugliness. Tie::File represents the
file as an array of lines, so you can access the 1_000_000-th line
"directly" as $tied_array[ 1_000_000]. Internally it, too, must
check all preceding lines.

Anno

Jens Thoms Toerring · Jul 24, 2007

I am looking for a way to open a file in Perl and start reading e.g.
on line 1,000,000.
One possibility is:

open (FILE, "...");

# move handle to line 1,000,000
for ($i=0; $i<1000000;$i++)
{ $data = <FILE>; }

# something magic ...

close (FILE);

This is not very efficiently. So, is there another "better" way to
move the Position of the handle through the lines of a file?

Unless you have some extra knowledge about the file (e.g. that
all lines have the same length) then there can't be any solution
to that problem that is not more or less equivalent to the one
you already have given above. It simply boils down to finding
the 1,000,000th occurrence of "\n" (which can be, depending on
the system you're on, a single character or a group of charac-
ters) and position the internally maintained "pointer" into the
file just behind that. This, in turn, requires that you read in
everything from the file up to that place, with the only excep-
tion for the case that you have some prior knowledge where this
position exactly is, in which case you can use the seek() func-
tion (or sysseek() if you intend to use sysread() or syswrite())
to "jump" to that position directly.

You may be able to speed things up a bit by using system specific
functions (e.g. something like mmap) that may allows you to map a
file to memory, which you then can search for the nth "\n" perhaps
a bit faster than when using the normal open() and <>.

Regards, Jens

roy.schultheiss · Jul 25, 2007

Thank you both for your answers.

I tried to split a file into smaller parts to start multiple processes
working on it. So the 1st process starts on line 1 - 10_000_000 and
2nd process from 10_000_001 to 20_000_000 etc.

I'll use seek to set the Position of the handle.

Regards, Roy

anno4000 · Jul 25, 2007

Thank you both for your answers.

I tried to split a file into smaller parts to start multiple processes
working on it. So the 1st process starts on line 1 - 10_000_000 and
2nd process from 10_000_001 to 20_000_000 etc.

I'll use seek to set the Position of the handle.

Hmm... Seek finds positions by byte, not by line.

To split a file (handle $fh) in $n chunks of roughly equal length
that all start at a new line, you could do this:

my $size = (-s $fh)/$n;
my @chunk_pos = ( 0);
for ( 1 .. $n ) {
seek $fh, $size*$_, 0;
<$fh>; # this will (in general) read an incomplete line
push @chunk_pos, tell $fh; # save start of next line
}

# process each chunk
for my $chunk ( 1 .. $n ) {
seek $fh, $chunk_pos[ $chunk - 1], 0;
while ( <$fh> ) {
# handle line in $_
last if tell( $fh) >= $chunk_pos[ $chunk];
}
}

You won't know in advance the line numbers in each chunk, nor how many
lines each chunk holds exactly. They will be of roughly equal size if
the distribution of line lengths isn't wildly irregular.

Anno

xhoster · Jul 25, 2007

Thank you both for your answers.

I tried to split a file into smaller parts to start multiple processes
working on it. So the 1st process starts on line 1 - 10_000_000 and
2nd process from 10_000_001 to 20_000_000 etc.

I generally prefer to "stripe" rather than "split" files, when I can
get away with it. So if I wanted to run 4 jobs, I would start 4 jobs,
each one given a $task from 0 to 3.

while (<>) {
next unless $. % 4 == $task;
#....
};

This way, you don't need to pre-compute anything based on the size of the
file. If you have an IO bottleneck, this could be either better or worse,
IO-wise, than the splitting method depending on the exact details of your
IO system.

I'll use seek to set the Position of the handle.

Be aware that you will likely put you into the middle of a line. So you
must burn that line before you start on the next "real" one. And you need
to arrange that this burned partial line gets processed correctly by one
of the other task. That is not terribly hard to do, but it is also easy
to screw up.

Xho

roy · Jul 26, 2007

I receive a XML-File up to 1 GB full of orders every day. I have to
split the orders and load them into a database for further processing.
I share this job onto multiple processes. This runs properly now.

Here a little impress from the code:

------------------------------ 8< ------------------------------

use Proc::Simple;
use constant MAX_PROCESSES => 10;

$filesize = -s "... file";
$step = int($filesize/MAX_PROCESSES+1);

for (my $i=0;$i<MAX_PROCESSES;$i++) {
$procs[$i] = Proc::Simple->new();
$procs[$i]->start(\&insert_orders, $filename, $i*$step, ($i+1)*
$step);
}

....

sub insert_orders {
my ($filename, $from, $to) = @_;

my $xml = new IO::File;
open ($xml, "< $filename");

if ($xml = set_handle ($xml, $from)) {
while (defined ($_dat = <$xml>)) {
$_temp = "\U$_dat\E"; # Convert into
capital letters
$_temp =~ s/\s+//g; # Remove blanks

if ($_temp eq '<ORDER>') {
$_mode = 'order';
$_order = '<?xml version="1.0" encoding="UTF-8"?>' .
"\n";
}

$_order .= $_dat if $_mode eq 'order';

if ($_temp eq '</ORDER>') {
# load $_order into the database ...

$_order = '';
$_mode = '';

last if ($to <= tell ($xml));
}
}
}

...

close ($xml);
return 1;
}

sub set_handle {
my ($handle, $pos) = @_;

seek($handle,$pos,SEEK_CUR);

if (defined (<$handle>)) # start new line
{ return $handle; }
else
{ return; }
}

------------------------------ 8< ------------------------------

It must be guaranteed that each process starts in a different order
otherwise an order would be inserted more than 1 time.

So, thank you all for you answers. Let me know if you want to have
more information about the code.

Regards,

roy

Klaus · Jul 26, 2007

for (my $i=0;$i<MAX_PROCESSES;$i++) {

for my $i (0..MAX_PROCESSES - 1) {

my $xml = new IO::File;
open ($xml, "< $filename");

open my $xml, '<', $filename or die "Error: $!';

while (defined ($_dat = <$xml>)) {
$_temp = "\U$_dat\E";

Variables starting with "$_..." (such as "$_dat" or "$_temp") look
very strange (at least to me) and are too easily confused with "$_".

xhoster · Jul 26, 2007

roy said:
I receive a XML-File up to 1 GB full of orders every day. I have to
split the orders and load them into a database for further processing.
I share this job onto multiple processes. This runs properly now.

This seems dangerous to me. Generally XML should be parsed by an XML
parser, not by something that happens to parse some restricted subset of
XML with a particular whitespace pattern that works for one example. If
the person who generates the file changes the white-space, for example,
then your program would break, while they could correctly claim that the
file they produced is valid XML and your program shouldn't have broke on
it. OTOH, if you have assurances the file you receive will always be in a
particular subset of XML that matches the expected white space, etc., then
perhaps the trade-off you are making is acceptable.

If parsing is not the bottleneck, then I'd just use something like
XML::Twig to read this one XML file and parcel it out to 10 XML files as a
preprocessing step. Then run each of those 10 files separately. Of
course, if parsing is the bottleneck, then this would defeat the purpose of
parallelizing.

Also, you should probably adapt your code to support "use strict;"

sub insert_orders {
my ($filename, $from, $to) = @_;

my $xml = new IO::File;
open ($xml, "< $filename");

if ($xml = set_handle ($xml, $from)) {
while (defined ($_dat = <$xml>)) {
$_temp = "\U$_dat\E"; # Convert into
capital letters
$_temp =~ s/\s+//g; # Remove blanks

if ($_temp eq '<ORDER>') {
$_mode = 'order';
$_order = '<?xml version="1.0" encoding="UTF-8"?>' .
"\n";
}

If you start fractionally through an order, this code above "burns" lines
until you get to the start of the first "full" order. Yet your set_handle
code also burns the initial (potentially) fraction of a line. It would be
cleaner if the code to burn data was all in one place.

$_order .= $_dat if $_mode eq 'order';

if ($_temp eq '</ORDER>') {
# load $_order into the database ...

This tries to load $_order into the database even when there is no
order to load, i.e. when you started out in the middle of a previous order.
$_order will be empty, but you try to load it anyway. Is that a problem?
You want to load $_order only when

$_order = '';
$_mode = '';

last if ($to <= tell ($xml));

This has the potential to lose orders. Let's say that $to is 1000, and
an order starts exactly at position 1000. This job will not process that
order, because because $to<=1000 is true. The next-higher job, whose
$start is 1000, also will not process this order, as the "partial" first
line it burned just happened to be a true full line, and that order
therefore gets forgotten. (I've verified this does in fact happen in a test
case)

last if ($to < tell($xml));

(Or change the way you burn data, as suggested above, so it all happens
in only one place.)

....

sub set_handle {
my ($handle, $pos) = @_;

seek($handle,$pos,SEEK_CUR);

if (defined (<$handle>)) # start new line

You probably only want to burn a line when $pos>0. When $pos==0, you
know the first line you read will be complete, so there is no reason
to burn it. Generally the burned line that starts out with each chunk will
be processed in the "previous" chunk, but when $pos==0 there was no
previous chunk. But this will not actually be a problem unless the first

line of you XML file is said:
{ return $handle; }
else
{ return; }

I don't understand the above. If $handle returned an undefined
value this time you read from it, won't it do so next time as well?
(I think the only time this isn't true is when $handle is an alias for
ARGV). So why not just return $handle regardless?

Xho

J. Gleixner · Jul 27, 2007

roy said:
roy said:

I receive a XML-File up to 1 GB full of orders every day. I have to
split the orders and load them into a database for further processing.
I share this job onto multiple processes. This runs properly now.

Click to expand...

This seems dangerous to me. Generally XML should be parsed by an XML
parser, not by something that happens to parse some restricted subset of [...]
If parsing is not the bottleneck, then I'd just use something like
XML::Twig to read this one XML file and parcel it out to 10 XML files as a
preprocessing step. Then run each of those 10 files separately. Of
course, if parsing is the bottleneck, then this would defeat the purpose of
parallelizing.

Also, see xml_split, which is one of the tools included with XML::Twig.

http://search.cpan.org/~mirod/XML-Twig-3.29/tools/xml_split/xml_split

xhoster · Aug 2, 2007

I receive a XML-File up to 1 GB full of orders every day. I have to
split the orders and load them into a database for further processing.
I share this job onto multiple processes. This runs properly now.

Here a little impress from the code:

------------------------------ 8< ------------------------------

use Proc::Simple;
use constant MAX_PROCESSES => 10;

$filesize = -s "... file";
$step = int($filesize/MAX_PROCESSES+1);

for (my $i=0;$i<MAX_PROCESSES;$i++) {
$procs[$i] = Proc::Simple->new();
$procs[$i]->start(\&insert_orders, $filename, $i*$step, ($i+1)*
$step);
}

...

Click to expand...

Hey, this is hardcoded stuff, multiples of 10.
Lucky for you the format is the same everytime.

What do you do when the file is corrupt due to fragmentation
errors?

You can't be serious. This is XML man. What your doing is worse
than reading a serial port stream.

To quite someone from a different thread, who is probably a troll
trying to resurrect himself under a new name:

: Yeah, can you describe whats wrong with it, or you just blowing smoke?
: Sln

Well, are you?

Sln

Xho

xhoster · Aug 2, 2007

I receive a XML-File up to 1 GB full of orders every day. I have to
split the orders and load them into a database for further processing.
I share this job onto multiple processes. This runs properly now.

Here a little impress from the code:

------------------------------ 8< ------------------------------

use Proc::Simple;
use constant MAX_PROCESSES => 10;

$filesize = -s "... file";
$step = int($filesize/MAX_PROCESSES+1);

for (my $i=0;$i<MAX_PROCESSES;$i++) {
$procs[$i] = Proc::Simple->new();
$procs[$i]->start(\&insert_orders, $filename, $i*$step, ($i+1)*
$step);
}

...

Click to expand...

Hey, this is hardcoded stuff, multiples of 10.
Lucky for you the format is the same everytime.

What do you do when the file is corrupt due to fragmentation
errors?

You can't be serious. This is XML man. What your doing is worse
than reading a serial port stream.

To quote someone from a different thread, who is probably a troll
trying to resurrect himself under a new name:

: Yeah, can you describe whats wrong with it, or you just blowing smoke?
: Sln

Well, are you?

Sln

Xho

setting binmode for empty filehandle	3	Apr 8, 2014
Read on closed filehandle	12	Feb 1, 2007
Read from filehandle opened for appending	6	Apr 4, 2006
recursive filehandle	11	Aug 19, 2008
fork messing up parent filehandle	6	Jun 20, 2006
Beginner Filehandle question	3	Jan 17, 2006
How to read from URL line-wise?	3	May 6, 2014
Perl Strings vs FileHandle	9	Sep 6, 2008

filehandle, read lines

roy.schultheiss

anno4000

Jens Thoms Toerring

roy.schultheiss

anno4000

xhoster

roy

Klaus

xhoster

J. Gleixner

xhoster

xhoster

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads