filehandle, read lines

R

roy.schultheiss

Hello all,

I am looking for a way to open a file in Perl and start reading e.g.
on line 1,000,000.
One possibility is:

open (FILE, "...");

# move handle to line 1,000,000
for ($i=0; $i<1000000;$i++)
{ $data = <FILE>; }

# something magic ...

close (FILE);

This is not very efficiently. So, is there another "better" way to
move the Position of the handle through the lines of a file?

Thank you very much,

roy
 
A

anno4000

Hello all,

I am looking for a way to open a file in Perl and start reading e.g.
on line 1,000,000.
One possibility is:

open (FILE, "...");

# move handle to line 1,000,000
for ($i=0; $i<1000000;$i++)
{ $data = <FILE>; }

# something magic ...

close (FILE);

This is not very efficiently. So, is there another "better" way to
move the Position of the handle through the lines of a file?

No, there isn't. It's a file full of bytes, the n-th line starts after
the n-1-th occurence of a line feed. So you must inspect all bytes
before the line in question.

There are modules that hide the ugliness. Tie::File represents the
file as an array of lines, so you can access the 1_000_000-th line
"directly" as $tied_array[ 1_000_000]. Internally it, too, must
check all preceding lines.

Anno
 
J

Jens Thoms Toerring

I am looking for a way to open a file in Perl and start reading e.g.
on line 1,000,000.
One possibility is:
open (FILE, "...");
# move handle to line 1,000,000
for ($i=0; $i<1000000;$i++)
{ $data = <FILE>; }
# something magic ...
close (FILE);
This is not very efficiently. So, is there another "better" way to
move the Position of the handle through the lines of a file?

Unless you have some extra knowledge about the file (e.g. that
all lines have the same length) then there can't be any solution
to that problem that is not more or less equivalent to the one
you already have given above. It simply boils down to finding
the 1,000,000th occurrence of "\n" (which can be, depending on
the system you're on, a single character or a group of charac-
ters) and position the internally maintained "pointer" into the
file just behind that. This, in turn, requires that you read in
everything from the file up to that place, with the only excep-
tion for the case that you have some prior knowledge where this
position exactly is, in which case you can use the seek() func-
tion (or sysseek() if you intend to use sysread() or syswrite())
to "jump" to that position directly.

You may be able to speed things up a bit by using system specific
functions (e.g. something like mmap) that may allows you to map a
file to memory, which you then can search for the nth "\n" perhaps
a bit faster than when using the normal open() and <>.

Regards, Jens
 
R

roy.schultheiss

Thank you both for your answers.

I tried to split a file into smaller parts to start multiple processes
working on it. So the 1st process starts on line 1 - 10_000_000 and
2nd process from 10_000_001 to 20_000_000 etc.

I'll use seek to set the Position of the handle.

Regards, Roy
 
A

anno4000

Thank you both for your answers.

I tried to split a file into smaller parts to start multiple processes
working on it. So the 1st process starts on line 1 - 10_000_000 and
2nd process from 10_000_001 to 20_000_000 etc.

I'll use seek to set the Position of the handle.

Hmm... Seek finds positions by byte, not by line.

To split a file (handle $fh) in $n chunks of roughly equal length
that all start at a new line, you could do this:

my $size = (-s $fh)/$n;
my @chunk_pos = ( 0);
for ( 1 .. $n ) {
seek $fh, $size*$_, 0;
<$fh>; # this will (in general) read an incomplete line
push @chunk_pos, tell $fh; # save start of next line
}

# process each chunk
for my $chunk ( 1 .. $n ) {
seek $fh, $chunk_pos[ $chunk - 1], 0;
while ( <$fh> ) {
# handle line in $_
last if tell( $fh) >= $chunk_pos[ $chunk];
}
}

You won't know in advance the line numbers in each chunk, nor how many
lines each chunk holds exactly. They will be of roughly equal size if
the distribution of line lengths isn't wildly irregular.

Anno
 
X

xhoster

Thank you both for your answers.

I tried to split a file into smaller parts to start multiple processes
working on it. So the 1st process starts on line 1 - 10_000_000 and
2nd process from 10_000_001 to 20_000_000 etc.

I generally prefer to "stripe" rather than "split" files, when I can
get away with it. So if I wanted to run 4 jobs, I would start 4 jobs,
each one given a $task from 0 to 3.

while (<>) {
next unless $. % 4 == $task;
#....
};


This way, you don't need to pre-compute anything based on the size of the
file. If you have an IO bottleneck, this could be either better or worse,
IO-wise, than the splitting method depending on the exact details of your
IO system.

I'll use seek to set the Position of the handle.

Be aware that you will likely put you into the middle of a line. So you
must burn that line before you start on the next "real" one. And you need
to arrange that this burned partial line gets processed correctly by one
of the other task. That is not terribly hard to do, but it is also easy
to screw up.

Xho
 
R

roy

I receive a XML-File up to 1 GB full of orders every day. I have to
split the orders and load them into a database for further processing.
I share this job onto multiple processes. This runs properly now.

Here a little impress from the code:

------------------------------ 8< ------------------------------

use Proc::Simple;
use constant MAX_PROCESSES => 10;

$filesize = -s "... file";
$step = int($filesize/MAX_PROCESSES+1);

for (my $i=0;$i<MAX_PROCESSES;$i++) {
$procs[$i] = Proc::Simple->new();
$procs[$i]->start(\&insert_orders, $filename, $i*$step, ($i+1)*
$step);
}

....

sub insert_orders {
my ($filename, $from, $to) = @_;

my $xml = new IO::File;
open ($xml, "< $filename");

if ($xml = set_handle ($xml, $from)) {
while (defined ($_dat = <$xml>)) {
$_temp = "\U$_dat\E"; # Convert into
capital letters
$_temp =~ s/\s+//g; # Remove blanks

if ($_temp eq '<ORDER>') {
$_mode = 'order';
$_order = '<?xml version="1.0" encoding="UTF-8"?>' .
"\n";
}

$_order .= $_dat if $_mode eq 'order';

if ($_temp eq '</ORDER>') {
# load $_order into the database ...

$_order = '';
$_mode = '';

last if ($to <= tell ($xml));
}
}
}

...

close ($xml);
return 1;
}


sub set_handle {
my ($handle, $pos) = @_;

seek($handle,$pos,SEEK_CUR);

if (defined (<$handle>)) # start new line
{ return $handle; }
else
{ return; }
}

------------------------------ 8< ------------------------------

It must be guaranteed that each process starts in a different order
otherwise an order would be inserted more than 1 time.

So, thank you all for you answers. Let me know if you want to have
more information about the code.

Regards,

roy
 
K

Klaus

for (my $i=0;$i<MAX_PROCESSES;$i++) {

for my $i (0..MAX_PROCESSES - 1) {
my $xml = new IO::File;
open ($xml, "< $filename");

open my $xml, '<', $filename or die "Error: $!';
while (defined ($_dat = <$xml>)) {
$_temp = "\U$_dat\E";

Variables starting with "$_..." (such as "$_dat" or "$_temp") look
very strange (at least to me) and are too easily confused with "$_".
 
X

xhoster

roy said:
I receive a XML-File up to 1 GB full of orders every day. I have to
split the orders and load them into a database for further processing.
I share this job onto multiple processes. This runs properly now.

This seems dangerous to me. Generally XML should be parsed by an XML
parser, not by something that happens to parse some restricted subset of
XML with a particular whitespace pattern that works for one example. If
the person who generates the file changes the white-space, for example,
then your program would break, while they could correctly claim that the
file they produced is valid XML and your program shouldn't have broke on
it. OTOH, if you have assurances the file you receive will always be in a
particular subset of XML that matches the expected white space, etc., then
perhaps the trade-off you are making is acceptable.

If parsing is not the bottleneck, then I'd just use something like
XML::Twig to read this one XML file and parcel it out to 10 XML files as a
preprocessing step. Then run each of those 10 files separately. Of
course, if parsing is the bottleneck, then this would defeat the purpose of
parallelizing.

Also, you should probably adapt your code to support "use strict;"
sub insert_orders {
my ($filename, $from, $to) = @_;

my $xml = new IO::File;
open ($xml, "< $filename");

if ($xml = set_handle ($xml, $from)) {
while (defined ($_dat = <$xml>)) {
$_temp = "\U$_dat\E"; # Convert into
capital letters
$_temp =~ s/\s+//g; # Remove blanks

if ($_temp eq '<ORDER>') {
$_mode = 'order';
$_order = '<?xml version="1.0" encoding="UTF-8"?>' .
"\n";
}

If you start fractionally through an order, this code above "burns" lines
until you get to the start of the first "full" order. Yet your set_handle
code also burns the initial (potentially) fraction of a line. It would be
cleaner if the code to burn data was all in one place.

$_order .= $_dat if $_mode eq 'order';

if ($_temp eq '</ORDER>') {
# load $_order into the database ...

This tries to load $_order into the database even when there is no
order to load, i.e. when you started out in the middle of a previous order.
$_order will be empty, but you try to load it anyway. Is that a problem?
You want to load $_order only when
$_order = '';
$_mode = '';

last if ($to <= tell ($xml));

This has the potential to lose orders. Let's say that $to is 1000, and
an order starts exactly at position 1000. This job will not process that
order, because because $to<=1000 is true. The next-higher job, whose
$start is 1000, also will not process this order, as the "partial" first
line it burned just happened to be a true full line, and that order
therefore gets forgotten. (I've verified this does in fact happen in a test
case)

last if ($to < tell($xml));

(Or change the way you burn data, as suggested above, so it all happens
in only one place.)

....
sub set_handle {
my ($handle, $pos) = @_;

seek($handle,$pos,SEEK_CUR);

if (defined (<$handle>)) # start new line

You probably only want to burn a line when $pos>0. When $pos==0, you
know the first line you read will be complete, so there is no reason
to burn it. Generally the burned line that starts out with each chunk will
be processed in the "previous" chunk, but when $pos==0 there was no
previous chunk. But this will not actually be a problem unless the first
line of you XML file is said:
{ return $handle; }
else
{ return; }

I don't understand the above. If $handle returned an undefined
value this time you read from it, won't it do so next time as well?
(I think the only time this isn't true is when $handle is an alias for
ARGV). So why not just return $handle regardless?

Xho
 
J

J. Gleixner

roy said:
I receive a XML-File up to 1 GB full of orders every day. I have to
split the orders and load them into a database for further processing.
I share this job onto multiple processes. This runs properly now.

This seems dangerous to me. Generally XML should be parsed by an XML
parser, not by something that happens to parse some restricted subset of [...]
If parsing is not the bottleneck, then I'd just use something like
XML::Twig to read this one XML file and parcel it out to 10 XML files as a
preprocessing step. Then run each of those 10 files separately. Of
course, if parsing is the bottleneck, then this would defeat the purpose of
parallelizing.

Also, see xml_split, which is one of the tools included with XML::Twig.

http://search.cpan.org/~mirod/XML-Twig-3.29/tools/xml_split/xml_split
 
X

xhoster

I receive a XML-File up to 1 GB full of orders every day. I have to
split the orders and load them into a database for further processing.
I share this job onto multiple processes. This runs properly now.

Here a little impress from the code:

------------------------------ 8< ------------------------------

use Proc::Simple;
use constant MAX_PROCESSES => 10;

$filesize = -s "... file";
$step = int($filesize/MAX_PROCESSES+1);

for (my $i=0;$i<MAX_PROCESSES;$i++) {
$procs[$i] = Proc::Simple->new();
$procs[$i]->start(\&insert_orders, $filename, $i*$step, ($i+1)*
$step);
}

...

Hey, this is hardcoded stuff, multiples of 10.
Lucky for you the format is the same everytime.

What do you do when the file is corrupt due to fragmentation
errors?

You can't be serious. This is XML man. What your doing is worse
than reading a serial port stream.

To quite someone from a different thread, who is probably a troll
trying to resurrect himself under a new name:

: Yeah, can you describe whats wrong with it, or you just blowing smoke?
: Sln

Well, are you?

Xho
 
X

xhoster

I receive a XML-File up to 1 GB full of orders every day. I have to
split the orders and load them into a database for further processing.
I share this job onto multiple processes. This runs properly now.

Here a little impress from the code:

------------------------------ 8< ------------------------------

use Proc::Simple;
use constant MAX_PROCESSES => 10;

$filesize = -s "... file";
$step = int($filesize/MAX_PROCESSES+1);

for (my $i=0;$i<MAX_PROCESSES;$i++) {
$procs[$i] = Proc::Simple->new();
$procs[$i]->start(\&insert_orders, $filename, $i*$step, ($i+1)*
$step);
}

...

Hey, this is hardcoded stuff, multiples of 10.
Lucky for you the format is the same everytime.

What do you do when the file is corrupt due to fragmentation
errors?

You can't be serious. This is XML man. What your doing is worse
than reading a serial port stream.

To quote someone from a different thread, who is probably a troll
trying to resurrect himself under a new name:

: Yeah, can you describe whats wrong with it, or you just blowing smoke?
: Sln

Well, are you?

Xho
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top