filehandle, read lines

Discussion in 'Perl Misc' started by roy.schultheiss@t-online.de, Jul 24, 2007.

  1. Guest

    Hello all,

    I am looking for a way to open a file in Perl and start reading e.g.
    on line 1,000,000.
    One possibility is:

    open (FILE, "...");

    # move handle to line 1,000,000
    for ($i=0; $i<1000000;$i++)
    { $data = <FILE>; }

    # something magic ...

    close (FILE);

    This is not very efficiently. So, is there another "better" way to
    move the Position of the handle through the lines of a file?

    Thank you very much,

    roy
     
    , Jul 24, 2007
    #1
    1. Advertising

  2. -berlin.de Guest

    <> wrote in comp.lang.perl.misc:
    > Hello all,
    >
    > I am looking for a way to open a file in Perl and start reading e.g.
    > on line 1,000,000.
    > One possibility is:
    >
    > open (FILE, "...");
    >
    > # move handle to line 1,000,000
    > for ($i=0; $i<1000000;$i++)
    > { $data = <FILE>; }
    >
    > # something magic ...
    >
    > close (FILE);
    >
    > This is not very efficiently. So, is there another "better" way to
    > move the Position of the handle through the lines of a file?


    No, there isn't. It's a file full of bytes, the n-th line starts after
    the n-1-th occurence of a line feed. So you must inspect all bytes
    before the line in question.

    There are modules that hide the ugliness. Tie::File represents the
    file as an array of lines, so you can access the 1_000_000-th line
    "directly" as $tied_array[ 1_000_000]. Internally it, too, must
    check all preceding lines.

    Anno
     
    -berlin.de, Jul 24, 2007
    #2
    1. Advertising

  3. wrote:
    > I am looking for a way to open a file in Perl and start reading e.g.
    > on line 1,000,000.
    > One possibility is:


    > open (FILE, "...");


    > # move handle to line 1,000,000
    > for ($i=0; $i<1000000;$i++)
    > { $data = <FILE>; }


    > # something magic ...


    > close (FILE);


    > This is not very efficiently. So, is there another "better" way to
    > move the Position of the handle through the lines of a file?


    Unless you have some extra knowledge about the file (e.g. that
    all lines have the same length) then there can't be any solution
    to that problem that is not more or less equivalent to the one
    you already have given above. It simply boils down to finding
    the 1,000,000th occurrence of "\n" (which can be, depending on
    the system you're on, a single character or a group of charac-
    ters) and position the internally maintained "pointer" into the
    file just behind that. This, in turn, requires that you read in
    everything from the file up to that place, with the only excep-
    tion for the case that you have some prior knowledge where this
    position exactly is, in which case you can use the seek() func-
    tion (or sysseek() if you intend to use sysread() or syswrite())
    to "jump" to that position directly.

    You may be able to speed things up a bit by using system specific
    functions (e.g. something like mmap) that may allows you to map a
    file to memory, which you then can search for the nth "\n" perhaps
    a bit faster than when using the normal open() and <>.

    Regards, Jens
    --
    \ Jens Thoms Toerring ___
    \__________________________ http://toerring.de
     
    Jens Thoms Toerring, Jul 24, 2007
    #3
  4. Guest

    Thank you both for your answers.

    I tried to split a file into smaller parts to start multiple processes
    working on it. So the 1st process starts on line 1 - 10_000_000 and
    2nd process from 10_000_001 to 20_000_000 etc.

    I'll use seek to set the Position of the handle.

    Regards, Roy
     
    , Jul 25, 2007
    #4
  5. -berlin.de Guest

    <> wrote in comp.lang.perl.misc:
    > Thank you both for your answers.
    >
    > I tried to split a file into smaller parts to start multiple processes
    > working on it. So the 1st process starts on line 1 - 10_000_000 and
    > 2nd process from 10_000_001 to 20_000_000 etc.
    >
    > I'll use seek to set the Position of the handle.


    Hmm... Seek finds positions by byte, not by line.

    To split a file (handle $fh) in $n chunks of roughly equal length
    that all start at a new line, you could do this:

    my $size = (-s $fh)/$n;
    my @chunk_pos = ( 0);
    for ( 1 .. $n ) {
    seek $fh, $size*$_, 0;
    <$fh>; # this will (in general) read an incomplete line
    push @chunk_pos, tell $fh; # save start of next line
    }

    # process each chunk
    for my $chunk ( 1 .. $n ) {
    seek $fh, $chunk_pos[ $chunk - 1], 0;
    while ( <$fh> ) {
    # handle line in $_
    last if tell( $fh) >= $chunk_pos[ $chunk];
    }
    }

    You won't know in advance the line numbers in each chunk, nor how many
    lines each chunk holds exactly. They will be of roughly equal size if
    the distribution of line lengths isn't wildly irregular.

    Anno
     
    -berlin.de, Jul 25, 2007
    #5
  6. Guest

    wrote:
    > Thank you both for your answers.
    >
    > I tried to split a file into smaller parts to start multiple processes
    > working on it. So the 1st process starts on line 1 - 10_000_000 and
    > 2nd process from 10_000_001 to 20_000_000 etc.


    I generally prefer to "stripe" rather than "split" files, when I can
    get away with it. So if I wanted to run 4 jobs, I would start 4 jobs,
    each one given a $task from 0 to 3.

    while (<>) {
    next unless $. % 4 == $task;
    #....
    };


    This way, you don't need to pre-compute anything based on the size of the
    file. If you have an IO bottleneck, this could be either better or worse,
    IO-wise, than the splitting method depending on the exact details of your
    IO system.


    > I'll use seek to set the Position of the handle.


    Be aware that you will likely put you into the middle of a line. So you
    must burn that line before you start on the next "real" one. And you need
    to arrange that this burned partial line gets processed correctly by one
    of the other task. That is not terribly hard to do, but it is also easy
    to screw up.

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jul 25, 2007
    #6
  7. roy Guest

    I receive a XML-File up to 1 GB full of orders every day. I have to
    split the orders and load them into a database for further processing.
    I share this job onto multiple processes. This runs properly now.

    Here a little impress from the code:

    ------------------------------ 8< ------------------------------

    use Proc::Simple;
    use constant MAX_PROCESSES => 10;

    $filesize = -s "... file";
    $step = int($filesize/MAX_PROCESSES+1);

    for (my $i=0;$i<MAX_PROCESSES;$i++) {
    $procs[$i] = Proc::Simple->new();
    $procs[$i]->start(\&insert_orders, $filename, $i*$step, ($i+1)*
    $step);
    }

    ....

    sub insert_orders {
    my ($filename, $from, $to) = @_;

    my $xml = new IO::File;
    open ($xml, "< $filename");

    if ($xml = set_handle ($xml, $from)) {
    while (defined ($_dat = <$xml>)) {
    $_temp = "\U$_dat\E"; # Convert into
    capital letters
    $_temp =~ s/\s+//g; # Remove blanks

    if ($_temp eq '<ORDER>') {
    $_mode = 'order';
    $_order = '<?xml version="1.0" encoding="UTF-8"?>' .
    "\n";
    }

    $_order .= $_dat if $_mode eq 'order';

    if ($_temp eq '</ORDER>') {
    # load $_order into the database ...

    $_order = '';
    $_mode = '';

    last if ($to <= tell ($xml));
    }
    }
    }

    ...

    close ($xml);
    return 1;
    }


    sub set_handle {
    my ($handle, $pos) = @_;

    seek($handle,$pos,SEEK_CUR);

    if (defined (<$handle>)) # start new line
    { return $handle; }
    else
    { return; }
    }

    ------------------------------ 8< ------------------------------

    It must be guaranteed that each process starts in a different order
    otherwise an order would be inserted more than 1 time.

    So, thank you all for you answers. Let me know if you want to have
    more information about the code.

    Regards,

    roy
     
    roy, Jul 26, 2007
    #7
  8. Klaus Guest

    On Jul 26, 2:50 pm, roy <> wrote:
    > for (my $i=0;$i<MAX_PROCESSES;$i++) {


    for my $i (0..MAX_PROCESSES - 1) {

    > my $xml = new IO::File;
    > open ($xml, "< $filename");


    open my $xml, '<', $filename or die "Error: $!';

    > while (defined ($_dat = <$xml>)) {
    > $_temp = "\U$_dat\E";


    Variables starting with "$_..." (such as "$_dat" or "$_temp") look
    very strange (at least to me) and are too easily confused with "$_".

    --
    Klaus
     
    Klaus, Jul 26, 2007
    #8
  9. Guest

    roy <> wrote:
    > I receive a XML-File up to 1 GB full of orders every day. I have to
    > split the orders and load them into a database for further processing.
    > I share this job onto multiple processes. This runs properly now.


    This seems dangerous to me. Generally XML should be parsed by an XML
    parser, not by something that happens to parse some restricted subset of
    XML with a particular whitespace pattern that works for one example. If
    the person who generates the file changes the white-space, for example,
    then your program would break, while they could correctly claim that the
    file they produced is valid XML and your program shouldn't have broke on
    it. OTOH, if you have assurances the file you receive will always be in a
    particular subset of XML that matches the expected white space, etc., then
    perhaps the trade-off you are making is acceptable.

    If parsing is not the bottleneck, then I'd just use something like
    XML::Twig to read this one XML file and parcel it out to 10 XML files as a
    preprocessing step. Then run each of those 10 files separately. Of
    course, if parsing is the bottleneck, then this would defeat the purpose of
    parallelizing.

    Also, you should probably adapt your code to support "use strict;"

    >
    > sub insert_orders {
    > my ($filename, $from, $to) = @_;
    >
    > my $xml = new IO::File;
    > open ($xml, "< $filename");
    >
    > if ($xml = set_handle ($xml, $from)) {
    > while (defined ($_dat = <$xml>)) {
    > $_temp = "\U$_dat\E"; # Convert into
    > capital letters
    > $_temp =~ s/\s+//g; # Remove blanks
    >
    > if ($_temp eq '<ORDER>') {
    > $_mode = 'order';
    > $_order = '<?xml version="1.0" encoding="UTF-8"?>' .
    > "\n";
    > }


    If you start fractionally through an order, this code above "burns" lines
    until you get to the start of the first "full" order. Yet your set_handle
    code also burns the initial (potentially) fraction of a line. It would be
    cleaner if the code to burn data was all in one place.


    >
    > $_order .= $_dat if $_mode eq 'order';
    >
    > if ($_temp eq '</ORDER>') {
    > # load $_order into the database ...


    This tries to load $_order into the database even when there is no
    order to load, i.e. when you started out in the middle of a previous order.
    $_order will be empty, but you try to load it anyway. Is that a problem?
    You want to load $_order only when
    $_temp eq '</ORDER>' and $_mode eq 'order'


    >
    > $_order = '';
    > $_mode = '';
    >
    > last if ($to <= tell ($xml));


    This has the potential to lose orders. Let's say that $to is 1000, and
    an order starts exactly at position 1000. This job will not process that
    order, because because $to<=1000 is true. The next-higher job, whose
    $start is 1000, also will not process this order, as the "partial" first
    line it burned just happened to be a true full line, and that order
    therefore gets forgotten. (I've verified this does in fact happen in a test
    case)

    last if ($to < tell($xml));

    (Or change the way you burn data, as suggested above, so it all happens
    in only one place.)

    ....

    >
    > sub set_handle {
    > my ($handle, $pos) = @_;
    >
    > seek($handle,$pos,SEEK_CUR);
    >
    > if (defined (<$handle>)) # start new line


    You probably only want to burn a line when $pos>0. When $pos==0, you
    know the first line you read will be complete, so there is no reason
    to burn it. Generally the burned line that starts out with each chunk will
    be processed in the "previous" chunk, but when $pos==0 there was no
    previous chunk. But this will not actually be a problem unless the first
    line of you XML file is "<ORDER>", which does seem likely for "real" XML.

    > { return $handle; }
    > else
    > { return; }


    I don't understand the above. If $handle returned an undefined
    value this time you read from it, won't it do so next time as well?
    (I think the only time this isn't true is when $handle is an alias for
    ARGV). So why not just return $handle regardless?

    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Jul 26, 2007
    #9
  10. J. Gleixner Guest

    wrote:
    > roy <> wrote:
    >> I receive a XML-File up to 1 GB full of orders every day. I have to
    >> split the orders and load them into a database for further processing.
    >> I share this job onto multiple processes. This runs properly now.

    >
    > This seems dangerous to me. Generally XML should be parsed by an XML
    > parser, not by something that happens to parse some restricted subset of

    [...]
    > If parsing is not the bottleneck, then I'd just use something like
    > XML::Twig to read this one XML file and parcel it out to 10 XML files as a
    > preprocessing step. Then run each of those 10 files separately. Of
    > course, if parsing is the bottleneck, then this would defeat the purpose of
    > parallelizing.


    Also, see xml_split, which is one of the tools included with XML::Twig.

    http://search.cpan.org/~mirod/XML-Twig-3.29/tools/xml_split/xml_split
     
    J. Gleixner, Jul 27, 2007
    #10
  11. Guest

    wrote:
    > On Thu, 26 Jul 2007 12:50:58 -0000, roy <>
    > wrote:
    >
    > >I receive a XML-File up to 1 GB full of orders every day. I have to
    > >split the orders and load them into a database for further processing.
    > >I share this job onto multiple processes. This runs properly now.
    > >
    > >Here a little impress from the code:
    > >
    > >------------------------------ 8< ------------------------------
    > >
    > >use Proc::Simple;
    > >use constant MAX_PROCESSES => 10;
    > >
    > >$filesize = -s "... file";
    > >$step = int($filesize/MAX_PROCESSES+1);
    > >
    > >for (my $i=0;$i<MAX_PROCESSES;$i++) {
    > > $procs[$i] = Proc::Simple->new();
    > > $procs[$i]->start(\&insert_orders, $filename, $i*$step, ($i+1)*
    > >$step);
    > >}
    > >
    > >...

    >
    > Hey, this is hardcoded stuff, multiples of 10.
    > Lucky for you the format is the same everytime.
    >
    > What do you do when the file is corrupt due to fragmentation
    > errors?
    >
    > You can't be serious. This is XML man. What your doing is worse
    > than reading a serial port stream.


    To quite someone from a different thread, who is probably a troll
    trying to resurrect himself under a new name:

    : Yeah, can you describe whats wrong with it, or you just blowing smoke?
    : Sln

    Well, are you?

    >
    > Sln


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Aug 3, 2007
    #11
  12. Guest

    wrote:
    > On Thu, 26 Jul 2007 12:50:58 -0000, roy <>
    > wrote:
    >
    > >I receive a XML-File up to 1 GB full of orders every day. I have to
    > >split the orders and load them into a database for further processing.
    > >I share this job onto multiple processes. This runs properly now.
    > >
    > >Here a little impress from the code:
    > >
    > >------------------------------ 8< ------------------------------
    > >
    > >use Proc::Simple;
    > >use constant MAX_PROCESSES => 10;
    > >
    > >$filesize = -s "... file";
    > >$step = int($filesize/MAX_PROCESSES+1);
    > >
    > >for (my $i=0;$i<MAX_PROCESSES;$i++) {
    > > $procs[$i] = Proc::Simple->new();
    > > $procs[$i]->start(\&insert_orders, $filename, $i*$step, ($i+1)*
    > >$step);
    > >}
    > >
    > >...

    >
    > Hey, this is hardcoded stuff, multiples of 10.
    > Lucky for you the format is the same everytime.
    >
    > What do you do when the file is corrupt due to fragmentation
    > errors?
    >
    > You can't be serious. This is XML man. What your doing is worse
    > than reading a serial port stream.


    To quote someone from a different thread, who is probably a troll
    trying to resurrect himself under a new name:

    : Yeah, can you describe whats wrong with it, or you just blowing smoke?
    : Sln

    Well, are you?

    >
    > Sln


    Xho

    --
    -------------------- http://NewsReader.Com/ --------------------
    Usenet Newsgroup Service $9.95/Month 30GB
     
    , Aug 3, 2007
    #12
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jack
    Replies:
    9
    Views:
    2,679
  2. Joe Wright
    Replies:
    0
    Views:
    525
    Joe Wright
    Jul 27, 2003
  3. Rex Gustavus Adolphus
    Replies:
    17
    Views:
    212
    Rex Gustavus Adolphus
    Mar 7, 2004
  4. Sisyphus
    Replies:
    6
    Views:
    151
  5. Russ

    Read on closed filehandle

    Russ, Feb 1, 2007, in forum: Perl Misc
    Replies:
    12
    Views:
    195
Loading...

Share This Page