File Position

Discussion in 'Perl Misc' started by mud_saisem, Feb 18, 2010.

  1. mud_saisem

    mud_saisem Guest

    Hi There,

    Does anybody know how to read through a file searching for a word and
    printing the file position of that word ?

    Thanks.
    mud_saisem, Feb 18, 2010
    #1
    1. Advertising

  2. mud_saisem <> writes:

    > Does anybody know how to read through a file searching for a word and
    > printing the file position of that word ?


    If your file contains plain ascii, iso-8859, or another 8bit charset
    it should be easy. The tell() function gives you the current location
    in the file, pos() gives you the location of regexp match, and
    index() directly gives you the location.

    So this should work (untested though)

    my $offset = 0;
    while (<$fh>) {
    if (/word/) {
    say "Found 'word' at location ", $offset + pos();
    }
    $offset = tell $fh;
    }

    If you file contains a variable width uniode encoding (like utf-8) it
    gets a lot harder.


    //Makholm
    Peter Makholm, Feb 18, 2010
    #2
    1. Advertising

  3. mud_saisem

    mud_saisem Guest

    On Feb 18, 4:22 pm, Peter Makholm <> wrote:
    > mud_saisem <> writes:
    > > Does anybody know how to read through a file searching for a word and
    > > printing the file position of that word ?

    >
    > If your file contains plain ascii, iso-8859, or another 8bit charset
    > it should be easy. The tell() function gives you the current location
    > in the file, pos() gives you the location of regexp match, and
    > index() directly gives you the location.
    >
    > So this should work (untested though)
    >
    >   my $offset = 0;
    >   while (<$fh>) {
    >       if (/word/) {
    >           say "Found 'word' at location ", $offset + pos();
    >       }
    >       $offset = tell $fh;
    >   }
    >
    > If you file contains a variable width uniode encoding (like utf-8) it
    > gets a lot harder.
    >
    > //Makholm


    Very Nice, Thank for the help !
    mud_saisem, Feb 18, 2010
    #3
  4. mud_saisem <> wrote:
    >Does anybody know how to read through a file searching for a word and
    >printing the file position of that word ?


    Please define 'position': are you talking about characters or bytes?

    Just slurp the whole file into a string and then use index() to get the
    position of the desired word in that string.
    This is very straight-forward and unless you are dealing with
    exceptionally large files (GB size) or unusual distribution of your
    'word' (almost always very early in the file) probably also faster than
    any looping line by line or chunk by chunk.

    jue
    Jürgen Exner, Feb 18, 2010
    #4
  5. mud_saisem

    mud_saisem Guest

    On Feb 18, 4:56 pm, Jürgen Exner <> wrote:
    > mud_saisem <> wrote:
    > >Does anybody know how to read through a file searching for a word and
    > >printing the file position of that word ?

    >
    > Please define 'position': are you talking about characters or bytes?
    >
    > Just slurp the whole file into a string and then use index() to get the
    > position of the desired word in that string.
    > This is very straight-forward and unless you are dealing with
    > exceptionally large files (GB size) or unusual distribution of your
    > 'word' (almost always very early in the file) probably also faster than
    > any looping line by line or chunk by chunk.
    >
    > jue


    The logs file that I will be scanning through range from 500Mb to 5Gb.
    So adding the content of the file into memory is not a option.

    What I meant about position was, if i am looking for a word like
    "slurp" (from your paragraph), it should tell me where in the file the
    word is, so that I can use the seek function and jump directly to the
    position in the file where the word "slurp" is.
    mud_saisem, Feb 18, 2010
    #5
  6. mud_saisem <> wrote:
    >On Feb 18, 4:56 pm, Jürgen Exner <> wrote:
    >> mud_saisem <> wrote:
    >> >Does anybody know how to read through a file searching for a word and
    >> >printing the file position of that word ?

    >>
    >> Please define 'position': are you talking about characters or bytes?

    [...]
    >What I meant about position was, if i am looking for a word like
    >"slurp" (from your paragraph), it should tell me where in the file the
    >word is,


    That is not any more specific than your first requrest. It could still
    be bytes or characters.

    >so that I can use the seek function


    Now, that is the critical clue. seek() is based on bytes, so you need a
    position in bytes in order to use seek().
    Position in characters would do you no good and therefore my suggestion
    with index() wouldn't do you any good, either, because it returns the
    position in characters. As does the suggestion from Peter Makholm. His
    regular expression search is character-based, too, therefore it will not
    return the byte-based position that you need for seek().
    That is unless your file is in a single-byte character set, of course,
    but you didn't say.

    jue
    Jürgen Exner, Feb 18, 2010
    #6
  7. >>>>> "Jürgen" == Jürgen Exner <> writes:

    Jürgen> Now, that is the critical clue. seek() is based on bytes, so you need a
    Jürgen> position in bytes in order to use seek().

    Historical fact: fseek(3) was originally based on ftell(3)-"cookies", where
    the stdio lib didn't promise to be able to return to any position that it
    hadn't originally handed you from a tell. As it turns out, those "cookies"
    were always byte positions on every operating system *I* saw stdio implemented
    on.

    print "Just another Perl hacker,";

    --
    Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
    <> <URL:http://www.stonehenge.com/merlyn/>
    Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
    See http://methodsandmessages.vox.com/ for Smalltalk and Seaside discussion
    Randal L. Schwartz, Feb 18, 2010
    #7
  8. mud_saisem

    Guest

    On Thu, 18 Feb 2010 06:22:51 +0100, Peter Makholm <> wrote:

    >mud_saisem <> writes:
    >
    >> Does anybody know how to read through a file searching for a word and
    >> printing the file position of that word ?

    >
    >If your file contains plain ascii, iso-8859, or another 8bit charset
    >it should be easy. The tell() function gives you the current location
    >in the file, pos() gives you the location of regexp match, and
    >index() directly gives you the location.
    >
    >So this should work (untested though)
    >
    > my $offset = 0;
    > while (<$fh>) {
    > if (/word/) {
    > say "Found 'word' at location ", $offset + pos();
    > }
    > $offset = tell $fh;
    > }
    >
    >If you file contains a variable width uniode encoding (like utf-8) it
    >gets a lot harder.

    ^^^
    But probably not impossible.

    -sln

    ------------------------
    use strict;
    use warnings;
    use Encode;

    binmode(STDOUT, ':encoding(UTF-8)');

    my $word = "wo\x{2100}rd";
    my $octet_search = encode('UTF-8', $word);
    my @FileLocations = ();

    my $filedata = encode ('UTF-8', "
    This $word \x{2100} is a $word puzzle
    It is not in this line,
    but $word is in this one.
    End.
    ");

    open my $fh, '<', \$filedata or die "can't open memory file: $!";

    my $linelength = 0;
    print "\n";

    while (<$fh>)
    {
    my $octet_dataline = $_;
    while ( /($octet_search)/g )
    {
    my ($byte_offset, $byte_len) = (
    $linelength + pos() - length($octet_search),
    length $1
    );
    print "Found $word at $byte_offset\n";
    print "Byte length is $byte_len, byte string is '$1'\n";
    push @FileLocations, $byte_offset, $byte_len;
    }
    $linelength += length ($octet_dataline);
    }
    close $fh;

    # To reconstitute,
    # seek to the offsets, and read length bytes
    #
    print "\nFile offset/length's:\n";
    while (my ($offset,$len) = splice(@FileLocations, 0,2)) {
    print "$offset, $len\n";
    }

    __END__


    Found woGäÇrd at 7
    Byte length is 7, byte string is 'wo+ó-ä-Çrd'
    Found woGäÇrd at 24
    Byte length is 7, byte string is 'wo+ó-ä-Çrd'
    Found woGäÇrd at 69
    Byte length is 7, byte string is 'wo+ó-ä-Çrd'

    File offset/length's:
    7, 7
    24, 7
    69, 7
    , Feb 18, 2010
    #8
  9. mud_saisem

    Ted Zlatanov Guest

    On Wed, 17 Feb 2010 20:21:19 -0800 (PST) mud_saisem <> wrote:

    ms> Does anybody know how to read through a file searching for a word and
    ms> printing the file position of that word ?

    Besides the great Perl solutions posted here, you may want to consider
    `grep -b' which will print the byte offset of each match, depending on
    your needs of course.

    Ted
    Ted Zlatanov, Feb 18, 2010
    #9
  10. mud_saisem

    Guest

    On Thu, 18 Feb 2010 08:13:22 -0800, wrote:

    >On Thu, 18 Feb 2010 06:22:51 +0100, Peter Makholm <> wrote:
    >
    >>mud_saisem <> writes:
    >>
    >>> Does anybody know how to read through a file searching for a word and
    >>> printing the file position of that word ?

    >>
    >>If your file contains plain ascii, iso-8859, or another 8bit charset
    >>it should be easy. The tell() function gives you the current location
    >>in the file, pos() gives you the location of regexp match, and
    >>index() directly gives you the location.
    >>
    >>So this should work (untested though)
    >>
    >> my $offset = 0;
    >> while (<$fh>) {
    >> if (/word/) {
    >> say "Found 'word' at location ", $offset + pos();
    >> }
    >> $offset = tell $fh;
    >> }
    >>
    >>If you file contains a variable width uniode encoding (like utf-8) it
    >>gets a lot harder.

    > ^^^
    >But probably not impossible.
    >
    >-sln
    >

    I guess I'll keep this around as a curiosity,
    not knowing the particulars of how/if Perl auto-promotes
    byte strings to utf8 in the regex process.

    If I try it out on different encodings, it seems to work.
    The only problem is with any BOM (byte order mark) as this would
    require adjusting the offset because of the bom/seek bug.

    Depending on the OS, an endian'es won't map correctly to utf8.
    For this reason, I left out the 16/32 LE's, because it prints to
    STDOUT, which is binmode to utf-8. But otherwise, all the endian's
    work as far as getting offsets.

    Same realestate, different code.
    Btw, this may be a much faster way to do regex on
    Unicode. Reading/processing regular expressions on a file opened
    in utf-8 mode and that happens to be very large, significantly
    slows down the regex engine (by several magnitudes).

    -sln
    --------------------
    # Rx_Bytes_Unicode_misc1.pl
    # -sln, 2/10
    use strict;
    use warnings;
    use Encode;

    binmode(STDOUT, ':encoding(UTF-8)');

    ## Try some encodings
    #
    for my $UTF ('ascii', 'UTF-8', 'UTF-16BE', 'UTF-32BE')
    {
    ## Create pattern in encoded bytes
    #
    my $word = "wo\x{2100}rd";
    my $octet_pattern = encode($UTF, $word."|End|one");

    print "\n",'-'x20,"\nEncoding: $UTF\nPattern: '$octet_pattern'\n";

    ## Create file data in encoded bytes
    #
    my $filedata = encode ($UTF,
    "This $word \x{2100} is a $word puzzle
    It is not in this line,
    but $word is in this one.
    The End."
    );

    ## Open a memory buffer in byte mode
    #
    open my $fh, '<', \$filedata
    or die "Can't open memory buffer for read: $!";
    print "\n";

    ## Process file data
    #
    my @FileLocations = ();
    my ($filepos, $line_count, $byte_offset, $byte_len) = (0,0);

    while (<$fh>)
    {
    ++$line_count;
    while ( /($octet_pattern)/g )
    {
    $byte_len = length $1;
    $byte_offset = $filepos + pos() - $byte_len;

    print "(line $line_count) Found '",decode($UTF,$1),
    "' (fpos= $byte_offset), byte string ",
    "(len= $byte_len) is '$1'\n";
    # save offset/length of matched item
    push @FileLocations, $byte_offset, $byte_len;
    }
    # $filepos += length;
    # or ->
    $filepos = tell ($fh);
    }

    ## Reconstitute file data.
    ## Seek to offsets, read length bytes
    #
    if ( @FileLocations ) {
    print "\nFile offset/length:\n";
    my $buf = '';
    while (my ($offset,$len) = splice(@FileLocations, 0,2)) {
    seek ($fh, $offset, 0);
    read ($fh, $buf, $len);
    print "$offset, $len, ",
    "$UTF: '$buf', UTF-8 string: '",
    decode($UTF, $buf), "'\n";
    }
    }
    close $fh;
    }
    __END__
    --------------------
    Encoding: ascii
    Pattern: 'wo?rd|End|one'

    (line 3) Found 'one' (fpos= 96), byte string (len= 3) is 'one'
    (line 4) Found 'End' (fpos= 115), byte string (len= 3) is 'End'

    File offset/length:
    96, 3, ascii: 'one', UTF-8 string: 'one'
    115, 3, ascii: 'End', UTF-8 string: 'End'

    --------------------
    Encoding: UTF-8
    Pattern: 'wo+ó-ä-Çrd|End|one'

    (line 1) Found 'woGäÇrd' (fpos= 5), byte string (len= 7) is 'wo+ó-ä-Çrd'
    (line 1) Found 'woGäÇrd' (fpos= 22), byte string (len= 7) is 'wo+ó-ä-Çrd'
    (line 3) Found 'woGäÇrd' (fpos= 85), byte string (len= 7) is 'wo+ó-ä-Çrd'
    (line 3) Found 'one' (fpos= 104), byte string (len= 3) is 'one'
    (line 4) Found 'End' (fpos= 123), byte string (len= 3) is 'End'

    File offset/length:
    5, 7, UTF-8: 'wo+ó-ä-Çrd', UTF-8 string: 'woGäÇrd'
    22, 7, UTF-8: 'wo+ó-ä-Çrd', UTF-8 string: 'woGäÇrd'
    85, 7, UTF-8: 'wo+ó-ä-Çrd', UTF-8 string: 'woGäÇrd'
    104, 3, UTF-8: 'one', UTF-8 string: 'one'
    123, 3, UTF-8: 'End', UTF-8 string: 'End'

    --------------------
    Encoding: UTF-16BE
    Pattern: ' w o! r d | E n d | o n e'

    (line 1) Found 'woGäÇrd' (fpos= 10), byte string (len= 11) is ' w o! r d '
    (line 1) Found 'woGäÇrd' (fpos= 36), byte string (len= 11) is ' w o! r d '
    (line 3) Found 'woGäÇrd' (fpos= 158), byte string (len= 11) is ' w o! r d '
    (line 3) Found 'one' (fpos= 192), byte string (len= 6) is ' o n e'
    (line 4) Found 'End' (fpos= 230), byte string (len= 7) is ' E n d '

    File offset/length:
    10, 11, UTF-16BE: ' w o! r d ', UTF-8 string: 'woGäÇrd'
    36, 11, UTF-16BE: ' w o! r d ', UTF-8 string: 'woGäÇrd'
    158, 11, UTF-16BE: ' w o! r d ', UTF-8 string: 'woGäÇrd'
    192, 6, UTF-16BE: ' o n e', UTF-8 string: 'one'
    230, 7, UTF-16BE: ' E n d ', UTF-8 string: 'End'

    --------------------
    Encoding: UTF-32BE
    Pattern: ' w o ! r d | E n d | o n e'

    (line 1) Found 'woGäÇrd' (fpos= 20), byte string (len= 23) is ' w o ! r
    d '
    (line 1) Found 'woGäÇrd' (fpos= 72), byte string (len= 23) is ' w o ! r
    d '
    (line 3) Found 'woGäÇrd' (fpos= 316), byte string (len= 23) is ' w o ! r
    d '
    (line 3) Found 'one' (fpos= 384), byte string (len= 12) is ' o n e'
    (line 4) Found 'End' (fpos= 460), byte string (len= 15) is ' E n d '

    File offset/length:
    20, 23, UTF-32BE: ' w o ! r d ', UTF-8 string: 'woGäÇrd'
    72, 23, UTF-32BE: ' w o ! r d ', UTF-8 string: 'woGäÇrd'
    316, 23, UTF-32BE: ' w o ! r d ', UTF-8 string: 'woGäÇrd'
    384, 12, UTF-32BE: ' o n e', UTF-8 string: 'one'
    460, 15, UTF-32BE: ' E n d ', UTF-8 string: 'End'
    , Feb 18, 2010
    #10
  11. >>>>> "Ben" == Ben Morrow <> writes:

    Ben> IIRC Win32's stdio in 'text' mode (the default) uses this mechanism to
    Ben> get around the CRLF->LF translation.

    Nice to know. I guess I'm lucky in that I've never had to use Windows
    except in internet cafes, where the first step is "download putty"
    so I can ssh to a real box.

    --
    Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
    <> <URL:http://www.stonehenge.com/merlyn/>
    Smalltalk/Perl/Unix consulting, Technical writing, Comedy, etc. etc.
    See http://methodsandmessages.vox.com/ for Smalltalk and Seaside discussion
    Randal L. Schwartz, Feb 19, 2010
    #11
  12. mud_saisem

    C.DeRykus Guest

    On Feb 17, 9:22 pm, Peter Makholm <> wrote:
    > mud_saisem <> writes:
    > > Does anybody know how to read through a file searching for a word and
    > > printing the file position of that word ?

    >
    > If your file contains plain ascii, iso-8859, or another 8bit charset
    > it should be easy. The tell() function gives you the current location
    > in the file, pos() gives you the location of regexp match, and
    > index() directly gives you the location.
    >
    > So this should work (untested though)
    >
    >   my $offset = 0;
    >   while (<$fh>) {
    >       if (/word/) {

    ^^^^^^^^^^^^

    if ( /word/g ) {


    Maybe the OP assumed it was correct because of
    the 'tell' addition.



    --
    Charles DeRykus
    C.DeRykus, Feb 19, 2010
    #12
  13. mud_saisem

    C.DeRykus Guest

    On Feb 18, 4:55 pm, "C.DeRykus" <> wrote:
    > On Feb 17, 9:22 pm, Peter Makholm <> wrote:> mud_saisem <> writes:
    > > > Does anybody know how to read through a file searching for a word and
    > > > printing the file position of that word ?

    >
    > > If your file contains plain ascii, iso-8859, or another 8bit charset
    > > it should be easy. The tell() function gives you the current location
    > > in the file, pos() gives you the location of regexp match, and
    > > index() directly gives you the location.

    >
    > > So this should work (untested though)

    >
    > >   my $offset = 0;
    > >   while (<$fh>) {
    > >       if (/word/) {

    >
    >         ^^^^^^^^^^^^
    >
    >         if ( /word/g ) {
    >
    > Maybe the OP assumed it was correct because of
    > the 'tell' addition.
    >


    You may need to loop to pick up multiple hits
    per line too if that was the goal.

    --
    Charles DeRykus
    C.DeRykus, Feb 19, 2010
    #13
  14. On 2010-02-18 08:08, Jürgen Exner <> wrote:
    > mud_saisem <> wrote:
    >>so that I can use the seek function

    >
    > Now, that is the critical clue. seek() is based on bytes, so you need a
    > position in bytes in order to use seek().
    > Position in characters would do you no good and therefore my suggestion
    > with index() wouldn't do you any good, either, because it returns the
    > position in characters.


    Only if you use index() on character strings - if you use index on byte
    strings it returns a byte position. So just read the file in binary,
    convert your search string to the same encoding and invoke index().

    Caveat: Some encodings are ambiguous: The same character sequence may be
    represented by different byte sequences. For those encodings, index
    won't work.

    hp
    Peter J. Holzer, Feb 19, 2010
    #14
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Luqman
    Replies:
    1
    Views:
    649
    Luqman
    Feb 7, 2006
  2. James Wong
    Replies:
    4
    Views:
    484
    James Wong
    Jul 14, 2004
  3. Replies:
    3
    Views:
    175
  4. James Black
    Replies:
    0
    Views:
    392
    James Black
    May 28, 2006
  5. brendan
    Replies:
    0
    Views:
    186
    brendan
    Aug 29, 2006
Loading...

Share This Page