File Position

mud_saisem · Feb 18, 2010

Hi There,

Does anybody know how to read through a file searching for a word and
printing the file position of that word ?

Thanks.

Peter Makholm · Feb 18, 2010

mud_saisem said:
Does anybody know how to read through a file searching for a word and
printing the file position of that word ?

If your file contains plain ascii, iso-8859, or another 8bit charset
it should be easy. The tell() function gives you the current location
in the file, pos() gives you the location of regexp match, and
index() directly gives you the location.

So this should work (untested though)

my $offset = 0;
while (<$fh>) {
if (/word/) {
say "Found 'word' at location ", $offset + pos();
}
$offset = tell $fh;
}

If you file contains a variable width uniode encoding (like utf-8) it
gets a lot harder.

//Makholm

mud_saisem · Feb 18, 2010

If your file contains plain ascii, iso-8859, or another 8bit charset
it should be easy. The tell() function gives you the current location
in the file, pos() gives you the location of regexp match, and
index() directly gives you the location.

So this should work (untested though)

my $offset = 0;
while (<$fh>) {
if (/word/) {
say "Found 'word' at location ", $offset + pos();
}
$offset = tell $fh;
}

If you file contains a variable width uniode encoding (like utf-8) it
gets a lot harder.

//Makholm

Very Nice, Thank for the help !

Jürgen Exner · Feb 18, 2010

mud_saisem said:
Does anybody know how to read through a file searching for a word and
printing the file position of that word ?

Please define 'position': are you talking about characters or bytes?

Just slurp the whole file into a string and then use index() to get the
position of the desired word in that string.
This is very straight-forward and unless you are dealing with
exceptionally large files (GB size) or unusual distribution of your
'word' (almost always very early in the file) probably also faster than
any looping line by line or chunk by chunk.

jue

mud_saisem · Feb 18, 2010

Please define 'position': are you talking about characters or bytes?

Just slurp the whole file into a string and then use index() to get the
position of the desired word in that string.
This is very straight-forward and unless you are dealing with
exceptionally large files (GB size) or unusual distribution of your
'word' (almost always very early in the file) probably also faster than
any looping line by line or chunk by chunk.

jue

The logs file that I will be scanning through range from 500Mb to 5Gb.
So adding the content of the file into memory is not a option.

What I meant about position was, if i am looking for a word like
"slurp" (from your paragraph), it should tell me where in the file the
word is, so that I can use the seek function and jump directly to the
position in the file where the word "slurp" is.

Jürgen Exner · Feb 18, 2010

mud_saisem said:
Please define 'position': are you talking about characters or bytes?

Click to expand...

[...]
What I meant about position was, if i am looking for a word like
"slurp" (from your paragraph), it should tell me where in the file the
word is,

That is not any more specific than your first requrest. It could still
be bytes or characters.

so that I can use the seek function

Now, that is the critical clue. seek() is based on bytes, so you need a
position in bytes in order to use seek().
Position in characters would do you no good and therefore my suggestion
with index() wouldn't do you any good, either, because it returns the
position in characters. As does the suggestion from Peter Makholm. His
regular expression search is character-based, too, therefore it will not
return the byte-based position that you need for seek().
That is unless your file is in a single-byte character set, of course,
but you didn't say.

jue

Randal L. Schwartz · Feb 18, 2010

Jürgen> Now, that is the critical clue. seek() is based on bytes, so you need a
Jürgen> position in bytes in order to use seek().

Historical fact: fseek(3) was originally based on ftell(3)-"cookies", where
the stdio lib didn't promise to be able to return to any position that it
hadn't originally handed you from a tell. As it turns out, those "cookies"
were always byte positions on every operating system *I* saw stdio implemented
on.

print "Just another Perl hacker,";

sln · Feb 18, 2010

If your file contains plain ascii, iso-8859, or another 8bit charset
it should be easy. The tell() function gives you the current location
in the file, pos() gives you the location of regexp match, and
index() directly gives you the location.

So this should work (untested though)

my $offset = 0;
while (<$fh>) {
if (/word/) {
say "Found 'word' at location ", $offset + pos();
}
$offset = tell $fh;
}

If you file contains a variable width uniode encoding (like utf-8) it
gets a lot harder.

^^^
But probably not impossible.

-sln

------------------------
use strict;
use warnings;
use Encode;

binmode(STDOUT, ':encoding(UTF-8)');

my $word = "wo\x{2100}rd";
my $octet_search = encode('UTF-8', $word);
my @FileLocations = ();

my $filedata = encode ('UTF-8', "
This $word \x{2100} is a $word puzzle
It is not in this line,
but $word is in this one.
End.
");

open my $fh, '<', \$filedata or die "can't open memory file: $!";

my $linelength = 0;
print "\n";

while (<$fh>)
{
my $octet_dataline = $_;
while ( /($octet_search)/g )
{
my ($byte_offset, $byte_len) = (
$linelength + pos() - length($octet_search),
length $1
);
print "Found $word at $byte_offset\n";
print "Byte length is $byte_len, byte string is '$1'\n";
push @FileLocations, $byte_offset, $byte_len;
}
$linelength += length ($octet_dataline);
}
close $fh;

# To reconstitute,
# seek to the offsets, and read length bytes
#
print "\nFile offset/length's:\n";
while (my ($offset,$len) = splice(@FileLocations, 0,2)) {
print "$offset, $len\n";
}

__END__

Found woGäÇrd at 7
Byte length is 7, byte string is 'wo+ó-ä-Çrd'
Found woGäÇrd at 24
Byte length is 7, byte string is 'wo+ó-ä-Çrd'
Found woGäÇrd at 69
Byte length is 7, byte string is 'wo+ó-ä-Çrd'

File offset/length's:
7, 7
24, 7
69, 7

Ted Zlatanov · Feb 18, 2010

ms> Does anybody know how to read through a file searching for a word and
ms> printing the file position of that word ?

Besides the great Perl solutions posted here, you may want to consider
`grep -b' which will print the byte offset of each match, depending on
your needs of course.

Ted

sln · Feb 18, 2010

^^^
But probably not impossible.

-sln

I guess I'll keep this around as a curiosity,
not knowing the particulars of how/if Perl auto-promotes
byte strings to utf8 in the regex process.

If I try it out on different encodings, it seems to work.
The only problem is with any BOM (byte order mark) as this would
require adjusting the offset because of the bom/seek bug.

Depending on the OS, an endian'es won't map correctly to utf8.
For this reason, I left out the 16/32 LE's, because it prints to
STDOUT, which is binmode to utf-8. But otherwise, all the endian's
work as far as getting offsets.

Same realestate, different code.
Btw, this may be a much faster way to do regex on
Unicode. Reading/processing regular expressions on a file opened
in utf-8 mode and that happens to be very large, significantly
slows down the regex engine (by several magnitudes).

-sln
--------------------
# Rx_Bytes_Unicode_misc1.pl
# -sln, 2/10
use strict;
use warnings;
use Encode;

binmode(STDOUT, ':encoding(UTF-8)');

## Try some encodings
#
for my $UTF ('ascii', 'UTF-8', 'UTF-16BE', 'UTF-32BE')
{
## Create pattern in encoded bytes
#
my $word = "wo\x{2100}rd";
my $octet_pattern = encode($UTF, $word."|End|one");

print "\n",'-'x20,"\nEncoding: $UTF\nPattern: '$octet_pattern'\n";

## Create file data in encoded bytes
#
my $filedata = encode ($UTF,
"This $word \x{2100} is a $word puzzle
It is not in this line,
but $word is in this one.
The End."
);

## Open a memory buffer in byte mode
#
open my $fh, '<', \$filedata
or die "Can't open memory buffer for read: $!";
print "\n";

## Process file data
#
my @FileLocations = ();
my ($filepos, $line_count, $byte_offset, $byte_len) = (0,0);

while (<$fh>)
{
++$line_count;
while ( /($octet_pattern)/g )
{
$byte_len = length $1;
$byte_offset = $filepos + pos() - $byte_len;

print "(line $line_count) Found '",decode($UTF,$1),
"' (fpos= $byte_offset), byte string ",
"(len= $byte_len) is '$1'\n";
# save offset/length of matched item
push @FileLocations, $byte_offset, $byte_len;
}
# $filepos += length;
# or ->
$filepos = tell ($fh);
}

## Reconstitute file data.
## Seek to offsets, read length bytes
#
if ( @FileLocations ) {
print "\nFile offset/length:\n";
my $buf = '';
while (my ($offset,$len) = splice(@FileLocations, 0,2)) {
seek ($fh, $offset, 0);
read ($fh, $buf, $len);
print "$offset, $len, ",
"$UTF: '$buf', UTF-8 string: '",
decode($UTF, $buf), "'\n";
}
}
close $fh;
}
__END__
--------------------
Encoding: ascii
Pattern: 'wo?rd|End|one'

(line 3) Found 'one' (fpos= 96), byte string (len= 3) is 'one'
(line 4) Found 'End' (fpos= 115), byte string (len= 3) is 'End'

File offset/length:
96, 3, ascii: 'one', UTF-8 string: 'one'
115, 3, ascii: 'End', UTF-8 string: 'End'

--------------------
Encoding: UTF-8
Pattern: 'wo+ó-ä-Çrd|End|one'

(line 1) Found 'woGäÇrd' (fpos= 5), byte string (len= 7) is 'wo+ó-ä-Çrd'
(line 1) Found 'woGäÇrd' (fpos= 22), byte string (len= 7) is 'wo+ó-ä-Çrd'
(line 3) Found 'woGäÇrd' (fpos= 85), byte string (len= 7) is 'wo+ó-ä-Çrd'
(line 3) Found 'one' (fpos= 104), byte string (len= 3) is 'one'
(line 4) Found 'End' (fpos= 123), byte string (len= 3) is 'End'

File offset/length:
5, 7, UTF-8: 'wo+ó-ä-Çrd', UTF-8 string: 'woGäÇrd'
22, 7, UTF-8: 'wo+ó-ä-Çrd', UTF-8 string: 'woGäÇrd'
85, 7, UTF-8: 'wo+ó-ä-Çrd', UTF-8 string: 'woGäÇrd'
104, 3, UTF-8: 'one', UTF-8 string: 'one'
123, 3, UTF-8: 'End', UTF-8 string: 'End'

--------------------
Encoding: UTF-16BE
Pattern: ' w o! r d | E n d | o n e'

(line 1) Found 'woGäÇrd' (fpos= 10), byte string (len= 11) is ' w o! r d '
(line 1) Found 'woGäÇrd' (fpos= 36), byte string (len= 11) is ' w o! r d '
(line 3) Found 'woGäÇrd' (fpos= 158), byte string (len= 11) is ' w o! r d '
(line 3) Found 'one' (fpos= 192), byte string (len= 6) is ' o n e'
(line 4) Found 'End' (fpos= 230), byte string (len= 7) is ' E n d '

File offset/length:
10, 11, UTF-16BE: ' w o! r d ', UTF-8 string: 'woGäÇrd'
36, 11, UTF-16BE: ' w o! r d ', UTF-8 string: 'woGäÇrd'
158, 11, UTF-16BE: ' w o! r d ', UTF-8 string: 'woGäÇrd'
192, 6, UTF-16BE: ' o n e', UTF-8 string: 'one'
230, 7, UTF-16BE: ' E n d ', UTF-8 string: 'End'

--------------------
Encoding: UTF-32BE
Pattern: ' w o ! r d | E n d | o n e'

(line 1) Found 'woGäÇrd' (fpos= 20), byte string (len= 23) is ' w o ! r
d '
(line 1) Found 'woGäÇrd' (fpos= 72), byte string (len= 23) is ' w o ! r
d '
(line 3) Found 'woGäÇrd' (fpos= 316), byte string (len= 23) is ' w o ! r
d '
(line 3) Found 'one' (fpos= 384), byte string (len= 12) is ' o n e'
(line 4) Found 'End' (fpos= 460), byte string (len= 15) is ' E n d '

File offset/length:
20, 23, UTF-32BE: ' w o ! r d ', UTF-8 string: 'woGäÇrd'
72, 23, UTF-32BE: ' w o ! r d ', UTF-8 string: 'woGäÇrd'
316, 23, UTF-32BE: ' w o ! r d ', UTF-8 string: 'woGäÇrd'
384, 12, UTF-32BE: ' o n e', UTF-8 string: 'one'
460, 15, UTF-32BE: ' E n d ', UTF-8 string: 'End'

Randal L. Schwartz · Feb 19, 2010

Ben> IIRC Win32's stdio in 'text' mode (the default) uses this mechanism to
Ben> get around the CRLF->LF translation.

Nice to know. I guess I'm lucky in that I've never had to use Windows
except in internet cafes, where the first step is "download putty"
so I can ssh to a real box.

C.DeRykus · Feb 19, 2010

If your file contains plain ascii, iso-8859, or another 8bit charset
it should be easy. The tell() function gives you the current location
in the file, pos() gives you the location of regexp match, and
index() directly gives you the location.

So this should work (untested though)

my $offset = 0;
while (<$fh>) {
if (/word/) {

^^^^^^^^^^^^

if ( /word/g ) {

Maybe the OP assumed it was correct because of
the 'tell' addition.

C.DeRykus · Feb 19, 2010

^^^^^^^^^^^^

if ( /word/g ) {

Maybe the OP assumed it was correct because of
the 'tell' addition.

You may need to loop to pick up multiple hits
per line too if that was the goal.

Peter J. Holzer · Feb 19, 2010

Now, that is the critical clue. seek() is based on bytes, so you need a
position in bytes in order to use seek().
Position in characters would do you no good and therefore my suggestion
with index() wouldn't do you any good, either, because it returns the
position in characters.

Only if you use index() on character strings - if you use index on byte
strings it returns a byte position. So just read the file in binary,
convert your search string to the same encoding and invoke index().

Caveat: Some encodings are ambiguous: The same character sequence may be
represented by different byte sequences. For those encodings, index
won't work.

hp

Align or position	1	Dec 13, 2022
How to position the tooltip comment on these buttons?	9	Nov 4, 2023
Need help making the position of an infinite animation sticky	1	Dec 18, 2022
How to build a system to track specific keyword position on Google Search?	0	Jun 20, 2022
Single put routine overlapping words during iteration	4	Jan 2, 2023
Positioning a popup	10	Dec 13, 2020
I have to finish this code for my assignment but I cant figure out how to solve it	1	Jun 27, 2023
Find and count strings of text from multiple files	17	Dec 16, 2021

File Position

mud_saisem

Peter Makholm

mud_saisem

Jürgen Exner

mud_saisem

Jürgen Exner

Randal L. Schwartz

sln

Ted Zlatanov

sln

Randal L. Schwartz

C.DeRykus

C.DeRykus

Peter J. Holzer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads