File Position

M

mud_saisem

Hi There,

Does anybody know how to read through a file searching for a word and
printing the file position of that word ?

Thanks.
 
P

Peter Makholm

mud_saisem said:
Does anybody know how to read through a file searching for a word and
printing the file position of that word ?

If your file contains plain ascii, iso-8859, or another 8bit charset
it should be easy. The tell() function gives you the current location
in the file, pos() gives you the location of regexp match, and
index() directly gives you the location.

So this should work (untested though)

my $offset = 0;
while (<$fh>) {
if (/word/) {
say "Found 'word' at location ", $offset + pos();
}
$offset = tell $fh;
}

If you file contains a variable width uniode encoding (like utf-8) it
gets a lot harder.


//Makholm
 
M

mud_saisem

If your file contains plain ascii, iso-8859, or another 8bit charset
it should be easy. The tell() function gives you the current location
in the file, pos() gives you the location of regexp match, and
index() directly gives you the location.

So this should work (untested though)

  my $offset = 0;
  while (<$fh>) {
      if (/word/) {
          say "Found 'word' at location ", $offset + pos();
      }
      $offset = tell $fh;
  }

If you file contains a variable width uniode encoding (like utf-8) it
gets a lot harder.

//Makholm

Very Nice, Thank for the help !
 
J

Jürgen Exner

mud_saisem said:
Does anybody know how to read through a file searching for a word and
printing the file position of that word ?

Please define 'position': are you talking about characters or bytes?

Just slurp the whole file into a string and then use index() to get the
position of the desired word in that string.
This is very straight-forward and unless you are dealing with
exceptionally large files (GB size) or unusual distribution of your
'word' (almost always very early in the file) probably also faster than
any looping line by line or chunk by chunk.

jue
 
M

mud_saisem

Please define 'position': are you talking about characters or bytes?

Just slurp the whole file into a string and then use index() to get the
position of the desired word in that string.
This is very straight-forward and unless you are dealing with
exceptionally large files (GB size) or unusual distribution of your
'word' (almost always very early in the file) probably also faster than
any looping line by line or chunk by chunk.

jue

The logs file that I will be scanning through range from 500Mb to 5Gb.
So adding the content of the file into memory is not a option.

What I meant about position was, if i am looking for a word like
"slurp" (from your paragraph), it should tell me where in the file the
word is, so that I can use the seek function and jump directly to the
position in the file where the word "slurp" is.
 
J

Jürgen Exner

mud_saisem said:
Please define 'position': are you talking about characters or bytes?
[...]
What I meant about position was, if i am looking for a word like
"slurp" (from your paragraph), it should tell me where in the file the
word is,

That is not any more specific than your first requrest. It could still
be bytes or characters.
so that I can use the seek function

Now, that is the critical clue. seek() is based on bytes, so you need a
position in bytes in order to use seek().
Position in characters would do you no good and therefore my suggestion
with index() wouldn't do you any good, either, because it returns the
position in characters. As does the suggestion from Peter Makholm. His
regular expression search is character-based, too, therefore it will not
return the byte-based position that you need for seek().
That is unless your file is in a single-byte character set, of course,
but you didn't say.

jue
 
R

Randal L. Schwartz

Jürgen> Now, that is the critical clue. seek() is based on bytes, so you need a
Jürgen> position in bytes in order to use seek().

Historical fact: fseek(3) was originally based on ftell(3)-"cookies", where
the stdio lib didn't promise to be able to return to any position that it
hadn't originally handed you from a tell. As it turns out, those "cookies"
were always byte positions on every operating system *I* saw stdio implemented
on.

print "Just another Perl hacker,";
 
S

sln

If your file contains plain ascii, iso-8859, or another 8bit charset
it should be easy. The tell() function gives you the current location
in the file, pos() gives you the location of regexp match, and
index() directly gives you the location.

So this should work (untested though)

my $offset = 0;
while (<$fh>) {
if (/word/) {
say "Found 'word' at location ", $offset + pos();
}
$offset = tell $fh;
}

If you file contains a variable width uniode encoding (like utf-8) it
gets a lot harder.
^^^
But probably not impossible.

-sln

------------------------
use strict;
use warnings;
use Encode;

binmode(STDOUT, ':encoding(UTF-8)');

my $word = "wo\x{2100}rd";
my $octet_search = encode('UTF-8', $word);
my @FileLocations = ();

my $filedata = encode ('UTF-8', "
This $word \x{2100} is a $word puzzle
It is not in this line,
but $word is in this one.
End.
");

open my $fh, '<', \$filedata or die "can't open memory file: $!";

my $linelength = 0;
print "\n";

while (<$fh>)
{
my $octet_dataline = $_;
while ( /($octet_search)/g )
{
my ($byte_offset, $byte_len) = (
$linelength + pos() - length($octet_search),
length $1
);
print "Found $word at $byte_offset\n";
print "Byte length is $byte_len, byte string is '$1'\n";
push @FileLocations, $byte_offset, $byte_len;
}
$linelength += length ($octet_dataline);
}
close $fh;

# To reconstitute,
# seek to the offsets, and read length bytes
#
print "\nFile offset/length's:\n";
while (my ($offset,$len) = splice(@FileLocations, 0,2)) {
print "$offset, $len\n";
}

__END__


Found woGäÇrd at 7
Byte length is 7, byte string is 'wo+ó-ä-Çrd'
Found woGäÇrd at 24
Byte length is 7, byte string is 'wo+ó-ä-Çrd'
Found woGäÇrd at 69
Byte length is 7, byte string is 'wo+ó-ä-Çrd'

File offset/length's:
7, 7
24, 7
69, 7
 
T

Ted Zlatanov

ms> Does anybody know how to read through a file searching for a word and
ms> printing the file position of that word ?

Besides the great Perl solutions posted here, you may want to consider
`grep -b' which will print the byte offset of each match, depending on
your needs of course.

Ted
 
S

sln

^^^
But probably not impossible.

-sln
I guess I'll keep this around as a curiosity,
not knowing the particulars of how/if Perl auto-promotes
byte strings to utf8 in the regex process.

If I try it out on different encodings, it seems to work.
The only problem is with any BOM (byte order mark) as this would
require adjusting the offset because of the bom/seek bug.

Depending on the OS, an endian'es won't map correctly to utf8.
For this reason, I left out the 16/32 LE's, because it prints to
STDOUT, which is binmode to utf-8. But otherwise, all the endian's
work as far as getting offsets.

Same realestate, different code.
Btw, this may be a much faster way to do regex on
Unicode. Reading/processing regular expressions on a file opened
in utf-8 mode and that happens to be very large, significantly
slows down the regex engine (by several magnitudes).

-sln
--------------------
# Rx_Bytes_Unicode_misc1.pl
# -sln, 2/10
use strict;
use warnings;
use Encode;

binmode(STDOUT, ':encoding(UTF-8)');

## Try some encodings
#
for my $UTF ('ascii', 'UTF-8', 'UTF-16BE', 'UTF-32BE')
{
## Create pattern in encoded bytes
#
my $word = "wo\x{2100}rd";
my $octet_pattern = encode($UTF, $word."|End|one");

print "\n",'-'x20,"\nEncoding: $UTF\nPattern: '$octet_pattern'\n";

## Create file data in encoded bytes
#
my $filedata = encode ($UTF,
"This $word \x{2100} is a $word puzzle
It is not in this line,
but $word is in this one.
The End."
);

## Open a memory buffer in byte mode
#
open my $fh, '<', \$filedata
or die "Can't open memory buffer for read: $!";
print "\n";

## Process file data
#
my @FileLocations = ();
my ($filepos, $line_count, $byte_offset, $byte_len) = (0,0);

while (<$fh>)
{
++$line_count;
while ( /($octet_pattern)/g )
{
$byte_len = length $1;
$byte_offset = $filepos + pos() - $byte_len;

print "(line $line_count) Found '",decode($UTF,$1),
"' (fpos= $byte_offset), byte string ",
"(len= $byte_len) is '$1'\n";
# save offset/length of matched item
push @FileLocations, $byte_offset, $byte_len;
}
# $filepos += length;
# or ->
$filepos = tell ($fh);
}

## Reconstitute file data.
## Seek to offsets, read length bytes
#
if ( @FileLocations ) {
print "\nFile offset/length:\n";
my $buf = '';
while (my ($offset,$len) = splice(@FileLocations, 0,2)) {
seek ($fh, $offset, 0);
read ($fh, $buf, $len);
print "$offset, $len, ",
"$UTF: '$buf', UTF-8 string: '",
decode($UTF, $buf), "'\n";
}
}
close $fh;
}
__END__
--------------------
Encoding: ascii
Pattern: 'wo?rd|End|one'

(line 3) Found 'one' (fpos= 96), byte string (len= 3) is 'one'
(line 4) Found 'End' (fpos= 115), byte string (len= 3) is 'End'

File offset/length:
96, 3, ascii: 'one', UTF-8 string: 'one'
115, 3, ascii: 'End', UTF-8 string: 'End'

--------------------
Encoding: UTF-8
Pattern: 'wo+ó-ä-Çrd|End|one'

(line 1) Found 'woGäÇrd' (fpos= 5), byte string (len= 7) is 'wo+ó-ä-Çrd'
(line 1) Found 'woGäÇrd' (fpos= 22), byte string (len= 7) is 'wo+ó-ä-Çrd'
(line 3) Found 'woGäÇrd' (fpos= 85), byte string (len= 7) is 'wo+ó-ä-Çrd'
(line 3) Found 'one' (fpos= 104), byte string (len= 3) is 'one'
(line 4) Found 'End' (fpos= 123), byte string (len= 3) is 'End'

File offset/length:
5, 7, UTF-8: 'wo+ó-ä-Çrd', UTF-8 string: 'woGäÇrd'
22, 7, UTF-8: 'wo+ó-ä-Çrd', UTF-8 string: 'woGäÇrd'
85, 7, UTF-8: 'wo+ó-ä-Çrd', UTF-8 string: 'woGäÇrd'
104, 3, UTF-8: 'one', UTF-8 string: 'one'
123, 3, UTF-8: 'End', UTF-8 string: 'End'

--------------------
Encoding: UTF-16BE
Pattern: ' w o! r d | E n d | o n e'

(line 1) Found 'woGäÇrd' (fpos= 10), byte string (len= 11) is ' w o! r d '
(line 1) Found 'woGäÇrd' (fpos= 36), byte string (len= 11) is ' w o! r d '
(line 3) Found 'woGäÇrd' (fpos= 158), byte string (len= 11) is ' w o! r d '
(line 3) Found 'one' (fpos= 192), byte string (len= 6) is ' o n e'
(line 4) Found 'End' (fpos= 230), byte string (len= 7) is ' E n d '

File offset/length:
10, 11, UTF-16BE: ' w o! r d ', UTF-8 string: 'woGäÇrd'
36, 11, UTF-16BE: ' w o! r d ', UTF-8 string: 'woGäÇrd'
158, 11, UTF-16BE: ' w o! r d ', UTF-8 string: 'woGäÇrd'
192, 6, UTF-16BE: ' o n e', UTF-8 string: 'one'
230, 7, UTF-16BE: ' E n d ', UTF-8 string: 'End'

--------------------
Encoding: UTF-32BE
Pattern: ' w o ! r d | E n d | o n e'

(line 1) Found 'woGäÇrd' (fpos= 20), byte string (len= 23) is ' w o ! r
d '
(line 1) Found 'woGäÇrd' (fpos= 72), byte string (len= 23) is ' w o ! r
d '
(line 3) Found 'woGäÇrd' (fpos= 316), byte string (len= 23) is ' w o ! r
d '
(line 3) Found 'one' (fpos= 384), byte string (len= 12) is ' o n e'
(line 4) Found 'End' (fpos= 460), byte string (len= 15) is ' E n d '

File offset/length:
20, 23, UTF-32BE: ' w o ! r d ', UTF-8 string: 'woGäÇrd'
72, 23, UTF-32BE: ' w o ! r d ', UTF-8 string: 'woGäÇrd'
316, 23, UTF-32BE: ' w o ! r d ', UTF-8 string: 'woGäÇrd'
384, 12, UTF-32BE: ' o n e', UTF-8 string: 'one'
460, 15, UTF-32BE: ' E n d ', UTF-8 string: 'End'
 
R

Randal L. Schwartz

Ben> IIRC Win32's stdio in 'text' mode (the default) uses this mechanism to
Ben> get around the CRLF->LF translation.

Nice to know. I guess I'm lucky in that I've never had to use Windows
except in internet cafes, where the first step is "download putty"
so I can ssh to a real box.
 
C

C.DeRykus

If your file contains plain ascii, iso-8859, or another 8bit charset
it should be easy. The tell() function gives you the current location
in the file, pos() gives you the location of regexp match, and
index() directly gives you the location.

So this should work (untested though)

  my $offset = 0;
  while (<$fh>) {
      if (/word/) {
^^^^^^^^^^^^

if ( /word/g ) {


Maybe the OP assumed it was correct because of
the 'tell' addition.
 
C

C.DeRykus

        ^^^^^^^^^^^^

        if ( /word/g ) {

Maybe the OP assumed it was correct because of
the 'tell' addition.

You may need to loop to pick up multiple hits
per line too if that was the goal.
 
P

Peter J. Holzer

Now, that is the critical clue. seek() is based on bytes, so you need a
position in bytes in order to use seek().
Position in characters would do you no good and therefore my suggestion
with index() wouldn't do you any good, either, because it returns the
position in characters.

Only if you use index() on character strings - if you use index on byte
strings it returns a byte position. So just read the file in binary,
convert your search string to the same encoding and invoke index().

Caveat: Some encodings are ambiguous: The same character sequence may be
represented by different byte sequences. For those encodings, index
won't work.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,582
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top