^^^
But probably not impossible.
-sln
I guess I'll keep this around as a curiosity,
not knowing the particulars of how/if Perl auto-promotes
byte strings to utf8 in the regex process.
If I try it out on different encodings, it seems to work.
The only problem is with any BOM (byte order mark) as this would
require adjusting the offset because of the bom/seek bug.
Depending on the OS, an endian'es won't map correctly to utf8.
For this reason, I left out the 16/32 LE's, because it prints to
STDOUT, which is binmode to utf-8. But otherwise, all the endian's
work as far as getting offsets.
Same realestate, different code.
Btw, this may be a much faster way to do regex on
Unicode. Reading/processing regular expressions on a file opened
in utf-8 mode and that happens to be very large, significantly
slows down the regex engine (by several magnitudes).
-sln
--------------------
# Rx_Bytes_Unicode_misc1.pl
# -sln, 2/10
use strict;
use warnings;
use Encode;
binmode(STDOUT, ':encoding(UTF-8)');
## Try some encodings
#
for my $UTF ('ascii', 'UTF-8', 'UTF-16BE', 'UTF-32BE')
{
## Create pattern in encoded bytes
#
my $word = "wo\x{2100}rd";
my $octet_pattern = encode($UTF, $word."|End|one");
print "\n",'-'x20,"\nEncoding: $UTF\nPattern: '$octet_pattern'\n";
## Create file data in encoded bytes
#
my $filedata = encode ($UTF,
"This $word \x{2100} is a $word puzzle
It is not in this line,
but $word is in this one.
The End."
);
## Open a memory buffer in byte mode
#
open my $fh, '<', \$filedata
or die "Can't open memory buffer for read: $!";
print "\n";
## Process file data
#
my @FileLocations = ();
my ($filepos, $line_count, $byte_offset, $byte_len) = (0,0);
while (<$fh>)
{
++$line_count;
while ( /($octet_pattern)/g )
{
$byte_len = length $1;
$byte_offset = $filepos + pos() - $byte_len;
print "(line $line_count) Found '",decode($UTF,$1),
"' (fpos= $byte_offset), byte string ",
"(len= $byte_len) is '$1'\n";
# save offset/length of matched item
push @FileLocations, $byte_offset, $byte_len;
}
# $filepos += length;
# or ->
$filepos = tell ($fh);
}
## Reconstitute file data.
## Seek to offsets, read length bytes
#
if ( @FileLocations ) {
print "\nFile offset/length:\n";
my $buf = '';
while (my ($offset,$len) = splice(@FileLocations, 0,2)) {
seek ($fh, $offset, 0);
read ($fh, $buf, $len);
print "$offset, $len, ",
"$UTF: '$buf', UTF-8 string: '",
decode($UTF, $buf), "'\n";
}
}
close $fh;
}
__END__
--------------------
Encoding: ascii
Pattern: 'wo?rd|End|one'
(line 3) Found 'one' (fpos= 96), byte string (len= 3) is 'one'
(line 4) Found 'End' (fpos= 115), byte string (len= 3) is 'End'
File offset/length:
96, 3, ascii: 'one', UTF-8 string: 'one'
115, 3, ascii: 'End', UTF-8 string: 'End'
--------------------
Encoding: UTF-8
Pattern: 'wo+ó-ä-Çrd|End|one'
(line 1) Found 'woGäÇrd' (fpos= 5), byte string (len= 7) is 'wo+ó-ä-Çrd'
(line 1) Found 'woGäÇrd' (fpos= 22), byte string (len= 7) is 'wo+ó-ä-Çrd'
(line 3) Found 'woGäÇrd' (fpos= 85), byte string (len= 7) is 'wo+ó-ä-Çrd'
(line 3) Found 'one' (fpos= 104), byte string (len= 3) is 'one'
(line 4) Found 'End' (fpos= 123), byte string (len= 3) is 'End'
File offset/length:
5, 7, UTF-8: 'wo+ó-ä-Çrd', UTF-8 string: 'woGäÇrd'
22, 7, UTF-8: 'wo+ó-ä-Çrd', UTF-8 string: 'woGäÇrd'
85, 7, UTF-8: 'wo+ó-ä-Çrd', UTF-8 string: 'woGäÇrd'
104, 3, UTF-8: 'one', UTF-8 string: 'one'
123, 3, UTF-8: 'End', UTF-8 string: 'End'
--------------------
Encoding: UTF-16BE
Pattern: ' w o! r d | E n d | o n e'
(line 1) Found 'woGäÇrd' (fpos= 10), byte string (len= 11) is ' w o! r d '
(line 1) Found 'woGäÇrd' (fpos= 36), byte string (len= 11) is ' w o! r d '
(line 3) Found 'woGäÇrd' (fpos= 158), byte string (len= 11) is ' w o! r d '
(line 3) Found 'one' (fpos= 192), byte string (len= 6) is ' o n e'
(line 4) Found 'End' (fpos= 230), byte string (len= 7) is ' E n d '
File offset/length:
10, 11, UTF-16BE: ' w o! r d ', UTF-8 string: 'woGäÇrd'
36, 11, UTF-16BE: ' w o! r d ', UTF-8 string: 'woGäÇrd'
158, 11, UTF-16BE: ' w o! r d ', UTF-8 string: 'woGäÇrd'
192, 6, UTF-16BE: ' o n e', UTF-8 string: 'one'
230, 7, UTF-16BE: ' E n d ', UTF-8 string: 'End'
--------------------
Encoding: UTF-32BE
Pattern: ' w o ! r d | E n d | o n e'
(line 1) Found 'woGäÇrd' (fpos= 20), byte string (len= 23) is ' w o ! r
d '
(line 1) Found 'woGäÇrd' (fpos= 72), byte string (len= 23) is ' w o ! r
d '
(line 3) Found 'woGäÇrd' (fpos= 316), byte string (len= 23) is ' w o ! r
d '
(line 3) Found 'one' (fpos= 384), byte string (len= 12) is ' o n e'
(line 4) Found 'End' (fpos= 460), byte string (len= 15) is ' E n d '
File offset/length:
20, 23, UTF-32BE: ' w o ! r d ', UTF-8 string: 'woGäÇrd'
72, 23, UTF-32BE: ' w o ! r d ', UTF-8 string: 'woGäÇrd'
316, 23, UTF-32BE: ' w o ! r d ', UTF-8 string: 'woGäÇrd'
384, 12, UTF-32BE: ' o n e', UTF-8 string: 'one'
460, 15, UTF-32BE: ' E n d ', UTF-8 string: 'End'