UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug

sln · Aug 5, 2009

* warning, tired person, large source code ahead ..
------------------------------------------------------

I don't guess they fixed this, and probably won't.
The failure of Encode to seek past the BOM on subsequent reads
then SEEK_SET 0 isin't something unique to Perl. Python has the same bug.

Here is a way to do it by analysing the BOM in the character semantics
of the encoding layer on the open handle without touching any layers
with binmode. The result is a conversion of a BOM into byte string
that can be compared to the octets of the Unicode family of encodings.
These octets of all the encoded BOM's are then simply read as keys to
a hash.

BOM's are unique in thier encoding's and file position so there can
be no ambiguity when analysing them in any particular read encoding
character semantic. Whatever the encoding, enough characters are
read (1 at a time, up to the Max_Bom_Width octet count) to convert
its ordinal value into the first Max_Bom_Width bytes of the characters.
The width is gleaned withe each read from a file position differential.
The bytes from each character ordinal are taken off in a loop with
a bitwise & 0xff then a shift >>= 8 on the ordinal until its zero.
0x0 is padded until the byte count equals the width.
Get the next character, then do it again until the total bytes
in the conversion adds up to Max_Bom_Width bytes (octets).

I threw this together from another project where I needed to do
this. I didn't really want to have to do this but was forced to
because of other requirements.

For better or worse, here it is. It works.
I thought I would throw it up here incase anybody else needs this
solution besides me.

-sln

--------------------------
# File: bomtest.pl
#
# CheckUTF() -
# UTF-8/16/32 check, Will add a guessed utf-16/32 layer
# if there is no current encoding() layer.
# Opened filehandle should be in byte context.
# This is more for automatically adding a utf layer on
# unknown files opened in byte mode.
# Only guesses utf-8/16/32 encodings, BOMb'd or not.
# Used standalone or optionally with Get_BOM().
#
# Get_BOM() -
# SEEK_SET workaround for BOMb'd Encoding(utf-) layer Bug
# and a general way to get the BOM and offset for subsequent
# seeks.
# Fixes SEEK_SET if need to itterate reading of BOMb'd file.
# Once called, subsequent seek calls are positionalble past
# the BOM from the offset returned. Only needs to be called
# once and preferably after a call to optional CheckUTF().
# Does not alter the PerlIO layers in any way.
# The BOM is determined with character semantics
# (that was the hard part!), a file opened in byte semantics
# is just more gravy.
# This will also detect a mismatch in width between the
# detected BOM and the current opened encoding layer.
# Used standalone or optionally with CheckUTF().
#
# test_surrogate_pair() -
# Just a utility function to print hi/lo surrogate pairs
# given a unicode test character (> 0x10000).
#
#
# Modify the top variables to test various configurations.
# It basically creates a test file in an encoding, reads it
# back in that encoding. Then reads it back in byte mode.
# The write flag can be set so that just reading occurs.
#
#
# -sln, 8/4/09
# --------------------------------------------------------

use strict;
use warnings;

binmode (STDOUT, ':utf8');

my $fname = 'dummy.txt';
my $Unichar = "\x{21000}";

my $data = "\x{1}\x{2}$Unichar";
my $enc = 'UTF-16'; # blank or UTF-xxLE/BE or :bytes, :utf8, :raw, etc ...
my $enc_string = ($enc =~ /:/) ? "$enc" : (length($enc) ? "encoding($enc)" : '');
my $write_file = 1;
my $test_surr = 0;

my ($fh, $line);
my ($width,$bom,$offset);

# Test surrogate pairs ..
#
test_surrogate_pair($Unichar) if $test_surr;

# Create a test file with encoding ..
#
if ($write_file) {
print "\nWriting sample data as '$enc' to $fname\n",'-'x20,"\n";
open $fh, ">$enc_string", $fname or die "can't open $fname for write $!";

print $fh $data;
close $fh;
}

# Read test file with encoding ..
#
print "\nReading $fname as '$enc'\n",'-'x20,"\n";
open $fh, "<$enc_string", $fname or die "can't open $fname for read $!";

CheckUTF($fh,'size'=> 100, 'verbose'=> 1);

($width,$bom,$offset) = Get_BOM( $fh, 'v' ); # list context, verbose prints
# $offset = Get_BOM( $fh ); # scalar context, no prints

print "\nGet_BOM returned $bom, $width, offset = $offset\n";

for (1 .. 3) {
print "Pass $_: ";
seek($fh, $offset, 0); # SEEK_SET
my $fpos = tell($fh);
my $ok = ($fpos == $offset) ? 'ok' : 'bad';
$line = <$fh>;
print "seek $offset/$fpos $ok, data (".length($line).") = ";
for (map {ord $_} split //, $line) {
printf ("%x ",$_);
}
print "\n";
}
close $fh;

# Read test file in byte mode ..
#
print "\nReading $fname as bytes\n",'-'x20,"\n";
open $fh, '<', $fname or die "can't open $fname for read $!";

$offset = 0;

for (1 .. 3) {
print "Pass $_: ";
seek($fh, $offset, 0); # SEEK_SET
my $fpos = tell($fh);
my $ok = ($fpos == $offset) ? 'ok' : 'bad';
$line = <$fh>;
print "seek $offset/$fpos $ok, data (".length($line).") = ";
for (map {ord $_} split //, $line) {
printf ("%x ",$_);
}
print "\n";
}
close $fh;

exit 0;

##
## End of program ...
##

sub Get_BOM
{
my ($fh, $verbose) = @_;
$verbose = 0 unless (defined $verbose and $verbose);
use Encode;

my %bom2enc = (
map { encode($_, "\x{feff}") => $_ } qw(
UTF-8
UTF-16BE
UTF-16LE
UTF-32BE
UTF-32LE
)
);
my %enc2bom = (
reverse(%bom2enc),
map { $_ => encode($_, "\x{feff}") } qw(
UCS-2
iso-10646-1
utf8
)
);
my @bombs = sort { length $b <=> length $a } keys %bom2enc;
my $MAX_BOM_LENGTH = length $bombs[0];
my $bomstr = join '|', @bombs;
my $bom_re = qr/^($bomstr)/o;
my $layers = "@{[PerlIO::get_layers($fh)]}";

print "\nGet_BOM() - Layers = $layers\n" if $verbose;

if (defined($fh) && (ref($fh) eq 'GLOB' || ref(\$fh) eq 'GLOB'))
{
# Note - Must SEEK_SET/read twice to
# mitigate UTF-16/32 BOM seek bug
# -----------------------------------
if (seek($fh, 0, 0))
{
my $sample = '';

read ($fh, $sample, 1);
seek($fh, 0, 0); # SEEK_SET
my $fpos = tell($fh);
my @samparray = ();

## Read in $MAX_BOM_LENGTH 'characters'
## in whatever encoding the file was opened with.
## Store each 'character' and bytes/char - based
## on the differential file position per read.
## A BOM can be sucessfully read un-alterred in
## 1,2,4 or 8 byte sequences.
## -------------------------------------------
for (1 .. $MAX_BOM_LENGTH)
{
my $count = read ($fh, my $buff, 1);
last if (!defined $count or $count != 1);
push @samparray, {'char'=> $buff, 'width'=> (tell($fh)-$fpos)};
$fpos = tell($fh);
}
$sample = '';

print ">> Reading up to $MAX_BOM_LENGTH characters\n" if $verbose;

for my $aref (@samparray) {
$sample .= $aref->{'char'};
printf (" - char %x , bytes = %d\n", ord($aref->{'char'}), $aref->{'width'}) if $verbose;
}
seek($fh, 0, 0); # SEEK_SET, last seek, set to position 0

if ($verbose) {
print " - Read ".length($sample)." characters. Position = $fpos\n";
print " - Sample (".length($sample).") = ";
for (map {ord $_} split //, $sample) {
printf ("%02x ", $_);
}
print "\n>> Converting to bytes\n";
}
## Convert the 'characters' to bytes.
## Only process $MAX_BOM_LENGTH bytes (octets).
## We only care about the BOM, which will match
## the encoding bytes we already have in %bom2enc.
## -----------------------------------------------

my ($octets, $offset) = ('',0);
my $numb_octets = 0;

for my $aref (@samparray)
{
my @ar = ();
my ($width, $ordchar) = (
$aref->{'width'},
ord( $aref->{'char'})
);
last if (($numb_octets + $width) > $MAX_BOM_LENGTH);
$numb_octets += $width;

while ($ordchar > 0) {
push @ar, $ordchar & 0xff;
$ordchar >>= 8;
--$width;
}
push (@ar,0) while ($width-- > 0);
for (reverse @ar) {
vec ($octets, $offset++, 8) = $_;
}
}
if ($verbose) {
print " BOMS avail = ";
for my $bom (@bombs) {
print '( ';
for (map {ord $_} split //, $bom) {
printf ("%02x",$_);
}
print ' ) ';
}
print "\n - Bom octets from sample (".length($octets).") = ";
for (map {ord $_} split //, $octets) {
printf ("%02x ", $_);
}
print "\n";
}
# end convert

if (my ($found) = $octets =~ /$bom_re/) {
my ($offset, $bomstring) = (
length($found),
$bom2enc{$found}
);
if ($verbose) {
print ">> Found $bomstring, BOM = ";
for (map {ord $_} split //, $found) {
printf ("%02x", $_);
}
print "\n";
}

my $noendian = $bomstring;
$noendian =~ s/(?:LE|BE)$//i;
my $width = ($layers =~ /$noendian/i) ? 'ok' : 'mismatch';
print " - Does not match width of current encoding layer\n" if ($verbose and $width ne 'ok');

return ($width, $bomstring,$offset);
}
print ">> BOM Not found\n" if $verbose;
return ('','',0);
}
# can't seek here, fall through
}
# There was an error ..
print "\nGet_BOM() - Layers = $layers\n" if (!$verbose);
print ">> Invalid filehandle or file is unseekable\n - Utf BOMs available (max length = $MAX_BOM_LENGTH):\n";
while (my ($key,$val) = each %enc2bom)
{
my $valstring = '';
for (split //, $val) {
$valstring .= sprintf ("%x ",ord $_);
}
print " $key = $bom2enc{$val} = $valstring\n";
#print " $key = $bom2enc{$val} = '$val' = '$enc2bom{$key}' = $valstring\n";
}
return ('','',-1);
}

sub CheckUTF
{
# UTF-8/16/32 check
# Will add a layer if there is no current
# encoding() layer. So open filehandle should
# be in byte context.
# ----------------------------------------
# Result value:
# -1 - not filehandle (or undef'ed)
# 0 - read error or empty file
# 1 - already have encoding layer (could force change later)
# 2 - guess_encoder() error message
# 3 - do nothing unless utf-16/32(be/le)
# 4 - added a layer

my ($fh,@args) = @_;
my %parm = ('verbose'=> 0, 'size'=> 60);
while (my ($name, $val) = splice (@args, 0, 2)) {
$name =~ s/^\s+|\s+$//; next if not defined $val;
if (lc($name) eq 'verbose' || (lc($name) eq 'size' && $val > 4))
{ $parm{$name} = $val; }
}
my ($Res, $Utfmsg, $Layers) = (0, 'UTF Check', '');

if (!defined($fh) || (ref($fh) ne 'GLOB' && ref(\$fh) ne 'GLOB')) {
$Utfmsg .= ", not a filehandle ..";
print "$Utfmsg\n" if $parm{'verbose'};
return wantarray ? ($Res,$Utfmsg,'') : $Utfmsg;
}
$Layers = ':'.join (':', PerlIO::get_layers($fh)).':';

if ($Layers =~ /:encoding/) {
$Utfmsg .= ", already have encoding layer";
$Res = 1;
} else {
my ($count, $sample);
my $utf8layer = $Layers =~ /:utf8/;

binmode ($fh,":bytes");
seek ($fh, 0, 0);

# Try to read a large sample
if (!defined($count = read ($fh,$sample,$parm{'size'},0))) {
$Utfmsg .= ". $!"; # read error
} elsif ($count <= 0) {
$Utfmsg .= ". File is empty";
} else {
seek ($fh, 0, 0); # SEEK_SET
use Encode::Guess;

my $decoder = guess_encoding ($sample); # ascii/utf8/BOMed UTF
$Res = 2;

if (ref($decoder)) {
my $name = $decoder->name;
$decoder = 'Do nothing';
$Res = 3;
$Utfmsg .= ", guess($count read): $name";
if ($name =~ /UTF.*?(?:16|32)/i) {
# $name =~ s/(?:LE|BE)$//i;
$decoder = "Adding layer";
$Res = 4;
binmode ($fh, ":encoding($name)");
}
}
$Utfmsg .= ". $decoder" if (defined $decoder); # guess error
}
binmode ($fh,":utf8") if $utf8layer;
}
$Utfmsg .= ' ..';
$Layers = "@{[PerlIO::get_layers($fh)]}";
print "@{[$Utfmsg,$Layers]}\n" if $parm{'verbose'};
return wantarray ? ($Res,$Utfmsg,$Layers) : "@{[$Utfmsg,$Layers]}";
}

sub test_surrogate_pair
{
my $test = shift;
print "\nTesting surrogate pairs\n",'-'x20,"\n";
if (ord($test) < 0x10000) {
print "nothing to test < 0x10000\n";
} else {
my $hi = (ord($test) - 0x10000) / 0x400 + 0xD800;
my $lo = (ord($test) - 0x10000) % 0x400 + 0xDC00;
my $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
printf "test uni = %x\n", ord($test);
printf "hi surrogate = %x\n", $hi;
printf "lo surrogate = %x\n", $lo;
printf "uni = %x\n", $uni;
}
}

__END__

C:\temp>perl bomtest.pl

Writing sample data as 'UTF-16' to dummy.txt
--------------------

Reading dummy.txt as 'UTF-16'
--------------------
UTF Check, already have encoding layer .. unix crlf encoding(UTF-16) utf8

Get_BOM() - Layers = unix crlf encoding(UTF-16) utf8- char feff , bytes = 2
- char 1 , bytes = 2
- char 2 , bytes = 2
- char 21000 , bytes = 4
- Read 4 characters. Position = 10
- Sample (4) = feff 01 02 21000BOMS avail = ( fffe0000 ) ( 0000feff ) ( efbbbf ) ( fffe ) ( feff )
- Bom octets from sample (4) = fe ff 00 01
Get_BOM returned UTF-16BE, ok, offset = 2
Pass 1: seek 2/2 ok, data (3) = 1 2 21000
Pass 2: seek 2/2 ok, data (3) = 1 2 21000
Pass 3: seek 2/2 ok, data (3) = 1 2 21000

Reading dummy.txt as bytes
--------------------
Pass 1: seek 0/0 ok, data (10) = fe ff 0 1 0 2 d8 44 dc 0
Pass 2: seek 0/0 ok, data (10) = fe ff 0 1 0 2 d8 44 dc 0
Pass 3: seek 0/0 ok, data (10) = fe ff 0 1 0 2 d8 44 dc 0

C:\temp>

sln · Aug 14, 2009

On Wed, 05 Aug 2009 01:23:43 -0700, (e-mail address removed) wrote:

<snip>

This is last post on this.
Get_BOM() must procede with bom analysis in byte mode,
ie: no character semantics. It was a nice try but there is
really no other way. The passed in handle can have any encoding,
is dupped, the dup set to ':raw'. SEEK_SET is still fixed.

CheckUTF() can be passed arguments to interface (and pass args)
with Get_BOM(), thus combining 2 steps into 1.

Parameters have been cleaned up, they are new. I've finally conquerred
this bitch.

The whole thing is extremely verbose, the guts of which could be boiled down
to a fifth of the lines, which I am doing now when dropping it into something
else. But needed verbose to check every possible situation.

Handles to pipes won't work or anything non-seekable, exits gracefully (mostly).
Using handles to memory are tricky. I wouldn't load up a buffer with other-than-
utf8 encodings and expect Perl to be gracefull seeking and reading.
If you have to, open a file, CheckUTF() it, read it into a buffer (not yet handle taken
on it) as utf8, the default. Open a handle on the memory AS UTF8, then read it back.
There won't be a bom on it so SEEK_SET is just 0 (no offset).

Usage examples on top, a couple of output captures on bottom (from different configurations).
I might try adding more guess suspects later as I get to try different ones.

Filters on, run for the hills!!

-sln

=====================================================

# File: bomtest.pl
#
# CheckUTF() -
# UTF-8/16/32 check, Will add a guessed utf-16/32 layer
# if there is no current encoding() layer.
# Opened filehandle should be in byte context.
# This is more for automatically adding a utf layer on
# unknown files opened in byte mode.
# Guesses utf-8/16/32 encodings, BOMb'd or not.
# Used standalone or optionally with Get_BOM().
#
# Get_BOM() -
# SEEK_SET workaround for BOMb'd Encoding(utf-) layer Bug
# and a general way to get the BOM and offset for subsequent
# seeks.
# The file can be opened using any encoding. The handle passed
# is duped, that handle set to ':raw'. No character semantics
# are used when analysing the BOM.
# This will also detect a mismatch in width between the
# detected BOM and the current opened encoding layer.
# Used standalone or optionally with CheckUTF().
# CheckUTF() has an option to get the bom on the callers
# behalf, combining steps.
#
# test_surrogate_pair() -
# Just a utility function to print hi/lo surrogate pairs
# given a unicode test character (> 0x10000).
#
#
# Modify the top variables to test various configurations.
#
# -sln, 8/13/09
# --------------------------------------------------------

use strict;
use warnings;

binmode (STDOUT, ':utf8');

{
#my $fname = 'dummy.txt';
my $fname = 'c:\temp\XML\CollectedData_1804.xml';

my $Unichar = "\x{21000}";

my $data = "\x{1}\x{2}$Unichar";
my $enc = ''; # blank or UTF-xxLE/BE or :bytes, :utf8, :raw, etc ...
my $enc_string = ($enc =~ /:/) ? "$enc" : (length($enc) ? "encoding($enc)" : '');
my $write_file = 0;
unless ($write_file eq 'dummy.txt') { $write_file = 0; }
my $test_surr = 0;

my ($fh, $line);
my ($width,$encfound,$bom,$offset);

# Test surrogate pairs ..
##
test_surrogate_pair($Unichar) if $test_surr;

# Create a test file with encoding ..
##
if ($write_file) {
print "\nWriting sample data as '$enc' to $fname\n",'-'x20,"\n";
open $fh, ">$enc_string", $fname or die "can't open $fname for write $!";

print $fh $data;

close $fh;
} else {
print "\nNOT WRITING to $fname !!\n",'-'x20,"\n";
}

# Read test file with encoding ..
##
print "\nReading $fname as '$enc'\n",'-'x20,"\n";
open $fh, "<$enc_string", $fname or die "can't open $fname for read $!";

$offset = CheckUTF($fh, 'size'=> 100, 'verbose'=> 1, 'getbom_v'=> 1);
print "\n";

for (1 .. 3) {
print "Pass $_: ";
seek($fh, $offset, 0); # SEEK_SET
my $fpos = tell($fh);
my $ok = ($fpos == $offset) ? 'ok' : 'bad';
$line = <$fh>;
print "seek $offset/$fpos $ok, data (".length($line).") = ".ordsplit($line)."\n";
print "$line\n";
}
close $fh;

# Read test file with encoding into scalar, open handle to scalar as :utf8 ..
##
print "\nBuffering $fname as '$enc', open/read buffer as ':utf8'\n",'-'x20,"\n";
open $fh, "<$enc_string", $fname or die "can't open $fname for read $!";

$offset = CheckUTF($fh, 'size'=> 100, 'verbose'=> 1, 'getbom_v'=> 1);
seek($fh, $offset, 0); # SEEK_SET

my $membuf = <$fh>;
close $fh;

open $fh, "<:utf8", \$membuf or die "can't open \$membuf for read $!";

$offset = 0;
print "\n";

for (1 .. 3) {
print "Pass $_: ";
seek($fh, $offset, 0); # SEEK_SET
my $fpos = tell($fh);
my $ok = ($fpos == $offset) ? 'ok' : 'bad';
$line = <$fh>;
print "seek $offset/$fpos $ok, data (".length($line).") = ".ordsplit($line)."\n";
}
close $fh;

# Read test file in byte mode ..
##
print "\nReading $fname as bytes\n",'-'x20,"\n";
open $fh, '<', $fname or die "can't open $fname for read $!";

$offset = 0;

for (1 .. 3) {
print "Pass $_: ";
seek($fh, $offset, 0); # SEEK_SET
my $fpos = tell($fh);
my $ok = ($fpos == $offset) ? 'ok' : 'bad';
$line = <$fh>;
print "seek $offset/$fpos $ok, data (".length($line).") = ";
for (map {ord $_} split //, $line) {
printf ("%x ",$_);
}
print "\n";
}
close $fh;

}
exit 0;

##
## End of program ...
##

sub ordsplit
{
my $string = shift;
my $buf = '';
for (map {ord $_} split //, $string) {
$buf.= sprintf ("%x ",$_);
}
return $buf;
}

sub Get_BOM
{
my ($fh_orig, $verbose) = @_;
$verbose = 0 unless (defined $verbose and lc($verbose) eq 'v');
use Encode;

my %bom2enc = (
map { encode($_, "\x{feff}") => $_ } qw(
UTF-8
UTF-16BE
UTF-16LE
UTF-32BE
UTF-32LE
)
);
my %enc2bom = (
reverse(%bom2enc),
map { $_ => encode($_, "\x{feff}") } qw(
UCS-2
iso-10646-1
utf8
)
);
my @bombs = sort { length $b <=> length $a } keys %bom2enc;
my $MAX_BOM_LENGTH = length $bombs[0];
my $bomstr = join '|', @bombs;
my $bom_re = qr/^($bomstr)/o;
my @arlays = PerlIO::get_layers($fh_orig);
my $layers = "@arlays";

print "Get_BOM() - Layers = $layers\n" if $verbose;

if (defined($fh_orig) && (ref($fh_orig) eq 'GLOB' || ref(\$fh_orig) eq 'GLOB'))
{
# Dup the passed in handle, binmode it to raw
# We will ONLY do byte semantics !!
# --------------------------------------------------
open my $fh_dup, "<&", $fh_orig or die "can't dup filehandle: $!";
binmode ($fh_dup, ":raw");

# Note - Must SEEK_SET/read twice to
# mitigate UTF-16/32 BOM seek bug
# -----------------------------------
if (seek($fh_dup, 0, 0)) # SEEK_SET
{
my $sample = '';
seek($fh_dup, 0, 0); # SEEK_SET

## Read in $MAX_BOM_LENGTH characters.
## As a check, store character/bytes read based
## on the differential file position per read.
## Since our dup handle is in byte mode, the bytes
## per character will always be 1.
## This is a hold over when there could be encoded
## characters and is not necessary.
## -----------------------------------------------
my $fpos = tell($fh_dup);
my @samparray = ();
print " * Reading up to $MAX_BOM_LENGTH characters\n" if $verbose;

for (1 .. $MAX_BOM_LENGTH)
{
my $count = read ($fh_dup, my $buff, 1);
last if (!defined $count or $count != 1);
push @samparray, {'char'=> $buff, 'width'=> (tell($fh_dup)-$fpos)};
$fpos = tell($fh_dup);
}
$sample = '';
for my $href (@samparray) {
$sample .= $href->{'char'};
printf (" - char %x , bytes = %d\n", ord($href->{'char'}), $href->{'width'}) if $verbose;
}
if ($verbose) {
print " - Read ".length($sample)." characters. Position = $fpos\n";
print " - Sample (".length($sample).") = ";
for (map {ord $_} split //, $sample) {
printf ("%02x ", $_);
}
print "\n * Analysing bytes\n";
}
## Convert the 'characters' to bytes.
## Only process $MAX_BOM_LENGTH bytes.
## We only care about the BOM, which will match
## the encoding bytes we already have in %bom2enc.
## -----------------------------------------------
my ($octets, $offset) = ('',0);
my $numb_octets = 0;

for my $href (@samparray)
{
my @ar = ();
my ($width, $ordchar) = (
$href->{'width'},
ord( $href->{'char'})
);
last if (($numb_octets + $width) > $MAX_BOM_LENGTH);
$numb_octets += $width;

while ($ordchar > 0) {
push @ar, $ordchar & 0xff;
$ordchar >>= 8;
--$width;
}
push (@ar,0) while ($width-- > 0);
for (reverse @ar) {
vec ($octets, $offset++, 8) = $_;
}
}
if ($verbose) {
print " BOMS avail = ";
for my $bom (@bombs) {
print '( ';
for (map {ord $_} split //, $bom) {
printf ("%02x",$_);
}
print ' ) ';
}
print "\n - Bom bytes from sample (".length($octets).") = ";
for (map {ord $_} split //, $octets) {
printf ("%02x ", $_);
}
print "\n";
}
# end convert

if (my ($found) = $octets =~ /$bom_re/) {

my ($bomenc, $bom, $bomwidth, $bomoffset) = (
$bom2enc{$found},
'','',
length($found)
);
for (map {ord $_} split //, $found) {
$bom .= sprintf ("%02x", $_);
}
print " * Found $bomenc, BOM = $bom\n" if ($verbose);

my $noendian = $bomenc;
$noendian =~ s/(?:LE|BE)$//i;
$bomwidth = ($layers =~ /$noendian/i) ? 'ok' : 'mismatch';
print " - Does not match width of current encoding layer\n" if ($verbose and $bomwidth ne 'ok');

close ($fh_dup);
seek($fh_orig, 0, 0); # SEEK_SET
read ($fh_orig, $sample, 1); # read
seek($fh_orig, 0, 0); # SEEK_SET last. Caller to seek past BOM.

return ($bomenc, $bom, $bomwidth, $bomoffset);
}
print " * BOM Not found\n" if $verbose;

close ($fh_dup);
seek($fh_orig, 0, 0); # SEEK_SET
read ($fh_orig, $sample, 1); # read
seek($fh_orig, 0, 0); # SEEK_SET last. Caller to seek past BOM.

return ('','','',0);
}
# seek failed, fall through
}
# There was an error ..
if ($verbose) {
print " * Invalid filehandle or file is unseekable\n - Utf BOMs available (max length = $MAX_BOM_LENGTH):\n";
while (my ($key,$val) = each %enc2bom)
{
my $valstring = '';
for (split //, $val) {
$valstring .= sprintf ("%x ",ord $_);
}
print " $key = $bom2enc{$val} = $valstring\n";
#print " $key = $bom2enc{$val} = '$val' = '$enc2bom{$key}' = $valstring\n";
}
}
return ('','','',-1);
}

sub CheckUTF
{
# UTF-8/16/32 check
# Will add a layer if there is no current
# encoding() layer. So open filehandle should
# be in byte context.
# ----------------------------------------
# Result value:
# -1 - not filehandle (or undef'ed)
# 0 - read error or empty file
# 1 - already have encoding layer (could force change later)
# 2 - guess_encoder() error message
# 3 - do nothing unless utf-16/32(be/le)
# 4 - added a layer

my ($fh_check,@args) = @_;
my %parm = ('verbose'=>0, 'size'=>60, 'getbom' =>0, 'bomopts'=>'');
while (my ($name, $val) = splice (@args, 0, 2))
{
$name =~ s/^\s+|\s+$//;
$name = lc $name;
next if not defined $val;

if ( $name =~ /^getbom(?:_(\w*))?/ and $val) {
$parm{'gbom'} = 1;
$parm{'bomopts'} = $1 if defined($1);
}
elsif ($name eq 'verbose' || $name eq 'size' && $val > 4) {
$parm{$name} = $val;
}
}
# 0 1 2 3 4 5 6
my ($Res, $Utfmsg, $layers, $bomenc, $bom, $bomwidth, $bomoffset) = (
0, 'UTF Check', '', '', '', '', 0);
$layers = ':'.join (':', PerlIO::get_layers($fh_check)).':';

if (!defined($fh_check) || (ref($fh_check) ne 'GLOB' && ref(\$fh_check) ne 'GLOB')) {
$Utfmsg .= ", not a filehandle";
print "$Utfmsg\n" if $parm{'verbose'};
$Res = -1;
}
elsif ($layers =~ /:encoding/) {
$Utfmsg .= ", already have encoding layer";
$Res = 1;
}
else {
my ($count, $sample);
my $utf8layer = $layers =~ /:utf8/;

binmode ($fh_check,":bytes");
seek ($fh_check, 0, 0); # SEEK_SET

# Try to read a large sample
if (!defined($count = read ($fh_check,$sample,$parm{'size'},0))) {
$Utfmsg .= ". $!"; # read error
} elsif ($count <= 0) {
$Utfmsg .= ". File is empty";
} else {
seek ($fh_check, 0, 0); # SEEK_SET
use Encode::Guess;

my $decoder = guess_encoding ($sample); # ascii/utf8/BOMed UTF
$Res = 2;

if (ref($decoder)) {
my $name = $decoder->name;
$decoder = 'Do nothing';
$Res = 3;
$Utfmsg .= ", guess($count): $name";
if ($name =~ /UTF.*?(?:16|32)/i) {
# $name =~ s/(?:LE|BE)$//i;
$decoder = "Adding layer";
$Res = 4;
binmode ($fh_check, ":encoding($name)");
}
}
$Utfmsg .= ". $decoder" if (defined $decoder); # guess error
}
binmode ($fh_check,":utf8") if $utf8layer;
}
$Utfmsg .= ' ..';
$layers = "@{[PerlIO::get_layers($fh_check)]}";

print "@{[$Utfmsg,$layers]}\n" if $parm{'verbose'};

$bomoffset = $Res if $Res == -1;
if ($Res != -1 && $parm{'gbom'}) {
($bomenc, $bom, $bomwidth, $bomoffset) = Get_BOM ($fh_check, $parm{'bomopts'});
$Utfmsg .= " bom($bomoffset) = $bom ..";
print "Get_BOM returned $bomenc, $bom, $bomwidth, $bomoffset\n" if $parm{'verbose'};
}
return ($Res, $Utfmsg, $layers, $bomenc, $bom, $bomwidth, $bomoffset);
}

sub test_surrogate_pair
{
my $test = shift;
print "\nTesting surrogate pairs\n",'-'x20,"\n";
if (ord($test) < 0x10000) {
print "nothing to test < 0x10000\n";
} else {
my $hi = (ord($test) - 0x10000) / 0x400 + 0xD800;
my $lo = (ord($test) - 0x10000) % 0x400 + 0xDC00;
my $uni = 0x10000 + ($hi - 0xD800) * 0x400 + ($lo - 0xDC00);
printf "test uni = %x\n", ord($test);
printf "hi surrogate = %x\n", $hi;
printf "lo surrogate = %x\n", $lo;
printf "uni = %x\n", $uni;
}
}

__END__

C:\temp>perl bomtest.pl

NOT WRITING to c:\temp\XML\CollectedData_1804.xml !!
--------------------

Reading c:\temp\XML\CollectedData_1804.xml as ''
--------------------
UTF Check, guess(100): UTF-16. Adding layer .. unix crlf encoding(UTF-16) utf8
Get_BOM() - Layers = unix crlf encoding(UTF-16) utf8
* Reading up to 4 characters
- char ff , bytes = 1
- char fe , bytes = 1
- char 3c , bytes = 1
- char 0 , bytes = 1
- Read 4 characters. Position = 4
- Sample (4) = ff fe 3c 00
* Analysing bytes
BOMS avail = ( fffe0000 ) ( 0000feff ) ( efbbbf ) ( fffe ) ( feff )
- Bom bytes from sample (4) = ff fe 3c 00
* Found UTF-16LE, BOM = fffe
Get_BOM returned UTF-16LE, fffe, ok, 2

Pass 1: seek 2/2 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a
<?xml version="1.0" encoding="unicode"?>

Pass 2: seek 2/2 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a
<?xml version="1.0" encoding="unicode"?>

Pass 3: seek 2/2 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a
<?xml version="1.0" encoding="unicode"?>

Buffering c:\temp\XML\CollectedData_1804.xml as '', open/read buffer as ':utf8'
--------------------
UTF Check, guess(100): UTF-16. Adding layer .. unix crlf encoding(UTF-16) utf8
Get_BOM() - Layers = unix crlf encoding(UTF-16) utf8
* Reading up to 4 characters
- char ff , bytes = 1
- char fe , bytes = 1
- char 3c , bytes = 1
- char 0 , bytes = 1
- Read 4 characters. Position = 4
- Sample (4) = ff fe 3c 00
* Analysing bytes
BOMS avail = ( fffe0000 ) ( 0000feff ) ( efbbbf ) ( fffe ) ( feff )
- Bom bytes from sample (4) = ff fe 3c 00
* Found UTF-16LE, BOM = fffe
Get_BOM returned UTF-16LE, fffe, ok, 2

Pass 1: seek 0/0 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a
Pass 2: seek 0/0 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a
Pass 3: seek 0/0 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a

Reading c:\temp\XML\CollectedData_1804.xml as bytes
--------------------
Pass 1: seek 0/0 ok, data (85) = ff fe 3c 0 3f 0 78 0 6d 0 6c 0 2
0 0 76 0 65 0 72 0 73 0 69 0 6f 0 6e 0 3d 0 22 0 31 0 2e
0 30 0 22 0 20 0 65 0 6e 0 63 0 6f 0 64 0 69 0 6e 0 67 0
3d 0 22 0 75 0 6e 0 69 0 63 0 6f 0 64 0 65 0 22 0 3f 0 3e
0 d 0 a
Pass 2: seek 0/0 ok, data (85) = ff fe 3c 0 3f 0 78 0 6d 0 6c 0 2
0 0 76 0 65 0 72 0 73 0 69 0 6f 0 6e 0 3d 0 22 0 31 0 2e
0 30 0 22 0 20 0 65 0 6e 0 63 0 6f 0 64 0 69 0 6e 0 67 0
3d 0 22 0 75 0 6e 0 69 0 63 0 6f 0 64 0 65 0 22 0 3f 0 3e
0 d 0 a
Pass 3: seek 0/0 ok, data (85) = ff fe 3c 0 3f 0 78 0 6d 0 6c 0 2
0 0 76 0 65 0 72 0 73 0 69 0 6f 0 6e 0 3d 0 22 0 31 0 2e
0 30 0 22 0 20 0 65 0 6e 0 63 0 6f 0 64 0 69 0 6e 0 67 0
3d 0 22 0 75 0 6e 0 69 0 63 0 6f 0 64 0 65 0 22 0 3f 0 3e
0 d 0 a

C:\temp>

===================================================================

C:\temp>perl bomtest.pl

NOT WRITING to c:\temp\XML\CollectedData_1804.xml !!
--------------------

Reading c:\temp\XML\CollectedData_1804.xml as 'UTF-16LE'
--------------------
UTF Check, already have encoding layer .. unix crlf encoding(UTF-16LE) utf8
Get_BOM() - Layers = unix crlf encoding(UTF-16LE) utf8
* Reading up to 4 characters
- char ff , bytes = 1
- char fe , bytes = 1
- char 3c , bytes = 1
- char 0 , bytes = 1
- Read 4 characters. Position = 4
- Sample (4) = ff fe 3c 00
* Analysing bytes
BOMS avail = ( fffe0000 ) ( 0000feff ) ( efbbbf ) ( fffe ) ( feff )
- Bom bytes from sample (4) = ff fe 3c 00
* Found UTF-16LE, BOM = fffe
Get_BOM returned UTF-16LE, fffe, ok, 2

Pass 1: seek 2/2 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a
<?xml version="1.0" encoding="unicode"?>

Pass 2: seek 2/2 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a
<?xml version="1.0" encoding="unicode"?>

Pass 3: seek 2/2 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a
<?xml version="1.0" encoding="unicode"?>

Buffering c:\temp\XML\CollectedData_1804.xml as 'UTF-16LE', open/read buffer as
':utf8'
--------------------
UTF Check, already have encoding layer .. unix crlf encoding(UTF-16LE) utf8
Get_BOM() - Layers = unix crlf encoding(UTF-16LE) utf8
* Reading up to 4 characters
- char ff , bytes = 1
- char fe , bytes = 1
- char 3c , bytes = 1
- char 0 , bytes = 1
- Read 4 characters. Position = 4
- Sample (4) = ff fe 3c 00
* Analysing bytes
BOMS avail = ( fffe0000 ) ( 0000feff ) ( efbbbf ) ( fffe ) ( feff )
- Bom bytes from sample (4) = ff fe 3c 00
* Found UTF-16LE, BOM = fffe
Get_BOM returned UTF-16LE, fffe, ok, 2

Pass 1: seek 0/0 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a
Pass 2: seek 0/0 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a
Pass 3: seek 0/0 ok, data (42) = 3c 3f 78 6d 6c 20 76 65 72 73 69 6f
6e 3d 22 31 2e 30 22 20 65 6e 63 6f 64 69 6e 67 3d 22 75 6e
69 63 6f 64 65 22 3f 3e d a

Reading c:\temp\XML\CollectedData_1804.xml as bytes
--------------------
Pass 1: seek 0/0 ok, data (85) = ff fe 3c 0 3f 0 78 0 6d 0 6c 0 2
0 0 76 0 65 0 72 0 73 0 69 0 6f 0 6e 0 3d 0 22 0 31 0 2e
0 30 0 22 0 20 0 65 0 6e 0 63 0 6f 0 64 0 69 0 6e 0 67 0
3d 0 22 0 75 0 6e 0 69 0 63 0 6f 0 64 0 65 0 22 0 3f 0 3e
0 d 0 a
Pass 2: seek 0/0 ok, data (85) = ff fe 3c 0 3f 0 78 0 6d 0 6c 0 2
0 0 76 0 65 0 72 0 73 0 69 0 6f 0 6e 0 3d 0 22 0 31 0 2e
0 30 0 22 0 20 0 65 0 6e 0 63 0 6f 0 64 0 69 0 6e 0 67 0
3d 0 22 0 75 0 6e 0 69 0 63 0 6f 0 64 0 65 0 22 0 3f 0 3e
0 d 0 a
Pass 3: seek 0/0 ok, data (85) = ff fe 3c 0 3f 0 78 0 6d 0 6c 0 2
0 0 76 0 65 0 72 0 73 0 69 0 6f 0 6e 0 3d 0 22 0 31 0 2e
0 30 0 22 0 20 0 65 0 6e 0 63 0 6f 0 64 0 69 0 6e 0 67 0
3d 0 22 0 75 0 6e 0 69 0 63 0 6f 0 64 0 65 0 22 0 3f 0 3e
0 d 0 a

C:\temp>

Nathan Keel · Aug 14, 2009

<snip>

Stop posting huge amounts of retarded shit already in an effort to
impress yourself (because it's garbage). Seriously, please stop doing
that crap!

Guessing Encodings and the PerlIO layer	2	Jul 27, 2009
Writing UTF-8 file under Windows	1	Jan 5, 2007
Translater + module + tkinter	1	Feb 16, 2023
Unicode: Strings marked 'utf8'. Can they be converted to 'byte' without going the vec() route?	0	Aug 3, 2009
Could someone help me with this source code?	5	Jan 20, 2007
Regex testing and UTF8 awarenes or Regex and numeric pattern matching	2	Mar 10, 2009
python-dev Summary for 2006-02-16 through 2006-02-28	1	Apr 29, 2006
Accessing large >2GB file succeeds fails with open/read	6	Feb 15, 2005

UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug

sln

sln

Nathan Keel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads