S
sln
Hi, this subject has probably been hashed over.
Maybe someone can steer me in the right direction.
If I could actually get un-alterred data from the first
4 bytes of data from a file I could check this myself
(UTF and BOM is all I care about).
And I can get byte data if the file hasn't had an encoding
translation added to the PerlIO layer. It seems that even
sysread bytes, in that case are decoded (if a layer is specified
at open time, or via binmode).
I've got a module that recieves a file handle and starts to work on
data. If there is no encoding layer associated with the file handle,
byte data is returned (the Perl default). If there is an encoding,
PerlIO converts the data (for example, a read) to the default utf8
via the Encode layer.
I thought file handles were now PerlIO objects but I can't find
any documentation on methods, that maybe could help me to find out
info on attached (en/de)coding layers. And maybe possibly manipulate
(temporarily) the layer interactions.
Its probable that many files are opened and passed to my module.
Its also most likely they are NOT opened with any particular encoding
layer specified. As a result, some files are more than likely UTF-16/32,
and will bomb as my module process the data.
I really only care about files that are of UTF encodings. All others
are exotic and left up to the caller to specify a layer.
I tried alot of stuff but had to settle on Encode::Guess. Apparently,
Encode and its flavors control the conversions and ANY encoding specified
in the IO layer will convert to utf8, which is ok but its pretty tough
to standardize a regex containing FFFE bytes to file data utf8.
The problem is the files BOM is encoded out via utf8 (encode layer)
then balks at the regex FFFE bytes. I would have to encode both to a
new encoding like UTF-32 for example:
$fileBOMdata = decode("UTF-32", pack 'L*', (BOM32, map {ord $_} split //, $fileBOMdata));
$regexLEdata = decode("UTF-32", pack 'L*', (BOM32, map {ord $_} split //, "\x{ff}\x{fe}"));
$regexBEdata = decode("UTF-32", pack 'L*', (BOM32, map {ord $_} split //, "\x{fe}\x{ff}"));
$fileBOMdata =~ /^$regexLEdata(\x{0}\x{0}|)?/ or /^(\x{0}\x{0}|)?$regexBEdata/
etc ...
But I think Encode gets to use byte data thus avoiding all this stuff.
So, here is what I got working. A snippet, verbose and not trimmed yet.
If there is another way please let me know.
-sln
#############
open my $fh, "<:encoding(UTF-16)", $fname or die "can't open $fname...";
# open my $fh, $fname or die "can't open $fname...";
# binmode ($fh, ":encoding(UTF-16)");
seek ($fh, 0, 0);
# UTF-8/16/32 check
print STDERR "UTF Check: ";
my $bomdata = '';
my $number_read = sysread ($fh, $bomdata, 4, 0);
seek ($fh, 0, 0);
if (defined $number_read and $number_read > 0)
{
use Encode::Guess;
# test 'guess' behavior ..
#$bomdata = "\x{ff}\x{fe}\x{0}\x{0}";
#$bomdata = "\x{fe}\x{ff}\x{0}\x{0}";
#$bomdata = "\x{0}\x{0}\x{ff}\x{ff}";
#$bomdata = "\x{4f}";
#$bomdata = "\x{8f}";
my $decoder = guess_encoding ( $bomdata ); # ascii/utf8/BOMed UTF
if (ref($decoder)) {
my $name = $decoder->name;
print STDERR "guess $name";
if ($name =~ /UTF.*?(?:16|32)/i) {
print STDERR " (not utf8). Adding this layer.\n";
binmode ($fh, ":encoding($name)");
} else {
print STDERR ". Not adding this layer.\n";
}
} else {
print STDERR "$decoder\n";
}
} else {
print STDERR "utf8 or file is empty\n";
}
#############
Maybe someone can steer me in the right direction.
If I could actually get un-alterred data from the first
4 bytes of data from a file I could check this myself
(UTF and BOM is all I care about).
And I can get byte data if the file hasn't had an encoding
translation added to the PerlIO layer. It seems that even
sysread bytes, in that case are decoded (if a layer is specified
at open time, or via binmode).
I've got a module that recieves a file handle and starts to work on
data. If there is no encoding layer associated with the file handle,
byte data is returned (the Perl default). If there is an encoding,
PerlIO converts the data (for example, a read) to the default utf8
via the Encode layer.
I thought file handles were now PerlIO objects but I can't find
any documentation on methods, that maybe could help me to find out
info on attached (en/de)coding layers. And maybe possibly manipulate
(temporarily) the layer interactions.
Its probable that many files are opened and passed to my module.
Its also most likely they are NOT opened with any particular encoding
layer specified. As a result, some files are more than likely UTF-16/32,
and will bomb as my module process the data.
I really only care about files that are of UTF encodings. All others
are exotic and left up to the caller to specify a layer.
I tried alot of stuff but had to settle on Encode::Guess. Apparently,
Encode and its flavors control the conversions and ANY encoding specified
in the IO layer will convert to utf8, which is ok but its pretty tough
to standardize a regex containing FFFE bytes to file data utf8.
The problem is the files BOM is encoded out via utf8 (encode layer)
then balks at the regex FFFE bytes. I would have to encode both to a
new encoding like UTF-32 for example:
$fileBOMdata = decode("UTF-32", pack 'L*', (BOM32, map {ord $_} split //, $fileBOMdata));
$regexLEdata = decode("UTF-32", pack 'L*', (BOM32, map {ord $_} split //, "\x{ff}\x{fe}"));
$regexBEdata = decode("UTF-32", pack 'L*', (BOM32, map {ord $_} split //, "\x{fe}\x{ff}"));
$fileBOMdata =~ /^$regexLEdata(\x{0}\x{0}|)?/ or /^(\x{0}\x{0}|)?$regexBEdata/
etc ...
But I think Encode gets to use byte data thus avoiding all this stuff.
So, here is what I got working. A snippet, verbose and not trimmed yet.
If there is another way please let me know.
-sln
#############
open my $fh, "<:encoding(UTF-16)", $fname or die "can't open $fname...";
# open my $fh, $fname or die "can't open $fname...";
# binmode ($fh, ":encoding(UTF-16)");
seek ($fh, 0, 0);
# UTF-8/16/32 check
print STDERR "UTF Check: ";
my $bomdata = '';
my $number_read = sysread ($fh, $bomdata, 4, 0);
seek ($fh, 0, 0);
if (defined $number_read and $number_read > 0)
{
use Encode::Guess;
# test 'guess' behavior ..
#$bomdata = "\x{ff}\x{fe}\x{0}\x{0}";
#$bomdata = "\x{fe}\x{ff}\x{0}\x{0}";
#$bomdata = "\x{0}\x{0}\x{ff}\x{ff}";
#$bomdata = "\x{4f}";
#$bomdata = "\x{8f}";
my $decoder = guess_encoding ( $bomdata ); # ascii/utf8/BOMed UTF
if (ref($decoder)) {
my $name = $decoder->name;
print STDERR "guess $name";
if ($name =~ /UTF.*?(?:16|32)/i) {
print STDERR " (not utf8). Adding this layer.\n";
binmode ($fh, ":encoding($name)");
} else {
print STDERR ". Not adding this layer.\n";
}
} else {
print STDERR "$decoder\n";
}
} else {
print STDERR "utf8 or file is empty\n";
}
#############