Guessing Encodings and the PerlIO layer


S

sln

Hi, this subject has probably been hashed over.
Maybe someone can steer me in the right direction.

If I could actually get un-alterred data from the first
4 bytes of data from a file I could check this myself
(UTF and BOM is all I care about).

And I can get byte data if the file hasn't had an encoding
translation added to the PerlIO layer. It seems that even
sysread bytes, in that case are decoded (if a layer is specified
at open time, or via binmode).

I've got a module that recieves a file handle and starts to work on
data. If there is no encoding layer associated with the file handle,
byte data is returned (the Perl default). If there is an encoding,
PerlIO converts the data (for example, a read) to the default utf8
via the Encode layer.

I thought file handles were now PerlIO objects but I can't find
any documentation on methods, that maybe could help me to find out
info on attached (en/de)coding layers. And maybe possibly manipulate
(temporarily) the layer interactions.

Its probable that many files are opened and passed to my module.
Its also most likely they are NOT opened with any particular encoding
layer specified. As a result, some files are more than likely UTF-16/32,
and will bomb as my module process the data.

I really only care about files that are of UTF encodings. All others
are exotic and left up to the caller to specify a layer.

I tried alot of stuff but had to settle on Encode::Guess. Apparently,
Encode and its flavors control the conversions and ANY encoding specified
in the IO layer will convert to utf8, which is ok but its pretty tough
to standardize a regex containing FFFE bytes to file data utf8.
The problem is the files BOM is encoded out via utf8 (encode layer)
then balks at the regex FFFE bytes. I would have to encode both to a
new encoding like UTF-32 for example:

$fileBOMdata = decode("UTF-32", pack 'L*', (BOM32, map {ord $_} split //, $fileBOMdata));
$regexLEdata = decode("UTF-32", pack 'L*', (BOM32, map {ord $_} split //, "\x{ff}\x{fe}"));
$regexBEdata = decode("UTF-32", pack 'L*', (BOM32, map {ord $_} split //, "\x{fe}\x{ff}"));
$fileBOMdata =~ /^$regexLEdata(\x{0}\x{0}|)?/ or /^(\x{0}\x{0}|)?$regexBEdata/
etc ...

But I think Encode gets to use byte data thus avoiding all this stuff.

So, here is what I got working. A snippet, verbose and not trimmed yet.
If there is another way please let me know.
-sln

#############
open my $fh, "<:encoding(UTF-16)", $fname or die "can't open $fname...";
# open my $fh, $fname or die "can't open $fname...";
# binmode ($fh, ":encoding(UTF-16)");

seek ($fh, 0, 0);

# UTF-8/16/32 check
print STDERR "UTF Check: ";

my $bomdata = '';
my $number_read = sysread ($fh, $bomdata, 4, 0);
seek ($fh, 0, 0);

if (defined $number_read and $number_read > 0)
{
use Encode::Guess;

# test 'guess' behavior ..
#$bomdata = "\x{ff}\x{fe}\x{0}\x{0}";
#$bomdata = "\x{fe}\x{ff}\x{0}\x{0}";
#$bomdata = "\x{0}\x{0}\x{ff}\x{ff}";
#$bomdata = "\x{4f}";
#$bomdata = "\x{8f}";

my $decoder = guess_encoding ( $bomdata ); # ascii/utf8/BOMed UTF

if (ref($decoder)) {
my $name = $decoder->name;
print STDERR "guess $name";
if ($name =~ /UTF.*?(?:16|32)/i) {
print STDERR " (not utf8). Adding this layer.\n";
binmode ($fh, ":encoding($name)");
} else {
print STDERR ". Not adding this layer.\n";
}
} else {
print STDERR "$decoder\n";
}
} else {
print STDERR "utf8 or file is empty\n";
}
#############
 
Ad

Advertisements

J

John W. Krahn

Hi, this subject has probably been hashed over.
Maybe someone can steer me in the right direction.

If I could actually get un-alterred data from the first
4 bytes of data from a file I could check this myself
(UTF and BOM is all I care about).

And I can get byte data if the file hasn't had an encoding
translation added to the PerlIO layer. It seems that even
sysread bytes, in that case are decoded (if a layer is specified
at open time, or via binmode).

I've got a module that recieves a file handle and starts to work on
data. If there is no encoding layer associated with the file handle,
byte data is returned (the Perl default). If there is an encoding,
PerlIO converts the data (for example, a read) to the default utf8
via the Encode layer.

I thought file handles were now PerlIO objects but I can't find
any documentation on methods, that maybe could help me to find out
info on attached (en/de)coding layers.

perldoc PerlIO
[ SNIP ]
Querying the layers of filehandles

The following returns the names of the PerlIO layers on a
filehandle.

my @layers = PerlIO::get_layers($fh); # Or FH, *FH, "FH".

And maybe possibly manipulate (temporarily) the layer interactions.

perldoc PerlIO
[ SNIP ]
:pop
A pseudo layer that removes the top-most layer. Gives perl
code a way to manipulate the layer stack.




John
 
Ad

Advertisements

S

sln

Hi, this subject has probably been hashed over.
Maybe someone can steer me in the right direction.

If I could actually get un-alterred data from the first
4 bytes of data from a file I could check this myself
(UTF and BOM is all I care about).

And I can get byte data if the file hasn't had an encoding
translation added to the PerlIO layer. It seems that even
sysread bytes, in that case are decoded (if a layer is specified
at open time, or via binmode).

I've got a module that recieves a file handle and starts to work on
data. If there is no encoding layer associated with the file handle,
byte data is returned (the Perl default). If there is an encoding,
PerlIO converts the data (for example, a read) to the default utf8
via the Encode layer.

I thought file handles were now PerlIO objects but I can't find
any documentation on methods, that maybe could help me to find out
info on attached (en/de)coding layers.

perldoc PerlIO
[ SNIP ]
Querying the layers of filehandles

The following returns the names of the PerlIO layers on a
filehandle.

my @layers = PerlIO::get_layers($fh); # Or FH, *FH, "FH".

And maybe possibly manipulate (temporarily) the layer interactions.

perldoc PerlIO
[ SNIP ]
:pop
A pseudo layer that removes the top-most layer. Gives perl
code a way to manipulate the layer stack.




John

Thanks John. I went all through the PerlIO docs today and PerlIOl
yesterday. Didn't really want to but Encoding led me there. I was
somehow stuck in Encoding::perlIO docs, had to go down more in the
left pane to find PerlIO.

Did massive tests to learn how the stack works (more like a list).
There is not much really you can do, I played with :raw, :pop, :bytes,
added multiple :encoding() layers to see how the stack works
(this is a mistake that this is actually allowed). All the time
debug printing the layer list, etc.. And :via() was interresting.
And I am not %100 sure about :encoding() layers filter accuracy.
Write a file out to UTF-16/32 (LE/BE) and its not read in the same.
On utf-32 it doesen't even write out the fffe sequence (looked at
it with a hex editor). Its probably my OS (windows) or my perl config.

I settled on this below to work rock-solid. Guess has more experience
than me on BOM (heuristics).

-sln

###################
open my $fh, $fname or die "can't open $fname...";
#binmode ($fh, ":encoding(UTF-16)");
#binmode ($fh, ":utf8");

# UTF-8/16/32 check
# ------------------
my ($UtfMsg, $Layers) = (
'UTF Check: ',
':'.join (':', PerlIO::get_layers($fh)).':'
);

if ($Layers =~ /:encoding/) {
$UtfMsg .= "Already have encoding layer";
} else {
my ($count, $sample);
my $utf8layer = $Layers =~ /:utf8/;

binmode ($fh, ":bytes");
seek ($fh, 0, 0);

if (defined($count = sysread ($fh,$sample,4,0)) && $count > 0)
{
seek ($fh, 0, 0);
use Encode::Guess;

my $decoder = guess_encoding ($sample); # ascii/utf8/BOMed UTF

if (ref($decoder)) {
my $name = $decoder->name;
$decoder = '. Do nothing';
$UtfMsg .= "guess $name";
if ($name =~ /UTF.*?(?:16|32)/i) {
# $name =~ s/(?:LE|BE)$//i;
$decoder = ". Adding $name layer";
binmode ($fh, ":encoding($name)");
}
}
$UtfMsg .= $decoder if (defined $decoder);
}
binmode ($fh, ":utf8") if ($utf8layer);
}

print STDERR "\n$UtfMsg ..\n";
#############
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Top