Guessing Encodings and the PerlIO layer

Discussion in 'Perl Misc' started by sln@netherlands.com, Jul 27, 2009.

  1. Guest

    Hi, this subject has probably been hashed over.
    Maybe someone can steer me in the right direction.

    If I could actually get un-alterred data from the first
    4 bytes of data from a file I could check this myself
    (UTF and BOM is all I care about).

    And I can get byte data if the file hasn't had an encoding
    translation added to the PerlIO layer. It seems that even
    sysread bytes, in that case are decoded (if a layer is specified
    at open time, or via binmode).

    I've got a module that recieves a file handle and starts to work on
    data. If there is no encoding layer associated with the file handle,
    byte data is returned (the Perl default). If there is an encoding,
    PerlIO converts the data (for example, a read) to the default utf8
    via the Encode layer.

    I thought file handles were now PerlIO objects but I can't find
    any documentation on methods, that maybe could help me to find out
    info on attached (en/de)coding layers. And maybe possibly manipulate
    (temporarily) the layer interactions.

    Its probable that many files are opened and passed to my module.
    Its also most likely they are NOT opened with any particular encoding
    layer specified. As a result, some files are more than likely UTF-16/32,
    and will bomb as my module process the data.

    I really only care about files that are of UTF encodings. All others
    are exotic and left up to the caller to specify a layer.

    I tried alot of stuff but had to settle on Encode::Guess. Apparently,
    Encode and its flavors control the conversions and ANY encoding specified
    in the IO layer will convert to utf8, which is ok but its pretty tough
    to standardize a regex containing FFFE bytes to file data utf8.
    The problem is the files BOM is encoded out via utf8 (encode layer)
    then balks at the regex FFFE bytes. I would have to encode both to a
    new encoding like UTF-32 for example:

    $fileBOMdata = decode("UTF-32", pack 'L*', (BOM32, map {ord $_} split //, $fileBOMdata));
    $regexLEdata = decode("UTF-32", pack 'L*', (BOM32, map {ord $_} split //, "\x{ff}\x{fe}"));
    $regexBEdata = decode("UTF-32", pack 'L*', (BOM32, map {ord $_} split //, "\x{fe}\x{ff}"));
    $fileBOMdata =~ /^$regexLEdata(\x{0}\x{0}|)?/ or /^(\x{0}\x{0}|)?$regexBEdata/
    etc ...

    But I think Encode gets to use byte data thus avoiding all this stuff.

    So, here is what I got working. A snippet, verbose and not trimmed yet.
    If there is another way please let me know.
    -sln

    #############
    open my $fh, "<:encoding(UTF-16)", $fname or die "can't open $fname...";
    # open my $fh, $fname or die "can't open $fname...";
    # binmode ($fh, ":encoding(UTF-16)");

    seek ($fh, 0, 0);

    # UTF-8/16/32 check
    print STDERR "UTF Check: ";

    my $bomdata = '';
    my $number_read = sysread ($fh, $bomdata, 4, 0);
    seek ($fh, 0, 0);

    if (defined $number_read and $number_read > 0)
    {
    use Encode::Guess;

    # test 'guess' behavior ..
    #$bomdata = "\x{ff}\x{fe}\x{0}\x{0}";
    #$bomdata = "\x{fe}\x{ff}\x{0}\x{0}";
    #$bomdata = "\x{0}\x{0}\x{ff}\x{ff}";
    #$bomdata = "\x{4f}";
    #$bomdata = "\x{8f}";

    my $decoder = guess_encoding ( $bomdata ); # ascii/utf8/BOMed UTF

    if (ref($decoder)) {
    my $name = $decoder->name;
    print STDERR "guess $name";
    if ($name =~ /UTF.*?(?:16|32)/i) {
    print STDERR " (not utf8). Adding this layer.\n";
    binmode ($fh, ":encoding($name)");
    } else {
    print STDERR ". Not adding this layer.\n";
    }
    } else {
    print STDERR "$decoder\n";
    }
    } else {
    print STDERR "utf8 or file is empty\n";
    }
    #############
     
    , Jul 27, 2009
    #1
    1. Advertising

  2. wrote:
    > Hi, this subject has probably been hashed over.
    > Maybe someone can steer me in the right direction.
    >
    > If I could actually get un-alterred data from the first
    > 4 bytes of data from a file I could check this myself
    > (UTF and BOM is all I care about).
    >
    > And I can get byte data if the file hasn't had an encoding
    > translation added to the PerlIO layer. It seems that even
    > sysread bytes, in that case are decoded (if a layer is specified
    > at open time, or via binmode).
    >
    > I've got a module that recieves a file handle and starts to work on
    > data. If there is no encoding layer associated with the file handle,
    > byte data is returned (the Perl default). If there is an encoding,
    > PerlIO converts the data (for example, a read) to the default utf8
    > via the Encode layer.
    >
    > I thought file handles were now PerlIO objects but I can't find
    > any documentation on methods, that maybe could help me to find out
    > info on attached (en/de)coding layers.


    perldoc PerlIO
    [ SNIP ]
    Querying the layers of filehandles

    The following returns the names of the PerlIO layers on a
    filehandle.

    my @layers = PerlIO::get_layers($fh); # Or FH, *FH, "FH".


    > And maybe possibly manipulate (temporarily) the layer interactions.


    perldoc PerlIO
    [ SNIP ]
    :pop
    A pseudo layer that removes the top-most layer. Gives perl
    code a way to manipulate the layer stack.




    John
    --
    Those people who think they know everything are a great
    annoyance to those of us who do. -- Isaac Asimov
     
    John W. Krahn, Jul 27, 2009
    #2
    1. Advertising

  3. Guest

    On Mon, 27 Jul 2009 01:25:08 -0700, "John W. Krahn" <> wrote:

    > wrote:
    >> Hi, this subject has probably been hashed over.
    >> Maybe someone can steer me in the right direction.
    >>
    >> If I could actually get un-alterred data from the first
    >> 4 bytes of data from a file I could check this myself
    >> (UTF and BOM is all I care about).
    >>
    >> And I can get byte data if the file hasn't had an encoding
    >> translation added to the PerlIO layer. It seems that even
    >> sysread bytes, in that case are decoded (if a layer is specified
    >> at open time, or via binmode).
    >>
    >> I've got a module that recieves a file handle and starts to work on
    >> data. If there is no encoding layer associated with the file handle,
    >> byte data is returned (the Perl default). If there is an encoding,
    >> PerlIO converts the data (for example, a read) to the default utf8
    >> via the Encode layer.
    >>
    >> I thought file handles were now PerlIO objects but I can't find
    >> any documentation on methods, that maybe could help me to find out
    >> info on attached (en/de)coding layers.

    >
    >perldoc PerlIO
    >[ SNIP ]
    > Querying the layers of filehandles
    >
    > The following returns the names of the PerlIO layers on a
    > filehandle.
    >
    > my @layers = PerlIO::get_layers($fh); # Or FH, *FH, "FH".
    >
    >
    >> And maybe possibly manipulate (temporarily) the layer interactions.

    >
    >perldoc PerlIO
    >[ SNIP ]
    > :pop
    > A pseudo layer that removes the top-most layer. Gives perl
    > code a way to manipulate the layer stack.
    >
    >
    >
    >
    >John


    Thanks John. I went all through the PerlIO docs today and PerlIOl
    yesterday. Didn't really want to but Encoding led me there. I was
    somehow stuck in Encoding::perlIO docs, had to go down more in the
    left pane to find PerlIO.

    Did massive tests to learn how the stack works (more like a list).
    There is not much really you can do, I played with :raw, :pop, :bytes,
    added multiple :encoding() layers to see how the stack works
    (this is a mistake that this is actually allowed). All the time
    debug printing the layer list, etc.. And :via() was interresting.
    And I am not %100 sure about :encoding() layers filter accuracy.
    Write a file out to UTF-16/32 (LE/BE) and its not read in the same.
    On utf-32 it doesen't even write out the fffe sequence (looked at
    it with a hex editor). Its probably my OS (windows) or my perl config.

    I settled on this below to work rock-solid. Guess has more experience
    than me on BOM (heuristics).

    -sln

    ###################
    open my $fh, $fname or die "can't open $fname...";
    #binmode ($fh, ":encoding(UTF-16)");
    #binmode ($fh, ":utf8");

    # UTF-8/16/32 check
    # ------------------
    my ($UtfMsg, $Layers) = (
    'UTF Check: ',
    ':'.join (':', PerlIO::get_layers($fh)).':'
    );

    if ($Layers =~ /:encoding/) {
    $UtfMsg .= "Already have encoding layer";
    } else {
    my ($count, $sample);
    my $utf8layer = $Layers =~ /:utf8/;

    binmode ($fh, ":bytes");
    seek ($fh, 0, 0);

    if (defined($count = sysread ($fh,$sample,4,0)) && $count > 0)
    {
    seek ($fh, 0, 0);
    use Encode::Guess;

    my $decoder = guess_encoding ($sample); # ascii/utf8/BOMed UTF

    if (ref($decoder)) {
    my $name = $decoder->name;
    $decoder = '. Do nothing';
    $UtfMsg .= "guess $name";
    if ($name =~ /UTF.*?(?:16|32)/i) {
    # $name =~ s/(?:LE|BE)$//i;
    $decoder = ". Adding $name layer";
    binmode ($fh, ":encoding($name)");
    }
    }
    $UtfMsg .= $decoder if (defined $decoder);
    }
    binmode ($fh, ":utf8") if ($utf8layer);
    }

    print STDERR "\n$UtfMsg ..\n";
    #############
     
    , Jul 28, 2009
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. et
    Replies:
    2
    Views:
    1,946
  2. Dhananjay
    Replies:
    1
    Views:
    1,172
    sloan
    Dec 18, 2006
  3. Mark Seger

    I think I want to use of mmap and perlio

    Mark Seger, Jun 15, 2007, in forum: Perl Misc
    Replies:
    2
    Views:
    176
    Tim S
    Jun 15, 2007
  4. Peter J. Holzer

    IO::Select and PerlIO

    Peter J. Holzer, Nov 19, 2012, in forum: Perl Misc
    Replies:
    9
    Views:
    432
    Peter J. Holzer
    Nov 24, 2012
  5. Rainer Weikusat

    perlio 'unix' layer file descriptor handling

    Rainer Weikusat, Jan 20, 2013, in forum: Perl Misc
    Replies:
    0
    Views:
    193
    Rainer Weikusat
    Jan 20, 2013
Loading...

Share This Page