Opening Unicode files?

Discussion in 'Perl Misc' started by Ilya Zakharevich, Dec 25, 2011.

  1. Does Perl ship with a simple method of opening Unicode files? E.g., I
    would like to have something like

    open my $fh, '< :BOM0or(utf8)', $filename

    where BOM0or does what Perl itself does for Perl files: it looks for the
    first 4 bytes; given that a Perl file starts in ASCII, one can detect
    BOMs, can detect UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, or see that it
    is none of the above (then the arument in parens explains what to do;
    e.g., Perl itself does BOM0or(latin1)).

    Likewise, if one does not know that the file starts in ASCII, one can
    still detect BOM (which does not appear often in the encodings I know)
    so one could do :BOMor(utf8). Do not recollect seeing such support
    for files open()ed by Perl programs; is there?

    Thanks,
    Ilya
    Ilya Zakharevich, Dec 25, 2011
    #1
    1. Advertising

  2. Ilya Zakharevich

    Guest

    On Sun, 25 Dec 2011 01:52:10 +0000 (UTC), Ilya Zakharevich
    <> wrote:

    >Does Perl ship with a simple method of opening Unicode files? E.g., I
    >would like to have something like
    >
    > open my $fh, '< :BOM0or(utf8)', $filename
    >
    >where BOM0or does what Perl itself does for Perl files: it looks for the
    >first 4 bytes; given that a Perl file starts in ASCII, one can detect
    >BOMs, can detect UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, or see that it
    >is none of the above (then the arument in parens explains what to do;
    >e.g., Perl itself does BOM0or(latin1)).
    >
    >Likewise, if one does not know that the file starts in ASCII, one can
    >still detect BOM (which does not appear often in the encodings I know)
    >so one could do :BOMor(utf8). Do not recollect seeing such support
    >for files open()ed by Perl programs; is there?
    >
    >Thanks,
    >Ilya



    Here's what I use and it seems to do what's needed:

    use File::BOM qw( :all );

    # Open specified input file
    open(IFH, '<:via(File::BOM)', $IF) or die "Could NOT open specified
    file ($IF)!\n";
    , Dec 26, 2011
    #2
    1. Advertising

  3. On 2011-12-26, <> wrote:
    >>Does Perl ship with a simple method of opening Unicode files? E.g., I
    >>would like to have something like
    >>
    >> open my $fh, '< :BOM0or(utf8)', $filename
    >>
    >>where BOM0or does what Perl itself does for Perl files: it looks for the
    >>first 4 bytes; given that a Perl file starts in ASCII, one can detect
    >>BOMs, can detect UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE, or see that it
    >>is none of the above (then the arument in parens explains what to do;
    >>e.g., Perl itself does BOM0or(latin1)).


    Thinking about it more, there are 3 situations:

    a) we know that the first 2 characters in the file are 7-bit, and
    are not 0. Then read the first 2 bytes; if both 0, it is 32BE
    (possibly with [hardly legal] BOM); if BOM-BE, it is 16BE+BOM; if
    high bits are set, it is UTF-8+BOM; if the first byte is 0, it is
    16BE.

    One needs to read the other 2 bytes only if 32BE is detected (and
    only if one wants to guard against BOM) and if the second byte is
    0 - then it may be 16LE or 32LE.

    The only possible confusion is whether the file is actually in
    Unicode encoding, or in an 8-bit encoding (or between UTF-7 and
    UTF-8-no-BOMs).

    b) The only thing known is that the first 2 chars are not 0. Again,
    one reads 2 bytes - but now there is no way to detect UTF-8-BOM.

    c) The only thing known is that the fist 2 chars are 7-bit. Then
    there is no way to detect BOMless UTF-16.

    d) General case: 8-bit chars may be present.

    It looks like the decision algorithms are DIFFERENT in these 4 cases;
    hence one needs 4 different "filters": One can call them BOM07, BOM08,
    BOM7, and BOM8.

    > Here's what I use and it seems to do what's needed:
    >
    > use File::BOM qw( :all );


    And do you know from which version it is shipped with Perl?

    > # Open specified input file
    > open(IFH, '<:via(File::BOM)', $IF) or die "Could NOT open specified
    > file ($IF)!\n";


    Do not see how this may be related: I see no way to inform the filter
    about what is known in advance...

    Thanks,
    Ilya
    Ilya Zakharevich, Dec 27, 2011
    #3
  4. On 2011-12-27, Ben Morrow <> wrote:
    > Encode::Guess, which can be invoked as
    >
    > open my $fh, '< :encoding(Guess)', $filename
    >
    > Somewhat annoyingly, you have to explicitly use Encode::Guess or it
    > won't recognise the encoding name, and you have to use
    > Encode::Guess->set_suspects to set the list of encodings to try.


    Same question as to the other answer: does it ship with Perl? And I
    do not want any guessing; I want a very deterministic procedure...

    Thanks,
    Ilya
    Ilya Zakharevich, Dec 27, 2011
    #4
  5. Ilya Zakharevich

    Guest

    On Tue, 27 Dec 2011 21:19:00 +0000 (UTC), Ilya Zakharevich
    <> wrote:

    >On 2011-12-27, Ben Morrow <> wrote:
    >> Encode::Guess, which can be invoked as
    >>
    >> open my $fh, '< :encoding(Guess)', $filename
    >>
    >> Somewhat annoyingly, you have to explicitly use Encode::Guess or it
    >> won't recognise the encoding name, and you have to use
    >> Encode::Guess->set_suspects to set the list of encodings to try.

    >
    >Same question as to the other answer: does it ship with Perl? And I
    >do not want any guessing; I want a very deterministic procedure...
    >
    >Thanks,
    >Ilya



    Do as all perl mongers do - use CPAN to locate, download and install
    the needed function.

    $>perl -MCPAN -e shell

    Similar source available with activesatate for windows
    , Dec 28, 2011
    #5
  6. On 2011-12-28, <> wrote:
    >>Same question as to the other answer: does it ship with Perl? And I
    >>do not want any guessing; I want a very deterministic procedure...


    > Do as all perl mongers do - use CPAN to locate, download and install
    > the needed function.
    >
    > $>perl -MCPAN -e shell


    I never do "as all perl mongers do". Neither, I expect, do users of
    my code.

    Hope this helps,
    Ilya
    Ilya Zakharevich, Dec 31, 2011
    #6
  7. Ilya Zakharevich

    Tim McDaniel Guest

    In article <>,
    <> wrote:
    >On Tue, 27 Dec 2011 21:19:00 +0000 (UTC), Ilya Zakharevich
    ><> wrote:
    >>Same question as to the other answer: does it ship with Perl? And I
    >>do not want any guessing; I want a very deterministic procedure...

    >
    >Do as all perl mongers do - use CPAN to locate, download and install
    >the needed function.
    >
    >$>perl -MCPAN -e shell


    I was a maintainer of servers at previous jobs and could do that for
    the system. But not at my current job, and if I wanted to do it for a
    shared script, I don't know yet how receptive they would be to a
    request. It's why I "use constant" instead of a more modern and
    convenient module.

    --
    Tim McDaniel,
    Tim McDaniel, Jan 2, 2012
    #7
  8. Ilya Zakharevich

    Guest

    On Tuesday, December 27, 2011 2:19:00 PM UTC-7, Ilya Zakharevich wrote:
    > On 2011-12-27, Ben Morrow <> wrote:
    > > Encode::Guess, which can be invoked as
    > >
    > > open my $fh, '< :encoding(Guess)', $filename
    > >
    > > Somewhat annoyingly, you have to explicitly use Encode::Guess or it
    > > won't recognise the encoding name, and you have to use
    > > Encode::Guess->set_suspects to set the list of encodings to try.

    >
    > Same question as to the other answer: does it ship with Perl? And I
    > do not want any guessing; I want a very deterministic procedure...


    Ilya,

    I understand completely. I find that Encode::Guess is too unreliable for
    my purposes. I have a replacement version that is built on a statistical
    model derived from very large English-language corpora, which it gets
    right 99.79% of the time, including on conflicting 8-bit encodings. For
    example, it knows CP1252 from MacRoman from ISO-8859-1 from ISO-8859-15,
    etc. I have a working alpha version of the code, so if you are interested in this
    technique or wish to know more, please send me mail. You can fetch the
    alpha version from

    http://training.perl.com/scripts/Encode-Guess-Educated-0.03.tar.gz

    I'm having trouble with my PAUSE id, so it isn't on CPAN yet.

    Hope this helps, and do feel free to write. I never look here for anything,
    so am likely to miss a reply.

    --tom
    , Feb 15, 2012
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,916
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    540
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. gangs
    Replies:
    0
    Views:
    263
    gangs
    Jan 9, 2007
  4. gangs
    Replies:
    1
    Views:
    297
    =?iso-8859-1?q?Kirit_S=E6lensminde?=
    Jan 9, 2007
  5. fniles
    Replies:
    0
    Views:
    268
    fniles
    Apr 26, 2009
Loading...

Share This Page