How to detect text file encoding in Perl

Discussion in 'Perl Misc' started by chaojen.chen, May 20, 2006.

  1. chaojen.chen

    chaojen.chen Guest

    Hello all,

    If I have a bunch of text files in the same directory and their
    encodings could UTF-16, UTF-8, or ASCII, is there any way of quickly
    detecting the exact encoding of each of them?

    Thanks,

    Enoch Chen
     
    chaojen.chen, May 20, 2006
    #1
    1. Advertisements

  2. chaojen.chen

    Anno Siegel Guest

    [Please don't top-post, and leave some attribution. Text re-arranged]
    Without even looking at it, I'd say a module with its name in all-caps
    is suspect. Supposing it is actually spelled that way.

    Anno
     
    Anno Siegel, May 20, 2006
    #2
    1. Advertisements

  3. Yeah, it makes you think of creations like POSIX and CGI. ;-)
    It's not.
     
    Gunnar Hjalmarsson, May 20, 2006
    #3
  4. chaojen.chen

    Anno Siegel Guest

    Well, those are acronyms that weren't invented by the authors.

    If GUESS were an acronym, the module would be more than suspect of
    cutesiness.
    Good to know :)

    Anno
     
    Anno Siegel, May 20, 2006
    #4
  5. Forget quickly, it is fundamentally impossible given an ASCII file to
    tell that not utf8.

    If the utf16 starts with a BOM so you can distinuish utf8 and utf16 by
    examining the first two bytes.

    That said, Encode::Guess is probably your friend.
     
    Brian McCauley, May 20, 2006
    #5
  6. chaojen.chen

    chaojen.chen Guest

    Brian McCauley 寫é“:
    Hello Brian,

    Thanks for your suggestion. And what does BOM stand for?

    Enoch
     
    chaojen.chen, May 21, 2006
    #6
  7. chaojen.chen

    Guest Guest

    wrote:
    : >
    : > If the utf16 starts with a BOM so you can distinuish utf8 and utf16 by
    : > examining the first two bytes.
    : >

    : Thanks for your suggestion. And what does BOM stand for?

    Google is probably your friend. If not: <B>yte <O>rder <M>ark.

    You frequently get a BOM at the beginning of your file if you store it
    on Windows with Notepad or similar editor simulations. If you choose to
    store your data as UTF-8, or your data _is_ UTF-8, you'll see that after
    storing the bytecount is two bytes more because the byte 0xff 0xef get
    prepended automatically, in order to tell the software which byte order
    is to be expected. This makes sense with UCS-2 Unicode (the "original"
    Unicode encoding) but not with UTF-8 (8-bit transformation format of
    Unicode) because the characters encoded in UTF-8 are self-synchronizing
    and no information about byte order is needed. In contrast, other programs
    behaving correctly frequently complain if the BOM appears where it simply
    doesn't belong.

    Oliver.
     
    Guest, May 21, 2006
    #7
  8. The BOM is the relevant encoding of the Unicode character U+FEFF. No
    way is it 0xff 0xef. The various encoded byte patterns are shown in
    that Unicode FAQ, and in utf-8 it's *three* bytes.
    "No, a BOM can be used as a signature no matter how the Unicode text
    is transformed"
    Yes, but "UCS-2" is out of date:
    http://www.unicode.org/faq/basic_q.html#23

    The utf-16 encoding form is its present counterpart.
    Nevertheless, the Unicode FAQ points out that utf-8 can usefully
    start with a BOM as an encoding signature.
    Except that it is not inherently incorrect for it to appear at the
    beginning of a utf-8 stream - but see the cited FAQ for details.

    Seems to me you would have done well to read that FAQ yourself, before
    putting misleading opinions on the record.

    regards
     
    Alan J. Flavell, May 21, 2006
    #8
  9. chaojen.chen

    Guest Guest

    : > (Oliver's erroneous statement:)
    : > storing the bytecount is two bytes more because the byte 0xff 0xef get
    : > prepended automatically,

    : The BOM is the relevant encoding of the Unicode character U+FEFF. No
    : way is it 0xff 0xef.

    Oops, I goofed up here, and the twisted order shows exactly what a byte
    order mark is good for. Just imagine this would have been transmitted as
    UCS-2, in Big Endian order.

    : The various encoded byte patterns are shown in
    : that Unicode FAQ, and in utf-8 it's *three* bytes.

    Again, my fault. Shouldn't post when I'm too tired.

    : > This makes sense with UCS-2 Unicode (the "original" Unicode
    : > encoding)

    : Yes, but "UCS-2" is out of date:
    : http://www.unicode.org/faq/basic_q.html#23

    But several (notably MS-based) applications still allow the user to choose
    UCS-2, UTF-8 _and_ Unicode.

    : > but not with UTF-8 (8-bit transformation format of Unicode) because
    : > the characters encoded in UTF-8 are self-synchronizing and no
    : > information about byte order is needed.

    : Nevertheless, the Unicode FAQ points out that utf-8 can usefully
    : start with a BOM as an encoding signature.

    The FAQ says so, but...

    : > In contrast, other programs behaving correctly frequently complain
    : > if the BOM appears where it simply doesn't belong.

    : Except that it is not inherently incorrect for it to appear at the
    : beginning of a utf-8 stream - but see the cited FAQ for details.

    But my experience (with shell scripts, interpretation of shebang lines
    of perl scripts, etc.) runs to the contrary. A UTF-8-encoded file _with_
    BOM causes unnecessary hiccups, even if this is against the formal spec.

    : Seems to me you would have done well to read that FAQ yourself, before
    : putting misleading opinions on the record.

    Sorry, I should have consulted the FAQ, but I stand by my negative experiences
    with superfluous BOMs.

    Oliver.
     
    Guest, May 21, 2006
    #9
  10. [re. my cite of http://www.unicode.org/faq/utf_bom.html#BOM ]
    Which is pretty much the point that the cited BOM FAQ makes, at
    http://www.unicode.org/faq/utf_bom.html#29 , and that was my primary
    reason for that suggestion to "see the cited FAQ for details".

    regards
     
    Alan J. Flavell, May 21, 2006
    #10
  11. That's a curious statement, given that UCS-2 and UTF-8 are parts of the
    Unicode standard. (UTF-16 is, too, BTW)

    [...]
    That's because a BOM isn't just a Byte Order Mark - It is a valid
    character (Zero Width No-Break Space). Of course inserting a Zero Width
    No-Break Space at the beginning of a file isn't against UTF-8 rules,
    just like inserting a normal space at the beginning of a file isn't
    against UTF-8 rules. But it is against the rules for Unix scripts: The
    first character must be a hash sign, not a space (zero width or not).

    hp
     
    Peter J. Holzer, May 21, 2006
    #11
  12. Well, every ASCII file is also UTF-8, but not vice versa.

    Or, phrased differently, if you can decode a file as UTF-8 and all
    characters have code less than 128, it is ASCII.
    You will probably also find a lot of zero bytes in an UTF-16 coded text
    file which is unlikely in a UTF-8 coded text file.

    hp
     
    Peter J. Holzer, May 21, 2006
    #12
  13. Oh, quite. But one could hardly expect MS to conform to someone
    else's specifications, hmmm?[1] AIUI, when they say "Unicode", they
    actually mean UTF-16, stored in little-endian format with BOM.

    (N.B one cannot call that UTF-16LE, because UTF-16LE or BE are
    forbidden to start with a BOM. Hope that's clear?).
    At risk of being pedantic: that character cannot be at one and the
    same time a BOM and a ZWNBSP: it's either one or the other. If it's
    not at the beginning, it can only be a ZWNBSP. If it *is* at the
    beginning, it's a matter of convention whether it's a BOM or a ZWNBSP.
    But see the FAQ, http://www.unicode.org/faq/utf_bom.html#27 etc. for a
    better explanation.

    regards

    [1] I see that wackypedia has a note about that:
    http://en.wikipedia.org/wiki/Embrace,_extend_and_extinguish
     
    Alan J. Flavell, May 22, 2006
    #13
  14. Let's hope that the questioner really does understand that ASCII is a
    7-bit code. There seems to be a substantial number of non-specialists
    who still believe in some mythical 8-bit "ASCII" (or "extended ASCII")
    code -- when I've been able to draw them out further on what this
    mythical code might be, it seems to mean different things to different
    people - some believe it to be what I'd know as CP437 the US national
    DOS codepage, some evidently think it means the "multinational" DOS
    codepage CP850, while yet others think it's a synonym for the equally
    mythical "ANSI"* encoding, which in reality is the MS proprietary
    Windows-1252 code - quite different from the DOS encodings.

    *)ANSI never did publish their own 8-bit encoding of this kind - they
    adopted ISO instead.

    In truth, all of these ASCII-*based* 8-bit encodings have their own
    proper names, and none of them has any right to be the mythical 8-bit
    "ASCII".

    Well, every (properly so called) ASCII file is also valid utf-8, as
    you say. But it's also valid iso-8859-x for your choice of x, or
    windows-125y for your choice of y, and so on.
    Indeed.

    And, in practice, if you have a body of plausible text content in
    iso-8859-1, or Windows-1252, containing a non-trivial number of bytes
    above 127, then it is extremely unlikely to look like valid utf-8.
     
    Alan J. Flavell, May 22, 2006
    #14
  15. [A complimentary Cc of this posting was NOT [per weedlist] sent to
    Peter J. Holzer
    True; but also note what the standard says:

    * use as an indication of non-breaking is deprecated; see 2060 instead

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, May 22, 2006
    #15
  16. [A complimentary Cc of this posting was sent to
    Alan J. Flavell
    ASCII is not "a 7-bit code".
    To the contrary. There seems to be a substantial number of
    non-specialists who still believe in that the term "ASCII" has some
    unique meaning nowadays. It does not.
    Exactly. This is what ASCII means today: the default "legacy"
    encoding of the given system (probably "in the given COUNTRY setting"
    too, whatever it means); it must be compatible with ANSI's 7-bit
    encoding in its first half. In practice this means one of cp437,
    cp850, or cp125[1-8] (maybe cp1004 too?). Details are clear (if any)
    from context only.

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, May 22, 2006
    #16
  17. [A complimentary Cc of this posting was sent to
    Alan J. Flavell
    To make it clear: UTF-16 is ALWAYS stored with BOM. And it is always
    stored in one of LE or BE schemes. So it is "a LE-variant of UTF-16",
    nothing non-standard-conforming. It is unfortunate indeed that the
    standards do not have a pre-argreed-to name for the variants.

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, May 22, 2006
    #17
  18. Ilya> ASCII is not "a 7-bit code".

    Ilya> To the contrary. There seems to be a substantial number of
    Ilya> non-specialists who still believe in that the term "ASCII" has some
    Ilya> unique meaning nowadays. It does not.

    Wikipedia *seriously* disagrees with you:

    <http://en.wikipedia.org/wiki/ASCII>.

    Maybe it's a regional interpretation.

    --
    Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
    <> <URL:http://www.stonehenge.com/merlyn/>
    Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
    See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

    *** ***
     
    Randal L. Schwartz, May 22, 2006
    #18
  19. Unicode specifies a layering (see chapter 2), consisting of three
    "Encoding Forms", and seven "Encoding Schemes".

    Confusingly, UTF-16 is not only the name of one of the encoding forms,
    but is also the name of one of that form's three encoding schemes.
    Given octet-oriented storage, how else would one store 16-bit units?
     
    Alan J. Flavell, May 22, 2006
    #19
  20. [A complimentary Cc of this posting was sent to
    Randal L. Schwartz
    Wikipedia is great as as read-and-do-the-opposite tool (but some part
    of it became much better in just one year).
    But anyway, ANY dictionary should decide which of two functions,
    proscription/description should it serve. As an alternative, it
    might mark each entry/section by appropriate mode-descriptor.

    However, this entry is obviously written in proscription-mode, but it
    is nowhere indicated; neither it is written that the most common usage
    deviated a lot from this wishful-thinking description.

    [My wishfull thinking is the same as for the authors of this entry;
    the difference is that I understand that it is hopeless to fight the
    "new wave" of M$/Apple-derived jargon.]

    Hope this helps,
    Ilya
     
    Ilya Zakharevich, May 22, 2006
    #20
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.