How to detect text file encoding in Perl

Discussion in 'Perl Misc' started by chaojen.chen@gmail.com, May 20, 2006.

  1. Guest

    Hello all,

    If I have a bunch of text files in the same directory and their
    encodings could UTF-16, UTF-8, or ASCII, is there any way of quickly
    detecting the exact encoding of each of them?

    Thanks,

    Enoch Chen
    , May 20, 2006
    #1
    1. Advertising

  2. Anno Siegel Guest

    cnhackTNT <> wrote in comp.lang.perl.misc:

    [Please don't top-post, and leave some attribution. Text re-arranged]

    > > Hello all,
    > >
    > > If I have a bunch of text files in the same directory and their
    > > encodings could UTF-16, UTF-8, or ASCII, is there any way of quickly
    > > detecting the exact encoding of each of them?


    > Maybe Encode::GUESS could help :)


    Without even looking at it, I'd say a module with its name in all-caps
    is suspect. Supposing it is actually spelled that way.

    Anno
    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
    Anno Siegel, May 20, 2006
    #2
    1. Advertising

  3. Anno Siegel wrote:
    > cnhackTNT <> wrote in comp.lang.perl.misc:
    >>
    >>Maybe Encode::GUESS could help :)

    >
    > Without even looking at it, I'd say a module with its name in all-caps
    > is suspect.


    Yeah, it makes you think of creations like POSIX and CGI. ;-)

    > Supposing it is actually spelled that way.


    It's not.

    --
    Gunnar Hjalmarsson
    Email: http://www.gunnar.cc/cgi-bin/contact.pl
    Gunnar Hjalmarsson, May 20, 2006
    #3
  4. Anno Siegel Guest

    Gunnar Hjalmarsson <> wrote in comp.lang.perl.misc:
    > Anno Siegel wrote:
    > > cnhackTNT <> wrote in comp.lang.perl.misc:
    > >>
    > >>Maybe Encode::GUESS could help :)

    > >
    > > Without even looking at it, I'd say a module with its name in all-caps
    > > is suspect.

    >
    > Yeah, it makes you think of creations like POSIX and CGI. ;-)


    Well, those are acronyms that weren't invented by the authors.

    If GUESS were an acronym, the module would be more than suspect of
    cutesiness.

    > > Supposing it is actually spelled that way.

    >
    > It's not.


    Good to know :)

    Anno
    --
    If you want to post a followup via groups.google.com, don't use
    the broken "Reply" link at the bottom of the article. Click on
    "show options" at the top of the article, then click on the
    "Reply" at the bottom of the article headers.
    Anno Siegel, May 20, 2006
    #4
  5. wrote:
    > If I have a bunch of text files in the same directory and their
    > encodings could UTF-16, UTF-8, or ASCII, is there any way of quickly
    > detecting the exact encoding of each of them?


    Forget quickly, it is fundamentally impossible given an ASCII file to
    tell that not utf8.

    If the utf16 starts with a BOM so you can distinuish utf8 and utf16 by
    examining the first two bytes.

    That said, Encode::Guess is probably your friend.
    Brian McCauley, May 20, 2006
    #5
  6. Guest

    Brian McCauley 寫é“:

    > wrote:
    > > If I have a bunch of text files in the same directory and their
    > > encodings could UTF-16, UTF-8, or ASCII, is there any way of quickly
    > > detecting the exact encoding of each of them?

    >
    > Forget quickly, it is fundamentally impossible given an ASCII file to
    > tell that not utf8.
    >
    > If the utf16 starts with a BOM so you can distinuish utf8 and utf16 by
    > examining the first two bytes.
    >
    > That said, Encode::Guess is probably your friend.


    Hello Brian,

    Thanks for your suggestion. And what does BOM stand for?

    Enoch
    , May 21, 2006
    #6
  7. Guest Guest

    wrote:
    : >
    : > If the utf16 starts with a BOM so you can distinuish utf8 and utf16 by
    : > examining the first two bytes.
    : >

    : Thanks for your suggestion. And what does BOM stand for?

    Google is probably your friend. If not: <B>yte <O>rder <M>ark.

    You frequently get a BOM at the beginning of your file if you store it
    on Windows with Notepad or similar editor simulations. If you choose to
    store your data as UTF-8, or your data _is_ UTF-8, you'll see that after
    storing the bytecount is two bytes more because the byte 0xff 0xef get
    prepended automatically, in order to tell the software which byte order
    is to be expected. This makes sense with UCS-2 Unicode (the "original"
    Unicode encoding) but not with UTF-8 (8-bit transformation format of
    Unicode) because the characters encoded in UTF-8 are self-synchronizing
    and no information about byte order is needed. In contrast, other programs
    behaving correctly frequently complain if the BOM appears where it simply
    doesn't belong.

    Oliver.


    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, May 21, 2006
    #7
  8. On Sun, 21 May 2006, -berlin.de wrote:

    > Google is probably your friend. If not: <B>yte <O>rder <M>ark.


    http://www.unicode.org/faq/utf_bom.html#BOM

    > store your data as UTF-8, or your data _is_ UTF-8, you'll see that after
    > storing the bytecount is two bytes more because the byte 0xff 0xef get
    > prepended automatically,


    The BOM is the relevant encoding of the Unicode character U+FEFF. No
    way is it 0xff 0xef. The various encoded byte patterns are shown in
    that Unicode FAQ, and in utf-8 it's *three* bytes.

    > in order to tell the software which byte order is to be expected.


    "No, a BOM can be used as a signature no matter how the Unicode text
    is transformed"

    > This makes sense with UCS-2 Unicode (the "original" Unicode
    > encoding)


    Yes, but "UCS-2" is out of date:
    http://www.unicode.org/faq/basic_q.html#23

    The utf-16 encoding form is its present counterpart.

    > but not with UTF-8 (8-bit transformation format of Unicode) because
    > the characters encoded in UTF-8 are self-synchronizing and no
    > information about byte order is needed.


    Nevertheless, the Unicode FAQ points out that utf-8 can usefully
    start with a BOM as an encoding signature.

    > In contrast, other programs behaving correctly frequently complain
    > if the BOM appears where it simply doesn't belong.


    Except that it is not inherently incorrect for it to appear at the
    beginning of a utf-8 stream - but see the cited FAQ for details.

    Seems to me you would have done well to read that FAQ yourself, before
    putting misleading opinions on the record.

    regards

    --

    Beware of negative easements.
    Alan J. Flavell, May 21, 2006
    #8
  9. Guest Guest

    Alan J. Flavell <> wrote:
    : > (Oliver's erroneous statement:)
    : > storing the bytecount is two bytes more because the byte 0xff 0xef get
    : > prepended automatically,

    : The BOM is the relevant encoding of the Unicode character U+FEFF. No
    : way is it 0xff 0xef.

    Oops, I goofed up here, and the twisted order shows exactly what a byte
    order mark is good for. Just imagine this would have been transmitted as
    UCS-2, in Big Endian order.

    : The various encoded byte patterns are shown in
    : that Unicode FAQ, and in utf-8 it's *three* bytes.

    Again, my fault. Shouldn't post when I'm too tired.

    : > This makes sense with UCS-2 Unicode (the "original" Unicode
    : > encoding)

    : Yes, but "UCS-2" is out of date:
    : http://www.unicode.org/faq/basic_q.html#23

    But several (notably MS-based) applications still allow the user to choose
    UCS-2, UTF-8 _and_ Unicode.

    : > but not with UTF-8 (8-bit transformation format of Unicode) because
    : > the characters encoded in UTF-8 are self-synchronizing and no
    : > information about byte order is needed.

    : Nevertheless, the Unicode FAQ points out that utf-8 can usefully
    : start with a BOM as an encoding signature.

    The FAQ says so, but...

    : > In contrast, other programs behaving correctly frequently complain
    : > if the BOM appears where it simply doesn't belong.

    : Except that it is not inherently incorrect for it to appear at the
    : beginning of a utf-8 stream - but see the cited FAQ for details.

    But my experience (with shell scripts, interpretation of shebang lines
    of perl scripts, etc.) runs to the contrary. A UTF-8-encoded file _with_
    BOM causes unnecessary hiccups, even if this is against the formal spec.

    : Seems to me you would have done well to read that FAQ yourself, before
    : putting misleading opinions on the record.

    Sorry, I should have consulted the FAQ, but I stand by my negative experiences
    with superfluous BOMs.

    Oliver.

    --
    Dr. Oliver Corff e-mail: -berlin.de
    Guest, May 21, 2006
    #9
  10. On Sun, 21 May 2006, -berlin.de wrote:

    > Alan J. Flavell <> wrote:


    [re. my cite of http://www.unicode.org/faq/utf_bom.html#BOM ]

    > : Except that it is not inherently incorrect for it to appear at the
    > : beginning of a utf-8 stream - but see the cited FAQ for details.
    >
    > But my experience (with shell scripts, interpretation of shebang
    > lines of perl scripts, etc.) runs to the contrary. A UTF-8-encoded
    > file _with_ BOM causes unnecessary hiccups, even if this is against
    > the formal spec.


    Which is pretty much the point that the cited BOM FAQ makes, at
    http://www.unicode.org/faq/utf_bom.html#29 , and that was my primary
    reason for that suggestion to "see the cited FAQ for details".

    regards
    Alan J. Flavell, May 21, 2006
    #10
  11. -berlin.de wrote:
    > Alan J. Flavell <> wrote:
    > : > This makes sense with UCS-2 Unicode (the "original" Unicode
    > : > encoding)
    >
    > : Yes, but "UCS-2" is out of date:
    > : http://www.unicode.org/faq/basic_q.html#23
    >
    > But several (notably MS-based) applications still allow the user to choose
    > UCS-2, UTF-8 _and_ Unicode.


    That's a curious statement, given that UCS-2 and UTF-8 are parts of the
    Unicode standard. (UTF-16 is, too, BTW)

    [...]
    > : > In contrast, other programs behaving correctly frequently complain
    > : > if the BOM appears where it simply doesn't belong.
    >
    > : Except that it is not inherently incorrect for it to appear at the
    > : beginning of a utf-8 stream - but see the cited FAQ for details.
    >
    > But my experience (with shell scripts, interpretation of shebang lines
    > of perl scripts, etc.) runs to the contrary. A UTF-8-encoded file _with_
    > BOM causes unnecessary hiccups, even if this is against the formal spec.


    That's because a BOM isn't just a Byte Order Mark - It is a valid
    character (Zero Width No-Break Space). Of course inserting a Zero Width
    No-Break Space at the beginning of a file isn't against UTF-8 rules,
    just like inserting a normal space at the beginning of a file isn't
    against UTF-8 rules. But it is against the rules for Unix scripts: The
    first character must be a hash sign, not a space (zero width or not).

    hp

    --
    _ | Peter J. Holzer | Man könnte sich [die Diskussion] auch
    |_|_) | Sysadmin WSR/LUGA | sparen, wenn man sie sich einfach sparen
    | | | | würde.
    __/ | http://www.hjp.at/ | -- Ralph Angenendt in dang 2006-04-15
    Peter J. Holzer, May 21, 2006
    #11
  12. Brian McCauley wrote:
    > wrote:
    >> If I have a bunch of text files in the same directory and their
    >> encodings could UTF-16, UTF-8, or ASCII, is there any way of quickly
    >> detecting the exact encoding of each of them?

    >
    > Forget quickly, it is fundamentally impossible given an ASCII file to
    > tell that not utf8.


    Well, every ASCII file is also UTF-8, but not vice versa.

    Or, phrased differently, if you can decode a file as UTF-8 and all
    characters have code less than 128, it is ASCII.

    > If the utf16 starts with a BOM so you can distinuish utf8 and utf16 by
    > examining the first two bytes.


    You will probably also find a lot of zero bytes in an UTF-16 coded text
    file which is unlikely in a UTF-8 coded text file.

    hp

    --
    _ | Peter J. Holzer | Man könnte sich [die Diskussion] auch
    |_|_) | Sysadmin WSR/LUGA | sparen, wenn man sie sich einfach sparen
    | | | | würde.
    __/ | http://www.hjp.at/ | -- Ralph Angenendt in dang 2006-04-15
    Peter J. Holzer, May 21, 2006
    #12
  13. On Mon, 22 May 2006, Peter J. Holzer wrote:

    > -berlin.de wrote:
    >
    > > But several (notably MS-based) applications still allow the user
    > > to choose UCS-2, UTF-8 _and_ Unicode.

    >
    > That's a curious statement, given that UCS-2 and UTF-8 are parts of
    > the Unicode standard. (UTF-16 is, too, BTW)


    Oh, quite. But one could hardly expect MS to conform to someone
    else's specifications, hmmm?[1] AIUI, when they say "Unicode", they
    actually mean UTF-16, stored in little-endian format with BOM.

    (N.B one cannot call that UTF-16LE, because UTF-16LE or BE are
    forbidden to start with a BOM. Hope that's clear?).

    > That's because a BOM isn't just a Byte Order Mark - It is a valid
    > character (Zero Width No-Break Space).


    At risk of being pedantic: that character cannot be at one and the
    same time a BOM and a ZWNBSP: it's either one or the other. If it's
    not at the beginning, it can only be a ZWNBSP. If it *is* at the
    beginning, it's a matter of convention whether it's a BOM or a ZWNBSP.
    But see the FAQ, http://www.unicode.org/faq/utf_bom.html#27 etc. for a
    better explanation.

    regards

    [1] I see that wackypedia has a note about that:
    http://en.wikipedia.org/wiki/Embrace,_extend_and_extinguish
    Alan J. Flavell, May 22, 2006
    #13
  14. On Mon, 22 May 2006, Peter J. Holzer wrote:

    > Brian McCauley wrote:
    > > wrote:
    > >> If I have a bunch of text files in the same directory and their
    > >> encodings could UTF-16, UTF-8, or ASCII, is there any way of
    > >> quickly detecting the exact encoding of each of them?

    > >
    > > Forget quickly, it is fundamentally impossible given an ASCII file
    > > to tell that not utf8.

    >
    > Well, every ASCII file is also UTF-8, but not vice versa.


    Let's hope that the questioner really does understand that ASCII is a
    7-bit code. There seems to be a substantial number of non-specialists
    who still believe in some mythical 8-bit "ASCII" (or "extended ASCII")
    code -- when I've been able to draw them out further on what this
    mythical code might be, it seems to mean different things to different
    people - some believe it to be what I'd know as CP437 the US national
    DOS codepage, some evidently think it means the "multinational" DOS
    codepage CP850, while yet others think it's a synonym for the equally
    mythical "ANSI"* encoding, which in reality is the MS proprietary
    Windows-1252 code - quite different from the DOS encodings.

    *)ANSI never did publish their own 8-bit encoding of this kind - they
    adopted ISO instead.

    In truth, all of these ASCII-*based* 8-bit encodings have their own
    proper names, and none of them has any right to be the mythical 8-bit
    "ASCII".

    Well, every (properly so called) ASCII file is also valid utf-8, as
    you say. But it's also valid iso-8859-x for your choice of x, or
    windows-125y for your choice of y, and so on.

    > Or, phrased differently, if you can decode a file as UTF-8 and all
    > characters have code less than 128, it is ASCII.


    Indeed.

    And, in practice, if you have a body of plausible text content in
    iso-8859-1, or Windows-1252, containing a non-trivial number of bytes
    above 127, then it is extremely unlikely to look like valid utf-8.
    Alan J. Flavell, May 22, 2006
    #14
  15. [A complimentary Cc of this posting was NOT [per weedlist] sent to
    Peter J. Holzer
    <>], who wrote in article <>:
    > That's because a BOM isn't just a Byte Order Mark - It is a valid
    > character (Zero Width No-Break Space).


    True; but also note what the standard says:

    * use as an indication of non-breaking is deprecated; see 2060 instead

    Hope this helps,
    Ilya
    Ilya Zakharevich, May 22, 2006
    #15
  16. [A complimentary Cc of this posting was sent to
    Alan J. Flavell
    <>], who wrote in article <>:
    > Let's hope that the questioner really does understand that ASCII is a
    > 7-bit code.


    ASCII is not "a 7-bit code".

    > There seems to be a substantial number of non-specialists who still
    > believe in some mythical 8-bit "ASCII" (or "extended ASCII") code.


    To the contrary. There seems to be a substantial number of
    non-specialists who still believe in that the term "ASCII" has some
    unique meaning nowadays. It does not.

    > -- when I've been able to draw them out further on what this
    > mythical code might be, it seems to mean different things to
    > different people.


    Exactly. This is what ASCII means today: the default "legacy"
    encoding of the given system (probably "in the given COUNTRY setting"
    too, whatever it means); it must be compatible with ANSI's 7-bit
    encoding in its first half. In practice this means one of cp437,
    cp850, or cp125[1-8] (maybe cp1004 too?). Details are clear (if any)
    from context only.

    Hope this helps,
    Ilya
    Ilya Zakharevich, May 22, 2006
    #16
  17. [A complimentary Cc of this posting was sent to
    Alan J. Flavell
    <>], who wrote in article <>:
    > Oh, quite. But one could hardly expect MS to conform to someone
    > else's specifications, hmmm?[1] AIUI, when they say "Unicode", they
    > actually mean UTF-16, stored in little-endian format with BOM.
    >
    > (N.B one cannot call that UTF-16LE, because UTF-16LE or BE are
    > forbidden to start with a BOM. Hope that's clear?).


    To make it clear: UTF-16 is ALWAYS stored with BOM. And it is always
    stored in one of LE or BE schemes. So it is "a LE-variant of UTF-16",
    nothing non-standard-conforming. It is unfortunate indeed that the
    standards do not have a pre-argreed-to name for the variants.

    Hope this helps,
    Ilya
    Ilya Zakharevich, May 22, 2006
    #17
  18. >>>>> "Ilya" == Ilya Zakharevich <> writes:

    Ilya> ASCII is not "a 7-bit code".

    >> There seems to be a substantial number of non-specialists who still
    >> believe in some mythical 8-bit "ASCII" (or "extended ASCII") code.


    Ilya> To the contrary. There seems to be a substantial number of
    Ilya> non-specialists who still believe in that the term "ASCII" has some
    Ilya> unique meaning nowadays. It does not.

    Wikipedia *seriously* disagrees with you:

    <http://en.wikipedia.org/wiki/ASCII>.

    Maybe it's a regional interpretation.

    --
    Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
    <> <URL:http://www.stonehenge.com/merlyn/>
    Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
    See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

    *** Posted via a free Usenet account from http://www.teranews.com ***
    Randal L. Schwartz, May 22, 2006
    #18
  19. On Mon, 22 May 2006, Ilya Zakharevich wrote:

    > To make it clear: UTF-16 is ALWAYS stored with BOM.


    Unicode specifies a layering (see chapter 2), consisting of three
    "Encoding Forms", and seven "Encoding Schemes".

    Confusingly, UTF-16 is not only the name of one of the encoding forms,
    but is also the name of one of that form's three encoding schemes.

    > And it is always stored in one of LE or BE schemes.


    Given octet-oriented storage, how else would one store 16-bit units?
    Alan J. Flavell, May 22, 2006
    #19
  20. [A complimentary Cc of this posting was sent to
    Randal L. Schwartz
    <>], who wrote in article <>:
    > >> There seems to be a substantial number of non-specialists who still
    > >> believe in some mythical 8-bit "ASCII" (or "extended ASCII") code.


    > Ilya> To the contrary. There seems to be a substantial number of
    > Ilya> non-specialists who still believe in that the term "ASCII" has some
    > Ilya> unique meaning nowadays. It does not.


    > Wikipedia *seriously* disagrees with you:


    Wikipedia is great as as read-and-do-the-opposite tool (but some part
    of it became much better in just one year).

    > <http://en.wikipedia.org/wiki/ASCII>.


    But anyway, ANY dictionary should decide which of two functions,
    proscription/description should it serve. As an alternative, it
    might mark each entry/section by appropriate mode-descriptor.

    However, this entry is obviously written in proscription-mode, but it
    is nowhere indicated; neither it is written that the most common usage
    deviated a lot from this wishful-thinking description.

    [My wishfull thinking is the same as for the authors of this entry;
    the difference is that I understand that it is hopeless to fight the
    "new wave" of M$/Apple-derived jargon.]

    Hope this helps,
    Ilya
    Ilya Zakharevich, May 22, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. sunil
    Replies:
    0
    Views:
    592
    sunil
    Jul 28, 2004
  2. Simon
    Replies:
    10
    Views:
    3,386
    Mayeul
    Jun 9, 2009
  3. Rebhan, Gilbert

    Detect file encoding utf-8

    Rebhan, Gilbert, Aug 29, 2007, in forum: Ruby
    Replies:
    3
    Views:
    312
    Gilbert Rebhan
    Aug 29, 2007
  4. Replies:
    2
    Views:
    356
  5. iMath
    Replies:
    8
    Views:
    123
    Dave Angel
    Dec 21, 2012
Loading...

Share This Page