How to detect text file encoding in Perl

C

chaojen.chen

Hello all,

If I have a bunch of text files in the same directory and their
encodings could UTF-16, UTF-8, or ASCII, is there any way of quickly
detecting the exact encoding of each of them?

Thanks,

Enoch Chen
 
A

Anno Siegel

[Please don't top-post, and leave some attribution. Text re-arranged]
Maybe Encode::GUESS could help :)

Without even looking at it, I'd say a module with its name in all-caps
is suspect. Supposing it is actually spelled that way.

Anno
 
A

Anno Siegel

Gunnar Hjalmarsson said:
Yeah, it makes you think of creations like POSIX and CGI. ;-)

Well, those are acronyms that weren't invented by the authors.

If GUESS were an acronym, the module would be more than suspect of
cutesiness.
It's not.

Good to know :)

Anno
 
B

Brian McCauley

If I have a bunch of text files in the same directory and their
encodings could UTF-16, UTF-8, or ASCII, is there any way of quickly
detecting the exact encoding of each of them?

Forget quickly, it is fundamentally impossible given an ASCII file to
tell that not utf8.

If the utf16 starts with a BOM so you can distinuish utf8 and utf16 by
examining the first two bytes.

That said, Encode::Guess is probably your friend.
 
C

chaojen.chen

Brian McCauley 寫é“:
Forget quickly, it is fundamentally impossible given an ASCII file to
tell that not utf8.

If the utf16 starts with a BOM so you can distinuish utf8 and utf16 by
examining the first two bytes.

That said, Encode::Guess is probably your friend.

Hello Brian,

Thanks for your suggestion. And what does BOM stand for?

Enoch
 
G

Guest

(e-mail address removed) wrote:
: >
: > If the utf16 starts with a BOM so you can distinuish utf8 and utf16 by
: > examining the first two bytes.
: >

: Thanks for your suggestion. And what does BOM stand for?

Google is probably your friend. If not: <B>yte <O>rder <M>ark.

You frequently get a BOM at the beginning of your file if you store it
on Windows with Notepad or similar editor simulations. If you choose to
store your data as UTF-8, or your data _is_ UTF-8, you'll see that after
storing the bytecount is two bytes more because the byte 0xff 0xef get
prepended automatically, in order to tell the software which byte order
is to be expected. This makes sense with UCS-2 Unicode (the "original"
Unicode encoding) but not with UTF-8 (8-bit transformation format of
Unicode) because the characters encoded in UTF-8 are self-synchronizing
and no information about byte order is needed. In contrast, other programs
behaving correctly frequently complain if the BOM appears where it simply
doesn't belong.

Oliver.
 
A

Alan J. Flavell

Google is probably your friend. If not: <B>yte <O>rder <M>ark.
http://www.unicode.org/faq/utf_bom.html#BOM

store your data as UTF-8, or your data _is_ UTF-8, you'll see that after
storing the bytecount is two bytes more because the byte 0xff 0xef get
prepended automatically,

The BOM is the relevant encoding of the Unicode character U+FEFF. No
way is it 0xff 0xef. The various encoded byte patterns are shown in
that Unicode FAQ, and in utf-8 it's *three* bytes.
in order to tell the software which byte order is to be expected.

"No, a BOM can be used as a signature no matter how the Unicode text
is transformed"
This makes sense with UCS-2 Unicode (the "original" Unicode
encoding)

Yes, but "UCS-2" is out of date:
http://www.unicode.org/faq/basic_q.html#23

The utf-16 encoding form is its present counterpart.
but not with UTF-8 (8-bit transformation format of Unicode) because
the characters encoded in UTF-8 are self-synchronizing and no
information about byte order is needed.

Nevertheless, the Unicode FAQ points out that utf-8 can usefully
start with a BOM as an encoding signature.
In contrast, other programs behaving correctly frequently complain
if the BOM appears where it simply doesn't belong.

Except that it is not inherently incorrect for it to appear at the
beginning of a utf-8 stream - but see the cited FAQ for details.

Seems to me you would have done well to read that FAQ yourself, before
putting misleading opinions on the record.

regards
 
G

Guest

: > (Oliver's erroneous statement:)
: > storing the bytecount is two bytes more because the byte 0xff 0xef get
: > prepended automatically,

: The BOM is the relevant encoding of the Unicode character U+FEFF. No
: way is it 0xff 0xef.

Oops, I goofed up here, and the twisted order shows exactly what a byte
order mark is good for. Just imagine this would have been transmitted as
UCS-2, in Big Endian order.

: The various encoded byte patterns are shown in
: that Unicode FAQ, and in utf-8 it's *three* bytes.

Again, my fault. Shouldn't post when I'm too tired.

: > This makes sense with UCS-2 Unicode (the "original" Unicode
: > encoding)

: Yes, but "UCS-2" is out of date:
: http://www.unicode.org/faq/basic_q.html#23

But several (notably MS-based) applications still allow the user to choose
UCS-2, UTF-8 _and_ Unicode.

: > but not with UTF-8 (8-bit transformation format of Unicode) because
: > the characters encoded in UTF-8 are self-synchronizing and no
: > information about byte order is needed.

: Nevertheless, the Unicode FAQ points out that utf-8 can usefully
: start with a BOM as an encoding signature.

The FAQ says so, but...

: > In contrast, other programs behaving correctly frequently complain
: > if the BOM appears where it simply doesn't belong.

: Except that it is not inherently incorrect for it to appear at the
: beginning of a utf-8 stream - but see the cited FAQ for details.

But my experience (with shell scripts, interpretation of shebang lines
of perl scripts, etc.) runs to the contrary. A UTF-8-encoded file _with_
BOM causes unnecessary hiccups, even if this is against the formal spec.

: Seems to me you would have done well to read that FAQ yourself, before
: putting misleading opinions on the record.

Sorry, I should have consulted the FAQ, but I stand by my negative experiences
with superfluous BOMs.

Oliver.
 
A

Alan J. Flavell

Alan J. Flavell <[email protected]> wrote:

[re. my cite of http://www.unicode.org/faq/utf_bom.html#BOM ]
: Except that it is not inherently incorrect for it to appear at the
: beginning of a utf-8 stream - but see the cited FAQ for details.

But my experience (with shell scripts, interpretation of shebang
lines of perl scripts, etc.) runs to the contrary. A UTF-8-encoded
file _with_ BOM causes unnecessary hiccups, even if this is against
the formal spec.

Which is pretty much the point that the cited BOM FAQ makes, at
http://www.unicode.org/faq/utf_bom.html#29 , and that was my primary
reason for that suggestion to "see the cited FAQ for details".

regards
 
P

Peter J. Holzer

: > This makes sense with UCS-2 Unicode (the "original" Unicode
: > encoding)

: Yes, but "UCS-2" is out of date:
: http://www.unicode.org/faq/basic_q.html#23

But several (notably MS-based) applications still allow the user to choose
UCS-2, UTF-8 _and_ Unicode.

That's a curious statement, given that UCS-2 and UTF-8 are parts of the
Unicode standard. (UTF-16 is, too, BTW)

[...]
: > In contrast, other programs behaving correctly frequently complain
: > if the BOM appears where it simply doesn't belong.

: Except that it is not inherently incorrect for it to appear at the
: beginning of a utf-8 stream - but see the cited FAQ for details.

But my experience (with shell scripts, interpretation of shebang lines
of perl scripts, etc.) runs to the contrary. A UTF-8-encoded file _with_
BOM causes unnecessary hiccups, even if this is against the formal spec.

That's because a BOM isn't just a Byte Order Mark - It is a valid
character (Zero Width No-Break Space). Of course inserting a Zero Width
No-Break Space at the beginning of a file isn't against UTF-8 rules,
just like inserting a normal space at the beginning of a file isn't
against UTF-8 rules. But it is against the rules for Unix scripts: The
first character must be a hash sign, not a space (zero width or not).

hp
 
P

Peter J. Holzer

Brian said:
Forget quickly, it is fundamentally impossible given an ASCII file to
tell that not utf8.

Well, every ASCII file is also UTF-8, but not vice versa.

Or, phrased differently, if you can decode a file as UTF-8 and all
characters have code less than 128, it is ASCII.
If the utf16 starts with a BOM so you can distinuish utf8 and utf16 by
examining the first two bytes.

You will probably also find a lot of zero bytes in an UTF-16 coded text
file which is unlikely in a UTF-8 coded text file.

hp
 
A

Alan J. Flavell

That's a curious statement, given that UCS-2 and UTF-8 are parts of
the Unicode standard. (UTF-16 is, too, BTW)

Oh, quite. But one could hardly expect MS to conform to someone
else's specifications, hmmm?[1] AIUI, when they say "Unicode", they
actually mean UTF-16, stored in little-endian format with BOM.

(N.B one cannot call that UTF-16LE, because UTF-16LE or BE are
forbidden to start with a BOM. Hope that's clear?).
That's because a BOM isn't just a Byte Order Mark - It is a valid
character (Zero Width No-Break Space).

At risk of being pedantic: that character cannot be at one and the
same time a BOM and a ZWNBSP: it's either one or the other. If it's
not at the beginning, it can only be a ZWNBSP. If it *is* at the
beginning, it's a matter of convention whether it's a BOM or a ZWNBSP.
But see the FAQ, http://www.unicode.org/faq/utf_bom.html#27 etc. for a
better explanation.

regards

[1] I see that wackypedia has a note about that:
http://en.wikipedia.org/wiki/Embrace,_extend_and_extinguish
 
A

Alan J. Flavell

Well, every ASCII file is also UTF-8, but not vice versa.

Let's hope that the questioner really does understand that ASCII is a
7-bit code. There seems to be a substantial number of non-specialists
who still believe in some mythical 8-bit "ASCII" (or "extended ASCII")
code -- when I've been able to draw them out further on what this
mythical code might be, it seems to mean different things to different
people - some believe it to be what I'd know as CP437 the US national
DOS codepage, some evidently think it means the "multinational" DOS
codepage CP850, while yet others think it's a synonym for the equally
mythical "ANSI"* encoding, which in reality is the MS proprietary
Windows-1252 code - quite different from the DOS encodings.

*)ANSI never did publish their own 8-bit encoding of this kind - they
adopted ISO instead.

In truth, all of these ASCII-*based* 8-bit encodings have their own
proper names, and none of them has any right to be the mythical 8-bit
"ASCII".

Well, every (properly so called) ASCII file is also valid utf-8, as
you say. But it's also valid iso-8859-x for your choice of x, or
windows-125y for your choice of y, and so on.
Or, phrased differently, if you can decode a file as UTF-8 and all
characters have code less than 128, it is ASCII.

Indeed.

And, in practice, if you have a body of plausible text content in
iso-8859-1, or Windows-1252, containing a non-trivial number of bytes
above 127, then it is extremely unlikely to look like valid utf-8.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was NOT [per weedlist] sent to
Peter J. Holzer
That's because a BOM isn't just a Byte Order Mark - It is a valid
character (Zero Width No-Break Space).

True; but also note what the standard says:

* use as an indication of non-breaking is deprecated; see 2060 instead

Hope this helps,
Ilya
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Alan J. Flavell
Let's hope that the questioner really does understand that ASCII is a
7-bit code.

ASCII is not "a 7-bit code".
There seems to be a substantial number of non-specialists who still
believe in some mythical 8-bit "ASCII" (or "extended ASCII") code.

To the contrary. There seems to be a substantial number of
non-specialists who still believe in that the term "ASCII" has some
unique meaning nowadays. It does not.
-- when I've been able to draw them out further on what this
mythical code might be, it seems to mean different things to
different people.

Exactly. This is what ASCII means today: the default "legacy"
encoding of the given system (probably "in the given COUNTRY setting"
too, whatever it means); it must be compatible with ANSI's 7-bit
encoding in its first half. In practice this means one of cp437,
cp850, or cp125[1-8] (maybe cp1004 too?). Details are clear (if any)
from context only.

Hope this helps,
Ilya
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Alan J. Flavell
Oh, quite. But one could hardly expect MS to conform to someone
else's specifications, hmmm?[1] AIUI, when they say "Unicode", they
actually mean UTF-16, stored in little-endian format with BOM.

(N.B one cannot call that UTF-16LE, because UTF-16LE or BE are
forbidden to start with a BOM. Hope that's clear?).

To make it clear: UTF-16 is ALWAYS stored with BOM. And it is always
stored in one of LE or BE schemes. So it is "a LE-variant of UTF-16",
nothing non-standard-conforming. It is unfortunate indeed that the
standards do not have a pre-argreed-to name for the variants.

Hope this helps,
Ilya
 
R

Randal L. Schwartz

Ilya> ASCII is not "a 7-bit code".

Ilya> To the contrary. There seems to be a substantial number of
Ilya> non-specialists who still believe in that the term "ASCII" has some
Ilya> unique meaning nowadays. It does not.

Wikipedia *seriously* disagrees with you:

<http://en.wikipedia.org/wiki/ASCII>.

Maybe it's a regional interpretation.

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095
<[email protected]> <URL:http://www.stonehenge.com/merlyn/>
Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

*** ***
 
A

Alan J. Flavell

To make it clear: UTF-16 is ALWAYS stored with BOM.

Unicode specifies a layering (see chapter 2), consisting of three
"Encoding Forms", and seven "Encoding Schemes".

Confusingly, UTF-16 is not only the name of one of the encoding forms,
but is also the name of one of that form's three encoding schemes.
And it is always stored in one of LE or BE schemes.

Given octet-oriented storage, how else would one store 16-bit units?
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Randal L. Schwartz
Ilya> To the contrary. There seems to be a substantial number of
Ilya> non-specialists who still believe in that the term "ASCII" has some
Ilya> unique meaning nowadays. It does not.
Wikipedia *seriously* disagrees with you:

Wikipedia is great as as read-and-do-the-opposite tool (but some part
of it became much better in just one year).

But anyway, ANY dictionary should decide which of two functions,
proscription/description should it serve. As an alternative, it
might mark each entry/section by appropriate mode-descriptor.

However, this entry is obviously written in proscription-mode, but it
is nowhere indicated; neither it is written that the most common usage
deviated a lot from this wishful-thinking description.

[My wishfull thinking is the same as for the authors of this entry;
the difference is that I understand that it is hopeless to fight the
"new wave" of M$/Apple-derived jargon.]

Hope this helps,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top