character encodings

Jaap Karssenberg · Nov 30, 2003

I have a script that should read files utf8 compliant, so I used
binmode(FILE, ':utf8'). But now it appears some users have latin2
encoded files, causing some regexes to throw warnings about malformed
utf8 chars. Is there a way to detect the character encoding and DWIM ? I
would hate to have to tell my users they should convert everything to
utf8 first.

Jürgen Exner · Nov 30, 2003

Jaap said:
I have a script that should read files utf8 compliant, so I used
binmode(FILE, ':utf8'). But now it appears some users have latin2
encoded files, causing some regexes to throw warnings about malformed
utf8 chars. Is there a way to detect the character encoding and DWIM

You don't tell us what kind fo files those are.
For e.g. HTML or XML the meta charset header resp. the encoding attribute
should tell you what encoding to expect.
Just scan for it and evaluate the rest of the file accordingly.

jue

Jaap Karssenberg · Nov 30, 2003

You don't tell us what kind fo files those are.
: For e.g. HTML or XML the meta charset header resp. the encoding
: attribute should tell you what encoding to expect.
: Just scan for it and evaluate the rest of the file accordingly.

Can be all kinds of files, the script is there to determine what they
are. In general the files have neither meta data attached to them or
headers with meta data.

Alan J. Flavell · Nov 30, 2003

For e.g. HTML or XML the meta charset header resp. the encoding attribute
should tell you what encoding to expect.

Maybe. Depends how the files got there. For HTTP transactions it's
legal (and often preferable) to supply the character coding on the
actual HTTP content-type header, and to make no mention of it inside
the actual body of the content. However, the O.P speaks of "files",
so presumably you're right, and the HTTP transaction issue is outside
this particular problem domain. But there's still the BOM option to
keep in mind!

Just scan for it and evaluate the rest of the file accordingly.

But you can't scan it without reading it, and you can't read it
without opening it; so you'd have to open it provisionally with *some*
mode, scan for the stuff that you have described - and then maybe
re-open it with a different mode?

OTOH, if one opens it in raw mode, and scans it in a way which can
accommodate itself to different encodings, then, when the relevant
encoding information has been found, the data can be piped through the
appropriate encoding layers explicitly.

There's a lot of options, and I'm not sure of the practical
implications of choosing one or another. If the data is to be
processed by an appropriate HTML or XML module, maybe that module can
adapt to different data encodings as read in raw mode?

What I think it comes down to is that it would definitely be a mistake
to open the file with a utf8 IO layer without being sure that it's
utf-8-encoded, due to the errors that will inevitably result if it
isn't.

hope this helps a bit.

Ben Morrow · Nov 30, 2003

Jaap Karssenberg said:
I have a script that should read files utf8 compliant, so I used
binmode(FILE, ':utf8'). But now it appears some users have latin2
encoded files, causing some regexes to throw warnings about malformed
utf8 chars. Is there a way to detect the character encoding and DWIM ? I
would hate to have to tell my users they should convert everything to
utf8 first.

You can't, in general. One thing you could try is

1. open the file in :raw mode.
2. read a largeish chunk into a $scalar.
3. turn the utf8 flag on with Encode::_utf8_on($scalar);.
4. check if the data is valid with Encode::is_utf8($scalar, 1);.
5. If it is, reopen the file with :utf8. If it ain't, assume latin2
and reopen with :encoding(latin2).

It seems there is no way to check if a sequence of bytes forms valid
utf8 without first setting the utf8 flag on... but never mind
that. Note that it is perfectly possible for data that was in fact
saved in latin2 to pass this test, just rather unlikely; and that if
next week you find some users are using latin1 you're completely
screwed, as there's no way to tell latin1 from latin2.

Ben

Alan J. Flavell · Nov 30, 2003

You don't tell us what kind fo files those are.
: For e.g. HTML or XML the meta charset header resp. the encoding
: attribute should tell you what encoding to expect.
: Just scan for it and evaluate the rest of the file accordingly.

Can be all kinds of files, the script is there to determine what they
are.

It can't be done, in general. There is no way to reliably distinguish
between the commonly-used 8-bit codes (iso-8859-whatever, etc.), for
example. It would be sheer guesswork, without some kind of additional
knowledge, language analysis or something.

As others have said, utf-8 can be verified for consistency, and the
hypothesis rejected if it proves to be false. But passing the
consistency test doesn't incontrovertibly prove that it's utf-8: it
might just be a co-incidence that a particular 8-bit-coded text passed
the utf-8 consistency check.

So we really _do_ need to know more about your situation if we are to
offer any kind of realistic help.

In general the files have neither meta data attached to them or
headers with meta data.

Then you're stuck with trying heuristic methods, IMHO.

Guessing Encodings and the PerlIO layer	2	Jul 27, 2009
Problem with a login script, SESSION user rights and put this together so it works with the other pages and MySQL. Code examples.	2	May 5, 2023
The future of the character-encodings library	4	Mar 16, 2011
how is the string encoded	20	Jan 3, 2012
A proposal to handle file encodings	31	Nov 22, 2012
Unicode help please	5	Oct 19, 2013
Read/write with UCS-2* encodings - Possible???	7	Feb 17, 2009
DBD::Oracle, Unicode, non-UTF8-non-ASCII strings	0	Jul 23, 2009

character encodings

Jaap Karssenberg

Jürgen Exner

Jaap Karssenberg

Alan J. Flavell

Ben Morrow

Alan J. Flavell

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads