character encodings

J

Jaap Karssenberg

I have a script that should read files utf8 compliant, so I used
binmode(FILE, ':utf8'). But now it appears some users have latin2
encoded files, causing some regexes to throw warnings about malformed
utf8 chars. Is there a way to detect the character encoding and DWIM ? I
would hate to have to tell my users they should convert everything to
utf8 first.
 
J

Jürgen Exner

Jaap said:
I have a script that should read files utf8 compliant, so I used
binmode(FILE, ':utf8'). But now it appears some users have latin2
encoded files, causing some regexes to throw warnings about malformed
utf8 chars. Is there a way to detect the character encoding and DWIM

You don't tell us what kind fo files those are.
For e.g. HTML or XML the meta charset header resp. the encoding attribute
should tell you what encoding to expect.
Just scan for it and evaluate the rest of the file accordingly.

jue
 
J

Jaap Karssenberg

You don't tell us what kind fo files those are.
: For e.g. HTML or XML the meta charset header resp. the encoding
: attribute should tell you what encoding to expect.
: Just scan for it and evaluate the rest of the file accordingly.

Can be all kinds of files, the script is there to determine what they
are. In general the files have neither meta data attached to them or
headers with meta data.
 
A

Alan J. Flavell

For e.g. HTML or XML the meta charset header resp. the encoding attribute
should tell you what encoding to expect.

Maybe. Depends how the files got there. For HTTP transactions it's
legal (and often preferable) to supply the character coding on the
actual HTTP content-type header, and to make no mention of it inside
the actual body of the content. However, the O.P speaks of "files",
so presumably you're right, and the HTTP transaction issue is outside
this particular problem domain. But there's still the BOM option to
keep in mind!
Just scan for it and evaluate the rest of the file accordingly.

But you can't scan it without reading it, and you can't read it
without opening it; so you'd have to open it provisionally with *some*
mode, scan for the stuff that you have described - and then maybe
re-open it with a different mode?

OTOH, if one opens it in raw mode, and scans it in a way which can
accommodate itself to different encodings, then, when the relevant
encoding information has been found, the data can be piped through the
appropriate encoding layers explicitly.

There's a lot of options, and I'm not sure of the practical
implications of choosing one or another. If the data is to be
processed by an appropriate HTML or XML module, maybe that module can
adapt to different data encodings as read in raw mode?

What I think it comes down to is that it would definitely be a mistake
to open the file with a utf8 IO layer without being sure that it's
utf-8-encoded, due to the errors that will inevitably result if it
isn't.

hope this helps a bit.
 
B

Ben Morrow

Jaap Karssenberg said:
I have a script that should read files utf8 compliant, so I used
binmode(FILE, ':utf8'). But now it appears some users have latin2
encoded files, causing some regexes to throw warnings about malformed
utf8 chars. Is there a way to detect the character encoding and DWIM ? I
would hate to have to tell my users they should convert everything to
utf8 first.

You can't, in general. One thing you could try is

1. open the file in :raw mode.
2. read a largeish chunk into a $scalar.
3. turn the utf8 flag on with Encode::_utf8_on($scalar);.
4. check if the data is valid with Encode::is_utf8($scalar, 1);.
5. If it is, reopen the file with :utf8. If it ain't, assume latin2
and reopen with :encoding(latin2).

It seems there is no way to check if a sequence of bytes forms valid
utf8 without first setting the utf8 flag on... but never mind
that. Note that it is perfectly possible for data that was in fact
saved in latin2 to pass this test, just rather unlikely; and that if
next week you find some users are using latin1 you're completely
screwed, as there's no way to tell latin1 from latin2. :)

Ben
 
A

Alan J. Flavell

You don't tell us what kind fo files those are.
: For e.g. HTML or XML the meta charset header resp. the encoding
: attribute should tell you what encoding to expect.
: Just scan for it and evaluate the rest of the file accordingly.

Can be all kinds of files, the script is there to determine what they
are.

It can't be done, in general. There is no way to reliably distinguish
between the commonly-used 8-bit codes (iso-8859-whatever, etc.), for
example. It would be sheer guesswork, without some kind of additional
knowledge, language analysis or something.

As others have said, utf-8 can be verified for consistency, and the
hypothesis rejected if it proves to be false. But passing the
consistency test doesn't incontrovertibly prove that it's utf-8: it
might just be a co-incidence that a particular 8-bit-coded text passed
the utf-8 consistency check.

So we really _do_ need to know more about your situation if we are to
offer any kind of realistic help.
In general the files have neither meta data attached to them or
headers with meta data.

Then you're stuck with trying heuristic methods, IMHO.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top