character encodings

Discussion in 'Perl Misc' started by Jaap Karssenberg, Nov 30, 2003.

  1. I have a script that should read files utf8 compliant, so I used
    binmode(FILE, ':utf8'). But now it appears some users have latin2
    encoded files, causing some regexes to throw warnings about malformed
    utf8 chars. Is there a way to detect the character encoding and DWIM ? I
    would hate to have to tell my users they should convert everything to
    utf8 first.

    --
    ) ( Jaap Karssenberg || Pardus [Larus] | |0| |
    : : http://pardus-larus.student.utwente.nl/~pardus | | |0|
    ) \ / ( |0|0|0|
    ",.*'*.," Proud owner of "Perl6 Essentials" 1st edition :)
     
    Jaap Karssenberg, Nov 30, 2003
    #1
    1. Advertising

  2. Jaap Karssenberg wrote:
    > I have a script that should read files utf8 compliant, so I used
    > binmode(FILE, ':utf8'). But now it appears some users have latin2
    > encoded files, causing some regexes to throw warnings about malformed
    > utf8 chars. Is there a way to detect the character encoding and DWIM


    You don't tell us what kind fo files those are.
    For e.g. HTML or XML the meta charset header resp. the encoding attribute
    should tell you what encoding to expect.
    Just scan for it and evaluate the rest of the file accordingly.

    jue
     
    Jürgen Exner, Nov 30, 2003
    #2
    1. Advertising

  3. On Sun, 30 Nov 2003 16:52:46 GMT Jürgen Exner wrote:
    : You don't tell us what kind fo files those are.
    : For e.g. HTML or XML the meta charset header resp. the encoding
    : attribute should tell you what encoding to expect.
    : Just scan for it and evaluate the rest of the file accordingly.

    Can be all kinds of files, the script is there to determine what they
    are. In general the files have neither meta data attached to them or
    headers with meta data.

    --
    ) ( Jaap Karssenberg || Pardus [Larus] | |0| |
    : : http://pardus-larus.student.utwente.nl/~pardus | | |0|
    ) \ / ( |0|0|0|
    ",.*'*.," Proud owner of "Perl6 Essentials" 1st edition :)
     
    Jaap Karssenberg, Nov 30, 2003
    #3
  4. On Sun, 30 Nov 2003, Jürgen Exner wrote:

    > For e.g. HTML or XML the meta charset header resp. the encoding attribute
    > should tell you what encoding to expect.


    Maybe. Depends how the files got there. For HTTP transactions it's
    legal (and often preferable) to supply the character coding on the
    actual HTTP content-type header, and to make no mention of it inside
    the actual body of the content. However, the O.P speaks of "files",
    so presumably you're right, and the HTTP transaction issue is outside
    this particular problem domain. But there's still the BOM option to
    keep in mind!

    > Just scan for it and evaluate the rest of the file accordingly.


    But you can't scan it without reading it, and you can't read it
    without opening it; so you'd have to open it provisionally with *some*
    mode, scan for the stuff that you have described - and then maybe
    re-open it with a different mode?

    OTOH, if one opens it in raw mode, and scans it in a way which can
    accommodate itself to different encodings, then, when the relevant
    encoding information has been found, the data can be piped through the
    appropriate encoding layers explicitly.

    There's a lot of options, and I'm not sure of the practical
    implications of choosing one or another. If the data is to be
    processed by an appropriate HTML or XML module, maybe that module can
    adapt to different data encodings as read in raw mode?

    What I think it comes down to is that it would definitely be a mistake
    to open the file with a utf8 IO layer without being sure that it's
    utf-8-encoded, due to the errors that will inevitably result if it
    isn't.

    hope this helps a bit.
     
    Alan J. Flavell, Nov 30, 2003
    #4
  5. Jaap Karssenberg

    Ben Morrow Guest

    Jaap Karssenberg <> wrote:
    > I have a script that should read files utf8 compliant, so I used
    > binmode(FILE, ':utf8'). But now it appears some users have latin2
    > encoded files, causing some regexes to throw warnings about malformed
    > utf8 chars. Is there a way to detect the character encoding and DWIM ? I
    > would hate to have to tell my users they should convert everything to
    > utf8 first.


    You can't, in general. One thing you could try is

    1. open the file in :raw mode.
    2. read a largeish chunk into a $scalar.
    3. turn the utf8 flag on with Encode::_utf8_on($scalar);.
    4. check if the data is valid with Encode::is_utf8($scalar, 1);.
    5. If it is, reopen the file with :utf8. If it ain't, assume latin2
    and reopen with :encoding(latin2).

    It seems there is no way to check if a sequence of bytes forms valid
    utf8 without first setting the utf8 flag on... but never mind
    that. Note that it is perfectly possible for data that was in fact
    saved in latin2 to pass this test, just rather unlikely; and that if
    next week you find some users are using latin1 you're completely
    screwed, as there's no way to tell latin1 from latin2. :)

    Ben

    --
    Like all men in Babylon I have been a proconsul; like all, a slave ... During
    one lunar year, I have been declared invisible; I shrieked and was not heard,
    I stole my bread and was not decapitated.
    ~ ~ Jorge Luis Borges, 'The Babylon Lottery'
     
    Ben Morrow, Nov 30, 2003
    #5
  6. On Sun, 30 Nov 2003, Jaap Karssenberg wrote:

    > On Sun, 30 Nov 2003 16:52:46 GMT Jürgen Exner wrote:
    > : You don't tell us what kind fo files those are.
    > : For e.g. HTML or XML the meta charset header resp. the encoding
    > : attribute should tell you what encoding to expect.
    > : Just scan for it and evaluate the rest of the file accordingly.
    >
    > Can be all kinds of files, the script is there to determine what they
    > are.


    It can't be done, in general. There is no way to reliably distinguish
    between the commonly-used 8-bit codes (iso-8859-whatever, etc.), for
    example. It would be sheer guesswork, without some kind of additional
    knowledge, language analysis or something.

    As others have said, utf-8 can be verified for consistency, and the
    hypothesis rejected if it proves to be false. But passing the
    consistency test doesn't incontrovertibly prove that it's utf-8: it
    might just be a co-incidence that a particular 8-bit-coded text passed
    the utf-8 consistency check.

    So we really _do_ need to know more about your situation if we are to
    offer any kind of realistic help.

    > In general the files have neither meta data attached to them or
    > headers with meta data.


    Then you're stuck with trying heuristic methods, IMHO.
     
    Alan J. Flavell, Nov 30, 2003
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Safalra
    Replies:
    8
    Views:
    665
    Roedy Green
    Jun 15, 2004
  2. Kenneth McDonald
    Replies:
    1
    Views:
    328
  3. JKPeck
    Replies:
    6
    Views:
    314
    Martin Miller
    Nov 14, 2006
  4. A_H
    Replies:
    3
    Views:
    936
    Gary Herron
    May 20, 2008
  5. Replies:
    7
    Views:
    3,689
Loading...

Share This Page