Reading Text File Encoding and converting to Perls internal UTF-8 encoding

Discussion in 'Perl Misc' started by sln@netherlands.com, Apr 17, 2009.

  1. Guest

    Need help from Unicode guru's or anybody with some knowledge on the subject.

    I maybe have a text (character) file I just open. But I don't know the encoding and I
    can't open it with any encoding attribute.

    It would appear to me that at the start of the file, there is an encoding mark (or none),
    assuming a text file, a sort of BOM sequence of octets that mark what its encoding is.

    Given that I might be passed a file descriptor only, I am module, and I rewind the position
    to the start of the file, is there any way I can tell the encoding. If I could, and
    its not utf8, I could decode() the rest of the file as octets, ie: in-place memeory decode,
    create a temp file decoded, or possibly re-open it with the proper encoding.

    I think that encoding is the usual 8/16/32 bit utf but with many locales (chars).

    I am still sketchy where to find a list of encoding markers to be able to find out
    this information. And still sketchy on the methods available for analysis and transformation.

    I know Perl has a massive 'use Encode' lib, nevertheless, this is what I need to do to finalize
    a module I'm working on.

    Thanks for the help.
    -sln
    , Apr 17, 2009
    #1
    1. Advertising

  2. Re: Reading Text File Encoding and converting to Perls internal UTF-8encoding

    wrote:

    > Given that I might be passed a file descriptor only, I am module, and I rewind the position
    > to the start of the file, is there any way I can tell the encoding. If I could, and
    > its not utf8, I could decode() the rest of the file as octets, ie: in-place memeory decode,
    > create a temp file decoded, or possibly re-open it with the proper encoding.


    As I understand it, and I have just written some Perl code that happily
    mixes two dozen languages in one web page, there isn't a really good way
    of doing what you want. Part of the reason for this is that given a big
    block of text encoded as plain ASCII, the same text in UTF8 is exactly,
    bit for bit, the same. It's only when you introduce "wide" characters in
    other alphabets that UTF8 does anything.

    In some cases it may be possible to make an intelligent guess at the
    encoding, but no more.

    Incidentally, and somewhat off-topic, is there anyone else for whom the
    letters UTF automatically mean 'use the force'?

    --
    I am Robert Billing, Christian, author, inventor, traveller, cook and
    animal lover. "It burned me from within. It quickened; I was with book
    as a woman is with child."

    Quality e-books for portable readers: http://www.alex-library.com
    Robert Billing, Apr 17, 2009
    #2
    1. Advertising

  3. Guest

    On Fri, 17 Apr 2009 23:48:10 +0100, Robert Billing <> wrote:

    > wrote:
    >
    >> Given that I might be passed a file descriptor only, I am module, and I rewind the position
    >> to the start of the file, is there any way I can tell the encoding. If I could, and
    >> its not utf8, I could decode() the rest of the file as octets, ie: in-place memeory decode,
    >> create a temp file decoded, or possibly re-open it with the proper encoding.

    >
    >As I understand it, and I have just written some Perl code that happily
    >mixes two dozen languages in one web page, there isn't a really good way
    >of doing what you want. Part of the reason for this is that given a big
    >block of text encoded as plain ASCII, the same text in UTF8 is exactly,
    >bit for bit, the same. It's only when you introduce "wide" characters in
    >other alphabets that UTF8 does anything.
    >
    >In some cases it may be possible to make an intelligent guess at the
    >encoding, but no more.
    >
    >Incidentally, and somewhat off-topic, is there anyone else for whom the
    >letters UTF automatically mean 'use the force'?


    I'm sorry, 'I exists and therefore I am' doesen't seem to work.

    -sln
    , Apr 18, 2009
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jimmy Shaw

    Converting from UTF-16 to UTF-32

    Jimmy Shaw, Jul 31, 2006, in forum: C++
    Replies:
    7
    Views:
    1,322
    P.J. Plauger
    Aug 1, 2006
  2. Kioko --
    Replies:
    3
    Views:
    297
    Walton Hoops
    Mar 24, 2010
  3. Pekka Niiranen
    Replies:
    7
    Views:
    298
    Joe Smith
    Jul 25, 2004
  4. Mr. Zeus
    Replies:
    6
    Views:
    264
    Ben Morrow
    Oct 13, 2004
  5. Replies:
    2
    Views:
    381
    Nathan Keel
    Aug 14, 2009
Loading...

Share This Page