Is there any way to discover what charset encoding a file is using?

Discussion in 'Java' started by James, Jun 29, 2004.

  1. James

    James Guest

    Hi all,

    Is there any way, to discover what charset eoncoding a file is
    actually by reading the content of it.

    For example, I may have a file which contains some Japanese Character,
    how could I determine if those character are actually Japanese ones.

    Thank You.

    James
     
    James, Jun 29, 2004
    #1
    1. Advertising

  2. James

    Roedy Green Guest

    On 29 Jun 2004 00:30:56 -0700, (James) wrote or
    quoted :

    >Is there any way, to discover what charset eoncoding a file is
    >actually by reading the content of it.
    >
    >For example, I may have a file which contains some Japanese Character,
    >how could I determine if those character are actually Japanese ones.


    see http://mindprod.com/projects/encodingidentification.html
    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jun 29, 2004
    #2
    1. Advertising

  3. Re: Is there any way to discover what charset encoding a file isusing?

    James wrote:
    > Is there any way, to discover what charset eoncoding a file is
    > actually by reading the content of it.


    Not with anything remotely approaching certainty.

    > For example, I may have a file which contains some Japanese Character,
    > how could I determine if those character are actually Japanese ones.


    Your best bet would be to take some common japanese words, encode them
    in each of the three(!) charsets commonly used in Japan plus UTF-8
    and UTF-16 and look for matches.

    If you just have a file that might be any language in any encoding,
    you're pretty much f*cked. In the worst case, it might be a *mix*
    of languages encoded in ISO-2022 (which, if I understood it correctly,
    is stateful and uses special command sequences to switch between modes
    in which different languages can be encoded).
     
    Michael Borgwardt, Jul 1, 2004
    #3
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. J.P.Jarolim
    Replies:
    0
    Views:
    1,055
    J.P.Jarolim
    Feb 27, 2004
  2. Alex Martelli
    Replies:
    25
    Views:
    1,464
    MrJean1
    Dec 1, 2005
  3. Replies:
    2
    Views:
    366
  4. Thierry Jeanneret
    Replies:
    2
    Views:
    74
    Thierry
    Mar 8, 2008
  5. optimistx

    javascript charset <> page charset

    optimistx, Aug 14, 2008, in forum: Javascript
    Replies:
    2
    Views:
    279
    optimistx
    Aug 15, 2008
Loading...

Share This Page