Is there any way to discover what charset encoding a file is using?

Discussion in 'Java' started by James, Jun 29, 2004.

  1. James

    James Guest

    Hi all,

    Is there any way, to discover what charset eoncoding a file is
    actually by reading the content of it.

    For example, I may have a file which contains some Japanese Character,
    how could I determine if those character are actually Japanese ones.

    Thank You.

    James
     
    James, Jun 29, 2004
    #1
    1. Advertisements

  2. James

    Roedy Green Guest

    On 29 Jun 2004 00:30:56 -0700, (James) wrote or
    quoted :

    >Is there any way, to discover what charset eoncoding a file is
    >actually by reading the content of it.
    >
    >For example, I may have a file which contains some Japanese Character,
    >how could I determine if those character are actually Japanese ones.


    see http://mindprod.com/projects/encodingidentification.html
    --
    Canadian Mind Products, Roedy Green.
    Coaching, problem solving, economical contract programming.
    See http://mindprod.com/jgloss/jgloss.html for The Java Glossary.
     
    Roedy Green, Jun 29, 2004
    #2
    1. Advertisements

  3. Re: Is there any way to discover what charset encoding a file isusing?

    James wrote:
    > Is there any way, to discover what charset eoncoding a file is
    > actually by reading the content of it.


    Not with anything remotely approaching certainty.

    > For example, I may have a file which contains some Japanese Character,
    > how could I determine if those character are actually Japanese ones.


    Your best bet would be to take some common japanese words, encode them
    in each of the three(!) charsets commonly used in Japan plus UTF-8
    and UTF-16 and look for matches.

    If you just have a file that might be any language in any encoding,
    you're pretty much f*cked. In the worst case, it might be a *mix*
    of languages encoded in ISO-2022 (which, if I understood it correctly,
    is stateful and uses special command sequences to switch between modes
    in which different languages can be encoded).
     
    Michael Borgwardt, Jul 1, 2004
    #3
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Alex Martelli
    Replies:
    25
    Views:
    1,793
    MrJean1
    Dec 1, 2005
  2. Replies:
    2
    Views:
    474
  3. gganesh
    Replies:
    2
    Views:
    359
    Minesh Patel
    May 5, 2009
  4. Arul hari
    Replies:
    1
    Views:
    149
    Thomas Adam
    Oct 19, 2007
  5. Thierry Jeanneret
    Replies:
    2
    Views:
    140
    Thierry
    Mar 8, 2008
  6. Andries

    is there a way ..... any way

    Andries, Apr 26, 2004, in forum: Perl Misc
    Replies:
    27
    Views:
    549
    Robin
    Apr 27, 2004
  7. lofenee
    Replies:
    5
    Views:
    406
    lofenee
    Jul 15, 2008
  8. optimistx

    javascript charset <> page charset

    optimistx, Aug 14, 2008, in forum: Javascript
    Replies:
    2
    Views:
    497
    optimistx
    Aug 15, 2008
Loading...