How to read unicode

Discussion in 'Java' started by JR, Jul 2, 2007.

  1. JR

    JR Guest

    I have a java program that parses text files of metadata and does
    various activities on it. I recently was asked to start working with
    Japanese Unicode characters but not sure where to begin if I need ot
    do anything specific for this. This program runs in a DOS window on a
    Western character set PC. Some questions that come to mind that I was
    hoping to get input on:

    1. Would it just work as is if I was running in a DOS window on a
    Japanese version of Windows XP?
    2. If in US, do I have to convert the characters from their graphical
    representation to their Unicode numeric equivalent?
    3. If so is there some way to parse the source data and convert it
    from like MS Mincho to Unicode?
    4.Can I save this data if converted as a standard text file?

    Thanks.

    JR
    JR, Jul 2, 2007
    #1
    1. Advertising

  2. JR

    stefanomnn Guest

    HI!
    for reading text file, i think what you need is knowing right
    encoding.
    eg. suppose it is UTF-16:

    Code:
    FileInputStream fileStream = new FileInputStream("yourFile");
    BufferedReader reader = new BufferedReader(new
    InputStreamReader(fileStream , "UTF-16"));
    String line = reader.readLine();
    
    now you have correct rappresentation of your String.
    i hope i helped you.
    stefanomnn, Jul 3, 2007
    #2
    1. Advertising

  3. JR

    Roedy Green Guest

    On Mon, 02 Jul 2007 15:23:51 -0700, JR <> wrote,
    quoted or indirectly quoted someone who said :

    >I have a java program that parses text files of metadata and does
    >various activities on it.


    If you display characters in a GUI, you just use Unicode, and it the
    GUI's problem to display them. The only tricky part is selecting
    fonts which support the Unicode characters you are using.
    See http://mindprod.com/applets/fontshower.html

    If you display characters on the console, it typically uses an 8-bit
    encoding of some kind. See http://mindprod.com/applets/fileio.html
    for how to convert to various 8-bit encodings.

    The default encoding should be suitable.

    Lie to Windows and tell it you live in Japan to find out what that
    default encoding is.

    See http://mindprod.com/jgloss/encoding.html
    --
    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Jul 3, 2007
    #3
  4. JR

    Chris Smith Guest

    JR <> wrote:
    > I have a java program that parses text files of metadata and does
    > various activities on it. I recently was asked to start working with
    > Japanese Unicode characters but not sure where to begin if I need ot
    > do anything specific for this. This program runs in a DOS window on a
    > Western character set PC. Some questions that come to mind that I was
    > hoping to get input on:
    >
    > 1. Would it just work as is if I was running in a DOS window on a
    > Japanese version of Windows XP?


    There are two ways to approach I/O. One is to use the system default
    character encoding. The other is to specify a character encoding. If
    you've used the system default character encoding, then it would
    probably work on a Japanese system with Japanese characters. If you've
    specified an encoding, then it probably won't.

    You should always prefer specifying an encoding when possible. However,
    the encoding you use has to match the encoding of the "metadata text
    files" you are reading. If you can't control those, then your choice is
    made for you. You need to find out from whomever writes these files
    what encoding they use.

    > 2. If in US, do I have to convert the characters from their graphical
    > representation to their Unicode numeric equivalent?


    You can't draw characters to the console that aren't in the character
    set for that console. So you'll either need to convert your code to a
    GUI, or give up on drawing Japanese characters on a non-Japanese
    terminal.

    > 3. If so is there some way to parse the source data and convert it
    > from like MS Mincho to Unicode?


    I don't know what MS Mincho is. Sorry.

    > 4.Can I save this data if converted as a standard text file?


    Sure you can save it. Again, you can save it either in a specific
    encoding, or with the platform default. If the text contains characters
    that can't be encoded with that encoding, they will appear as '?'
    characters.

    --
    Chris Smith
    Chris Smith, Jul 4, 2007
    #4
  5. JR

    Oliver Wong Guest

    "Chris Smith" <> wrote in message
    news:...
    > JR <> wrote:
    >> 2. If in US, do I have to convert the characters from their graphical
    >> representation to their Unicode numeric equivalent?

    >
    > You can't draw characters to the console that aren't in the character
    > set for that console. So you'll either need to convert your code to a
    > GUI, or give up on drawing Japanese characters on a non-Japanese
    > terminal.
    >
    >> 3. If so is there some way to parse the source data and convert it
    >> from like MS Mincho to Unicode?

    >
    > I don't know what MS Mincho is. Sorry.


    It's the name of a font which contains glyph for Japanese characters
    (and perhaps CJK characters in general) made by Microsoft. It comes with
    Windows and usually when you're using a font that otherwise doesn't
    support CJK characters (e.g. Times or Arial), Windows will silently
    substitute the Mincho font instead, so it's one of the most common fonts
    used for displaying CJK characters (at least in the Windows world).

    The poster also made this post which implies that (s)he is pretty
    confused about how Unicode, font, and related topics works:
    http://groups.google.ca/group/comp....read/thread/853bd25f432f9df5/8804136f5c810c41

    <quote>
    I have some text files with western characters in english, and
    japanese fonts in them.
    </quote>

    I saw that post before seeing this one, so I thought it was just
    sloppy wording or mixed up terminology, but now it really sounds like the
    OP is conflating fonts and text at the conceptual level.

    - Oliver
    Oliver Wong, Jul 4, 2007
    #5
  6. JR

    Roedy Green Guest

    On Mon, 02 Jul 2007 15:23:51 -0700, JR <> wrote,
    quoted or indirectly quoted someone who said :

    >I recently was asked to start working with
    >Japanese Unicode characters but not sure where to begin if I need ot
    >do anything specific for this.


    the first thing is to find out how this file is encoded.

    Possibilities include:

    Cp930 Japanese Katakana-Kanji mixed with 4370 UDC, superset
    of 5026
    Cp939 Japanese Latin Kanji mixed with 4370 UDC, superset of
    5035
    Cp942 Japanese (OS/2) superset of 932
    Cp942C variant of Cp942. Japanese (OS/2) superset of Cp932
    Cp943 Japanese (OS/2) superset of Cp932 and Shift-JIS.
    Cp943C Variant of Cp943. Japanese (OS/2) superset of Cp932
    and Shift-JIS.
    Cp33722 IBM-eucJP - Japanese (superset of 5050)

    JIS Japanese
    JIS0201 JIS 0201, Japanese
    JIS0212 JIS 0212, Japanese
    JISAutoDetect Detects and converts from Shift-JIS, EUC-JP,
    ISO- 2022 JP (conversion to Unicode only)
    JIS_X0201 Japanese
    JIS_X0212-1990f Japanese

    Shift_JIS Shift JIS. Japanese. A Microsoft code that
    extends csHalfWidthKatakana to include kanji by adding a second byte
    when the value of the first byte is in the ranges 81-9F or E0-EF.

    See http://mindprod.com/jgloss/encoding.html

    I am working on a little utility called EncodingRecogniser which
    should help you. All it does is display any given file presuming any
    of Java's supported encodings, telling you about BOMs.

    I hope to post it some time tonight.
    --
    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Jul 5, 2007
    #6
  7. JR

    Roedy Green Guest

    On Wed, 04 Jul 2007 23:24:28 GMT, Roedy Green
    <> wrote, quoted or indirectly quoted
    someone who said :

    >I am working on a little utility called EncodingRecogniser which
    >should help you. All it does is display any given file presuming any
    >of Java's supported encodings, telling you about BOMs.


    The utility is now posted with Java source. You can use it online at
    http://mindprod.com/applets/encodingrecogniser.html
    or downoad it at
    http://mindprod.com/products1.html#ENCODINGRECOGNISER

    I added some whistles -- hex bytes and hex chars, and notification
    where BOMs are detected.

    --
    Roedy Green Canadian Mind Products
    The Java Glossary
    http://mindprod.com
    Roedy Green, Jul 5, 2007
    #7
  8. JR <> wrote in news:1183415031.233708.186300
    @q69g2000hsb.googlegroups.com:

    > I have a java program that parses text files of metadata and does
    > various activities on it. I recently was asked to start working with
    > Japanese Unicode characters but not sure where to begin if I need ot
    > do anything specific for this. This program runs in a DOS window on a
    > Western character set PC. Some questions that come to mind that I was
    > hoping to get input on:
    >
    > 1. Would it just work as is if I was running in a DOS window on a
    > Japanese version of Windows XP?
    > 2. If in US, do I have to convert the characters from their graphical
    > representation to their Unicode numeric equivalent?
    > 3. If so is there some way to parse the source data and convert it
    > from like MS Mincho to Unicode?
    > 4.Can I save this data if converted as a standard text file?


    First, I would recommend that you spend some time learning the difference
    between character sets (e.g. unicode), encodings (e.g. UTF-8) and fonds (e.g.
    MS Mincho). Several web pages that I've found useful for this include:

    http://czyborra.com/
    http://www.i18nguy.com/unicode/codepages.html
    http://www.unicode.org/
    http://www.faqs.org/rfcs/rfc2044.html
    http://www.faqs.org/rfcs/rfc2781.html

    Cheers!
    GRB

    --
    ---------------------------------------------------------------------
    Greg R. Broderick

    A. Top posters.
    Q. What is the most annoying thing on Usenet?
    ---------------------------------------------------------------------
    Greg R. Broderick, Jul 5, 2007
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Robert Mark Bram
    Replies:
    0
    Views:
    3,917
    Robert Mark Bram
    Sep 28, 2003
  2. ygao

    unicode wrap unicode object?

    ygao, Apr 8, 2006, in forum: Python
    Replies:
    6
    Views:
    543
    =?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=
    Apr 8, 2006
  3. Gabriele *darkbard* Farina

    Unicode digit to unicode string

    Gabriele *darkbard* Farina, May 16, 2006, in forum: Python
    Replies:
    2
    Views:
    514
    Gabriele *darkbard* Farina
    May 16, 2006
  4. gabor
    Replies:
    13
    Views:
    548
    Leo Kislov
    Nov 18, 2006
  5. Jean-Paul Calderone
    Replies:
    23
    Views:
    665
    Leo Kislov
    Nov 21, 2006
Loading...

Share This Page