How to read unicode

JR · Jul 2, 2007

I have a java program that parses text files of metadata and does
various activities on it. I recently was asked to start working with
Japanese Unicode characters but not sure where to begin if I need ot
do anything specific for this. This program runs in a DOS window on a
Western character set PC. Some questions that come to mind that I was
hoping to get input on:

1. Would it just work as is if I was running in a DOS window on a
Japanese version of Windows XP?
2. If in US, do I have to convert the characters from their graphical
representation to their Unicode numeric equivalent?
3. If so is there some way to parse the source data and convert it
from like MS Mincho to Unicode?
4.Can I save this data if converted as a standard text file?

Thanks.

JR

stefanomnn · Jul 3, 2007

HI!
for reading text file, i think what you need is knowing right
encoding.
eg. suppose it is UTF-16:

Code:

FileInputStream fileStream = new FileInputStream("yourFile");
BufferedReader reader = new BufferedReader(new
InputStreamReader(fileStream , "UTF-16"));
String line = reader.readLine();

now you have correct rappresentation of your String.
i hope i helped you.

Roedy Green · Jul 3, 2007

I have a java program that parses text files of metadata and does
various activities on it.

If you display characters in a GUI, you just use Unicode, and it the
GUI's problem to display them. The only tricky part is selecting
fonts which support the Unicode characters you are using.
See http://mindprod.com/applets/fontshower.html

If you display characters on the console, it typically uses an 8-bit
encoding of some kind. See http://mindprod.com/applets/fileio.html
for how to convert to various 8-bit encodings.

The default encoding should be suitable.

Lie to Windows and tell it you live in Japan to find out what that
default encoding is.

See http://mindprod.com/jgloss/encoding.html

Chris Smith · Jul 4, 2007

JR said:
I have a java program that parses text files of metadata and does
various activities on it. I recently was asked to start working with
Japanese Unicode characters but not sure where to begin if I need ot
do anything specific for this. This program runs in a DOS window on a
Western character set PC. Some questions that come to mind that I was
hoping to get input on:

1. Would it just work as is if I was running in a DOS window on a
Japanese version of Windows XP?

There are two ways to approach I/O. One is to use the system default
character encoding. The other is to specify a character encoding. If
you've used the system default character encoding, then it would
probably work on a Japanese system with Japanese characters. If you've
specified an encoding, then it probably won't.

You should always prefer specifying an encoding when possible. However,
the encoding you use has to match the encoding of the "metadata text
files" you are reading. If you can't control those, then your choice is
made for you. You need to find out from whomever writes these files
what encoding they use.

2. If in US, do I have to convert the characters from their graphical
representation to their Unicode numeric equivalent?

You can't draw characters to the console that aren't in the character
set for that console. So you'll either need to convert your code to a
GUI, or give up on drawing Japanese characters on a non-Japanese
terminal.

3. If so is there some way to parse the source data and convert it
from like MS Mincho to Unicode?

I don't know what MS Mincho is. Sorry.

4.Can I save this data if converted as a standard text file?

Sure you can save it. Again, you can save it either in a specific
encoding, or with the platform default. If the text contains characters
that can't be encoded with that encoding, they will appear as '?'
characters.

Oliver Wong · Jul 4, 2007

Chris Smith said:
You can't draw characters to the console that aren't in the character
set for that console. So you'll either need to convert your code to a
GUI, or give up on drawing Japanese characters on a non-Japanese
terminal.

I don't know what MS Mincho is. Sorry.

It's the name of a font which contains glyph for Japanese characters
(and perhaps CJK characters in general) made by Microsoft. It comes with
Windows and usually when you're using a font that otherwise doesn't
support CJK characters (e.g. Times or Arial), Windows will silently
substitute the Mincho font instead, so it's one of the most common fonts
used for displaying CJK characters (at least in the Windows world).

The poster also made this post which implies that (s)he is pretty
confused about how Unicode, font, and related topics works:
http://groups.google.ca/group/comp....read/thread/853bd25f432f9df5/8804136f5c810c41

<quote>
I have some text files with western characters in english, and
japanese fonts in them.
</quote>

I saw that post before seeing this one, so I thought it was just
sloppy wording or mixed up terminology, but now it really sounds like the
OP is conflating fonts and text at the conceptual level.

- Oliver

Roedy Green · Jul 5, 2007

I recently was asked to start working with
Japanese Unicode characters but not sure where to begin if I need ot
do anything specific for this.

the first thing is to find out how this file is encoded.

Possibilities include:

Cp930 Japanese Katakana-Kanji mixed with 4370 UDC, superset
of 5026
Cp939 Japanese Latin Kanji mixed with 4370 UDC, superset of
5035
Cp942 Japanese (OS/2) superset of 932
Cp942C variant of Cp942. Japanese (OS/2) superset of Cp932
Cp943 Japanese (OS/2) superset of Cp932 and Shift-JIS.
Cp943C Variant of Cp943. Japanese (OS/2) superset of Cp932
and Shift-JIS.
Cp33722 IBM-eucJP - Japanese (superset of 5050)

JIS Japanese
JIS0201 JIS 0201, Japanese
JIS0212 JIS 0212, Japanese
JISAutoDetect Detects and converts from Shift-JIS, EUC-JP,
ISO- 2022 JP (conversion to Unicode only)
JIS_X0201 Japanese
JIS_X0212-1990f Japanese

Shift_JIS Shift JIS. Japanese. A Microsoft code that
extends csHalfWidthKatakana to include kanji by adding a second byte
when the value of the first byte is in the ranges 81-9F or E0-EF.

See http://mindprod.com/jgloss/encoding.html

I am working on a little utility called EncodingRecogniser which
should help you. All it does is display any given file presuming any
of Java's supported encodings, telling you about BOMs.

I hope to post it some time tonight.

Roedy Green · Jul 5, 2007

I am working on a little utility called EncodingRecogniser which
should help you. All it does is display any given file presuming any
of Java's supported encodings, telling you about BOMs.

The utility is now posted with Java source. You can use it online at
http://mindprod.com/applets/encodingrecogniser.html
or downoad it at
http://mindprod.com/products1.html#ENCODINGRECOGNISER

I added some whistles -- hex bytes and hex chars, and notification
where BOMs are detected.

Greg R. Broderick · Jul 5, 2007

I have a java program that parses text files of metadata and does
various activities on it. I recently was asked to start working with
Japanese Unicode characters but not sure where to begin if I need ot
do anything specific for this. This program runs in a DOS window on a
Western character set PC. Some questions that come to mind that I was
hoping to get input on:

1. Would it just work as is if I was running in a DOS window on a
Japanese version of Windows XP?
2. If in US, do I have to convert the characters from their graphical
representation to their Unicode numeric equivalent?
3. If so is there some way to parse the source data and convert it
from like MS Mincho to Unicode?
4.Can I save this data if converted as a standard text file?

First, I would recommend that you spend some time learning the difference
between character sets (e.g. unicode), encodings (e.g. UTF-8) and fonds (e.g.
MS Mincho). Several web pages that I've found useful for this include:

http://czyborra.com/
http://www.i18nguy.com/unicode/codepages.html
http://www.unicode.org/
http://www.faqs.org/rfcs/rfc2044.html
http://www.faqs.org/rfcs/rfc2781.html

Cheers!
GRB

--
---------------------------------------------------------------------
Greg R. Broderick (e-mail address removed)

A. Top posters.
Q. What is the most annoying thing on Usenet?
---------------------------------------------------------------------

problem with java displaying unicode, under ms-windows	13	Jul 22, 2012
How to play corresponding sound?	2	Jun 10, 2023
Converting EBCDIC to Unicode	3	Sep 28, 2010
given char* utf8, how to read unicode line by line, and output utf8	2	Mar 13, 2012
Ascii to Unicode.	4	Jul 28, 2010
fgetwc doesn't read Unicode	6	Jun 8, 2011
Thinking Unicode	0	Aug 8, 2013
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023

How to read unicode

JR

stefanomnn

Roedy Green

Chris Smith

Oliver Wong

Roedy Green

Roedy Green

Greg R. Broderick

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads