unicode characters in asci file

WindAndWaves · Nov 22, 2004

Hi Gurus

Do you know if it is possible to display unicode characters (e.g. japanese
ones) in a asci based file?

TIA

- Nicolaas

Philip Ronan · Nov 22, 2004

WindAndWaves said:
Do you know if it is possible to display unicode characters (e.g. japanese
ones) in a asci based file?

I'm not sure I understand what you mean.

The ASCII character set contains 128 characters consisting of uppercase and
lowercase Roman alphabets, Arabic numerals from 0-9, various punctuation
characters and 32 control codes. No Japanese characters there at all.

The Unicode standard contains thousands of characters, but it isn't the same
thing as ASCII.

If you want to include Japanese characters in an *HTML* file, then you
should use Unicode character entities. For example, the characters for
"Japan" are 日本.

Is that what you wanted?

Jukka K. Korpela · Nov 22, 2004

Philip Ronan said:
If you want to include Japanese characters in an *HTML* file, then you
should use Unicode character entities.

Make it "could". Why not use UTF-8? But technically it is indeed possible
to write an HTML document that is ASCII encoded, yet contains any
characters you want.

And they are character references, not entities. See
http://www.cs.tut.fi/~jkorpela/chars/ref.html

For example, the characters for
"Japan" are 日本.

Using decimal notation works more often, though the difference is getting
more and more marginal.

Sybren Stuvel · Nov 22, 2004

Philip Ronan said:
[...] you should use Unicode character entities.

Jukka K. Korpela replied:

Make it "could". Why not use UTF-8?

UTF-8 *is* unicode. It's just an encoding. Philip didn't specify any
encoding - OP might as well use UCS, although it's a large.

Sybren

Steve Pugh · Nov 22, 2004

Philip Ronan said:
Philip Ronan said:

[...] you should use Unicode character entities.

Click to expand...

Jukka K. Korpela replied:

Make it "could". Why not use UTF-8?

Click to expand...

UTF-8 *is* unicode. It's just an encoding. Philip didn't specify any
encoding - OP might as well use UCS, although it's a large.

Philip said to use "Unicode character entities" and from his example it is
clear that he was talking about "Numeric character references" -
As such Philip didn't need to specify any encoding - in HTML
all character references are always to Unicode so the encoding used would
be irrelevant.

Jukka was pointing out that instead of the character references the OP
could use UTF-8 and include the characters directly in the page.

Steve

WindAndWaves · Nov 22, 2004

Steve Pugh said:
Philip Ronan said:

[...] you should use Unicode character entities.

Click to expand...

Jukka K. Korpela replied:

Make it "could". Why not use UTF-8?

Click to expand...

UTF-8 *is* unicode. It's just an encoding. Philip didn't specify any
encoding - OP might as well use UCS, although it's a large.

Click to expand...

Philip said to use "Unicode character entities" and from his example it is
clear that he was talking about "Numeric character references" -
As such Philip didn't need to specify any encoding - in HTML
all character references are always to Unicode so the encoding used would
be irrelevant.

Jukka was pointing out that instead of the character references the OP
could use UTF-8 and include the characters directly in the page.

Steve

Thank you all for your replies. I know understand that it is indeed
possible to have 'funny' characters in an ascii file. You see, I have an
index file, which I would like to load quickly, but also contains some
Japanese, Russian, Chinese, etc.. characters (links pointing to translations
of the page). Now, I could either double the file in size by saving it as
unicode or I could use the codes to specify the characters that I
need.

Can someone please confirm that I understood this correctly.

Thank you

- Nicolaas

PS does anyone know of any programs / online applications that can translate
characters into these codes ()

Jukka K. Korpela · Nov 22, 2004

WindAndWaves said:
I have an index file, which I would like to load quickly, but also
contains some Japanese, Russian, Chinese, etc.. characters (links
pointing to translations of the page).

Ideally, we would use language negotiation (a protocol for selecting
content based on the language preferences in the browser and information
on existing versions in the server) for sending the user the best
alternative available. But this is unreliable since most people have
wrong language settings in their browsers, so a multilingual index file
is indeed needed for a multilingual site.

Now, I could either double
the file in size by saving it as unicode or I could use the
codes to specify the characters that I need.

You can use either of the methods, but please note that using Unicode
does not double the file size. Well, sometimes it might, but normally it
won't. In UTF-8, each Ascii character takes just one octet (byte), just
as in a pure Ascii file. Other characters take two or more octets each,
but if your document (including HTML markup, which uses Ascii only) is
dominantly Ascii characters, the increase in file size won't be big, and
it'll probably be a little smaller than the size of a version that uses
references. (After all, Ӓ is seven octets.)

PS does anyone know of any programs / online applications that can
translate characters into these codes ()

There are many of them, for different platforms. See
http://www.alanwood.net/unicode/utilities_editors.html
(which is about Unicode editors, which let you work with UTF-8 in
general, but they often have an output mode that uses ).

Sybren Stuvel · Nov 23, 2004

Steve Pugh enlightened us with:

Jukka was pointing out that instead of the character references the
OP could use UTF-8 and include the characters directly in the page.

Ah, ok! Indeed, that's very possible, and I do it often.

Sybren

Sybren Stuvel · Nov 23, 2004

WindAndWaves enlightened us with:

I know understand that it is indeed possible to have 'funny'
characters in an ascii file.

Strictly speaking, it's not. You have references to 'funny'
characters, but the references themselve are ASCII again, so no
'funny' characters are actually in the file. Or you have UTF-8 'funny'
characters in the file, but then the file isn't ASCII any more.

You see, I have an index file, which I would like to load quickly,
but also contains some Japanese, Russian, Chinese, etc.. characters
(links pointing to translations of the page). Now, I could either
double the file in size by saving it as unicode or I could use the
codes to specify the characters that I need.

You understood it incorrectly. If you were to use UCS to store the
unicode, you'd be right. If you use UTF-8 to store the unicode, the
ASCII characters would still take a single byte, and the others two or
more.

PS does anyone know of any programs / online applications that can
translate characters into these codes ()

I think HTML tidy can do that.

Sybren

Using characters from the International Phonetic Alphabet in a C program	0	Sep 21, 2022
Japanese characters in TITLE element	28	Apr 4, 2011
Unicode help please	5	Oct 19, 2013
Unicode File movement from Windows to Unix adding Special Characters	1	Feb 8, 2011
Trying to replace unicode characters	0	Sep 8, 2004
Unicode characters in btye-strings	5	Mar 12, 2010
difference between the real characters and the codes	4	Feb 24, 2005
Unicode Normalization Form C?	5	Apr 4, 2013

unicode characters in asci file

WindAndWaves

Philip Ronan

Jukka K. Korpela

Sybren Stuvel

Steve Pugh

WindAndWaves

Jukka K. Korpela

Sybren Stuvel

Sybren Stuvel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads