unicode characters in asci file

W

WindAndWaves

Hi Gurus

Do you know if it is possible to display unicode characters (e.g. japanese
ones) in a asci based file?

TIA

- Nicolaas
 
P

Philip Ronan

WindAndWaves said:
Do you know if it is possible to display unicode characters (e.g. japanese
ones) in a asci based file?

I'm not sure I understand what you mean.

The ASCII character set contains 128 characters consisting of uppercase and
lowercase Roman alphabets, Arabic numerals from 0-9, various punctuation
characters and 32 control codes. No Japanese characters there at all.

The Unicode standard contains thousands of characters, but it isn't the same
thing as ASCII.

If you want to include Japanese characters in an *HTML* file, then you
should use Unicode character entities. For example, the characters for
"Japan" are 日本.

Is that what you wanted?
 
J

Jukka K. Korpela

Philip Ronan said:
If you want to include Japanese characters in an *HTML* file, then you
should use Unicode character entities.

Make it "could". Why not use UTF-8? But technically it is indeed possible
to write an HTML document that is ASCII encoded, yet contains any
characters you want.

And they are character references, not entities. See
http://www.cs.tut.fi/~jkorpela/chars/ref.html
For example, the characters for
"Japan" are 日本.

Using decimal notation works more often, though the difference is getting
more and more marginal.
 
S

Sybren Stuvel

Philip Ronan said:
[...] you should use Unicode character entities.

Jukka K. Korpela replied:
Make it "could". Why not use UTF-8?

UTF-8 *is* unicode. It's just an encoding. Philip didn't specify any
encoding - OP might as well use UCS, although it's a large.

Sybren
 
S

Steve Pugh

Philip Ronan said:
[...] you should use Unicode character entities.

Jukka K. Korpela replied:
Make it "could". Why not use UTF-8?

UTF-8 *is* unicode. It's just an encoding. Philip didn't specify any
encoding - OP might as well use UCS, although it's a large.

Philip said to use "Unicode character entities" and from his example it is
clear that he was talking about "Numeric character references" -
As such Philip didn't need to specify any encoding - in HTML
all character references are always to Unicode so the encoding used would
be irrelevant.

Jukka was pointing out that instead of the character references the OP
could use UTF-8 and include the characters directly in the page.

Steve
 
W

WindAndWaves

Steve Pugh said:
Philip Ronan said:
[...] you should use Unicode character entities.

Jukka K. Korpela replied:
Make it "could". Why not use UTF-8?

UTF-8 *is* unicode. It's just an encoding. Philip didn't specify any
encoding - OP might as well use UCS, although it's a large.

Philip said to use "Unicode character entities" and from his example it is
clear that he was talking about "Numeric character references" -
As such Philip didn't need to specify any encoding - in HTML
all character references are always to Unicode so the encoding used would
be irrelevant.

Jukka was pointing out that instead of the character references the OP
could use UTF-8 and include the characters directly in the page.

Steve

Thank you all for your replies. I know understand that it is indeed
possible to have 'funny' characters in an ascii file. You see, I have an
index file, which I would like to load quickly, but also contains some
Japanese, Russian, Chinese, etc.. characters (links pointing to translations
of the page). Now, I could either double the file in size by saving it as
unicode or I could use the codes to specify the characters that I
need.

Can someone please confirm that I understood this correctly.

Thank you


- Nicolaas

PS does anyone know of any programs / online applications that can translate
characters into these codes ()
 
J

Jukka K. Korpela

WindAndWaves said:
I have an index file, which I would like to load quickly, but also
contains some Japanese, Russian, Chinese, etc.. characters (links
pointing to translations of the page).

Ideally, we would use language negotiation (a protocol for selecting
content based on the language preferences in the browser and information
on existing versions in the server) for sending the user the best
alternative available. But this is unreliable since most people have
wrong language settings in their browsers, so a multilingual index file
is indeed needed for a multilingual site.
Now, I could either double
the file in size by saving it as unicode or I could use the
codes to specify the characters that I need.

You can use either of the methods, but please note that using Unicode
does not double the file size. Well, sometimes it might, but normally it
won't. In UTF-8, each Ascii character takes just one octet (byte), just
as in a pure Ascii file. Other characters take two or more octets each,
but if your document (including HTML markup, which uses Ascii only) is
dominantly Ascii characters, the increase in file size won't be big, and
it'll probably be a little smaller than the size of a version that uses
references. (After all, Ӓ is seven octets.)
PS does anyone know of any programs / online applications that can
translate characters into these codes ()

There are many of them, for different platforms. See
http://www.alanwood.net/unicode/utilities_editors.html
(which is about Unicode editors, which let you work with UTF-8 in
general, but they often have an output mode that uses ).
 
S

Sybren Stuvel

Steve Pugh enlightened us with:
Jukka was pointing out that instead of the character references the
OP could use UTF-8 and include the characters directly in the page.

Ah, ok! Indeed, that's very possible, and I do it often.

Sybren
 
S

Sybren Stuvel

WindAndWaves enlightened us with:
I know understand that it is indeed possible to have 'funny'
characters in an ascii file.

Strictly speaking, it's not. You have references to 'funny'
characters, but the references themselve are ASCII again, so no
'funny' characters are actually in the file. Or you have UTF-8 'funny'
characters in the file, but then the file isn't ASCII any more.
You see, I have an index file, which I would like to load quickly,
but also contains some Japanese, Russian, Chinese, etc.. characters
(links pointing to translations of the page). Now, I could either
double the file in size by saving it as unicode or I could use the
codes to specify the characters that I need.

You understood it incorrectly. If you were to use UCS to store the
unicode, you'd be right. If you use UTF-8 to store the unicode, the
ASCII characters would still take a single byte, and the others two or
more.
PS does anyone know of any programs / online applications that can
translate characters into these codes ()

I think HTML tidy can do that.

Sybren
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top