content-type and unicode

  • Thread starter Simply Confusing!
  • Start date
S

Simply Confusing!

Hi

i'm looking for a simple answer to what could be a complex question.

i'll try to make my question digestible.

i've done some web-pages in chinese. i pretty much ALWAYS work in unicode
sequences, meaning, I convert the word doc's with chinese char's into html,
then transplant the UNICODE SEQUENCES (ie, characters represented with stuff
like this: 樣的東 ... etc ) into my templates.

somewhere I was told that for chinese, you use "big5" (traditional) and
"gb1312" (simplified) for the charset attrib's on the Content-type metatag.
This I did, but occasionally, the browser would display ascii-gibberish, and
occasionally weird things would happen between where I'd download the
gibberish containing file, and my unicode sequences had actually been
replaced by ascii-gibberish. odd.

so then I reverted to using the iso-8859-1 charset attrib, and everything
settled down. no problem. I use the lang-tags zh-tw and zh-cn to ID my
pages as tradtional or simplified. (yes, i know that does not relate to
char display).

so i recently found a chinese language site and checked out the source code.
it was puzzling because the charset was utf-8 and the source was actually in
original chinese characters, not unicode.

i'm quite puzzled now. my chinese pages are displaying fine with unicode
under iso-8859-1, but I'm not sure what the "definitive" way is to display
non-latin character sequences. is there one?

i'd be particularly interested in hearing from asians who design asian
sites; also from western coders who have successfully developed chinese
language sites, or other non-latin language sites (russian, hebrew, arabic,
etc...)

thanks for any clarification or comments.

SC
 
J

J.O. Aho

Simply said:
i've done some web-pages in chinese. i pretty much ALWAYS work in unicode
sequences, meaning, I convert the word doc's with chinese char's into html,
then transplant the UNICODE SEQUENCES (ie, characters represented with stuff
like this: 樣的東 ... etc ) into my templates.

so i recently found a chinese language site and checked out the source code.
it was puzzling because the charset was utf-8 and the source was actually in
original chinese characters, not unicode.

i'm quite puzzled now. my chinese pages are displaying fine with unicode
under iso-8859-1, but I'm not sure what the "definitive" way is to display
non-latin character sequences. is there one?

iso-8869-1 does only support a-zA-Z and some national characters used mainly
in western and northern Europe and do not support any form of Chinese
characters. It supports 256 "characters", which hardly would be enough for any
form of Chinese alone.

Character setups like big5 and gb2312 uses dual bytes to represent characters,
usually combinations of characters above the 128 first ones. If you want to
use these character setups, you should save the text in that format and not
convert it to HTML entities, as you do.

UTF-8 is a new character setup where you can use all languages in the same
time, it works in the same way as big5 does, where multiple bytes represents
characters, this way you get around the 256 character limitation of a singe
byte character setup. UTF-8 is an Unicode character setup.
 
?

=?ISO-8859-1?Q?G=E9rard_Talbot?=

J.O. Aho wrote :
iso-8869-1

You most probably meant iso-8859-1 here.

does only support a-zA-Z and some national characters used mainly
in western and northern Europe and do not support any form of Chinese
characters. It supports 256 "characters", which hardly would be enough for any
form of Chinese alone.

Character setups like big5 and gb2312 uses dual bytes to represent characters,
usually combinations of characters above the 128 first ones. If you want to
use these character setups, you should save the text in that format and not
convert it to HTML entities, as you do.

Correct. This is also my recommendation.
UTF-8 is a new character setup where you can use all languages in the same
time, it works in the same way as big5 does, where multiple bytes represents
characters, this way you get around the 256 character limitation of a singe
byte character setup. UTF-8 is an Unicode character setup.

Exactly.

Gérard
 
I

I V

The &#... sequences (I think the correct term is "character
references," hopefully someone will correct me if I'm wrong) are a way of
representing unicode characters in a document that is stored in an
encoding that doesn't include all the unicode characters. Note that the
encoding in which you save your file has no effect on your use of
character references - these will always represent unicode characters, no
matter what encoding you use.

utf-8 allows you to directly store unicode characters in the file, so you
don't need to represent them using the &#... sequences. However, to use
it, you will need to use a text editor that can read and write utf-8
files, and that allows you to insert all the characters that you want to
use.

I don't think there is a "definitive" method. If you are only using a few
characters outside of iso-8859-1, it might be easiest to carry on using
&#... sequences. If you are using a lot of Chinese characters, on the
other hand, it might be easier (and might lower your file size, too) to
use a different encoding, so that you can store the Chinese characters
directly in the file. You could use UTF-8, big5, or another encoding,
depending on what your text editor supports. UTF-8 may be useful if you
are mixing western and Chinese characters because, as J.O. says, UTF-8
allows you to directly insert any unicode character.
iso-8869-1 does only support a-zA-Z and some national characters used
mainly in western and northern Europe and do not support any form of
Chinese characters. It supports 256 "characters", which hardly would be
enough for any form of Chinese alone.

While that's true, iso-8859-1 encoded documents can still include any
unicode characters through the use of &#... sequences.
 
?

=?ISO-8859-1?Q?G=E9rard_Talbot?=

Simply Confusing! wrote :
Hi

i'm looking for a simple answer to what could be a complex question.

i'll try to make my question digestible.

i've done some web-pages in chinese. i pretty much ALWAYS work in unicode
sequences, meaning, I convert the word doc's with chinese char's into html,
then transplant the UNICODE SEQUENCES (ie, characters represented with stuff
like this: 樣的東 ... etc ) into my templates.

somewhere I was told that for chinese, you use "big5" (traditional) and
"gb1312" (simplified)

You most likely meant to say gb2312 here, not gb1312.
for the charset attrib's on the Content-type metatag.

You unfortunately need more than that. The web server should serve the
document as big5 or gb2312 with the correct charset. Sometimes the web
server could be misconfigured. You may have to ask your webserver admin
(in my case, I had) so that - if you're lucky - the Apache server can be
tuned accordingly to serve your document as big5 or gb2312.

Content Negotiation (for Apache servers)
http://httpd.apache.org/docs/1.3/content-negotiation.html

One way I remembered on working around the problem (until the admin of
the web server would fix the problem) was to create an .htaccess file
and then editing in it the character set with

AddCharset GB-2312 .html

AddCharset directive in Apache servers
http://httpd.apache.org/docs/1.3/mod/mod_mime.html#addcharset

FAQ: Setting charset information in .htaccess
http://www.w3.org/International/questions/qa-htaccess-charset

Setting the HTTP charset parameter
http://www.w3.org/International/O-HTTP-charset.en.php

This I did, but occasionally, the browser would display ascii-gibberish, and
occasionally weird things would happen between where I'd download the
gibberish containing file, and my unicode sequences had actually been
replaced by ascii-gibberish. odd.


so then I reverted to using the iso-8859-1 charset attrib, and everything
settled down. no problem. I use the lang-tags zh-tw and zh-cn to ID my
pages as tradtional or simplified. (yes, i know that does not relate to
char display).

You need here what is called the http headers response for your webpages
so that you can know for sure how is your webpage served. From the
symptoms you describe, I would bet this is what is happening: your
webserver is not configured to deal, to serve your webpage with the
correct/intended character set.

View HTTP Request and Response Header
http://web-sniffer.net/

Most developer tools/toolbar have a http headers feature.
E.g.:
LiveHTTPHeaders
http://livehttpheaders.mozdev.org/

You can even have a bookmarklet for that:

Jesse Ruderman Validation Bookmarklets
http://www.squarefree.com/bookmarklets/validation.html

More and more browsers now provide such feature too or view info panel
on how the document was served. For Opera 9:
Opera W3-Dev Menu
http://tobyinkster.co.uk/opera

W3-dev > More Page tests > HTTP Headers

so i recently found a chinese language site and checked out the source code.
it was puzzling because the charset was utf-8 and the source was actually in
original chinese characters, not unicode.

i'm quite puzzled now. my chinese pages are displaying fine with unicode
under iso-8859-1, but I'm not sure what the "definitive" way is to display
non-latin character sequences. is there one?


99% chances - I'd bet - are that your web server is misconfigured and
can not handle sending your webpage as big5 or gb2312.
i'd be particularly interested in hearing from asians who design asian
sites;

On-line Chinese Tools
http://projects.ldc.upenn.edu/Chinese/info_it.htm

Penn State lab courses on computing in foreign scripts:
Tips for Developing Non-English Web Sites
http://tlt.its.psu.edu/suggestions/international/

Penn State lab courses on computing in foreign scripts: Chinese
(Simplified & Traditional)
http://tlt.its.psu.edu/suggestions/international/bylanguage/chinese.html

> also from western coders who have successfully developed chinese
language sites, or other non-latin language sites (russian, hebrew, arabic,
etc...)

Help Chinese translation page
http://www.gtalbot.org/DHTMLSection/HelpChineseTranslationPage.html

I have done webpages in Chinese, Russian, Hebrew, Arabic, etc, in over
20 languages, even Inuktitut.

Site Map
http://www.gtalbot.org/Varia/SiteMap.html

Gérard
 
S

Simply Confusing!

THANKS to all for helping to distill this complex and confusing language
display issue.

PC
 
J

J.O. Aho

I said:
While that's true, iso-8859-1 encoded documents can still include any
unicode characters through the use of &#... sequences.

Sure you can use HTML entities to represent character that isn't supported in
the character setup you save your html-files, but it's still not optimal for a
big5 site to use iso-8859-1 with HTML entities and remember that this only
works in a html browser that supports HTML entities for Unicode characters
(most modern should do). You will not get a good representation on search
engines and searching will not work well.
 
J

Jukka K. Korpela

Scripsit Simply Confusing!:
i'm looking for a simple answer to what could be a complex question.

You seem to have got a lot of useful advice, so I'll just throw in some
additional casual remarks.
somewhere I was told that for chinese, you use "big5" (traditional)
and "gb1312" (simplified) for the charset attrib's on the
Content-type metatag. This I did, but occasionally, the browser would
display ascii-gibberish,

It would be essential to know some URL(s) to see what really happens.
Setting the encoding (charset) in a meta tag is as such correct, though many
people frown upon it, but if the server sends contradicting information
about the encoding, the server wins. Some browsers might incorrectly make
their own guesses even in the presence of encoding information. Finally, it
is possible that the meta tag has some typo and gets ignored - and then (in
the absence of encoding information in HTTP headers) browsers will have to
make their guesses, and they may guess differently.
I use the lang-tags zh-tw and
zh-cn to ID my pages as tradtional or simplified. (yes, i know that
does not relate to char display).

Actually they _do_ relate to (affect) character display, even though they do
not affect the question of interpreting data as characters. After characters
have been identified, a browser _may_ use language information to select a
suitable _font_, and a browser _may_ have different treatment for zh-TW and
zh-CN in this respect.
so i recently found a chinese language site and checked out the
source code. it was puzzling because the charset was utf-8 and the
source was actually in original chinese characters, not unicode.

It was probably Unicode - just _real_ Unicode, not notations
(which aren't part of Unicode at all - they are just a SGML, HTML, or XML
thing defined using Unicode numbers).

The choice between the Chinese encodings and utf-8 is a practical one, and
largely a matter of assumed efficiency. The Chinese encodings have been
designed for Chinese text and they are more efficient for it than utf-8,
which was designed to cover "all" characters in the world so that texts in
Western languages can be represented efficiently.
 
?

=?ISO-8859-1?Q?G=E9rard_Talbot?=

J.O. Aho wrote :
Sure you can use HTML entities to represent character that isn't supported in
the character setup you save your html-files, but it's still not optimal for a
big5 site to use iso-8859-1 with HTML entities and remember that this only
works in a html browser that supports HTML entities for Unicode characters
(most modern should do).

Browser support for Unicode characters is not the most frequent issue in
such case; font support usually is. You can still have undisplayed
characters when, while using (named or numerical) character entities
because the font can not render (does not support) the referenced glyph.
You will not get a good representation on search
engines and searching will not work well.

Gérard
 
S

Simply Confusing!

Let me see if I have this any where near *understanding* ...

If I use the iso-8859-1 charset definition, I can only use roman
alpha-numeric sequences in my code (ie, like regular english/french/german,
or the unicode &#__; sequence), no problem, correct?

If I use BIG5 or GB2312 ... I (should/can/must?) insert the original chinese
figures into the raw code?

If I use UTF-8, I can use either the unicode sequence or the original
chinese figures.


(ps- looking for simple "what is best answers",
10-words-or-less-kind-of-thing, no long treatises or court-cases please :)

thanks!

SC
 
?

=?ISO-8859-1?Q?G=E9rard_Talbot?=

Simply Confusing! wrote :
Let me see if I have this any where near *understanding* ...

If you use BIG5 or GB2312 ... then you should/can/must insert "the
original chinese figures into the raw code". That's what J.O. Aho, Jukka
and I have been recommending you.

Remember that you need to verify that your webserver is properly
configured to send your webpage as big5 or gb2312 in such case or to
contact your webserver admin if the webserver is not sending your
webpage as big5 or gb2312.

Gérard
 
J

Jukka K. Korpela

Scripsit Simply Confusing!:
If I use the iso-8859-1 charset definition, I can only use roman
alpha-numeric sequences in my code (ie, like regular
english/french/german, or the unicode &#__; sequence), no problem,
correct?

What you are probably trying to say is correct, but what you are actually
saying is obscure; I mean the "alpha-numeric sequences in my code" part. You
can use almost any Latin letters as used in Western European languages as
such, and for characters that have no iso-8859-1 code, you can use a &#__;
sequence. (Regular English, French, and German contain punctuation marks
that don't exist in iso-8859-1, and French even has a letter, the oe
ligature, that isn't there.)
If I use BIG5 or GB2312 ... I (should/can/must?) insert the original
chinese figures into the raw code?

You can. You probably should, since it doesn't really make much sense to use
a Chinese encoding and yet use &#__; for Chinese characters, but you can, so
"must" would be wrong.
If I use UTF-8, I can use either the unicode sequence or the original
chinese figures.

Yes. Here, too, there's normally little reason not to insert the characters
as such.
(ps- looking for simple "what is best answers",
10-words-or-less-kind-of-thing, no long treatises or court-cases
please :)

Nobody expects the Spanish inquisition!
 
J

J.O. Aho

Simply said:
Let me see if I have this any where near *understanding* ...

If I use the iso-8859-1 charset definition, I can only use roman
alpha-numeric sequences in my code (ie, like regular english/french/german,
or the unicode &#__; sequence), no problem, correct?

Yes, thats true (HTML entities still depends on the fonts installed on the
client machine supports the characters that you want to display with the HTML
entities).

If I use BIG5 or GB2312 ... I (should/can/must?) insert the original chinese
figures into the raw code?

You should type the Chinese in BIG5/GB2312 directly (in the same way as you
would type German in ISO-8859-1), you still can type English (a-zA-Z) in the
CJK encodings.

If I use UTF-8, I can use either the unicode sequence or the original
chinese figures.

Not sure what you mean with Unicode, but I suspect you mean HTML entities (
), UTF-8 is Unicode and you don't need HTML entities as you can
represent all characters in UTF-8 and you still can get trash if you mix
Unicode with HTML entities in UTF-8, depending on how you insert the HTML
entities into the text.

(ps- looking for simple "what is best answers",
10-words-or-less-kind-of-thing, no long treatises or court-cases please :)

Use UTF-8, don't use HTML entities.
 
J

Jukka K. Korpela

Scripsit J.O. Aho:
(HTML entities still depends on the fonts installed
on the client machine supports the characters that you want to
display with the HTML entities).

Please don't confuse people in this already confusing issue. Confusing
entities with character references is understandable (W3C does it too), but
character display does _not_ depend on the way the character was represented
in HTML source.
Not sure what you mean with Unicode, but I suspect you mean HTML
entities ( ),

Character references (as defined in SGML and XML, not really in HTML
separately).

UTF-8 is Unicode and you don't need HTML
entities as you can represent all characters in UTF-8 and you still
can get trash if you mix Unicode with HTML entities in UTF-8,
depending on how you insert the HTML entities into the text.



Use UTF-8, don't use HTML entities.

No. Use UTF-8 or a Chinese encoding, based on criteria that cannot be
discussed without further information. Using ASCII or ISO-8859-1 with
character references would make sense for pages in Chinese only in rare
cases, e.g. when the pages must be editable (somehow) using simple editors
that don't let you input and view Chinese characters.

Use character references or predefined entity references (like —)
whenever you find them comfortable. You probably won't find them comfortable
at all for Chinese characters, though.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top