How should Chinese sites be encoded to be listed in search engine?

P

Pat

Would google and other search engines support the indexing of
non-English UTF-8 encoded websites?

Most chinese website indexed on google appears to be
- for Traditional Chinese, charset=big5" encoding=ANSI
- For Simplified Chinese, charset=gb2312 encoding=ANSI


Does it support for
charset=UTF-8" encoding=UTF-8
 
N

Nikita the Spider

Pat said:
Would google and other search engines support the indexing of
non-English UTF-8 encoded websites?
Yes.


Most chinese website indexed on google appears to be
- for Traditional Chinese, charset=big5" encoding=ANSI
- For Simplified Chinese, charset=gb2312 encoding=ANSI

I don't have any experience with Asian encodings but my guess is that
big5 is preferable to UTF8 because it is more efficient (i.e. takes up
less space) when most of the characters are Asian. If you don't mind
fatter pages, UTF8 should be fine.

HTH
 
D

Dylan Sung

Nikita the Spider said:
I don't have any experience with Asian encodings but my guess is that
big5 is preferable to UTF8 because it is more efficient (i.e. takes up
less space) when most of the characters are Asian. If you don't mind
fatter pages, UTF8 should be fine.

Encodings like GB and Big5 are double byte encodings. However, unicode (utf8
at least) uses three or more bytes for far east asian characters (amongst
others). So yes, in terms of economy, GB and Big5 yield text files that have
fewer bytes.

You can view the repetoire of characters in unicode as having subsets of GB
and Big5 within them, and thus you can do direct converseions from GB to
unicode, and Big5 to unicode. However there are characters in GB which do
not occur in Big5 and vice versa, so conversion between the two is lossy. My
guess is that google employs searching algorithms which convert characters
to utf-8 and then searches for webpages which contain both simplified gb and
traditional characters in Big5 all at the same time, at least this is what I
get when I'm entering one or the other character set characters into their
search field.

Dyl.
 
D

Dylan Sung

Dylan Sung said:
Encodings like GB and Big5 are double byte encodings. However, unicode
(utf8 at least) uses three or more bytes for far east asian characters
(amongst others). So yes, in terms of economy, GB and Big5 yield text
files that have fewer bytes.

You can view the repetoire of characters in unicode as having subsets of
GB and Big5 within them, and thus you can do direct converseions from GB
to unicode, and Big5 to unicode. However there are characters in GB which
do not occur in Big5 and vice versa, so conversion between the two is
lossy. My guess is that google employs searching algorithms which convert
characters to utf-8 and then searches for webpages which contain both
simplified gb and traditional characters in Big5 all at the same time, at
least this is what I get when I'm entering one or the other character set
characters into their search field.


Sorry, didn't answer the original question. I think that web pages should
list their encodings as appropriate. That is gb, when gb is used and so
forth. Search engines can do the rest.

Dyl.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top