How should Chinese sites be encoded to be listed in search engine?

Discussion in 'HTML' started by Pat, Jul 17, 2006.

  1. Pat

    Pat Guest

    Would google and other search engines support the indexing of
    non-English UTF-8 encoded websites?

    Most chinese website indexed on google appears to be
    - for Traditional Chinese, charset=big5" encoding=ANSI
    - For Simplified Chinese, charset=gb2312 encoding=ANSI


    Does it support for
    charset=UTF-8" encoding=UTF-8
     
    Pat, Jul 17, 2006
    #1
    1. Advertising

  2. In article <>,
    "Pat" <> wrote:

    > Would google and other search engines support the indexing of
    > non-English UTF-8 encoded websites?


    Yes.


    > Most chinese website indexed on google appears to be
    > - for Traditional Chinese, charset=big5" encoding=ANSI
    > - For Simplified Chinese, charset=gb2312 encoding=ANSI


    I don't have any experience with Asian encodings but my guess is that
    big5 is preferable to UTF8 because it is more efficient (i.e. takes up
    less space) when most of the characters are Asian. If you don't mind
    fatter pages, UTF8 should be fine.

    HTH

    --
    Philip
    http://NikitaTheSpider.com/
    Whole-site HTML validation, link checking and more
     
    Nikita the Spider, Jul 17, 2006
    #2
    1. Advertising

  3. Pat

    Dylan Sung Guest

    "Nikita the Spider" <> wrote in message
    news:...
    > In article <>,
    > "Pat" <> wrote:
    >
    >> Would google and other search engines support the indexing of
    >> non-English UTF-8 encoded websites?

    >
    > Yes.
    >
    >
    >> Most chinese website indexed on google appears to be
    >> - for Traditional Chinese, charset=big5" encoding=ANSI
    >> - For Simplified Chinese, charset=gb2312 encoding=ANSI

    >
    > I don't have any experience with Asian encodings but my guess is that
    > big5 is preferable to UTF8 because it is more efficient (i.e. takes up
    > less space) when most of the characters are Asian. If you don't mind
    > fatter pages, UTF8 should be fine.


    Encodings like GB and Big5 are double byte encodings. However, unicode (utf8
    at least) uses three or more bytes for far east asian characters (amongst
    others). So yes, in terms of economy, GB and Big5 yield text files that have
    fewer bytes.

    You can view the repetoire of characters in unicode as having subsets of GB
    and Big5 within them, and thus you can do direct converseions from GB to
    unicode, and Big5 to unicode. However there are characters in GB which do
    not occur in Big5 and vice versa, so conversion between the two is lossy. My
    guess is that google employs searching algorithms which convert characters
    to utf-8 and then searches for webpages which contain both simplified gb and
    traditional characters in Big5 all at the same time, at least this is what I
    get when I'm entering one or the other character set characters into their
    search field.

    Dyl.
     
    Dylan Sung, Jul 17, 2006
    #3
  4. Pat

    Dylan Sung Guest

    "Dylan Sung" <> wrote in message
    news:e9gfj6$rli$...
    >
    > "Nikita the Spider" <> wrote in message
    > news:...
    >> In article <>,
    >> "Pat" <> wrote:
    >>
    >>> Would google and other search engines support the indexing of
    >>> non-English UTF-8 encoded websites?

    >>
    >> Yes.
    >>
    >>
    >>> Most chinese website indexed on google appears to be
    >>> - for Traditional Chinese, charset=big5" encoding=ANSI
    >>> - For Simplified Chinese, charset=gb2312 encoding=ANSI

    >>
    >> I don't have any experience with Asian encodings but my guess is that
    >> big5 is preferable to UTF8 because it is more efficient (i.e. takes up
    >> less space) when most of the characters are Asian. If you don't mind
    >> fatter pages, UTF8 should be fine.

    >
    > Encodings like GB and Big5 are double byte encodings. However, unicode
    > (utf8 at least) uses three or more bytes for far east asian characters
    > (amongst others). So yes, in terms of economy, GB and Big5 yield text
    > files that have fewer bytes.
    >
    > You can view the repetoire of characters in unicode as having subsets of
    > GB and Big5 within them, and thus you can do direct converseions from GB
    > to unicode, and Big5 to unicode. However there are characters in GB which
    > do not occur in Big5 and vice versa, so conversion between the two is
    > lossy. My guess is that google employs searching algorithms which convert
    > characters to utf-8 and then searches for webpages which contain both
    > simplified gb and traditional characters in Big5 all at the same time, at
    > least this is what I get when I'm entering one or the other character set
    > characters into their search field.



    Sorry, didn't answer the original question. I think that web pages should
    list their encodings as appropriate. That is gb, when gb is used and so
    forth. Search engines can do the rest.

    Dyl.
     
    Dylan Sung, Jul 17, 2006
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Cowboy \(Gregory A. Beamer\) [MVP]

    Need to get my site listed with search engines, where do I start?

    Cowboy \(Gregory A. Beamer\) [MVP], Jun 8, 2004, in forum: ASP .Net
    Replies:
    3
    Views:
    335
    Charlie@CBFC
    Jun 8, 2004
  2. Knackeback

    chinese encoded in UTF-8 and XML

    Knackeback, Sep 25, 2003, in forum: XML
    Replies:
    4
    Views:
    4,324
    Albert Chun-Chieh Huang
    Sep 30, 2003
  3. Ricardo Cabral

    Search engine for Eclipse sites

    Ricardo Cabral, Nov 14, 2006, in forum: Java
    Replies:
    0
    Views:
    299
    Ricardo Cabral
    Nov 14, 2006
  4. Sasha
    Replies:
    3
    Views:
    614
    Sasha
    May 22, 2007
  5. AutahG
    Replies:
    0
    Views:
    428
    AutahG
    Mar 1, 2008
Loading...

Share This Page