Google cached version mangled

Discussion in 'HTML' started by N Cook, May 15, 2005.

  1. N Cook

    N Cook Guest

    I've added a small bit of foreign script to a file and now the Google
    cached version is wholly mangled.
    The Google version starts
    ÿþ
    first letter y with 2 dots over and then a sort of p
    and all that follows is minus spaces and the source html with brackets.
    I tried adding html lang ="en" in <> at the beginning of the file but no
    change
    on the Google cached version.
     
    N Cook, May 15, 2005
    #1
    1. Advertising

  2. N Cook

    Toby Inkster Guest

    N Cook wrote:

    > ÿþ


    Check your HTTP headers. This is a common UTF-16 thingy.

    --
    Toby A Inkster BSc (Hons) ARCS
    Contact Me ~ http://tobyinkster.co.uk/contact
     
    Toby Inkster, May 15, 2005
    #2
    1. Advertising

  3. N Cook wrote:

    > I've added a small bit of foreign script to a file


    It would help if you showed a URL.

    > I tried adding html lang ="en" in <> at the beginning of the file but no
    > change on the Google cached version.


    The lang attribute tells the user agent what language the document is
    written in. This is useful for things such as telling an aural browser
    which pronunciation guide to use, or for search engines to filter out
    documents if the user specified "Only in language X".

    It doesn't tell the user agent anything about how characters are represented
    in the text file. For that you need to configure your webserver to inform
    the user agent what the character encoding of the file is.

    http://www.cs.tut.fi/~jkorpela/chars/

    --
    David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
    Home is where the ~/.bashrc is
     
    David Dorward, May 15, 2005
    #3
  4. N Cook

    N Cook Guest

    "David Dorward" <> wrote in message
    news:d67ic8$pdk$1$...
    > N Cook wrote:
    >
    > > I've added a small bit of foreign script to a file

    >
    > It would help if you showed a URL.
    >
    > > I tried adding html lang ="en" in <> at the beginning of the file but no
    > > change on the Google cached version.

    >
    > The lang attribute tells the user agent what language the document is
    > written in. This is useful for things such as telling an aural browser
    > which pronunciation guide to use, or for search engines to filter out
    > documents if the user specified "Only in language X".
    >
    > It doesn't tell the user agent anything about how characters are

    represented
    > in the text file. For that you need to configure your webserver to inform
    > the user agent what the character encoding of the file is.
    >
    > http://www.cs.tut.fi/~jkorpela/chars/
    >
    > --
    > David Dorward <http://blog.dorward.me.uk/> <http://dorward.me.uk/>
    > Home is where the ~/.bashrc is


    The actual file is
    http://www.divdev.fsnet.co.uk/dysch.htm
    all fine until I added the Hebrew piece near the top #linking
    to the full Hebrew summary text near the end of the file.
    The Hebrew text reads correctly right to left etc , just that Google cached
    would seem not to like it.

    Do i need to add an Isocode number for English , not just the "en"
    designation ?
     
    N Cook, May 15, 2005
    #4
  5. N Cook

    N Cook Guest

    "N Cook" <> wrote in message
    news:d67nfm$h4c$...
    > "David Dorward" <> wrote in message
    > news:d67ic8$pdk$1$...
    > > N Cook wrote:
    > >
    > > > I've added a small bit of foreign script to a file

    > >
    > > It would help if you showed a URL.
    > >
    > > > I tried adding html lang ="en" in <> at the beginning of the file but

    no
    > > > change on the Google cached version.

    > >
    > > The lang attribute tells the user agent what language the document is
    > > written in. This is useful for things such as telling an aural browser
    > > which pronunciation guide to use, or for search engines to filter out
    > > documents if the user specified "Only in language X".
    > >
    > > It doesn't tell the user agent anything about how characters are

    > represented
    > > in the text file. For that you need to configure your webserver to

    inform
    > > the user agent what the character encoding of the file is.
    > >
    > > http://www.cs.tut.fi/~jkorpela/chars/
    > >
    > > --
    > > David Dorward <http://blog.dorward.me.uk/>

    <http://dorward.me.uk/>
    > > Home is where the ~/.bashrc is

    >
    > The actual file is
    > http://www.divdev.fsnet.co.uk/dysch.htm
    > all fine until I added the Hebrew piece near the top #linking
    > to the full Hebrew summary text near the end of the file.
    > The Hebrew text reads correctly right to left etc , just that Google

    cached
    > would seem not to like it.
    >
    > Do i need to add an Isocode number for English , not just the "en"
    > designation ?
    >
    >
    >
    >
    >
    >


    That URL is now converted to try without any reference to "he".
    The original that of this weekend is cached on Google is now parked, renamed
    as
    http://www.divdev.fsnet.co.uk/dysch_old.htm
     
    N Cook, May 15, 2005
    #5
  6. N Cook

    Toby Inkster Guest

    N Cook wrote:

    > http://www.divdev.fsnet.co.uk/dysch.htm


    As I said yesterday, this is a UTF-16 file. You ought to specify that it's
    UTF-16 in the HTTP headers.

    Better yet -- convert it to UTF-8 (which handles Hebrew characters just
    fine!) and specify UTF-8 in the HTTP headers.

    --
    Toby A Inkster BSc (Hons) ARCS
    Contact Me ~ http://tobyinkster.co.uk/contact
     
    Toby Inkster, May 16, 2005
    #6
  7. "Toby Inkster" <> skrev i meddelandet
    news:p...
    > N Cook wrote:
    >
    > > http://www.divdev.fsnet.co.uk/dysch.htm

    >
    > As I said yesterday, this is a UTF-16 file. You ought to specify that it's
    > UTF-16 in the HTTP headers.
    >
    > Better yet -- convert it to UTF-8 (which handles Hebrew characters just
    > fine!) and specify UTF-8 in the HTTP headers.
    >
    > --
    > Toby A Inkster BSc (Hons) ARCS
    > Contact Me ~ http://tobyinkster.co.uk/contact
    >


    I am not sure whether it is the same subject you are talkning about but I
    have noticed something unusual ( for me) about the way how the webbsite
    https://www.scaiecat-spa-gigi.com can be searched at www.google.se now.
    When I searched the term "Scaiecat Spa Gigi" I got some hits from this
    website and then a link to other pages of the same websites.
    And when I did it, I found about 500 results.
    Now I do not find this link any more, although it is clear that there are
    more pages which have been indexed.
    For example:
    http://www.google.se/search?hl=sv&q=Scaiecat Spa Gigi&meta=
    http://www.google.se/search?hl=sv&q=boende i Italien&meta=
    http://www.google.se/search?hl=sv&q=fakta Italien&meta=
    http://www.google.it/search?q=traduzioni svedese italiano&hl=it&lr=&start=10&sa=N
    http://www.google.it/search?hl=it&q=parlamento svedese&meta=

    Please, note that a part of the cached links are https adresses and php
    adresses and another part are html adresses.
    In the image section you still find a lot of results by using the term
    "Scaiecat Spa Gigi"
    http://images.google.se/images?q=Scaiecat Spa Gigi&hl=sv
    So, now I am wondering what has happened.



    --
    Luigi ( un italiano che vive in Svezia)
    https://www.scaiecat-spa-gigi.com/it/partille-a-maggio-2005.html
     
    Luigi Donatello Asero, May 16, 2005
    #7
  8. N Cook

    N Cook Guest

    "Toby Inkster" <> wrote in message
    news:p...
    > N Cook wrote:
    >
    > > http://www.divdev.fsnet.co.uk/dysch.htm

    >
    > As I said yesterday, this is a UTF-16 file. You ought to specify that it's
    > UTF-16 in the HTTP headers.
    >
    > Better yet -- convert it to UTF-8 (which handles Hebrew characters just
    > fine!) and specify UTF-8 in the HTTP headers.
    >
    > --
    > Toby A Inkster BSc (Hons) ARCS
    > Contact Me ~ http://tobyinkster.co.uk/contact
    >


    The Hebrew text as perceived by Google covers 'letters'
    & # 1488 ... & # 1514 (no spaces)
    Is there a simple way of converting them to equivalents
    that will not upset Google. I'm thinking of a cut & paste
    into an online facility like online language translation.
    I couldn't find one using keywords {convert "utf-16 to utf-8" online }
     
    N Cook, May 16, 2005
    #8
  9. N Cook

    N Cook Guest

    "N Cook" <> wrote in message
    news:d6a292$sir$...
    >
    > "Toby Inkster" <> wrote in message
    > news:p...
    > > N Cook wrote:
    > >
    > > > http://www.divdev.fsnet.co.uk/dysch.htm

    > >
    > > As I said yesterday, this is a UTF-16 file. You ought to specify that

    it's
    > > UTF-16 in the HTTP headers.
    > >
    > > Better yet -- convert it to UTF-8 (which handles Hebrew characters just
    > > fine!) and specify UTF-8 in the HTTP headers.
    > >
    > > --
    > > Toby A Inkster BSc (Hons) ARCS
    > > Contact Me ~ http://tobyinkster.co.uk/contact
    > >

    >
    > The Hebrew text as perceived by Google covers 'letters'
    > & # 1488 ... & # 1514 (no spaces)
    > Is there a simple way of converting them to equivalents
    > that will not upset Google. I'm thinking of a cut & paste
    > into an online facility like online language translation.
    > I couldn't find one using keywords {convert "utf-16 to utf-8" online }
    >
    >
    >
    >
    >
    >


    For the archives, for anyone else not so computer-wise.
    It looks as though all that is required is when it comes to saving file to
    disk , in my case from Notepad, to
    select coding option in "Save As" as UTF-8 rather than Unicode which I had
    done before.
    Will try ftp, UTF-8 version revised file this week
     
    N Cook, May 18, 2005
    #9
  10. N Cook

    N Cook Guest

    "N Cook" <> wrote in message
    news:d6f85h$cvg$...
    > "N Cook" <> wrote in message
    > news:d6a292$sir$...
    > >
    > > "Toby Inkster" <> wrote in message
    > > news:p...
    > > > N Cook wrote:
    > > >
    > > > > http://www.divdev.fsnet.co.uk/dysch.htm
    > > >
    > > > As I said yesterday, this is a UTF-16 file. You ought to specify that

    > it's
    > > > UTF-16 in the HTTP headers.
    > > >
    > > > Better yet -- convert it to UTF-8 (which handles Hebrew characters

    just
    > > > fine!) and specify UTF-8 in the HTTP headers.
    > > >
    > > > --
    > > > Toby A Inkster BSc (Hons) ARCS
    > > > Contact Me ~ http://tobyinkster.co.uk/contact
    > > >

    > >
    > > The Hebrew text as perceived by Google covers 'letters'
    > > & # 1488 ... & # 1514 (no spaces)
    > > Is there a simple way of converting them to equivalents
    > > that will not upset Google. I'm thinking of a cut & paste
    > > into an online facility like online language translation.
    > > I couldn't find one using keywords {convert "utf-16 to utf-8" online }
    > >
    > >
    > >
    > >
    > >
    > >

    >
    > For the archives, for anyone else not so computer-wise.
    > It looks as though all that is required is when it comes to saving file to
    > disk , in my case from Notepad, to
    > select coding option in "Save As" as UTF-8 rather than Unicode which I had
    > done before.
    > Will try ftp, UTF-8 version revised file this week
    >
    >


    That didn't work.

    This file, basically in English, contains some UTF-16 code for Hebrew,
    Russian
    and Thai and is cached with no problem on Google
    http://pclt.cis.yale.edu/pclt/encoding/
    cached on
    http://64.233.183.104/search?q=cache:VqK1HChCXs0J:pclt.cis.yale.edu/pclt/enc
    oding/+%22iso-8859-8%22+hebrew+russian+thai+yale&hl=en&start=1&ie=UTF-8

    That Hebrew text does not contain character numbers 1494, 1509 and 1510
    which are in 'my' Hebrew text.
    I've tried a version minus 2 of these in case they are interpreted as
    control codes , I've also added reference to charset=windows-1252.
     
    N Cook, May 20, 2005
    #10
  11. N Cook

    Toby Inkster Guest

    N Cook wrote:

    > That Hebrew text does not contain character numbers 1494, 1509 and 1510
    > which are in 'my' Hebrew text.


    1. You are still not sending a charset in the HTTP header.

    2. You have three bytes of junk before the <HTML> tag. Remove them. If
    your text editor doesn't show you these three bytes, then use a hex editor
    or get a better text editor.

    --
    Toby A Inkster BSc (Hons) ARCS
    Contact Me ~ http://tobyinkster.co.uk/contact
     
    Toby Inkster, May 20, 2005
    #11
  12. N Cook

    N Cook Guest

    "Toby Inkster" <> wrote in message
    news:p...
    > N Cook wrote:
    >
    > > That Hebrew text does not contain character numbers 1494, 1509 and 1510
    > > which are in 'my' Hebrew text.

    >
    > 1. You are still not sending a charset in the HTTP header.
    >
    > 2. You have three bytes of junk before the <HTML> tag. Remove them. If
    > your text editor doesn't show you these three bytes, then use a hex editor
    > or get a better text editor.
    >
    > --
    > Toby A Inkster BSc (Hons) ARCS
    > Contact Me ~ http://tobyinkster.co.uk/contact
    >


    Yes, Thanks for that , viewed as .txt file in Word and there is some junk
    crept in from somewhere.
     
    N Cook, May 20, 2005
    #12
  13. N Cook

    N Cook Guest

    "Toby Inkster" <> wrote in message
    news:p...
    > N Cook wrote:
    >
    > > That Hebrew text does not contain character numbers 1494, 1509 and 1510
    > > which are in 'my' Hebrew text.

    >
    > 1. You are still not sending a charset in the HTTP header.
    >
    > 2. You have three bytes of junk before the <HTML> tag. Remove them. If
    > your text editor doesn't show you these three bytes, then use a hex editor
    > or get a better text editor.
    >
    > --
    > Toby A Inkster BSc (Hons) ARCS
    > Contact Me ~ http://tobyinkster.co.uk/contact
    >


    The junk
    
    at the file top
    appears in saved file after selecting UTF-8 rather than Unicode in Notepad
    options,
    although not displayed viewing the file in Notepad, so may not be the
    problem.

    Latest version I've ftp'd has Unicode selected and charset=windows-1252 at
    top
     
    N Cook, May 20, 2005
    #13
  14. N Cook

    N Cook Guest

    Followup
    I downloaded Hex Editor XVI32 from
    http://www.chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm
    That allowed me to remove FE,FF / 255,266 / ÿþ / y diaresis and p with
    ascender
    that clogs up the front of the file.
    Apparently this is BOM Byte Order Mark and also Zero Width Non-Breaking
    Space (ZWNBSP).
    With Hex editor also "Replace All " inter-character 00 to zilch and now the
    bulk of my file
    http://www.divdev.fsnet.co.uk/dysch.htm
    with luck should read ok when Google Cached comes round in a day or two.
    Browser reading of the Hebrew 'unicode' is now junk but I feel I'm now
    getting there.
    Just hopefully a matter of converting the Hebrew code characters like hex
    code
    05D2 to decimal code ג which Google Cached seems to like and also
    browsers. Once I get the hang of cut and paste (block & paste ?) in the Hex
    Editor or some
    other fudge.
    I'm using this yale file as a model which reads Hebrew on browser and is
    cached by Google correctly
    http://pclt.cis.yale.edu/pclt/encoding/
    and a bare minimum of HTML eg no "he" LANG designation.
     
    N Cook, Jun 1, 2005
    #14
  15. N Cook

    N Cook Guest

    "N Cook" <> wrote in message
    news:d7kehp$kdf$...
    > Followup
    > I downloaded Hex Editor XVI32 from
    > http://www.chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm
    > That allowed me to remove FE,FF / 255,266 / ÿþ / y diaresis and p with
    > ascender
    > that clogs up the front of the file.
    > Apparently this is BOM Byte Order Mark and also Zero Width Non-Breaking
    > Space (ZWNBSP).
    > With Hex editor also "Replace All " inter-character 00 to zilch and now

    the
    > bulk of my file
    > http://www.divdev.fsnet.co.uk/dysch.htm
    > with luck should read ok when Google Cached comes round in a day or two.
    > Browser reading of the Hebrew 'unicode' is now junk but I feel I'm now
    > getting there.
    > Just hopefully a matter of converting the Hebrew code characters like hex
    > code
    > 05D2 to decimal code ג which Google Cached seems to like and also
    > browsers. Once I get the hang of cut and paste (block & paste ?) in the

    Hex
    > Editor or some
    > other fudge.
    > I'm using this yale file as a model which reads Hebrew on browser and is
    > cached by Google correctly
    > http://pclt.cis.yale.edu/pclt/encoding/
    > and a bare minimum of HTML eg no "he" LANG designation.
    >
    >
    >


    For the archives , the problem seems cracked concerning Hebrew unicode
    text and corrupted Google cached.
    The Google search text is now correct and the cached version should
    be corrected the next time the spider comes around.
    Solution written up and will appear in computer section of
    this file in next few days
    http://www.divdev.fsnet.co.uk/repair4.htm
     
    N Cook, Jun 3, 2005
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Craig G
    Replies:
    0
    Views:
    406
    Craig G
    Mar 7, 2005
  2. John
    Replies:
    0
    Views:
    400
  3. Luigi Donatello Asero

    Re: Google cached version mangled

    Luigi Donatello Asero, May 16, 2005, in forum: HTML
    Replies:
    0
    Views:
    501
    Luigi Donatello Asero
    May 16, 2005
  4. V Green
    Replies:
    0
    Views:
    925
    V Green
    Feb 5, 2008
  5. PA Bear [MS MVP]
    Replies:
    0
    Views:
    1,036
    PA Bear [MS MVP]
    Feb 5, 2008
Loading...

Share This Page