Google cached version mangled

N

N Cook

I've added a small bit of foreign script to a file and now the Google
cached version is wholly mangled.
The Google version starts
ÿþ
first letter y with 2 dots over and then a sort of p
and all that follows is minus spaces and the source html with brackets.
I tried adding html lang ="en" in <> at the beginning of the file but no
change
on the Google cached version.
 
D

David Dorward

N said:
I've added a small bit of foreign script to a file

It would help if you showed a URL.
I tried adding html lang ="en" in <> at the beginning of the file but no
change on the Google cached version.

The lang attribute tells the user agent what language the document is
written in. This is useful for things such as telling an aural browser
which pronunciation guide to use, or for search engines to filter out
documents if the user specified "Only in language X".

It doesn't tell the user agent anything about how characters are represented
in the text file. For that you need to configure your webserver to inform
the user agent what the character encoding of the file is.

http://www.cs.tut.fi/~jkorpela/chars/
 
N

N Cook

David Dorward said:
It would help if you showed a URL.


The lang attribute tells the user agent what language the document is
written in. This is useful for things such as telling an aural browser
which pronunciation guide to use, or for search engines to filter out
documents if the user specified "Only in language X".

It doesn't tell the user agent anything about how characters are represented
in the text file. For that you need to configure your webserver to inform
the user agent what the character encoding of the file is.

http://www.cs.tut.fi/~jkorpela/chars/

The actual file is
http://www.divdev.fsnet.co.uk/dysch.htm
all fine until I added the Hebrew piece near the top #linking
to the full Hebrew summary text near the end of the file.
The Hebrew text reads correctly right to left etc , just that Google cached
would seem not to like it.

Do i need to add an Isocode number for English , not just the "en"
designation ?
 
N

N Cook

N Cook said:
The actual file is
http://www.divdev.fsnet.co.uk/dysch.htm
all fine until I added the Hebrew piece near the top #linking
to the full Hebrew summary text near the end of the file.
The Hebrew text reads correctly right to left etc , just that Google cached
would seem not to like it.

Do i need to add an Isocode number for English , not just the "en"
designation ?

That URL is now converted to try without any reference to "he".
The original that of this weekend is cached on Google is now parked, renamed
as
http://www.divdev.fsnet.co.uk/dysch_old.htm
 
L

Luigi Donatello Asero

Toby Inkster said:
As I said yesterday, this is a UTF-16 file. You ought to specify that it's
UTF-16 in the HTTP headers.

Better yet -- convert it to UTF-8 (which handles Hebrew characters just
fine!) and specify UTF-8 in the HTTP headers.

I am not sure whether it is the same subject you are talkning about but I
have noticed something unusual ( for me) about the way how the webbsite
https://www.scaiecat-spa-gigi.com can be searched at www.google.se now.
When I searched the term "Scaiecat Spa Gigi" I got some hits from this
website and then a link to other pages of the same websites.
And when I did it, I found about 500 results.
Now I do not find this link any more, although it is clear that there are
more pages which have been indexed.
For example:
http://www.google.se/search?hl=sv&q=Scaiecat+Spa+Gigi&meta=
http://www.google.se/search?hl=sv&q=boende+i+Italien&meta=
http://www.google.se/search?hl=sv&q=fakta+Italien&meta=
http://www.google.it/search?q=traduzioni+svedese+italiano&hl=it&lr=&start=10&sa=N
http://www.google.it/search?hl=it&q=parlamento+svedese&meta=

Please, note that a part of the cached links are https adresses and php
adresses and another part are html adresses.
In the image section you still find a lot of results by using the term
"Scaiecat Spa Gigi"
http://images.google.se/images?q=Scaiecat+Spa+Gigi&hl=sv
So, now I am wondering what has happened.
 
N

N Cook

Toby Inkster said:
As I said yesterday, this is a UTF-16 file. You ought to specify that it's
UTF-16 in the HTTP headers.

Better yet -- convert it to UTF-8 (which handles Hebrew characters just
fine!) and specify UTF-8 in the HTTP headers.

The Hebrew text as perceived by Google covers 'letters'
& # 1488 ... & # 1514 (no spaces)
Is there a simple way of converting them to equivalents
that will not upset Google. I'm thinking of a cut & paste
into an online facility like online language translation.
I couldn't find one using keywords {convert "utf-16 to utf-8" online }
 
N

N Cook

N Cook said:
The Hebrew text as perceived by Google covers 'letters'
& # 1488 ... & # 1514 (no spaces)
Is there a simple way of converting them to equivalents
that will not upset Google. I'm thinking of a cut & paste
into an online facility like online language translation.
I couldn't find one using keywords {convert "utf-16 to utf-8" online }

For the archives, for anyone else not so computer-wise.
It looks as though all that is required is when it comes to saving file to
disk , in my case from Notepad, to
select coding option in "Save As" as UTF-8 rather than Unicode which I had
done before.
Will try ftp, UTF-8 version revised file this week
 
N

N Cook

N Cook said:
For the archives, for anyone else not so computer-wise.
It looks as though all that is required is when it comes to saving file to
disk , in my case from Notepad, to
select coding option in "Save As" as UTF-8 rather than Unicode which I had
done before.
Will try ftp, UTF-8 version revised file this week

That didn't work.

This file, basically in English, contains some UTF-16 code for Hebrew,
Russian
and Thai and is cached with no problem on Google
http://pclt.cis.yale.edu/pclt/encoding/
cached on
http://64.233.183.104/search?q=cache:VqK1HChCXs0J:pclt.cis.yale.edu/pclt/enc
oding/+%22iso-8859-8%22+hebrew+russian+thai+yale&hl=en&start=1&ie=UTF-8

That Hebrew text does not contain character numbers 1494, 1509 and 1510
which are in 'my' Hebrew text.
I've tried a version minus 2 of these in case they are interpreted as
control codes , I've also added reference to charset=windows-1252.
 
T

Toby Inkster

N said:
That Hebrew text does not contain character numbers 1494, 1509 and 1510
which are in 'my' Hebrew text.

1. You are still not sending a charset in the HTTP header.

2. You have three bytes of junk before the <HTML> tag. Remove them. If
your text editor doesn't show you these three bytes, then use a hex editor
or get a better text editor.
 
N

N Cook

Toby Inkster said:
1. You are still not sending a charset in the HTTP header.

2. You have three bytes of junk before the <HTML> tag. Remove them. If
your text editor doesn't show you these three bytes, then use a hex editor
or get a better text editor.

Yes, Thanks for that , viewed as .txt file in Word and there is some junk
crept in from somewhere.
 
N

N Cook

Toby Inkster said:
1. You are still not sending a charset in the HTTP header.

2. You have three bytes of junk before the <HTML> tag. Remove them. If
your text editor doesn't show you these three bytes, then use a hex editor
or get a better text editor.

The junk

at the file top
appears in saved file after selecting UTF-8 rather than Unicode in Notepad
options,
although not displayed viewing the file in Notepad, so may not be the
problem.

Latest version I've ftp'd has Unicode selected and charset=windows-1252 at
top
 
N

N Cook

Followup
I downloaded Hex Editor XVI32 from
http://www.chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm
That allowed me to remove FE,FF / 255,266 / ÿþ / y diaresis and p with
ascender
that clogs up the front of the file.
Apparently this is BOM Byte Order Mark and also Zero Width Non-Breaking
Space (ZWNBSP).
With Hex editor also "Replace All " inter-character 00 to zilch and now the
bulk of my file
http://www.divdev.fsnet.co.uk/dysch.htm
with luck should read ok when Google Cached comes round in a day or two.
Browser reading of the Hebrew 'unicode' is now junk but I feel I'm now
getting there.
Just hopefully a matter of converting the Hebrew code characters like hex
code
05D2 to decimal code ג which Google Cached seems to like and also
browsers. Once I get the hang of cut and paste (block & paste ?) in the Hex
Editor or some
other fudge.
I'm using this yale file as a model which reads Hebrew on browser and is
cached by Google correctly
http://pclt.cis.yale.edu/pclt/encoding/
and a bare minimum of HTML eg no "he" LANG designation.
 
N

N Cook

N Cook said:
Followup
I downloaded Hex Editor XVI32 from
http://www.chmaas.handshake.de/delphi/freeware/xvi32/xvi32.htm
That allowed me to remove FE,FF / 255,266 / ÿþ / y diaresis and p with
ascender
that clogs up the front of the file.
Apparently this is BOM Byte Order Mark and also Zero Width Non-Breaking
Space (ZWNBSP).
With Hex editor also "Replace All " inter-character 00 to zilch and now the
bulk of my file
http://www.divdev.fsnet.co.uk/dysch.htm
with luck should read ok when Google Cached comes round in a day or two.
Browser reading of the Hebrew 'unicode' is now junk but I feel I'm now
getting there.
Just hopefully a matter of converting the Hebrew code characters like hex
code
05D2 to decimal code ג which Google Cached seems to like and also
browsers. Once I get the hang of cut and paste (block & paste ?) in the Hex
Editor or some
other fudge.
I'm using this yale file as a model which reads Hebrew on browser and is
cached by Google correctly
http://pclt.cis.yale.edu/pclt/encoding/
and a bare minimum of HTML eg no "he" LANG designation.

For the archives , the problem seems cracked concerning Hebrew unicode
text and corrupted Google cached.
The Google search text is now correct and the cached version should
be corrected the next time the spider comes around.
Solution written up and will appear in computer section of
this file in next few days
http://www.divdev.fsnet.co.uk/repair4.htm
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,012
Latest member
RoxanneDzm

Latest Threads

Top