Cyrillic on the web

B

BusyGuy

Hello people. My first visit in here (will become a reg) so I introduce
myself as a professional webmaster. Mac G5, Aluminium and Titanium
Powerbooks, Eizo monitors, Epson 4800 printer, etc.

OS X 10.46, BBEdit and Deamweaver. (He he I have a silly sense of
humor. I made a typo and it ended up looking like BB EatDirt)

I'm not a computer or net newbie but I am far from being an HTML
expert. I've hit a snag and seek some guidance.

I'm trying to author a page in cyrillic. Everything I've tried has
failed and I have not, so far, found any readable/understandable online
education.

I can, for example, do a page in BBEdit and type the cyrillic alphabet
in it. The result remains correct on my screen but wrong online. you
can see it at <http://eastwest-commerce.net/test>

I would be grateful for any help. Thanks.
 
J

Jukka K. Korpela

BusyGuy said:
I can, for example, do a page in BBEdit and type the cyrillic alphabet
in it. The result remains correct on my screen but wrong online. you
can see it at http://eastwest-commerce.net/test

First, you have a missing quotation mark that prevents at least some
browsers from recognizing the <meta> tag where you try to declare the
character encoding. It should be

<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=windows-1251">

Second, the data on the page isn't in windows-1251 encoding, or in any other
encoding commonly used for texts using Cyrillic letters. It might be
MacCyrillic, but such an encoding should not be used on the Web. I'm pretty
sure that the cause of the problem is one or both of the following:
1) Your software saves the data in some Mac-specific encoding.
2) The software that you use for data transfer (e.g., a separate FTP
program, or the upload feature of your authoring software) performs an
incorrect character encoding conversion (e.g., converting from a Mac
encoding to windows-1251 without realizing that the data already was in
windows-1251, making the result all wrong).

You need to save the data in some widely known encoding (such as
windows-1251), or convert it to such an encoding, and transfer it to the
server in a manner that preserves 8-bit bytes as such; in FTP, use binary
mode to prevent any unwanted character set conversion attempts at that
phase.

There are some garbage characters at the start of the document, before the
<html> tag. Make sure they get removed.

There's quite a lot to be fixed on the page in other respects, but I won't
go into them (and it would be much easier to rewrite the page than to fix
it), because the above-mentioned principles should handle the problem with
Cyrillic letters.

Note: Windows-1251 does not contain all the Cyrillic letters that are used
in different languages, but it is sufficient for Russian, for example.
 
A

Alan J. Flavell

There are some garbage characters at the start of the document,
before the <html> tag.

Yes, there seem to be three bytes there: d4 aa f8. I can't help
worrying that they started life as a utf-8 BOM (ef bb bf), and have
been mapped through whatever misguided encoding coversion has
scrambled the rest of the content.
Make sure they get removed.

They're the key to this puzzle! (Don't throw away the key ;-)

Oh yes, A.Prilop is going to love this!! That's exactly what happens
when one passes ef bb bf through Mr. Pirard's old Mac -> iso-8859-1
conversion table from 1992.

The only good thing one can say about that translation table nowadays
is that it's reversible, so it *would* be possible to translate this
rubbish back onto its original form. Whereupon it just might turn out
to be utf-8-encoded...

Hmmm yes, if I take the first 6 bytes of the document title: ad fc 8b
c4 ad bd, and run them back through Pirard's table, I get d0 9f d1 80
d0 b8 , which is the utf-8 representation of the three Cyrillic
letters for "Pri" (I'm not going to try to put cyrillic letters into
this posting!). Going on a bit further, I make it out to be
"Privetst...", does that make some kind of sense?

However, I think I'd prefer to start again from fresh materials!!

Evidently one should make a note of this characteristic "d4 aa f8"
signature, in case one comes across it again.

Aha, indeed, Google has seen it:
http://forum.altap.cz/viewtopic.php?t=74&sid=e9d765b713aba13d6b006ffb174467aa

(Oh well, it beats doing the crossword, I suppose.)
 
J

Jukka K. Korpela

Alan J. Flavell said:
Yes, there seem to be three bytes there: d4 aa f8. I can't help
worrying that they started life as a utf-8 BOM (ef bb bf), and have
been mapped through whatever misguided encoding coversion has
scrambled the rest of the content.

Well spotted.
Oh yes, A.Prilop is going to love this!! That's exactly what happens
when one passes ef bb bf through Mr. Pirard's old Mac -> iso-8859-1
conversion table from 1992.

Sounds quite plausible under the circumstances.
Hmmm yes, if I take the first 6 bytes of the document title: ad fc 8b
c4 ad bd, and run them back through Pirard's table, I get d0 9f d1 80
d0 b8 , which is the utf-8 representation of the three Cyrillic
letters for "Pri" (I'm not going to try to put cyrillic letters into
this posting!). Going on a bit further, I make it out to be
"Privetst...", does that make some kind of sense?

Surely, it's the start of a Russian word that means 'greeting'. (Of course,
using such words in a document title is waste of precious real estate, but I
digress.)
However, I think I'd prefer to start again from fresh materials!!

Me too. And using UTF-8 for Russian isn't particularly efficient. Using e.g.
windows-1251, you have one octet (byte) for each character. Using UTF-8, you
have one octet for each character in the Ascii range (including characters
used in HTML markup) but two octets for each Cyrillic letter. UTF-8 would be
fine if the document contained, say, a mixture of Russian and French.
 
B

BusyGuy

Jukka and Alan, thank you both very much for your kind assistnce.

It's still early morning here. I'll get into this immediately after
breakfast and report back in case anyone is interested in a successful
outcome.

However, even before analysis and work, I can say three things:‹

1 Missing quote mark ‹ how damned silly of me. And uncharacteristic.
Please don't take it as an indication that I'm careless or stupid.

2 Encoding and uploading. I've been using BBEdit to compose and Fetch
to upload. BBEdit, in case you don't know it, is very cool. It will,
for example, pull me up on save if there is a glyph that does not fit
its view of the universe. I think I can use that to advantage later
today.

Fetch is set to upload in "automatic" format. When I uploaded the page
you've seen then brought a copy back to earth its cyrillic content was
corrupted so that is another interesting area to examine.

3 Garbage characters at the start are a known mystery. They even get
added sometimes to pages that do not contain any cyrillic. i think they
are put there by BBEdit when it chokes on a glyph. It stops me when i
try to save and announces...well, look at the attachment.

More news as it happens,

grh
 
A

Alan J. Flavell

Surely, it's the start of a Russian word that means 'greeting'. (Of
course, using such words in a document title is waste of precious
real estate, but I digress.)

But it might be part of a trading name, I wouldn't dismiss it without
further study...
Me too. And using UTF-8 for Russian isn't particularly efficient.

Right. Also, I think it's probably unwise to use a utf-8 BOM:
although it's technically legal, I have a hunch that (even when
correctly used) it'll disturb some browser/versions that are still in
use.

[snipped uncontentious remarks...]
 
B

BusyGuy

Jukka and Alan...

Problem solved and thanks again for your help.

I found a preferrence in BBEdit that I had set wrongly. Indeed it was
not encoding in 1251.

I also set Fetch to upload as raw data but I don't know if that
contributed to the solution.

Jukka, you say that "There's quite a lot [more] to be fixed". My
problem is that I am not an HTML expert. My forte is Photoshop.

I generally author in Dreamweaver then use BBEdit to cull out the code
that my limited experience recognizes as extraneous. You are clearly
seeing more than I. If all it does is bloat the code and maybe slow
things down a tad, I can live with that as I am unfortunately too busy
these days to undertake the necessary (and desired) HTML education. But
you can be sure I will visit this group often as a starting point for
my learning curve.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top