Input Character Set Handling

R

Richard Cornford

VK wrote:
[MLW:]
And for sure you have checked *what* charset is indicated in
browser for your "UTF-8" ?

Are you sure you are not, once again, looking for the wrong thing in the
wrong place? (for example, at the Encoding item in the menu for IE's
post-XSLT transformation representation of the XML).

Firefox's 'View Page Info' has no trouble reporting the resource as
UTF-8, and a hex dump of the bytes actually sent shows:-

3C 21 44 4F 43 54 59 50 45 20 72 6F 6F 74 20 5B
0A 20 20 20 20 3C 21 45 4C 45 4D 45 4E 54 20 72
6F 6F 74 20 28 23 50 43 44 41 54 41 29 3E 0A 20
20 20 20 5D 3E 0A 0A 3C 72 6F 6F 74 3E C2 AE 3C
2F 72 6F 6F 74 3E 0A

- which certainly is UTF-8 encoded (the registered trade mark character
is the C2 AE sequence just before the 3C at the end of the penultimate
line).

Richard.
 
V

VK

And for sure you have checked *what* charset is indicated in
Are you sure you are not, once again, looking for the wrong thing in the
wrong place? (for example, at the Encoding item in the menu for IE's
post-XSLT transformation representation of the XML).
Firefox's 'View Page Info' has no trouble reporting the resource as
UTF-8, and a hex dump of the bytes actually sent shows:-
3C 21 44 4F 43 54 59 50 45 20 72 6F 6F 74 20 5B
0A 20 20 20 20 3C 21 45 4C 45 4D 45 4E 54 20 72
6F 6F 74 20 28 23 50 43 44 41 54 41 29 3E 0A 20
20 20 20 5D 3E 0A 0A 3C 72 6F 6F 74 3E C2 AE 3C
2F 72 6F 6F 74 3E 0A
- which certainly is UTF-8 encoded (the registered trade mark character
is the C2 AE sequence just before the 3C at the end of the penultimate
line).

Wow! Now I see. Sorry for being so slow, but it just takes a bit for
such sophisticated hack. So instead of say "CYRILLIC CAPITAL LETTER A"
(Unicode 0x0410) we are taking its UTF-8 encoding 208 144 and placing
two 8-bit encoded characters matching 208 and 144. Say in Cyrillic
(Windows-1251) these will be CYRILLIC CAPITAL LETTER R and CYRILLIC
SMALL LETTER DJE (Serbian). With UTF-8 properly declared parser will
take these two characters together and display as one Unicode character
CYRILLIC CAPITAL LETTER A. Just tried it: it works for modern browsers.
Wow... I will definitely add it to our knowledge base, as a sample of
what people may come up with with enough of free time available :)

Sorry again to everyone for being so slow: but it's really...
sophisticated.
 
V

VK

VK said:
Wow! Now I see. Sorry for being so slow, but it just takes a bit for
such sophisticated hack. So instead of say "CYRILLIC CAPITAL LETTER A"
(Unicode 0x0410) we are taking its UTF-8 encoding 208 144 and placing
two 8-bit encoded characters matching 208 and 144. Say in Cyrillic
(Windows-1251) these will be CYRILLIC CAPITAL LETTER R and CYRILLIC
SMALL LETTER DJE (Serbian). With UTF-8 properly declared parser will
take these two characters together and display as one Unicode character
CYRILLIC CAPITAL LETTER A. Just tried it: it works for modern browsers.

More over: despite the physical file contains Windows-1251 CYRILLIC
CAPITAL LETTER R + CYRILLIC SMALL LETTER DJE combo, in View > Source
and by DOM methods it's reported as Unicode CYRILLIC CAPITAL LETTER A.
It's all very sketchy but I see some client-side protection algorithms
much more elegant and effective than traditional boring obfuscators
(see for instance another recent post here by Senderos). If I come up
with something useful: this thread and your names will be mentioned.
 
M

Michael Winter

No, I don't believe so, though I can't say I'm certain of just what that
option is supposed to do - I've never looked into it.

Alan Flavell's discussion of form submission and internationalisation[1]
notes that the encoding scheme of the document affects how form data is
transferred. This document mentions that that IE option appears only to
affect anchors, and the path component at that, not form submission or
the query component.
Partially. The first URL leads to illegal HTTP transmission (no charset
provided neither by page nor by server).

Is being wrong a hobby for you or something?

If no encoding scheme is specified, the HTTP/1.1 specification (RFC
2616) states that "media subtypes of the 'text' type are defined to have
a default charset value of 'ISO-8859-1' when received via HTTP" (3.7.1
Canonicalization and Text Defaults).

It isn't unusual for this to be ignored in practice, what with
auto-detection and user preferences, but that doesn't make omitting the
charset parameter "illegal", only ill-advised.

[snip]

Mike

[1] FORM submission and i18n, Alan J. Flavell
<http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html>
 
V

VK

Michael said:
If no encoding scheme is specified, the HTTP/1.1 specification (RFC
2616) states that "media subtypes of the 'text' type are defined to have
a default charset value of 'ISO-8859-1' when received via HTTP" (3.7.1
Canonicalization and Text Defaults).

3.7.1
....
Data in character sets other than "ISO-8859-1" or
its subsets MUST be labeled with an appropriate charset value.
....

But after all RFC is RFC: "Request For Comments" - nothing less but
nothing more; thus take it serious but with caution.
Anyone wants troubles with documents not shown and broken: no problems,
send your documents with no charset indications of any kind. When your
customers will come to complain, just quote them RFC's - maybe it will
help to save your business (I doubt very much, but feel free to try
:).
From my side it is even good that W3C considered HTTP rfc's frozen as
nothing to correct in there. It means I'll continue to get my money for
helping freshly graduated admin's to fix their boohs and for explaining
them that the Internet as it is is in the wires, not in the books.

That is *not* to offend anyone participating in this thread or simply
reading this thread. I just refuse to take the role of some stubbering
bastard forcing charset usage while it's allowed to skip by some RFC.
As I said: anyone is free to do whatever she wants. Just more money for
me anyway.
 
M

Michael Winter

VK wrote:

[snip]
3.7.1
...
Data in character sets other than "ISO-8859-1" or
its subsets MUST be labeled with an appropriate charset value.
...

So? The data did use the ISO-8859-1 encoding form, so labelling it as
such is not technically required.
But after all RFC is RFC: "Request For Comments" - nothing less but
nothing more; thus take it serious but with caution.

Each distinct version of an Internet standards-related
specification is published as part of the "Request for
Comments" (RFC) document series. This archival series is the
official publication channel for Internet standards documents
and other publications of the IESG, IAB, and Internet
community.
-- 2.1 Requests for Comments (RFCs),
The Internet Standards Process (Revision 3), RFC 2026

Where do you think Internet protocols are specified?
Anyone wants troubles with documents not shown and broken: no
problems, send your documents with no charset indications of any
kind. When your customers will come to complain, just quote them
RFC's - maybe it will help to save your business (I doubt very much,
but feel free to try :).

You seem to have problems reading, so let me paraphrase my previous
post: it is not wrong to omit a charset parameter if the encoding form
is ISO-8859-1, but it is not recommended.

[snip]

Mike
 
M

Michael Winter

VK wrote:

[snip]

[R. Cornford:]
Wow! Now I see. Sorry for being so slow, but it just takes a bit for
such sophisticated hack.

A hack? No, simply how UTF-8 works. I have no idea what it was you
posted earlier, but it was not a UTF-8 encoded document (at least not in
the spirit it was meant to be).

[snip]

Mike
 
V

VK

Michael said:
Where do you think Internet protocols are specified?

Mostly and mainly in the same place where the [window] object is: :)
it goes per the traditions and per the "templatic" implementation.

Any way, I did some research (damn time zone change, cannot get a
sleep). Sorry I cannot post URL's as I used Perl scripts on one of our
clients' server - they will not like it. Feel free to re-evaluate
yourselve, watch the shebang path as usual.

[ Test 1 ]
#!/usr/bin/perl
print "Content-Type: text/html; charset=iso-8859-1\n\n";
print <<EndOfBlock;
<html>
<head>
<title>Test 1</title>
</head>
<body>
<form method="GET" action="">
<fieldset>
<input type="text" name="test">
<input type="submit">
</fieldset>
</form>
</body>
</html>
EndOfBlock
exit(0);

[ Test 2 ]
#!/usr/bin/perl
print "Content-Type: text/html; charset=iso-8859-1\n\n";
print <<EndOfBlock;
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Test 1</title>
</head>
<body>
<form method="GET" action="">
<fieldset>
<input type="text" name="test">
<input type="submit">
</fieldset>
</form>
</body>
</html>
EndOfBlock
exit(0);

[ Test 3 ]
#!/usr/bin/perl
print "Content-Type: text/html\n\n";
print <<EndOfBlock;
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Test 1</title>
</head>
<body>
<form method="GET" action="">
<fieldset>
<input type="text" name="test">
<input type="submit">
</fieldset>
</form>
</body>
</html>
EndOfBlock
exit(0);

[Test 1] sets iso-8859-1 charset in the server header

[Test 2] sets iso-8859-1 charset in the server header but UTF-8 in META
tag. Server header is obligated to take priority over meta if UA is not
broken (thus iso-8859-1 remains)

[Test 3] sets UTF-8 in meta.

The variant of charset not set at all is not taken into consideration.
Feel free to break your browser yourselve :)

In each generated form I typed in the same Russian word which sounds as
"probah" and wich means as I understand "a probe". See the first match
in search results
<http://www.google.com/search?hl=en&q=проба&btnG=Google+Search>

//////////////
[Test 1] (iso-8859-1 set be server header)
Reported charset by all UA': iso-8859-1

Submission results:

IE 6.0
test=%EF%F0%EE%E1%E0

Firefox 1.5
test=%26%231087%3B%26%231088%3B%26%231086%3B%26%231073%3B%26%231072%3B

Opera 9.02
test=%26%231087%3B%26%231088%3B%26%231086%3B%26%231073%3B%26%231072%3B


//////////////
Test 2 (iso-8859-1 set by server header, overrides meta tag)


Reported charset by all UA': iso-8859-1

Submission results (watch the change for IE):

IE 6.0
test=%26%231087%3B%26%231088%3B%26%231086%3B%26%231073%3B%26%231072%3B

Firefox 1.5
test=%26%231087%3B%26%231088%3B%26%231086%3B%26%231073%3B%26%231072%3B

Opera 9.02
test=%26%231087%3B%26%231088%3B%26%231086%3B%26%231073%3B%26%231072%3B


//////////////
Test 3 (UTF-8 set by meta tag)

Reported charset by all UA': UTF-8

Submission results:

IE 6.0
test=%D0%BF%D1%80%D0%BE%D0%B1%D0%B0

Firefox 1.5
test=%D0%BF%D1%80%D0%BE%D0%B1%D0%B0

Opera 9.02
test=%D0%BF%D1%80%D0%BE%D0%B1%D0%B0
 
R

Richard Cornford

VK said:
Wow! Now I see. Sorry for being so slow, but it just takes a
bit for such sophisticated hack. So instead of say "CYRILLIC
CAPITAL LETTER A" (Unicode 0x0410) we are taking its UTF-8
encoding 208 144 and placing two 8-bit encoded characters
matching 208 and 144. Say in Cyrillic (Windows-1251) these
will be CYRILLIC CAPITAL LETTER R and CYRILLIC SMALL LETTER
DJE (Serbian). With UTF-8 properly declared parser will
take these two characters together and display as one Unicode
character CYRILLIC CAPITAL LETTER A. Just tried it: it works
for modern browsers. Wow... I will definitely add it to our
knowledge base, as a sample of what people may come up with
with enough of free time available :)

ROTLMLOL. It all just goes straight over you head, doesn't it?
Sorry again to everyone for being so slow: but it's really...
sophisticated.

Sophisticated? I suppose that depends on how rudimentary your intellect
is to start with.

Richard.
 
B

Bart Van der Donck

VK said:
Partially. The first URL leads to illegal HTTP transmission (no charset
provided neither by page nor by server). This way it activates error
correction mechanics in browser. And UA's error correction is all
separate issue of conversation.

Okay, let's disable such correction mechanisms then; say the following
example in ISO-8859-1. It shows the same result:
http://www.dotinternet.be/temp/exampleISO.htm

I think it's like Michael Winter said (RFC 2616): "Media subtypes of
the 'text' type are defined to have a default charset value of
'ISO-8859-1' when received via HTTP". This specification seems to be
well obeyed by the browsers that I tested.
Say IE 6 SP1 / Win 98SE studies the input stream and by some formal
signs decides that it's Cyrillic.

If that would happen, it would still get encoded to %E9 in a query
string. It's only the browser that decides how to display the
character, albeit HTML entity И (Cyrillic) or é (Latin-1).
When you change the character table, %E9 might point to a Latin,
Cyrillic or Swahili sign, or depending on whatever table is used. That
has no effect on query string encoding, those are two separate things.
These "formal signs" are very fragil and the source is wide open for
the "Korean issie" and "Characters jam" effects. They don't happen here
just because of the simplicity of the page content.

Yes, true.
 
B

Bart Van der Donck

VK said:
You come to say to any Java team guy "Unicode" (unlike
"Candyman" one time will suffice :) and then run away quickly
before he started beating you.

What a luxury. In the Perl world everybody starts fighting with
everybody.
 
M

Michael Winter

VK said:
Michael said:
Where do you think Internet protocols are specified?

Mostly and mainly in the same place where the [window] object is: :)
it goes per the traditions and per the "templatic" implementation.

You want to compare the object model of competing products to
interworking network protocols?
Any way, I did some research ...

Why? The document I cited from Alan Flavell had already drawn the
necessary conclusions. Did you read it?

[snip]

Mike
 
V

VK

Any way, I did some research ...
Why? The document I cited from Alan Flavell had already drawn the
necessary conclusions. Did you read it?

Alan Flavell has no idea (AFAICT) neither about the Korean Issue, nor
about the Character Jam nor about the Phenomenon of the first non-ASCII
character as such. This way it is not an authority to me until the
knowledge of these issues is demostrated somewhere else in his books.
 
M

Michael Winter

VK said:
Alan Flavell has no idea (AFAICT) neither about the Korean Issue,

The only time you've referred to a "Korean Issue" in the past was caused
by a failure in MSIE to detect an encoding scheme correctly, producing
rather odd results when it guessed UTF-7. The solution to that is
obvious, and Alan addresses it indirectly by recommending that the user
agent should never need guess. That said, he does touch on it:

In that analysis, I've disregarded utf-7 format (which would be
wrongly identified as us-ascii), as being inappropriate for use
in an HTTP context. One might mention, however, that when MSIE
is set to auto-detect character encodings, it has been known to
mis-identify some us-ascii pages, claiming them to be in utf-7.
-- Heuristic recognition of utf-8?,
FORM submission and i18n, Alan J. Flavell
nor about the Character Jam nor about the Phenomenon of the first
non-ASCII character as such.

If you want a sensible discussion of the issues, actually describe them
properly.

[snip]

Mike
 
V

VK

Michael said:
If you want a sensible discussion of the issues, actually describe them
properly.

The issue is that UA's acting unstable w/o charset indicated somehow.
That is especially true for IE6 which also happens to be the most
widely used UA at this time. IE6 is a very old, I would say ancient,
browser (by the Web time scale) with Unicode and UTF-__ encodings
support implemented atop and addon somehow anyhow.

This is only far related to JavaScript programming though. Maybe I'll
make a demo page showing what an innocent page can do with IE6 if
charset is not provided.
 
P

Paul Gorodyansky

Hello!

VK said:
...

Wow! Now I see. Sorry for being so slow, but it just takes a bit for
such sophisticated hack. So instead of say "CYRILLIC CAPITAL LETTER A"
(Unicode 0x0410) we are taking its UTF-8 encoding 208 144 and placing
two 8-bit encoded characters matching 208 and 144. Say in Cyrillic
(Windows-1251) these will be CYRILLIC CAPITAL LETTER R and CYRILLIC
SMALL LETTER DJE (Serbian). With UTF-8 properly declared parser will
take these two characters together and display as one Unicode character
CYRILLIC CAPITAL LETTER A. Just tried it: it works for modern browsers.
Wow... I will definitely add it to our knowledge base, as a sample of
what people may come up with with enough of free time available :)

Sorry again to everyone for being so slow: but it's really...
sophisticated.

Sophisticated? Hack (from another message)?
But you wrote that you deal with say Japanese and Korean 'legacy' encodings
so say you do know what Shift_Jis is, right? Then why you write such noncense:
?

"Two characters"??? UTF-8 is same multi-byte encoding as Shift_JIS -
do you write about ONE Japanese letter which is encoded by 2 bytes
in Shift_JIS in the same manner, that is,
"... one byte matches...character, 2nd byte matches... character then
these 2 characters together ... one Japanese letter"?

There are no "characters" there, just 2 bytes that represent one Cyrillic
letter in mulit-byte encoding "UTF-8" -
same way as another 2 bytes represent one Japanese letter in multi-byte
encoding "Shift_Jis".
As Michale wrote, you somehow did not thing about the serialization,
about files on the disk.
I don't know why you did not know before about say .HTML files
containing pure UTF-8 text (i.e. real UTF-8 characters as mulit-byte items)
to produce a multilingual page - such I18n examples and well known pages
exist on the Web since I became and I18n engineer back in 1997 :)

For example, for my Cyrillic(Russian) instructional site I prepared
a section "Multilingual HTML" many, many years ago -
it included preparation of the .htm _files_ containing UTF-8 text -
no one in the right mind would NOT have _large_ text represented in
your examples of UTF-8 - <item>%C2%AE</item> - how do you think
a wen site owner would _edit/correct_ such page it - instead of a
_readable_ text (say Russian+German letters in UTF-8 encoding)
would contain just things like >%C2%AE?

Strange (based on your statements of I18n knowledge) that we here have to explain
you UTF-8 facts written say for _beginners_ at least 6 years ago on my site in
"Multilingual HTML" section (M.Flavell's site is listed there as a source
for non-beginners): http://RusWin.net/mix.htm

It has UTF-8 examples, too: http://RusWin.net/utf8euro.htm
and http://RusWin.net/utf8-jap.htm

Same can be said aboit XML. In both XML and HTML serialization
(files on disk) is a VERY _common_ practice to have real UTF-8
text in .xml and .html
 
P

Paul Gorodyansky

Hello!

VK said:
...

Wow! Now I see. Sorry for being so slow, but it just takes a bit for
such sophisticated hack.
So instead of say "CYRILLIC CAPITAL LETTER A"
(Unicode 0x0410) we are taking its UTF-8 encoding 208 144 and placing
two 8-bit encoded characters matching 208 and 144. Say in Cyrillic
(Windows-1251) these will be CYRILLIC CAPITAL LETTER R and CYRILLIC
SMALL LETTER DJE (Serbian). With UTF-8 properly declared parser will
take these two characters together and display as one Unicode character
CYRILLIC CAPITAL LETTER A. Just tried it: it works for modern browsers.
Wow... I will definitely add it to our knowledge base, as a sample of
what people may come up with with enough of free time available :)

Sorry again to everyone for being so slow: but it's really...
sophisticated.

Sophisticated? Hack? Free time? It's a common _practice_, not
"free time strange example" - please read below.

You wrote that you deal with say Japanese and Korean 'legacy' encodings
so say you do know what Shift_Jis is, right? Then why you write such noncense:
take these two characters together and display as one?

"Two characters"??? UTF-8 is same multi-byte encoding as Shift_JIS -
do you write about ONE Japanese letter which is encoded by 2 bytes
in Shift_JIS in the same manner, that is,
"... one byte matches...character, 2nd byte matches... character then
these 2 characters together ... one Japanese letter"?
I don't think so :)

There are no "characters" there, neither for Russian in UTF-8 nor for Japanese
in Shift_JIS - just _2 bytes_ that represent one Cyrillic letter in multi-byte
encoding "UTF-8" -
same way as another 2 bytes represent one Japanese letter in multi-byte
encoding "Shift_Jis".

As Michael wrote above, you somehow did not thing about the serialization,
about files on the disk.

I don't know why you did not know before about say .HTML files
containing pure UTF-8 text (i.e. real UTF-8 characters as mulit-byte items)
to produce a multilingual page - such I18n examples and well known pages
exist on the Web since I became and I18n engineer back in 1997 :)

For example, for my Cyrillic(Russian) instructional site I prepared
a section "Multilingual HTML" many, many years ago -
it included preparation of the .htm _files_ containing UTF-8 text -

and it is NOT "free time hack, example 'just for amusement' example" -

no one in the right mind would have _large_ text represented in
_your examples_ of UTF-8 - <item>%C2%AE</item> - how do you think
a Web site owner would _maintain/edit/correct_ such page if - instead of a
_readable_ text (say Russian+German letters in UTF-8 encoding) it contains
just things like >%C2%AE?

In reality most multi-lingual Web pages serialized as .htm files, contain
real UTF-8 text, so it's not a "hack" but practical thing used everywhere -
and in accordance with UTF-8 definition as "mulit-byte encoding".


Strange (based on your statements of I18n knowledge) that we here have to explain
you UTF-8 facts written say for _beginners_ at least 6 years ago on my site in
"Multilingual HTML" section (M.Flavell's site is listed there as a source
for non-beginners): http://RusWin.net/mix.htm

It has UTF-8 examples, too: http://RusWin.net/utf8euro.htm
and http://RusWin.net/utf8-jap.htm

Same can be said aboit XML. In both XML and HTML serialization
(files on disk) is a VERY _common_ practice to have real UTF-8
text in .xml and .html
 
V

VK

Paul said:
You wrote that you deal with say Japanese and Korean 'legacy' encodings
so say you do know what Shift_Jis is, right? Then why you write such noncense:

I would hardly call Shift JIS a "legacy" one as it remains the only one
used in Japan itself :) (grace to Unicode, Inc. screwed the entire
nation)

But no, you did not get my post right: I was talking about a standard
Western (Latin 1) page interpreted as written in Hangul (Korean
ideograph alphabet) or Unicode 16-bit: because of missing charset
indication. I really think now to make a demo set and to post it at
ciwah, as it seems a terra ingornita for too many people.
(We've called the relevant problem "Korean issue" - it is a slang term
because 1) ISO Latin page being interpreted as UTF-7 with Hangul
(Korean) characters in it and 2) because we've got a number of requests
on the matter at the moment of the first big USA - North Korea crisis.
No national offence I hope.)
"Two characters"??? UTF-8 is same multi-byte encoding as Shift_JIS -
do you write about ONE Japanese letter which is encoded by 2 bytes
in Shift_JIS in the same manner, that is,
"... one byte matches...character, 2nd byte matches... character then
these 2 characters together ... one Japanese letter"?
I don't think so :)

What I missed in the discussion of "storing UTF-8 directly and
delivering it verbatim" is the usage of term "byte" and "byte sequence"
in application to a *text document*. Naturally everything consists of
bytes, including any .html or .txt file. At the same time there is a
core distinction between a text file and a binary file. And the text
file by definition consists of *characters*, not of bytes. And from a
point of view of any "8-bit observer" such document is nothing but a
set of 8-bit characters. It is required to provide an extra instruction
to interpret it in some other way.
After I have transformed the explanations from byte terms to character
terms I understood that Mr.Winter tried to tell me. It required an
extra abstraction effort from my side as it's a bit like describing a
painting in terms of wavelengths. From the other side it helped greatly
to understand a big category of help requests (about broken pages)
which we've getting. Before I thought that the vistimes are just...
strange people. Now I understand that they are simply looking at from
"another dimension" and from that dimension what they are doing is
totally correct and it has perfect sense. Unfortunately for them the
Internet oftenly operates in the dimension different from their's.
 
V

VK

No, not at all. Japanese text - using multi-byte encoding Shift_JIS -
is - as you wrote yourself - a collection of *character* - Japanese
ones

Only as long as declared/auto-recognized as Shift_JIS. Otherwise it's
8-bit charset. I invite you once again to read the origin of this
branch of the thread, not just my latest posts.
Why? Multi-byte UTF-8 text with UTF-8 characters the same concept
as multi-byte Japanese Shift_JIS text so it's strange that the concept
looks new for you -
I'd understand if you were aperson who never dealt with Japanese,
Chinese or Korean...

I dealt a lot with them. Before further discuss the issue two things
should be done:

1) the discussion moved to ciwah or even ciwam as it is too far of
JavaScript IMHO (though somehow connected so maybe can be left here ?).

2) I want to show the cases I was talking in this thread: I hate
*abstract* discussion of a kind:
- I can do it because it's written here that I can do it.
- You never cannot do it because the sh** will happen.
 
P

Paul Gorodyansky

Hello!

1) the discussion moved to ciwah or even ciwam as it is too far of
JavaScript IMHO (though somehow connected so maybe can be left here ?).

Right. I just posted because I was surprized why the concept of real UTF-8
characters (vs URL-encoding or entities) was so new for you when it's exactly
the same as say Japanese Shift_JIS - both are multi-byte encodings and for
_both_ does NOT make any sense to described a multi-byte *character* as
you did, i.e. I replied to this (which is wrong for multi-byte encoding being it
Shift_JIS or UTF-8, because there are *no* 'two 8-bit characters, it's one
multi-byte character):
... we are taking its UTF-8 encoding 208 144 and placing
two 8-bit encoded characters matching 208 and 144. Say in Cyrillic
(Windows-1251) these will be CYRILLIC CAPITAL LETTER R and CYRILLIC
SMALL LETTER DJE (Serbian).

It (describing parts of a multi-byte character as separate characters of _another_
encoding) would be wrong to described in this way a two-byte Japanese character -
and it's wrong to to so for a UTF-8 character.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top