Translation: "Please dont post replies to CLPM unless you are certain
the information you give is correct".
Oh no, I wouldn't go *that* far; but trying to answer a question about
Hebrew, when you say you're not sure whether iso-8859-1 has Hebrew
characters in it, *did* seem to be rather adventurous, in the
circumstances. IMHO and RTL and YMMV.
Further reading revealed ...
http://www.unicode.org/faq/unicode_web.html#11
says "If you have a single CGI and a single HTML form, then the
browsers will return the data in the encoding of the original form".
Kind-of odd wording they are using, but yes, that's right: by default,
browsers submit their forms input using the same character encoding as
the HTML page which contains the form. And this is basically the only
option which works widely enough to be used (putting accept-charset on
the <form...> element is technically valid, but not widely supported).
However, Netscape 4.* versions get this massively wrong when the HTML
page is in utf-8. (Not that I really use NN4.* any more, but I keep
a copy for test purposes).
There's more about this topic (for anyone who's interested ;-) at
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html
For example, google has
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
If Google thinks the browser is capable of it, indeed it does.
(Try it from NN4.* and you'll find a different result).
Some further playing with a calculator shows that the %D7 which the
OP refers to is simply a hex representation of the first byte of the
UTF-8 encoding of the Hebrew DALET character.
Confirmed: based on the input mentioned in the original posting -
דנה
<!-- 1491 5d3 d793 -->
<!-- 1504 5e0 d7a0 -->
<!-- 1492 5d4 d794 -->
Those are decimal and hexadecimal code points, followed by the utf-8
representation. DALET NUN HE (reading off the unicode page U+05xx,
since I can't actually read Hebrew, sorry).
This byte does not designate a specific language
(i.e. script) as the OP appears to mistakenly assume.
That's technically accurate; although it just so happens that the
Hebrew alphabet (not counting the combining marks) in their utf-8
representations all have "d7" as their first octet (byte), so, in a
way, it -is- indicative of the Hebrew script.
Well, the questioner referred to a "get" (which to me indicates
"form-URL-encoded" format), and said at the outset:
| the string דנה
| need to be tern into %D7%93%D7%A0%D7%94
Juergen Exner's reply seemed to be headed in the direction of
understanding the result as a url-encoded utf-8 representation, which
indeed hits the nail on the head, right.
But the original poster then added in a followup:
| i meant to send the real string (my name - dana) but google site
| encoded it.
by which I understood that Google had turned the original encoding
(whatever it might have been) into notations. Unfortunately
the actual posting
http://groups.google.com/[email protected]&output=gplain
claims to be in:
Content-Type: text/plain; charset=ISO-8859-1
which throws no light at all on what the actual posting details would
have been.
So we really don't know for sure from this whether the questioner is
working in iso-8859-8, utf-8 or what, in their practical application.
I'm pretty sure the OP wanted to search Google for a name in Hebrew.
To submit a search request to something, indeed.
Once the Hebrew text has been entered into Perl's natural Unicode
format, it will be represented internally as utf-8 octets.
Witness, step by step:
my $string = chr(1491) . chr(1504) . chr(1492);
my $result = unpack("H*",$string);
print $result, "\n";
Gives the result:
d793d7a0d794
Quod Erat Demonstrandum. It remains to insert the "%" characters
at appropriate points (no doubt a Perl golfer will be along any moment
to boil this down to a one-liner).
But of course I cheated: I created the input by using the chr()
function. I say again, we need to know how our questioner is creating
this input before we can replace that initial step with something
useful.
Great stuff.