tern an hebrew string into unicode

D

dana livni

hello,
i hope you can help me.
i have an hebrow string (it can be in more languges) and i need to
tern it to somting i can send in a "get" but it have to be in unicode
for exmple:
the string דנה
need to be tern into %D7%93%D7%A0%D7%94

i tried alot of metouds but it does not work, please help me.
 
J

Jürgen Exner

dana said:
i hope you can help me.
i have an hebrow string (it can be in more languges) and i need to
tern it to somting i can send in a "get" but it have to be in unicode
for exmple:
the string דנה

This appears to be _one_ textual representation of those characters in
Unicode, presumably their numerical value in UTF-16?
need to be tern into %D7%93%D7%A0%D7%94

This on the other hand doesn't look like Unicode at all but rather like
maybe URL encoding?

"Unicode" can be encoded in many different ways. Not only UTF-8 versus
UTF-16 versus UTF-32 but the resulting values can then be encoded in code
points (maybe what you got first), or as Base-64, or as URL-encode, or or
or.

Without knowing from where to where you really want to go it is very
difficult to offer any advise.

jue
 
D

dana livni

i gess you right, i need to convert the text in order to send it in a
get request - in the format of the www.vivvisimo.com site.
i think that all the %d7 meen that this is hebrow and that the second
pare symbol the specific letter.
i'm not sure witch encoding is it.
i meant to send the real string (my name - dana) but google site
encoded it.

if there any function that get a string and the encode for use and
retearnd a string of two pares :
1. symbol the languge
2. symbol the specific letter.
like in my example, i will find the encoding i'm looking for.

thanks
 
I

Ian Wilson

dana said:
i gess you right, i need to convert the text in order to send it in a
get request - in the format of the www.vivvisimo.com site.

Thats a parked domain, maybe you mean www.vivisimo.com?
i think that all the %d7 meen that this is hebrow and that the second
pare symbol the specific letter.

In Unicode, Hebrew glyphs are in the range 0590-05FF

You originally said
Decimal 1491 is Hex 05D3 which is the Unicode code-point for Hebrew
letter DALET. Presumably this is the first letter of "Dana" in Hebrew.
http://www.unicode.org/charts/
i'm not sure witch encoding is it.
i meant to send the real string (my name - dana) but google site
encoded it.

Does vivisimo accept Unicode? I thought most sites expected ISO-8859-1
(Latin 1). Which does not include Hebrew characters AFAIK.
if there any function that get a string and the encode for use and
retearnd a string of two pares :
1. symbol the languge
2. symbol the specific letter.
like in my example, i will find the encoding i'm looking for.

Does such an encoding exist? A pair of 8-bit bytes would allow 256
languages of 256 glyphs. There must be more than 256 languages in
Unicode and most of them have more than 256 glyphs. So such an encoding
could not represent more than a small subset of Unicode.

http://www.marsengineering.com/charCodeConverter.html
 
A

Alan J. Flavell

Thats a parked domain, maybe you mean www.vivisimo.com?

I worried about the fact that I didn't understand exactly what the
questioner was trying to achieve, so I was reluctant to try to answer
the question, even if I might have some of the relevant expertise.
In Unicode, Hebrew glyphs are in the range 0590-05FF

You originally said

Decimal 1491 is Hex 05D3 which is the Unicode code-point for Hebrew
letter DALET. Presumably this is the first letter of "Dana" in Hebrew.
http://www.unicode.org/charts/

Looking good so far.
Does vivisimo accept Unicode? I thought most sites expected ISO-8859-1
(Latin 1). Which does not include Hebrew characters AFAIK.

This is the point at which your reply lost credibility for me, I'm
afraid. If you don't know that for sure, I'm puzzled that you thought
it helpful to try to offer an answer.
Does such an encoding exist? A pair of 8-bit bytes would allow 256
languages of 256 glyphs.

I'm not sure where you're heading here. Seems to be devising a
problem for which there have long since been solutions.

Current Perl versions have a natural way of representing Unicode
internally; and natural ways of turning it into other useful
representations (could be iso-8859-8; could be HTML
representations which the questioner evidently already knows about;
etc.) if utf-8 coding is somehow not appropriate.

But I still don't feel confident that I know what the original poster
wanted to achieve, so I couldn't offer a practical answer to their
questions yet, not with any degree of confidence.
There must be more than 256 languages in Unicode

Unicode doesn't really "do" languages, except in the context of
disambiguating unified CJK characters. Greek (language) is still
Greek (language) even when transcribed into Latin characters; English
(language) is still Engrish (language) when transcribed into Japanese
writing. Unicode represents *writing systems*, not languages.

have fun
 
D

dana livni

i'm not sure i understood your answer.

what i want to do?
i want to create a uri.
it sould look like the uri of google or vivisimo.
when you enters one of those sites and search for a word in hebrow or
any other languge both sites tern every charecter in it to an
exprision in this pattern:
%xx%xx. the first pear seam to mark the languge (in hebrow d7) and the
second the spasific letter.

i want to find a way to do the same.
i hope now it is clear enougth.
 
I

Ian

Alan J. Flavell said:
I worried about the fact that I didn't understand exactly what the
questioner was trying to achieve, so I was reluctant to try to answer
the question, even if I might have some of the relevant expertise.


Looking good so far.

Uh oh.

This is the point at which your reply lost credibility for me, I'm
afraid. If you don't know that for sure, I'm puzzled that you thought
it helpful to try to offer an answer.

Translation: "Please dont post replies to CLPM unless you are certain
the information you give is correct".

Further reading revealed ...

http://www.unicode.org/faq/unicode_web.html#11
says "If you have a single CGI and a single HTML form, then the
browsers will return the data in the encoding of the original form".

Both Google and Vivisimo search forms refer to UTF-8

For example, google has
<meta http-equiv="content-type" content="text/html; charset=UTF-8">


I'm not sure where you're heading here. Seems to be devising a
problem for which there have long since been solutions.

I was attempting reductio ad absurdam. Some further playing with a
calculator shows that the %D7 which the OP refers to is simply a hex
representation of the first byte of the UTF-8 encoding of the Hebrew
DALET character. This byte does not designate a specific language
(i.e. script) as the OP appears to mistakenly assume.
Current Perl versions have a natural way of representing Unicode
internally; and natural ways of turning it into other useful
representations (could be iso-8859-8; could be HTML
representations which the questioner evidently already knows about;
etc.) if utf-8 coding is somehow not appropriate.
But I still don't feel confident that I know what the original poster
wanted to achieve, so I couldn't offer a practical answer to their
questions yet, not with any degree of confidence.

Translation: Fools rush in where angels fear to tread.
I'm pretty sure the OP wanted to search Google for a name in Hebrew.
Point taken however.
Unicode doesn't really "do" languages, except in the context of
disambiguating unified CJK characters. Greek (language) is still
Greek (language) even when transcribed into Latin characters; English
(language) is still Engrish (language) when transcribed into Japanese
writing. Unicode represents *writing systems*, not languages.

This is of course true, my mistake. In fact, there is a web page at
unicode.org which does refer to the number of languages which can be
written using the various writing systems covered by Unicode. This
number is less than 256 :-(

I did.
 
A

Alan J. Flavell

Translation: "Please dont post replies to CLPM unless you are certain
the information you give is correct".

Oh no, I wouldn't go *that* far; but trying to answer a question about
Hebrew, when you say you're not sure whether iso-8859-1 has Hebrew
characters in it, *did* seem to be rather adventurous, in the
circumstances. IMHO and RTL and YMMV.
Further reading revealed ...

http://www.unicode.org/faq/unicode_web.html#11
says "If you have a single CGI and a single HTML form, then the
browsers will return the data in the encoding of the original form".

Kind-of odd wording they are using, but yes, that's right: by default,
browsers submit their forms input using the same character encoding as
the HTML page which contains the form. And this is basically the only
option which works widely enough to be used (putting accept-charset on
the <form...> element is technically valid, but not widely supported).

However, Netscape 4.* versions get this massively wrong when the HTML
page is in utf-8. (Not that I really use NN4.* any more, but I keep
a copy for test purposes).

There's more about this topic (for anyone who's interested ;-) at
http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html
For example, google has
<meta http-equiv="content-type" content="text/html; charset=UTF-8">

If Google thinks the browser is capable of it, indeed it does.
(Try it from NN4.* and you'll find a different result).
Some further playing with a calculator shows that the %D7 which the
OP refers to is simply a hex representation of the first byte of the
UTF-8 encoding of the Hebrew DALET character.

Confirmed: based on the input mentioned in the original posting -

דנה

<!-- 1491 5d3 d793 -->
<!-- 1504 5e0 d7a0 -->
<!-- 1492 5d4 d794 -->

Those are decimal and hexadecimal code points, followed by the utf-8
representation. DALET NUN HE (reading off the unicode page U+05xx,
since I can't actually read Hebrew, sorry).
This byte does not designate a specific language
(i.e. script) as the OP appears to mistakenly assume.

That's technically accurate; although it just so happens that the
Hebrew alphabet (not counting the combining marks) in their utf-8
representations all have "d7" as their first octet (byte), so, in a
way, it -is- indicative of the Hebrew script.

Well, the questioner referred to a "get" (which to me indicates
"form-URL-encoded" format), and said at the outset:

| the string דנה
| need to be tern into %D7%93%D7%A0%D7%94

Juergen Exner's reply seemed to be headed in the direction of
understanding the result as a url-encoded utf-8 representation, which
indeed hits the nail on the head, right.

But the original poster then added in a followup:

| i meant to send the real string (my name - dana) but google site
| encoded it.

by which I understood that Google had turned the original encoding
(whatever it might have been) into notations. Unfortunately
the actual posting

http://groups.google.com/[email protected]&output=gplain

claims to be in:

Content-Type: text/plain; charset=ISO-8859-1

which throws no light at all on what the actual posting details would
have been.

So we really don't know for sure from this whether the questioner is
working in iso-8859-8, utf-8 or what, in their practical application.
I'm pretty sure the OP wanted to search Google for a name in Hebrew.

To submit a search request to something, indeed.

Once the Hebrew text has been entered into Perl's natural Unicode
format, it will be represented internally as utf-8 octets.

Witness, step by step:

my $string = chr(1491) . chr(1504) . chr(1492);

my $result = unpack("H*",$string);

print $result, "\n";

Gives the result:

d793d7a0d794

Quod Erat Demonstrandum. It remains to insert the "%" characters
at appropriate points (no doubt a Perl golfer will be along any moment
to boil this down to a one-liner).

But of course I cheated: I created the input by using the chr()
function. I say again, we need to know how our questioner is creating
this input before we can replace that initial step with something
useful.

Great stuff.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,564
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top