tern an hebrew string into unicode

Discussion in 'Perl Misc' started by dana livni, Jun 29, 2004.

  1. dana livni

    dana livni Guest

    hello,
    i hope you can help me.
    i have an hebrow string (it can be in more languges) and i need to
    tern it to somting i can send in a "get" but it have to be in unicode
    for exmple:
    the string דנה
    need to be tern into %D7%93%D7%A0%D7%94

    i tried alot of metouds but it does not work, please help me.
     
    dana livni, Jun 29, 2004
    #1
    1. Advertising

  2. dana livni wrote:
    > i hope you can help me.
    > i have an hebrow string (it can be in more languges) and i need to
    > tern it to somting i can send in a "get" but it have to be in unicode
    > for exmple:
    > the string דנה


    This appears to be _one_ textual representation of those characters in
    Unicode, presumably their numerical value in UTF-16?

    > need to be tern into %D7%93%D7%A0%D7%94


    This on the other hand doesn't look like Unicode at all but rather like
    maybe URL encoding?

    "Unicode" can be encoded in many different ways. Not only UTF-8 versus
    UTF-16 versus UTF-32 but the resulting values can then be encoded in code
    points (maybe what you got first), or as Base-64, or as URL-encode, or or
    or.

    Without knowing from where to where you really want to go it is very
    difficult to offer any advise.

    jue
     
    Jürgen Exner, Jun 29, 2004
    #2
    1. Advertising

  3. dana livni

    dana livni Guest

    i gess you right, i need to convert the text in order to send it in a
    get request - in the format of the www.vivvisimo.com site.
    i think that all the %d7 meen that this is hebrow and that the second
    pare symbol the specific letter.
    i'm not sure witch encoding is it.
    i meant to send the real string (my name - dana) but google site
    encoded it.

    if there any function that get a string and the encode for use and
    retearnd a string of two pares :
    1. symbol the languge
    2. symbol the specific letter.
    like in my example, i will find the encoding i'm looking for.

    thanks
     
    dana livni, Jun 30, 2004
    #3
  4. dana livni

    Ian Wilson Guest

    dana livni wrote:
    > i gess you right, i need to convert the text in order to send it in a
    > get request - in the format of the www.vivvisimo.com site.


    Thats a parked domain, maybe you mean www.vivisimo.com?

    > i think that all the %d7 meen that this is hebrow and that the second
    > pare symbol the specific letter.


    In Unicode, Hebrew glyphs are in the range 0590-05FF

    You originally said
    >> the string דנה


    Decimal 1491 is Hex 05D3 which is the Unicode code-point for Hebrew
    letter DALET. Presumably this is the first letter of "Dana" in Hebrew.
    http://www.unicode.org/charts/

    > i'm not sure witch encoding is it.
    > i meant to send the real string (my name - dana) but google site
    > encoded it.


    Does vivisimo accept Unicode? I thought most sites expected ISO-8859-1
    (Latin 1). Which does not include Hebrew characters AFAIK.

    > if there any function that get a string and the encode for use and
    > retearnd a string of two pares :
    > 1. symbol the languge
    > 2. symbol the specific letter.
    > like in my example, i will find the encoding i'm looking for.


    Does such an encoding exist? A pair of 8-bit bytes would allow 256
    languages of 256 glyphs. There must be more than 256 languages in
    Unicode and most of them have more than 256 glyphs. So such an encoding
    could not represent more than a small subset of Unicode.

    http://www.marsengineering.com/charCodeConverter.html
     
    Ian Wilson, Jul 2, 2004
    #4
  5. On Fri, 2 Jul 2004, Ian Wilson wrote:

    > dana livni wrote:
    > > i gess you right, i need to convert the text in order to send it in a
    > > get request - in the format of the www.vivvisimo.com site.

    >
    > Thats a parked domain, maybe you mean www.vivisimo.com?
    >
    > > i think that all the %d7 meen that this is hebrow and that the second
    > > pare symbol the specific letter.


    I worried about the fact that I didn't understand exactly what the
    questioner was trying to achieve, so I was reluctant to try to answer
    the question, even if I might have some of the relevant expertise.

    > In Unicode, Hebrew glyphs are in the range 0590-05FF
    >
    > You originally said
    > >> the string דנה

    >
    > Decimal 1491 is Hex 05D3 which is the Unicode code-point for Hebrew
    > letter DALET. Presumably this is the first letter of "Dana" in Hebrew.
    > http://www.unicode.org/charts/


    Looking good so far.

    > Does vivisimo accept Unicode? I thought most sites expected ISO-8859-1
    > (Latin 1). Which does not include Hebrew characters AFAIK.


    This is the point at which your reply lost credibility for me, I'm
    afraid. If you don't know that for sure, I'm puzzled that you thought
    it helpful to try to offer an answer.

    > > if there any function that get a string and the encode for use and
    > > retearnd a string of two pares :
    > > 1. symbol the languge
    > > 2. symbol the specific letter.
    > > like in my example, i will find the encoding i'm looking for.

    >
    > Does such an encoding exist? A pair of 8-bit bytes would allow 256
    > languages of 256 glyphs.


    I'm not sure where you're heading here. Seems to be devising a
    problem for which there have long since been solutions.

    Current Perl versions have a natural way of representing Unicode
    internally; and natural ways of turning it into other useful
    representations (could be iso-8859-8; could be HTML
    representations which the questioner evidently already knows about;
    etc.) if utf-8 coding is somehow not appropriate.

    But I still don't feel confident that I know what the original poster
    wanted to achieve, so I couldn't offer a practical answer to their
    questions yet, not with any degree of confidence.

    > There must be more than 256 languages in Unicode


    Unicode doesn't really "do" languages, except in the context of
    disambiguating unified CJK characters. Greek (language) is still
    Greek (language) even when transcribed into Latin characters; English
    (language) is still Engrish (language) when transcribed into Japanese
    writing. Unicode represents *writing systems*, not languages.

    have fun
     
    Alan J. Flavell, Jul 2, 2004
    #5
  6. dana livni

    dana livni Guest

    i'm not sure i understood your answer.

    what i want to do?
    i want to create a uri.
    it sould look like the uri of google or vivisimo.
    when you enters one of those sites and search for a word in hebrow or
    any other languge both sites tern every charecter in it to an
    exprision in this pattern:
    %xx%xx. the first pear seam to mark the languge (in hebrow d7) and the
    second the spasific letter.

    i want to find a way to do the same.
    i hope now it is clear enougth.
     
    dana livni, Jul 4, 2004
    #6
  7. dana livni

    Ian Guest

    "Alan J. Flavell" <> wrote in message news:<>...
    > On Fri, 2 Jul 2004, Ian Wilson wrote:
    >
    > > dana livni wrote:
    > > > i gess you right, i need to convert the text in order to send it in a
    > > > get request - in the format of the www.vivvisimo.com site.

    > >
    > > Thats a parked domain, maybe you mean www.vivisimo.com?
    > >
    > > > i think that all the %d7 meen that this is hebrow and that the second
    > > > pare symbol the specific letter.

    >
    > I worried about the fact that I didn't understand exactly what the
    > questioner was trying to achieve, so I was reluctant to try to answer
    > the question, even if I might have some of the relevant expertise.
    >
    > > In Unicode, Hebrew glyphs are in the range 0590-05FF
    > >
    > > You originally said
    > > >> the string דנה

    > >
    > > Decimal 1491 is Hex 05D3 which is the Unicode code-point for Hebrew
    > > letter DALET. Presumably this is the first letter of "Dana" in Hebrew.
    > > http://www.unicode.org/charts/

    >
    > Looking good so far.


    Uh oh.


    > > Does vivisimo accept Unicode? I thought most sites expected ISO-8859-1
    > > (Latin 1). Which does not include Hebrew characters AFAIK.

    >
    > This is the point at which your reply lost credibility for me, I'm
    > afraid. If you don't know that for sure, I'm puzzled that you thought
    > it helpful to try to offer an answer.


    Translation: "Please dont post replies to CLPM unless you are certain
    the information you give is correct".

    Further reading revealed ...

    http://www.unicode.org/faq/unicode_web.html#11
    says "If you have a single CGI and a single HTML form, then the
    browsers will return the data in the encoding of the original form".

    Both Google and Vivisimo search forms refer to UTF-8

    For example, google has
    <meta http-equiv="content-type" content="text/html; charset=UTF-8">



    > > > if there any function that get a string and the encode for use and
    > > > retearnd a string of two pares :
    > > > 1. symbol the languge
    > > > 2. symbol the specific letter.
    > > > like in my example, i will find the encoding i'm looking for.

    > >
    > > Does such an encoding exist? A pair of 8-bit bytes would allow 256
    > > languages of 256 glyphs.

    >
    > I'm not sure where you're heading here. Seems to be devising a
    > problem for which there have long since been solutions.


    I was attempting reductio ad absurdam. Some further playing with a
    calculator shows that the %D7 which the OP refers to is simply a hex
    representation of the first byte of the UTF-8 encoding of the Hebrew
    DALET character. This byte does not designate a specific language
    (i.e. script) as the OP appears to mistakenly assume.

    > Current Perl versions have a natural way of representing Unicode
    > internally; and natural ways of turning it into other useful
    > representations (could be iso-8859-8; could be HTML
    > representations which the questioner evidently already knows about;
    > etc.) if utf-8 coding is somehow not appropriate.


    > But I still don't feel confident that I know what the original poster
    > wanted to achieve, so I couldn't offer a practical answer to their
    > questions yet, not with any degree of confidence.


    Translation: Fools rush in where angels fear to tread.
    I'm pretty sure the OP wanted to search Google for a name in Hebrew.
    Point taken however.

    > > There must be more than 256 languages in Unicode

    >
    > Unicode doesn't really "do" languages, except in the context of
    > disambiguating unified CJK characters. Greek (language) is still
    > Greek (language) even when transcribed into Latin characters; English
    > (language) is still Engrish (language) when transcribed into Japanese
    > writing. Unicode represents *writing systems*, not languages.


    This is of course true, my mistake. In fact, there is a web page at
    unicode.org which does refer to the number of languages which can be
    written using the various writing systems covered by Unicode. This
    number is less than 256 :-(

    > have fun


    I did.
     
    Ian, Jul 15, 2004
    #7
  8. On Thu, 15 Jul 2004, Ian wrote:

    > > > I thought most sites expected ISO-8859-1
    > > > (Latin 1). Which does not include Hebrew characters AFAIK.

    > >
    > > This is the point at which your reply lost credibility for me, I'm
    > > afraid.

    >
    > Translation: "Please dont post replies to CLPM unless you are certain
    > the information you give is correct".


    Oh no, I wouldn't go *that* far; but trying to answer a question about
    Hebrew, when you say you're not sure whether iso-8859-1 has Hebrew
    characters in it, *did* seem to be rather adventurous, in the
    circumstances. IMHO and RTL and YMMV.

    > Further reading revealed ...
    >
    > http://www.unicode.org/faq/unicode_web.html#11
    > says "If you have a single CGI and a single HTML form, then the
    > browsers will return the data in the encoding of the original form".


    Kind-of odd wording they are using, but yes, that's right: by default,
    browsers submit their forms input using the same character encoding as
    the HTML page which contains the form. And this is basically the only
    option which works widely enough to be used (putting accept-charset on
    the <form...> element is technically valid, but not widely supported).

    However, Netscape 4.* versions get this massively wrong when the HTML
    page is in utf-8. (Not that I really use NN4.* any more, but I keep
    a copy for test purposes).

    There's more about this topic (for anyone who's interested ;-) at
    http://ppewww.ph.gla.ac.uk/~flavell/charset/form-i18n.html

    > For example, google has
    > <meta http-equiv="content-type" content="text/html; charset=UTF-8">


    If Google thinks the browser is capable of it, indeed it does.
    (Try it from NN4.* and you'll find a different result).

    > Some further playing with a calculator shows that the %D7 which the
    > OP refers to is simply a hex representation of the first byte of the
    > UTF-8 encoding of the Hebrew DALET character.


    Confirmed: based on the input mentioned in the original posting -

    דנה

    <!-- 1491 5d3 d793 -->
    <!-- 1504 5e0 d7a0 -->
    <!-- 1492 5d4 d794 -->

    Those are decimal and hexadecimal code points, followed by the utf-8
    representation. DALET NUN HE (reading off the unicode page U+05xx,
    since I can't actually read Hebrew, sorry).

    > This byte does not designate a specific language
    > (i.e. script) as the OP appears to mistakenly assume.


    That's technically accurate; although it just so happens that the
    Hebrew alphabet (not counting the combining marks) in their utf-8
    representations all have "d7" as their first octet (byte), so, in a
    way, it -is- indicative of the Hebrew script.

    Well, the questioner referred to a "get" (which to me indicates
    "form-URL-encoded" format), and said at the outset:

    | the string דנה
    | need to be tern into %D7%93%D7%A0%D7%94

    Juergen Exner's reply seemed to be headed in the direction of
    understanding the result as a url-encoded utf-8 representation, which
    indeed hits the nail on the head, right.

    But the original poster then added in a followup:

    | i meant to send the real string (my name - dana) but google site
    | encoded it.

    by which I understood that Google had turned the original encoding
    (whatever it might have been) into notations. Unfortunately
    the actual posting

    http://groups.google.com/groups?selm=&output=gplain

    claims to be in:

    Content-Type: text/plain; charset=ISO-8859-1

    which throws no light at all on what the actual posting details would
    have been.

    So we really don't know for sure from this whether the questioner is
    working in iso-8859-8, utf-8 or what, in their practical application.

    > I'm pretty sure the OP wanted to search Google for a name in Hebrew.


    To submit a search request to something, indeed.

    Once the Hebrew text has been entered into Perl's natural Unicode
    format, it will be represented internally as utf-8 octets.

    Witness, step by step:

    my $string = chr(1491) . chr(1504) . chr(1492);

    my $result = unpack("H*",$string);

    print $result, "\n";

    Gives the result:

    d793d7a0d794

    Quod Erat Demonstrandum. It remains to insert the "%" characters
    at appropriate points (no doubt a Perl golfer will be along any moment
    to boil this down to a one-liner).

    But of course I cheated: I created the input by using the chr()
    function. I say again, we need to know how our questioner is creating
    this input before we can replace that initial step with something
    useful.

    > > have fun

    >
    > I did.


    Great stuff.
     
    Alan J. Flavell, Jul 15, 2004
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Efy.
    Replies:
    2
    Views:
    1,119
  2. Mr. x

    sample of configuration for Hebrew

    Mr. x, Nov 4, 2003, in forum: ASP .Net
    Replies:
    5
    Views:
    2,673
    Mr. x
    Nov 5, 2003
  3. Jon Skeet [C# MVP]

    convert from unicode to ascii (hebrew)

    Jon Skeet [C# MVP], Jun 28, 2004, in forum: ASP .Net
    Replies:
    2
    Views:
    13,701
    z. f.
    Jun 28, 2004
  4. Jack Christensen

    [ANN] tern - The SQL Fan's Migrator

    Jack Christensen, Feb 8, 2011, in forum: Ruby
    Replies:
    0
    Views:
    133
    Jack Christensen
    Feb 8, 2011
  5. mitchell_laks
    Replies:
    10
    Views:
    294
    Anno Siegel
    Dec 11, 2005
Loading...

Share This Page