converting UTF-8 to entities like 剛

Discussion in 'Ruby' started by Jian Lin, May 9, 2009.

  1. Jian Lin

    Jian Lin Guest

    I was trying to convert UTF-8 content into a series of entities like
    剛 so that whatever the page encoding is, the characters would
    show...

    so I used something like this:
    <%
    begin
    t = ''
    s = Iconv.conv("UTF-32", "UTF-8", some_utf8_string)

    s.scan(/(.)(.)(.)(.)/) do |b1, b2, b3, b4|
    t += ("&#x" + "%02X" % b3.ord) + ("%02X" % b4.ord) + ";"
    end
    rescue => details
    t = "exception " + details
    end
    %>

    <%= t %>

    but some characters get converted, and some don't. Is it true that
    (.)(.)(.)(.) will not necessarily match 4 bytes at a time?

    At first, I was going to use

    s = Iconv.conv("UTF-16", "UTF-8", some_utf8_string)

    but then i found that utf-16 is also variable length... so I used UTF-32
    instead which is fixed length. The UTF-8 string I have is just the
    Basic Plane... so should be all in the 0x0000 to 0xFFFF range in
    unicode.
    --
    Posted via http://www.ruby-forum.com/.
     
    Jian Lin, May 9, 2009
    #1
    1. Advertising

  2. Jian Lin

    Robert Dober Guest

    On Sat, May 9, 2009 at 2:04 PM, Jian Lin <> wrote:
    sorry for a quite superficial answer, but can you use the unicode
    switch for regexen in your Ruby Version. This seems to be the problem.

    Robert


    --=20
    Si tu veux construire un bateau ...
    Ne rassemble pas des hommes pour aller chercher du bois, pr=E9parer des
    outils, r=E9partir les t=E2ches, all=E9ger le travail=85 mais enseigne aux
    gens la nostalgie de l=92infini de la mer.

    If you want to build a ship, don=92t herd people together to collect
    wood and don=92t assign them tasks and work, but rather teach them to
    long for the endless immensity of the sea.

    --
    Antoine de Saint-Exup=E9ry
     
    Robert Dober, May 9, 2009
    #2
    1. Advertising

  3. Jian Lin

    Jian Lin Guest

    Robert Dober wrote:
    > On Sat, May 9, 2009 at 2:04 PM, Jian Lin <> wrote:
    > sorry for a quite superficial answer, but can you use the unicode
    > switch for regexen in your Ruby Version. This seems to be the problem.
    >
    > Robert



    it really might be the 0 that is choking the regular expression match...
    if i use

    s.scan(/(.)(.)(.)(.)/s)

    then it works better but still not all characters are converted...

    but the way i have a solution using the byte processing ... in next post

    --
    Posted via http://www.ruby-forum.com/.
     
    Jian Lin, May 9, 2009
    #3
  4. Jian Lin

    Jian Lin Guest

    this works:

    but i am sure there are more elegant solutions.

    <%
    begin
    t = ''
    s = Iconv.conv("UTF-32", "UTF-8", some_utf8_string)

    (s.length / 4).times do |i|
    b3 = s[i*4 + 2]
    b4 = s[i*4 + 3]
    t += ("&#x" + "%02X" % b3) + ("%02X" % b4) + ";"
    end
    rescue => details
    t = "exception " + details
    end
    %>

    <%= t %>
    --
    Posted via http://www.ruby-forum.com/.
     
    Jian Lin, May 9, 2009
    #4
  5. Jian Lin

    Jian Lin Guest

    Robert Dober wrote:
    > On Sat, May 9, 2009 at 2:04 PM, Jian Lin <> wrote:
    > sorry for a quite superficial answer, but can you use the unicode
    > switch for regexen in your Ruby Version. This seems to be the problem.
    >
    > Robert



    by the way... Robert... what is the regexen? is it the regular
    expression modifier? I'd like it to match absolutely anything
    (newline, 0, etc)... but seems like there is no match
    --
    Posted via http://www.ruby-forum.com/.
     
    Jian Lin, May 9, 2009
    #5
  6. On Sat, May 9, 2009 at 8:40 AM, Jian Lin <> wrote:
    > Robert Dober wrote:
    >> On Sat, May 9, 2009 at 2:04 PM, Jian Lin <> wrote:
    >> sorry for a quite superficial answer, but can you use the unicode
    >> switch for regexen in your Ruby Version. This seems to be the problem.
    >>
    >> Robert

    >
    >
    > by the way... Robert... what is the regexen? =A0is it the regular
    > expression modifier? =A0 I'd like it to match absolutely anything
    > (newline, 0, etc)... but seems like there is no match


    I'm pretty sure that Robert used regexen as the geeky way of pluralizing re=
    gex.

    The unicode switch (a u regular expression option) forces the use of
    unicode to interpret the string being matched, otherwise it uses
    whatever the encoding of the source file containing the regular
    expression.

    e.g. /./u

    If you want . to match newlines you want the m (multi-line) option.
    Normally . will match anything BUT a new line, m changes this.

    rb(main):001:0> "a\nb".match(/a.b/)
    =3D> nil
    irb(main):002:0> "a\nb".match(/a.b/m)
    =3D> #<MatchData:0x6a248>

    --=20
    Rick DeNatale

    Blog: http://talklikeaduck.denhaven2.com/
    Twitter: http://twitter.com/RickDeNatale
    WWR: http://www.workingwithrails.com/person/9021-rick-denatale
    LinkedIn: http://www.linkedin.com/in/rickdenatale
     
    Rick DeNatale, May 9, 2009
    #6
  7. Jian Lin

    Jian Lin Guest

    Rick Denatale wrote:
    > On Sat, May 9, 2009 at 8:40 AM, Jian Lin <> wrote:
    >> (newline, 0, etc)... but seems like there is no match

    > I'm pretty sure that Robert used regexen as the geeky way of pluralizing
    > regex.
    >
    > The unicode switch (a u regular expression option) forces the use of
    > unicode to interpret the string being matched, otherwise it uses
    > whatever the encoding of the source file containing the regular
    > expression.
    >
    > e.g. /./u
    >
    > If you want . to match newlines you want the m (multi-line) option.
    > Normally . will match anything BUT a new line, m changes this.
    >
    > rb(main):001:0> "a\nb".match(/a.b/)
    > => nil
    > irb(main):002:0> "a\nb".match(/a.b/m)
    > => #<MatchData:0x6a248>


    aha... here i just want to match 4 bytes at a time, no matter what the
    bytes are. Using "m" won't do it... the "u" would be helpful if i
    match one UTF-8 character at a time and then process it... right now i
    actually convert it all at once to UTF-32 and then process it... so I
    wonder if there is a way to match 4 bytes at a time.

    --
    Posted via http://www.ruby-forum.com/.
     
    Jian Lin, May 9, 2009
    #7
  8. Jian Lin

    7stud -- Guest

    Jian Lin wrote:
    > Rick Denatale wrote:
    >> On Sat, May 9, 2009 at 8:40 AM, Jian Lin <> wrote:
    >>> (newline, 0, etc)... but seems like there is no match

    >> I'm pretty sure that Robert used regexen as the geeky way of pluralizing
    >> regex.
    >>
    >> The unicode switch (a u regular expression option) forces the use of
    >> unicode to interpret the string being matched, otherwise it uses
    >> whatever the encoding of the source file containing the regular
    >> expression.
    >>
    >> e.g. /./u
    >>
    >> If you want . to match newlines you want the m (multi-line) option.
    >> Normally . will match anything BUT a new line, m changes this.
    >>
    >> rb(main):001:0> "a\nb".match(/a.b/)
    >> => nil
    >> irb(main):002:0> "a\nb".match(/a.b/m)
    >> => #<MatchData:0x6a248>

    >
    > aha... here i just want to match 4 bytes at a time, no matter what the
    > bytes are. Using "m" won't do it... the "u" would be helpful if i
    > match one UTF-8 character at a time and then process it... right now i
    > actually convert it all at once to UTF-32 and then process it... so I
    > wonder if there is a way to match 4 bytes at a time.


    So what's the problem? A dot matches any byte (with the 'm' switch).
    Make a regex with four dots:

    /..../

    or

    /.{4}/
    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, May 9, 2009
    #8
  9. Jian Lin

    7stud -- Guest

    7stud --, May 9, 2009
    #9
  10. Jian Lin

    Jian Lin Guest

    7stud -- wrote:
    > 7stud -- wrote:
    > Whoops. With the 'm' switch:
    >
    > /..../m
    >
    > or
    >
    > /.{4}/m



    the problem is that some characters are converted to the correct
    骼 etc, but some characters are not... you can try if you want...
    just go to Google News and get a China, taiwan, or hk news headline.

    --
    Posted via http://www.ruby-forum.com/.
     
    Jian Lin, May 9, 2009
    #10
  11. Jian Lin

    7stud -- Guest

    Jian Lin wrote:
    > 7stud -- wrote:
    >> 7stud -- wrote:
    >> Whoops. With the 'm' switch:
    >>
    >> /..../m
    >>
    >> or
    >>
    >> /.{4}/m

    >
    >
    > the problem is that some characters are converted to the correct
    > 骼 etc, but some characters are not... you can try if you want...
    > just go to Google News and get a China, taiwan, or hk news headline.


    Then why do you insist that you are trying to match any 4 bytes?
    --
    Posted via http://www.ruby-forum.com/.
     
    7stud --, May 10, 2009
    #11
  12. Jian Lin

    Jian Lin Guest

    7stud -- wrote:
    > Jian Lin wrote:
    >> 7stud -- wrote:
    >>> 7stud -- wrote:
    >>> Whoops. With the 'm' switch:
    >>>
    >>> /..../m
    >>>
    >>> or
    >>>
    >>> /.{4}/m

    >>
    >>
    >> the problem is that some characters are converted to the correct
    >> 骼 etc, but some characters are not... you can try if you want...
    >> just go to Google News and get a China, taiwan, or hk news headline.

    >
    > Then why do you insist that you are trying to match any 4 bytes?


    no... the program converts the UTF-8 string into UTF-32, so that each
    character (code point) is 4 bytes long. And then the program process
    the end result, 4 bytes at a time, so that's why scanning 4 bytes at a
    time.

    --
    Posted via http://www.ruby-forum.com/.
     
    Jian Lin, May 10, 2009
    #12
  13. Jian Lin

    Robert Dober Guest

    >
    > I'm pretty sure that Robert used regexen as the geeky way of pluralizing regex.

    I plead guilty your honor :)
     
    Robert Dober, May 10, 2009
    #13
  14. Jian Lin wrote:
    >
    > I was trying to convert UTF-8 content into a series of entities like
    > 剛 so that whatever the page encoding is, the characters would
    > show...
    >


    If you are one 1.9, you could use String.codepoints. Something similar
    to:

    >> 'å¨æ–¯åŠ çš„中文很ä¸å¥½'.codepoints.to_a.map {|e| "&#x#{e.to_s(16)};"}

    => ["威", "斯", "加", "的", "中",
    "文", "很", "不", "好"]

    HTH
    å¨æ–¯åŠ 
    --
    Posted via http://www.ruby-forum.com/.
     
    Wisccal Wisccal, May 13, 2009
    #14
  15. Jian Lin

    Jian Lin Guest

    Wisccal Wisccal wrote:

    > If you are one 1.9, you could use String.codepoints. Something similar
    > to:
    >
    >>> 'å¨æ–¯åŠ çš„中文很ä¸å¥½'.codepoints.to_a.map {|e| "&#x#{e.to_s(16)};"}

    > => ["威", "斯", "加", "的", "中",
    > "文", "很", "不", "好"]
    >
    > HTH
    > å¨æ–¯åŠ 


    that's really cool... Wisccal, how do you know Chinese?

    --
    Posted via http://www.ruby-forum.com/.
     
    Jian Lin, May 13, 2009
    #15
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. slberry
    Replies:
    0
    Views:
    423
    slberry
    May 15, 2004
  2. Jimmy Shaw

    Converting from UTF-16 to UTF-32

    Jimmy Shaw, Jul 31, 2006, in forum: C++
    Replies:
    7
    Views:
    1,379
    P.J. Plauger
    Aug 1, 2006
  3. Kioko --
    Replies:
    3
    Views:
    335
    Walton Hoops
    Mar 24, 2010
  4. Jim Higson
    Replies:
    3
    Views:
    241
    Eric Amick
    Jul 25, 2004
  5. Crap

    UTF-8 to named character entities

    Crap, Jun 26, 2005, in forum: Perl Misc
    Replies:
    4
    Views:
    221
    RedGrittyBrick
    Jun 30, 2005
Loading...

Share This Page