Replace Unicode character

Discussion in 'Perl Misc' started by Ryan Chan, Oct 5, 2009.

  1. Ryan Chan

    Ryan Chan Guest

    Hello,

    Below my code which want to replace unicode character "â–¡" with empty
    string, what wrong with the code?

    ###################

    use strict;
    use warnings;
    use utf8;

    my $s = "â–¡"; # hex value = A1BC

    $s =~ s/\xA1\xBC//gi;
    print $s;

    ###################


    Thanks.
     
    Ryan Chan, Oct 5, 2009
    #1
    1. Advertisements

  2. Ryan Chan

    Ryan Chan Guest

    Hello,


    even I use

    $s =~ s/\xA1BC//gi;

    the same...

    Thanks anyway
     
    Ryan Chan, Oct 5, 2009
    #2
    1. Advertisements

  3. Ryan Chan

    Ben Bullock Guest

    \x{A1BC} works though.

    It's documented in "perldoc perlunicode".

    According to Unicode::UCD this is the character "YI SYLLABLE LIEX".
     
    Ben Bullock, Oct 5, 2009
    #3
  4. Also note that UTF-8 "\xA1\xBC" is not equivalent to U+A1BC. In fact
    "\xA1\xBC" is not a valid UTF-8 character at all, U+A1BC is
    "\xEA\x86\xBC" in UTF-8, and the character in Ryan's posting was U+25A1
    (WHITE SQUARE) or "\xE2\x96\xA1" in UTF-8.

    hp
     
    Peter J. Holzer, Oct 5, 2009
    #4
  5. Since it has not been spelled out yet:

    $s contains one character. The regex contains two characters. One
    character never matches two characters.

    Funnily, if you're working in an utf8 environment, even a simple \xA1 can
    actually be stored as two *bytes*:
    00000000 a1 c2 a1 |...|
    00000003
     
    Jochen Lehmeier, Oct 5, 2009
    #5
  6. Ryan Chan

    sln Guest

    I guess scalar data can actually be stored as bytes (0..255) before say
    decoding octets into Perl's internal form. Either the resultant string
    is all ASCII or a mix with the utf8 flag turned on (character semantics).

    I think this is the base storage strategy for Perl. It speeds things up.
    Encoding just converts it back into octets, turning off the utf8 flag
    (byte semantics). This process is not always symetrical and there is
    sometimes more than one encoding representations of the same thing.

    Sort of a bastardized system.

    -sln
     
    sln, Oct 5, 2009
    #6
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.