Replace Unicode character

Discussion in 'Perl Misc' started by Ryan Chan, Oct 5, 2009.

  1. Ryan Chan

    Ryan Chan Guest

    Hello,

    Below my code which want to replace unicode character "â–¡" with empty
    string, what wrong with the code?

    ###################

    use strict;
    use warnings;
    use utf8;

    my $s = "â–¡"; # hex value = A1BC

    $s =~ s/\xA1\xBC//gi;
    print $s;

    ###################


    Thanks.
    Ryan Chan, Oct 5, 2009
    #1
    1. Advertising

  2. Ryan Chan

    Ryan Chan Guest

    Hello,

    On Oct 5, 11:24 pm, bugbear <bugbear@trim_papermule.co.uk_trim> wrote:
    > Your regexp replaces TWO characters, first one A1, second one BC.
    >
    > Since your target string does not contain either of these
    > characters, nothing happens.
    >
    >   BugBear



    even I use

    $s =~ s/\xA1BC//gi;

    the same...

    Thanks anyway
    Ryan Chan, Oct 5, 2009
    #2
    1. Advertising

  3. Ryan Chan

    Ben Bullock Guest

    On Oct 6, 12:28 am, Ryan Chan <> wrote:

    > $s =~ s/\xA1BC//gi;


    \x{A1BC} works though.

    It's documented in "perldoc perlunicode".

    According to Unicode::UCD this is the character "YI SYLLABLE LIEX".
    Ben Bullock, Oct 5, 2009
    #3
  4. On 2009-10-05 15:36, Ben Bullock <> wrote:
    > On Oct 6, 12:28 am, Ryan Chan <> wrote:
    >> $s =~ s/\xA1BC//gi;

    >
    > \x{A1BC} works though.
    >
    > It's documented in "perldoc perlunicode".
    >
    > According to Unicode::UCD this is the character "YI SYLLABLE LIEX".
    >


    Also note that UTF-8 "\xA1\xBC" is not equivalent to U+A1BC. In fact
    "\xA1\xBC" is not a valid UTF-8 character at all, U+A1BC is
    "\xEA\x86\xBC" in UTF-8, and the character in Ryan's posting was U+25A1
    (WHITE SQUARE) or "\xE2\x96\xA1" in UTF-8.

    hp
    Peter J. Holzer, Oct 5, 2009
    #4
  5. On Mon, 05 Oct 2009 17:18:20 +0200, Ryan Chan <>
    wrote:

    > Below my code which want to replace unicode character "â–¡" with empty
    > string, what wrong with the code?


    Since it has not been spelled out yet:

    $s contains one character. The regex contains two characters. One
    character never matches two characters.

    Funnily, if you're working in an utf8 environment, even a simple \xA1 can
    actually be stored as two *bytes*:

    > perl -e '$s="\xa1"; print $s; binmode STDOUT,":encoding(utf8)"; print
    > $s;' | hexdump -C

    00000000 a1 c2 a1 |...|
    00000003
    Jochen Lehmeier, Oct 5, 2009
    #5
  6. Ryan Chan

    Guest

    On Mon, 05 Oct 2009 20:56:39 +0200, "Jochen Lehmeier" <> wrote:

    >On Mon, 05 Oct 2009 17:18:20 +0200, Ryan Chan <>
    >wrote:
    >
    >> Below my code which want to replace unicode character "?" with empty
    >> string, what wrong with the code?

    >
    >Since it has not been spelled out yet:
    >
    >$s contains one character. The regex contains two characters. One
    >character never matches two characters.
    >
    >Funnily, if you're working in an utf8 environment, even a simple \xA1 can
    >actually be stored as two *bytes*:
    >
    >> perl -e '$s="\xa1"; print $s; binmode STDOUT,":encoding(utf8)"; print
    >> $s;' | hexdump -C

    >00000000 a1 c2 a1 |...|
    >00000003


    I guess scalar data can actually be stored as bytes (0..255) before say
    decoding octets into Perl's internal form. Either the resultant string
    is all ASCII or a mix with the utf8 flag turned on (character semantics).

    I think this is the base storage strategy for Perl. It speeds things up.
    Encoding just converts it back into octets, turning off the utf8 flag
    (byte semantics). This process is not always symetrical and there is
    sometimes more than one encoding representations of the same thing.

    Sort of a bastardized system.

    -sln
    , Oct 5, 2009
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kenneth McDonald
    Replies:
    1
    Views:
    827
    Carl Banks
    Dec 27, 2006
  2. herman
    Replies:
    5
    Views:
    7,595
    =?ISO-8859-1?Q?Erik_Wikstr=F6m?=
    Aug 30, 2007
  3. Replies:
    1
    Views:
    791
    Alexey Smirnov
    Jul 10, 2008
  4. Alexey Smirnov
    Replies:
    0
    Views:
    679
    Alexey Smirnov
    Jul 10, 2008
  5. Wes Groleau
    Replies:
    6
    Views:
    175
    Wes Groleau
    Dec 19, 2013
Loading...

Share This Page