Unicode in regexp

Discussion in 'Perl Misc' started by patari, May 21, 2007.

  1. patari

    patari Guest

    Hi,

    I have some text which has unicode character \u+2013 for example:
    PERFORMANCE - A COMPARATIVE STUDY

    How can I find this character and change it to two - characters for
    LaTeX?

    Somehow next code doesn't work, assuming that $str contains string
    mentioned earlier:

    $str =~ s/\x{2013}/--/g;

    If I save that text in a UTF-8 file and open that file like this
    open(FILE,"<:utf8","text.txt");
    then above regular expression works. How could I get regexp to work
    for text that is not read from a file which is specified to be in
    UTF-8 encoding?
     
    patari, May 21, 2007
    #1
    1. Advertising

  2. patari

    Guest

    On 5¿ù21ÀÏ, ¿ÀÈÄ8½Ã09ºÐ, patari <> wrote:
    > Hi,
    >
    > I have some text which has unicode character \u+2013 for example:
    > PERFORMANCE - A COMPARATIVE STUDY
    >
    > How can I find this character and change it to two - characters for
    > LaTeX?
    >
    > Somehow next code doesn't work, assuming that $str contains string
    > mentioned earlier:
    >
    > $str =~ s/\x{2013}/--/g;
    >
    > If I save that text in a UTF-8 file and open that file like this
    > open(FILE,"<:utf8","text.txt");
    > then above regular expression works. How could I get regexp to work
    > for text that is not read from a file which is specified to be in
    > UTF-8 encoding?



    Hello,

    Save your script in UTF-8 encoding and just use the unicode
    characters, rather than \x{****} form, in the regexp:

    $str =~ s/-/--/g; # First "-" is \x{2013}, not dash.

    Or,

    decode it first, perform substitution, and encode it back:

    use Encode;
    $octets = decode("UTF-8", $str);
    $octets =~ s/\x{2013}/--/g;
    $str =~ encode("UTF-8", $octets);
     
    , May 21, 2007
    #2
    1. Advertising

  3. patari

    Mumia W. Guest

    On 05/21/2007 06:09 AM, patari wrote:
    > [...]
    > Somehow next code doesn't work, assuming that $str contains string
    > mentioned earlier:
    >
    > $str =~ s/\x{2013}/--/g;
    >
    > If I save that text in a UTF-8 file and open that file like this
    > open(FILE,"<:utf8","text.txt");
    > then above regular expression works. How could I get regexp to work
    > for text that is not read from a file which is specified to be in
    > UTF-8 encoding?
    >


    Where does the text come from?

    How do you know that u+2013 is in that text?
     
    Mumia W., May 21, 2007
    #3
  4. patari

    patari Guest

    On 21 touko, 18:57, "Mumia W." <paduille.4061.mumia.w
    > wrote:
    > On 05/21/2007 06:09 AM, patari wrote:
    >
    > > [...]
    > > Somehow next code doesn't work, assuming that $str contains string
    > > mentioned earlier:

    >
    > > $str =~ s/\x{2013}/--/g;

    >
    > > If I save that text in a UTF-8 file and open that file like this
    > > open(FILE,"<:utf8","text.txt");
    > > then above regular expression works. How could I get regexp to work
    > > for text that is not read from a file which is specified to be in
    > > UTF-8 encoding?

    >
    > Where does the text come from?
    >
    > How do you know that u+2013 is in that text?



    Text comes originally from user of cgi application, but in this case
    the text is fetched from database. I know that character u+2013
    because the text is viewed with browser where it shows, and I can copy
    that for example to emacs which tells me the code of the character.
     
    patari, May 22, 2007
    #4
  5. patari

    patari Guest

    Hi,

    On 21 touko, 15:37, wrote:
    > Hello,
    >
    > Save your script in UTF-8 encoding and just use the unicode
    > characters, rather than \x{****} form, in the regexp:
    >
    > $str =~ s/-/--/g; # First "-" is \x{2013}, not dash.
    >
    > Or,
    >
    > decode it first, perform substitution, and encode it back:
    >
    > use Encode;
    > $octets = decode("UTF-8", $str);
    > $octets =~ s/\x{2013}/--/g;
    > $str =~ encode("UTF-8", $octets);



    Unfortunately that doesn't work either. It only changes that character
    and characters like ä to some mess of characters. I think that decode
    and encode should be changed.
    Neither does $str =~ s/\x20\x13/--/g; work.

    But thanks to you and Petr Vileta I finally got the solution by
    combining your hints. I first encoded the string
    my $octets = encode("UTF-8",$str);
    and then printed it to Apaches log. The character seemed to be encoded
    \xc2\x96. Using this I could match the regexp and change the
    character.

    Here is the solution if anyone else bumps into similar problems:
    my $octets = encode("UTF-8",$str);
    if ($octets =~ /\xc2\x96/) {
    $octets =~ s/\xc2\x96/--/g;
    }
    $str = decode("UTF-8",$octets);

    I'm still wondering why \x{2013} didn't match after encode. It seems
    that encode also changes that character and in this case codes it as
    \xc2\x96.
     
    patari, May 22, 2007
    #5
  6. On May 21, 12:09 pm, patari <> wrote:
    > I have some text which has unicode character \u+2013 for example:
    > PERFORMANCE - A COMPARATIVE STUDY


    Unicode text is a abstract series of code points.

    When you pass Unicode character data from one place to another (e.g.
    web form to web server, web server to web browser, application to
    database, database to application, file to application, application to
    file...) you need the two ends to agree what encoding is being used to
    serialise the abstract series of code points into a series of bytes.

    Perl has two types of string: Unicode strings and byte strings. Byte
    strings contain bytes or, sometimes, ASCII text. There are various
    rules about what happens if you treat a byte string containing bytes
    in the range 0x80-0xFF a text but I'm not going to go into those here.
    You should ideally explicitly say when you want to convert a byte
    sequence to a Unicode character sequence and specify what encoding you
    are using.

    So, when you want to read your sample text (as a series of bytes from
    an external source) into a Perl Unicode string you need to make sure
    that you tell Perl (somehow) what encoding is being used.

    > How can I find this character and change it to two - characters for
    > LaTeX?
    >
    > Somehow next code doesn't work, assuming that $str contains string
    > mentioned earlier:
    >
    > $str =~ s/\x{2013}/--/g;


    The code is right the assumption is wrong. $str did not contain U
    +2013.

    >From evidence elsewhere in this thread I can determine that $str

    either was not a Unicode string at all (in which case it contained
    only bytes - one of which was 0x96) or it was a Unicode string and
    contained U+96.

    Now it just so happens that in Latin1 the byte 0x96 encodes the
    Unicode code point U+96 and in Windows-1250 the byte 0x96 encodes the
    Unicode code point U+2013.

    So I conclude that at some point your Unicode text has been passed
    from one place to another in such a way that the sender thinks it's
    using Windows-1250 encoding and the receiver thinks it's Latin1
    encoding. The effect of this is to transform the printable Unicode
    characher 'EN DASH' into the non-printable Unicode control character
    'START OF GUARDED AREA'.

    There is not sufficient evidence presented in this thread to work out
    where this corruption occurred.

    > If I save that text in a UTF-8 file and open that file like this
    > open(FILE,"<:utf8","text.txt");
    > then above regular expression works. How could I get regexp to work
    > for text that is not read from a file which is specified to be in
    > UTF-8 encoding?


    By making sure that you know what encoding is being used by the place
    that you are reading it from and instructing Perl to decode it if from
    that encoding into Unicode.
     
    Brian McCauley, May 22, 2007
    #6
  7. On May 22, 8:48 am, patari <> wrote:
    > On 21 touko, 18:57, "Mumia W." <paduille.4061.mumia.w
    >
    >
    >
    > > wrote:
    > > On 05/21/2007 06:09 AM, patari wrote:

    >
    > > > [...]
    > > > Somehow next code doesn't work, assuming that $str contains string
    > > > mentioned earlier:

    >
    > > > $str =~ s/\x{2013}/--/g;

    >
    > > > If I save that text in a UTF-8 file and open that file like this
    > > > open(FILE,"<:utf8","text.txt");
    > > > then above regular expression works. How could I get regexp to work
    > > > for text that is not read from a file which is specified to be in
    > > > UTF-8 encoding?

    >
    > > Where does the text come from?

    >
    > > How do you know that u+2013 is in that text?

    >
    > Text comes originally from user of cgi application, but in this case
    > the text is fetched from database. I know that character u+2013
    > because the text is viewed with browser where it shows, and I can copy
    > that for example to emacs which tells me the code of the character.


    That is a bad inference.
     
    Brian McCauley, May 22, 2007
    #7
  8. On May 22, 9:50 am, patari <> wrote:

    > I first encoded the string
    > my $octets = encode("UTF-8",$str);
    > and then printed it to Apaches log. The character seemed to be encoded
    > \xc2\x96.


    Which tells us that the character is U+96.

    > I'm still wondering why \x{2013} didn't match after encode.


    encode() returns a byte string. It contains only bytes. \x{2013} is
    not a byte so it can never exist in a byte string.

    > It seems
    > that encode also changes that character and in this case codes it as
    > \xc2\x96.


    No, there's no reason to believe that $str ever contained U+2013
     
    Brian McCauley, May 22, 2007
    #8
  9. On May 21, 1:37 pm, wrote:

    > use Encode;
    > $octets = decode("UTF-8", $str);


    Your variable naming is confusing. decode() takes an byte (aka octet)
    string as an argument and returns a string of Unicode characters (not
    a string of bytes).
     
    Brian McCauley, May 22, 2007
    #9
  10. patari

    Guest

    On 5¿ù23ÀÏ, ¿ÀÀü2½Ã24ºÐ, Brian McCauley <> wrote:
    > On May 21, 1:37 pm, wrote:
    >
    > > use Encode;
    > > $octets = decode("UTF-8", $str);

    >
    > Your variable naming is confusing. decode() takes an byte (aka octet)
    > string as an argument and returns a string of Unicode characters (not
    > a string of bytes).


    Oops,

    You are right. I copied that code from "perldoc Encode" but I made the
    mistake and wrote the names the wrong way about. :'(

    Thanks for pointing it.
     
    , May 22, 2007
    #10
  11. [A complimentary Cc of this posting was NOT [per weedlist] sent to
    Brian McCauley
    <>], who wrote in article <>:

    > Perl has two types of string: Unicode strings and byte strings.


    Perl has only one type of strings. Perl strings consist of
    characters. Characters are small integers (for some value of "small");
    [there is also some cultural baggage associated to the integers, which
    influences some Perl operations, as in ucfirst()].

    Is it too hard to understand?

    Puzzled,
    Ilya
     
    Ilya Zakharevich, Jun 12, 2007
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Greg Hurrell
    Replies:
    4
    Views:
    166
    James Edward Gray II
    Feb 14, 2007
  2. Mikel Lindsaar
    Replies:
    0
    Views:
    506
    Mikel Lindsaar
    Mar 31, 2008
  3. Joao Silva
    Replies:
    16
    Views:
    377
    7stud --
    Aug 21, 2009
  4. Uldis  Bojars
    Replies:
    2
    Views:
    196
    Janwillem Borleffs
    Dec 17, 2006
  5. Matìj Cepl

    new RegExp().test() or just RegExp().test()

    Matìj Cepl, Nov 24, 2009, in forum: Javascript
    Replies:
    3
    Views:
    191
    Matěj Cepl
    Nov 24, 2009
Loading...

Share This Page