Unicode in regexp

patari · May 21, 2007

Hi,

I have some text which has unicode character \u+2013 for example:
PERFORMANCE - A COMPARATIVE STUDY

How can I find this character and change it to two - characters for
LaTeX?

Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:

$str =~ s/\x{2013}/--/g;

If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?

gypark2 · May 21, 2007

Hi,

I have some text which has unicode character \u+2013 for example:
PERFORMANCE - A COMPARATIVE STUDY

How can I find this character and change it to two - characters for
LaTeX?

Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:

$str =~ s/\x{2013}/--/g;

If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?

Hello,

Save your script in UTF-8 encoding and just use the unicode
characters, rather than \x{****} form, in the regexp:

$str =~ s/-/--/g; # First "-" is \x{2013}, not dash.

Or,

decode it first, perform substitution, and encode it back:

use Encode;
$octets = decode("UTF-8", $str);
$octets =~ s/\x{2013}/--/g;
$str =~ encode("UTF-8", $octets);

Mumia W. · May 21, 2007

[...]
Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:

$str =~ s/\x{2013}/--/g;

If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?

Where does the text come from?

How do you know that u+2013 is in that text?

patari · May 22, 2007

[...]
Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:

Click to expand...

$str =~ s/\x{2013}/--/g;

Click to expand...

If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?

Click to expand...

Where does the text come from?

How do you know that u+2013 is in that text?

Text comes originally from user of cgi application, but in this case
the text is fetched from database. I know that character u+2013
because the text is viewed with browser where it shows, and I can copy
that for example to emacs which tells me the code of the character.

patari · May 22, 2007

Hi,

Hello,

Save your script in UTF-8 encoding and just use the unicode
characters, rather than \x{****} form, in the regexp:

$str =~ s/-/--/g; # First "-" is \x{2013}, not dash.

Or,

decode it first, perform substitution, and encode it back:

use Encode;
$octets = decode("UTF-8", $str);
$octets =~ s/\x{2013}/--/g;
$str =~ encode("UTF-8", $octets);

Unfortunately that doesn't work either. It only changes that character
and characters like ä to some mess of characters. I think that decode
and encode should be changed.
Neither does $str =~ s/\x20\x13/--/g; work.

But thanks to you and Petr Vileta I finally got the solution by
combining your hints. I first encoded the string
my $octets = encode("UTF-8",$str);
and then printed it to Apaches log. The character seemed to be encoded
\xc2\x96. Using this I could match the regexp and change the
character.

Here is the solution if anyone else bumps into similar problems:
my $octets = encode("UTF-8",$str);
if ($octets =~ /\xc2\x96/) {
$octets =~ s/\xc2\x96/--/g;
}
$str = decode("UTF-8",$octets);

I'm still wondering why \x{2013} didn't match after encode. It seems
that encode also changes that character and in this case codes it as
\xc2\x96.

Brian McCauley · May 22, 2007

I have some text which has unicode character \u+2013 for example:
PERFORMANCE - A COMPARATIVE STUDY

Unicode text is a abstract series of code points.

When you pass Unicode character data from one place to another (e.g.
web form to web server, web server to web browser, application to
database, database to application, file to application, application to
file...) you need the two ends to agree what encoding is being used to
serialise the abstract series of code points into a series of bytes.

Perl has two types of string: Unicode strings and byte strings. Byte
strings contain bytes or, sometimes, ASCII text. There are various
rules about what happens if you treat a byte string containing bytes
in the range 0x80-0xFF a text but I'm not going to go into those here.
You should ideally explicitly say when you want to convert a byte
sequence to a Unicode character sequence and specify what encoding you
are using.

So, when you want to read your sample text (as a series of bytes from
an external source) into a Perl Unicode string you need to make sure
that you tell Perl (somehow) what encoding is being used.

How can I find this character and change it to two - characters for
LaTeX?

Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:

$str =~ s/\x{2013}/--/g;

The code is right the assumption is wrong. $str did not contain U
+2013.

From evidence elsewhere in this thread I can determine that $str

either was not a Unicode string at all (in which case it contained
only bytes - one of which was 0x96) or it was a Unicode string and
contained U+96.

Now it just so happens that in Latin1 the byte 0x96 encodes the
Unicode code point U+96 and in Windows-1250 the byte 0x96 encodes the
Unicode code point U+2013.

So I conclude that at some point your Unicode text has been passed
from one place to another in such a way that the sender thinks it's
using Windows-1250 encoding and the receiver thinks it's Latin1
encoding. The effect of this is to transform the printable Unicode
characher 'EN DASH' into the non-printable Unicode control character
'START OF GUARDED AREA'.

There is not sufficient evidence presented in this thread to work out
where this corruption occurred.

If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?

By making sure that you know what encoding is being used by the place
that you are reading it from and instructing Perl to decode it if from
that encoding into Unicode.

Brian McCauley · May 22, 2007

On 05/21/2007 06:09 AM, patari wrote:

[...]
Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:
$str =~ s/\x{2013}/--/g;
If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?

Click to expand...

Click to expand...

Where does the text come from?

Click to expand...

How do you know that u+2013 is in that text?

Click to expand...

Text comes originally from user of cgi application, but in this case
the text is fetched from database. I know that character u+2013
because the text is viewed with browser where it shows, and I can copy
that for example to emacs which tells me the code of the character.

That is a bad inference.

Brian McCauley · May 22, 2007

I first encoded the string
my $octets = encode("UTF-8",$str);
and then printed it to Apaches log. The character seemed to be encoded
\xc2\x96.

Which tells us that the character is U+96.

I'm still wondering why \x{2013} didn't match after encode.

encode() returns a byte string. It contains only bytes. \x{2013} is
not a byte so it can never exist in a byte string.

It seems
that encode also changes that character and in this case codes it as
\xc2\x96.

No, there's no reason to believe that $str ever contained U+2013

Brian McCauley · May 22, 2007

use Encode;
$octets = decode("UTF-8", $str);

Your variable naming is confusing. decode() takes an byte (aka octet)
string as an argument and returns a string of Unicode characters (not
a string of bytes).

gypark2 · May 22, 2007

Your variable naming is confusing. decode() takes an byte (aka octet)
string as an argument and returns a string of Unicode characters (not
a string of bytes).

Oops,

You are right. I copied that code from "perldoc Encode" but I made the
mistake and wrote the names the wrong way about. :'(

Thanks for pointing it.

Ilya Zakharevich · Jun 12, 2007

[A complimentary Cc of this posting was NOT [per weedlist] sent to
Brian McCauley

Perl has two types of string: Unicode strings and byte strings.

Perl has only one type of strings. Perl strings consist of
characters. Characters are small integers (for some value of "small");
[there is also some cultural baggage associated to the integers, which
influences some Perl operations, as in ucfirst()].

Is it too hard to understand?

Puzzled,
Ilya

Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
Unicode help please	5	Oct 19, 2013
Thinking Unicode	0	Aug 8, 2013
Opening Unicode files?	7	Dec 25, 2011
Unicode codepoints	5	Jun 22, 2011
Python Unicode handling wins again -- mostly	67	Nov 30, 2013
regexp(ing) Backus-Naurish expressions ...	7	Mar 13, 2013
unicode compare errors	3	Dec 10, 2010

Unicode in regexp

patari

gypark2

Mumia W.

patari

patari

Brian McCauley

Brian McCauley

Brian McCauley

Brian McCauley

gypark2

Ilya Zakharevich

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads