Unicode in regexp

P

patari

Hi,

I have some text which has unicode character \u+2013 for example:
PERFORMANCE - A COMPARATIVE STUDY

How can I find this character and change it to two - characters for
LaTeX?

Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:

$str =~ s/\x{2013}/--/g;

If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?
 
G

gypark2

Hi,

I have some text which has unicode character \u+2013 for example:
PERFORMANCE - A COMPARATIVE STUDY

How can I find this character and change it to two - characters for
LaTeX?

Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:

$str =~ s/\x{2013}/--/g;

If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?


Hello,

Save your script in UTF-8 encoding and just use the unicode
characters, rather than \x{****} form, in the regexp:

$str =~ s/-/--/g; # First "-" is \x{2013}, not dash.

Or,

decode it first, perform substitution, and encode it back:

use Encode;
$octets = decode("UTF-8", $str);
$octets =~ s/\x{2013}/--/g;
$str =~ encode("UTF-8", $octets);
 
M

Mumia W.

[...]
Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:

$str =~ s/\x{2013}/--/g;

If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?

Where does the text come from?

How do you know that u+2013 is in that text?
 
P

patari

[...]
Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:
$str =~ s/\x{2013}/--/g;
If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?

Where does the text come from?

How do you know that u+2013 is in that text?


Text comes originally from user of cgi application, but in this case
the text is fetched from database. I know that character u+2013
because the text is viewed with browser where it shows, and I can copy
that for example to emacs which tells me the code of the character.
 
P

patari

Hi,

Hello,

Save your script in UTF-8 encoding and just use the unicode
characters, rather than \x{****} form, in the regexp:

$str =~ s/-/--/g; # First "-" is \x{2013}, not dash.

Or,

decode it first, perform substitution, and encode it back:

use Encode;
$octets = decode("UTF-8", $str);
$octets =~ s/\x{2013}/--/g;
$str =~ encode("UTF-8", $octets);


Unfortunately that doesn't work either. It only changes that character
and characters like ä to some mess of characters. I think that decode
and encode should be changed.
Neither does $str =~ s/\x20\x13/--/g; work.

But thanks to you and Petr Vileta I finally got the solution by
combining your hints. I first encoded the string
my $octets = encode("UTF-8",$str);
and then printed it to Apaches log. The character seemed to be encoded
\xc2\x96. Using this I could match the regexp and change the
character.

Here is the solution if anyone else bumps into similar problems:
my $octets = encode("UTF-8",$str);
if ($octets =~ /\xc2\x96/) {
$octets =~ s/\xc2\x96/--/g;
}
$str = decode("UTF-8",$octets);

I'm still wondering why \x{2013} didn't match after encode. It seems
that encode also changes that character and in this case codes it as
\xc2\x96.
 
B

Brian McCauley

I have some text which has unicode character \u+2013 for example:
PERFORMANCE - A COMPARATIVE STUDY

Unicode text is a abstract series of code points.

When you pass Unicode character data from one place to another (e.g.
web form to web server, web server to web browser, application to
database, database to application, file to application, application to
file...) you need the two ends to agree what encoding is being used to
serialise the abstract series of code points into a series of bytes.

Perl has two types of string: Unicode strings and byte strings. Byte
strings contain bytes or, sometimes, ASCII text. There are various
rules about what happens if you treat a byte string containing bytes
in the range 0x80-0xFF a text but I'm not going to go into those here.
You should ideally explicitly say when you want to convert a byte
sequence to a Unicode character sequence and specify what encoding you
are using.

So, when you want to read your sample text (as a series of bytes from
an external source) into a Perl Unicode string you need to make sure
that you tell Perl (somehow) what encoding is being used.
How can I find this character and change it to two - characters for
LaTeX?

Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:

$str =~ s/\x{2013}/--/g;

The code is right the assumption is wrong. $str did not contain U
+2013.
From evidence elsewhere in this thread I can determine that $str
either was not a Unicode string at all (in which case it contained
only bytes - one of which was 0x96) or it was a Unicode string and
contained U+96.

Now it just so happens that in Latin1 the byte 0x96 encodes the
Unicode code point U+96 and in Windows-1250 the byte 0x96 encodes the
Unicode code point U+2013.

So I conclude that at some point your Unicode text has been passed
from one place to another in such a way that the sender thinks it's
using Windows-1250 encoding and the receiver thinks it's Latin1
encoding. The effect of this is to transform the printable Unicode
characher 'EN DASH' into the non-printable Unicode control character
'START OF GUARDED AREA'.

There is not sufficient evidence presented in this thread to work out
where this corruption occurred.
If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?

By making sure that you know what encoding is being used by the place
that you are reading it from and instructing Perl to decode it if from
that encoding into Unicode.
 
B

Brian McCauley

On 05/21/2007 06:09 AM, patari wrote:
[...]
Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:
$str =~ s/\x{2013}/--/g;
If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?
Where does the text come from?
How do you know that u+2013 is in that text?

Text comes originally from user of cgi application, but in this case
the text is fetched from database. I know that character u+2013
because the text is viewed with browser where it shows, and I can copy
that for example to emacs which tells me the code of the character.

That is a bad inference.
 
B

Brian McCauley

I first encoded the string
my $octets = encode("UTF-8",$str);
and then printed it to Apaches log. The character seemed to be encoded
\xc2\x96.

Which tells us that the character is U+96.
I'm still wondering why \x{2013} didn't match after encode.

encode() returns a byte string. It contains only bytes. \x{2013} is
not a byte so it can never exist in a byte string.
It seems
that encode also changes that character and in this case codes it as
\xc2\x96.

No, there's no reason to believe that $str ever contained U+2013
 
B

Brian McCauley

use Encode;
$octets = decode("UTF-8", $str);

Your variable naming is confusing. decode() takes an byte (aka octet)
string as an argument and returns a string of Unicode characters (not
a string of bytes).
 
G

gypark2

Your variable naming is confusing. decode() takes an byte (aka octet)
string as an argument and returns a string of Unicode characters (not
a string of bytes).

Oops,

You are right. I copied that code from "perldoc Encode" but I made the
mistake and wrote the names the wrong way about. :'(

Thanks for pointing it.
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was NOT [per weedlist] sent to
Brian McCauley
Perl has two types of string: Unicode strings and byte strings.

Perl has only one type of strings. Perl strings consist of
characters. Characters are small integers (for some value of "small");
[there is also some cultural baggage associated to the integers, which
influences some Perl operations, as in ucfirst()].

Is it too hard to understand?

Puzzled,
Ilya
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top