I have some text which has unicode character \u+2013 for example:
PERFORMANCE - A COMPARATIVE STUDY
Unicode text is a abstract series of code points.
When you pass Unicode character data from one place to another (e.g.
web form to web server, web server to web browser, application to
database, database to application, file to application, application to
file...) you need the two ends to agree what encoding is being used to
serialise the abstract series of code points into a series of bytes.
Perl has two types of string: Unicode strings and byte strings. Byte
strings contain bytes or, sometimes, ASCII text. There are various
rules about what happens if you treat a byte string containing bytes
in the range 0x80-0xFF a text but I'm not going to go into those here.
You should ideally explicitly say when you want to convert a byte
sequence to a Unicode character sequence and specify what encoding you
are using.
So, when you want to read your sample text (as a series of bytes from
an external source) into a Perl Unicode string you need to make sure
that you tell Perl (somehow) what encoding is being used.
How can I find this character and change it to two - characters for
LaTeX?
Somehow next code doesn't work, assuming that $str contains string
mentioned earlier:
$str =~ s/\x{2013}/--/g;
The code is right the assumption is wrong. $str did not contain U
+2013.
From evidence elsewhere in this thread I can determine that $str
either was not a Unicode string at all (in which case it contained
only bytes - one of which was 0x96) or it was a Unicode string and
contained U+96.
Now it just so happens that in Latin1 the byte 0x96 encodes the
Unicode code point U+96 and in Windows-1250 the byte 0x96 encodes the
Unicode code point U+2013.
So I conclude that at some point your Unicode text has been passed
from one place to another in such a way that the sender thinks it's
using Windows-1250 encoding and the receiver thinks it's Latin1
encoding. The effect of this is to transform the printable Unicode
characher 'EN DASH' into the non-printable Unicode control character
'START OF GUARDED AREA'.
There is not sufficient evidence presented in this thread to work out
where this corruption occurred.
If I save that text in a UTF-8 file and open that file like this
open(FILE,"<:utf8","text.txt");
then above regular expression works. How could I get regexp to work
for text that is not read from a file which is specified to be in
UTF-8 encoding?
By making sure that you know what encoding is being used by the place
that you are reading it from and instructing Perl to decode it if from
that encoding into Unicode.