Replacing unicode characters

E

Erik Sandblom

Hi

I'm trying to replace double quotation marks in a UTF-8 document:

$string =~ s#\x{201D}#”#g;

My script is in latin-1. Otherwise I would just try putting in the
characters literally.

I'm using Perl 5.8. Replacing windows characters works beautifully:
$string =~ s/\x93/“/g;

I thought using the character codes \x{201D}, meant you don't have to worry
about telling Perl what character encoding is being used. Is this not true?

Erik Sandblom
 
B

Ben Morrow

Erik Sandblom said:
I'm trying to replace double quotation marks in a UTF-8 document:

$string =~ s#\x{201D}#”#g;

How have you read in $string? If the file is UTF8, you need to tell
Perl so, or it will assume Latin1:

open my $FH, '<:utf8', $filename or die...;

or, better

open my $FH, '<:encoding(utf8)', $filename or die...;

as this will be more resiliant if the file isn't actually utf8.

Ben
 
E

Erik Sandblom

i artikel [email protected], skrev Ben Morrow på
(e-mail address removed) den 04-02-12 19.26:
How have you read in $string? If the file is UTF8, you need to tell
Perl so, or it will assume Latin1:


Really? The system I'm on is RedHat 8, and I understand they have some
default variable somewhere saying that "everything is UTF-8 unless otherwise
specified". And Perl then follows that. But I'm not sure.

open my $FH, '<:encoding(utf8)', $filename or die...;


Thank you, this solved my problem. I also had to unset "use bytes;" by
putting in "no bytes;". Apparently "use bytes" makes Perl treat characters
as being two-digit rather than four-digit. Wrong terminology I'm sure, but
it may help someone else in my position. I had previously set "use bytes" to
be able to use accented characters in good-old latin-1.

Thanks again.
 
B

Ben Morrow

Erik Sandblom said:
i artikel [email protected], skrev Ben Morrow på
(e-mail address removed) den 04-02-12 19.26:

Really? The system I'm on is RedHat 8, and I understand they have some
default variable somewhere saying that "everything is UTF-8 unless otherwise
specified". And Perl then follows that. But I'm not sure.

If you are using 5.8.0 and have your LC_ALL environment var set to
something with UTF8 in, perl will push a :utf8 onto all filehandles by
default. This behaviour was disabled in 5.8.1, as it caused *lots* of
compatability problems.
Thank you, this solved my problem. I also had to unset "use bytes;"
by putting in "no bytes;". Apparently "use bytes" makes Perl treat
characters as being two-digit rather than four-digit. Wrong
terminology I'm sure,

Yes... are the digits you are refering to hex digits, so you actually
mean 8-bit (eg. \x12) rather than 16-bit (eg. \x{1234})? In this case,
you are under a misapprehension: the more recent versions of Unicode
are in fact 21-bit character encodings, not 16-bit: that is, \x{12345}
is a valid Unicode character number (currently not assigned a
character).
but it may help someone else in my position. I had previously set
"use bytes" to be able to use accented characters in good-old
latin-1.

You shouldn't need to do this: if you're mixing character sets, I'd
strongly recommend you convert everything to Perl's internal Unicode
using Encode. If you want latin1 literals in your Perl source, put
use encoding 'latin1';
at the top; and don't try to mix encodings (ie. have both latin1 and
utf8 literals) in one source file.

Ben
 
T

Tulan W. Hu

Ben Morrow said:
If you are using 5.8.0 and have your LC_ALL environment var set to
something with UTF8 in, perl will push a :utf8 onto all filehandles by
default. This behaviour was disabled in 5.8.1, as it caused *lots* of
compatability problems.

Ben,

How about perl 5.8.2?
I got an utf8 file, but I just use regular open to read it
and I print the string out after I read them. It seems ok.
I use Unicode::String to convert the lines to latin1.
The following code seems to work ok.

use File::Slurp;
use Unicode::String qw(utf8 latin1);

my @l2 = ();
@l2 = read_file('filename');
foreach my $nline (@l2) {
my $l = utf8($nline);
print "$l";
print $l->latin1;
}

Do you see any problem with this?
 
A

Alan J. Flavell

Really? The system I'm on is RedHat 8, and I understand they have some
default variable somewhere saying that "everything is UTF-8 unless otherwise
specified".

Sort-of. The default locale has utf-8 in it, is the key.
And Perl then follows that.

Specifically, 5.8.0 follows that. But it confused too many people, as
Ben Morrow has already mentioned.
Thank you, this solved my problem. I also had to unset "use bytes;" by
putting in "no bytes;".

Who put "use bytes" in there in the first place? IMHO it's offered as
a quick fix for those who had made a tacit assumption in their coding
(roughly speaking, that character data could be handled identically to
binary data, without giving any thought to the difference. The old
unix-hardened Perl hackers used to be very bad about that, but, with
Perl's increasing claim to be platform-portable, that stance no longer
held water, if I could mix a metaphor).
Apparently "use bytes" makes Perl treat characters
as being two-digit rather than four-digit.

there's something in what you say, though I don't think I'd quite have
put it like that...
Wrong terminology I'm sure,

I can only confirm your assumption! (SCNR ;)
I had previously set "use bytes" to
be able to use accented characters in good-old latin-1.

That's the kind of situation where I'd gripe about it being the wrong
solution, even if - in the limited circumstances you needed it - it
gave the impression of doing the right thing.

If you're going to be processing text (as opposed to binary data),
then I think in the long term it will pay off to be honest with Perl
(>= 5.8) about that, and tell it frankly what coding is involved.

By the way, don't confuse the processing of character data on
input/output streams with how Perl deals with characters that are
specified within the source code. They're two different topics, and
need to be grasped accordingly.

The unicode introduction and spec in the Perl documentation is pretty
good, although it's rather silent about a few areas where the
implementation falls short of what the documentation might lead one to
expect (previous discussions here will show some detail about that).
But for the most part I've found it does what it says it does: the key
part is to approach the documentation with a fairly open mind, rather
than assuming that it's sure to be more or less what one had expected.
OK, fair enough, I don't know what it was that *you* expected, but
I've met several people who thought it was obvious and didn't bother
to RTFM, and then were astonished that they could make no sense of
what seemed to be happening.

good luck
 
E

Erik Sandblom

i artikel [email protected], skrev Ben Morrow på
(e-mail address removed) den 04-02-12 21.45:
Yes... are the digits you are refering to hex digits, so you actually
mean 8-bit (eg. \x12) rather than 16-bit (eg. \x{1234})?


Yes, that's right. I'm finally getting the hang of hexadecimal and I've
deduced 16-bit comes from that 2 to the power of four is 16. But what does
that really mean, considering each "digit", as I still call them, can have
16 different numbers, and not 2? That would be 16 to the power of four which
is a large number, about 66 000 unless I'm mistaken.

In this case,
you are under a misapprehension: the more recent versions of Unicode
are in fact 21-bit character encodings, not 16-bit: that is, \x{12345}
is a valid Unicode character number (currently not assigned a
character).


Oh my goodness, that's a lot of characters. Why doesn't everyone just learn
English? ;-)

You shouldn't need to do this: if you're mixing character sets, I'd
strongly recommend you convert everything to Perl's internal Unicode
using Encode. If you want latin1 literals in your Perl source, put
use encoding 'latin1';
at the top; and don't try to mix encodings (ie. have both latin1 and
utf8 literals) in one source file.


Well, what I've done is used latin-1 literals and saved the file in latin-1
encoding. Then I have used utf8 codes like \x{201D} to represent utf8
characters. I've written "use bytes" at the top of my perl script. Forgive
my ignorance but how would it behave differently with "use encoding latin1"
at the top?

Thanks for all your help,
Erik Sandblom
 
B

Ben Morrow

Erik Sandblom said:
i artikel [email protected], skrev Ben Morrow på
(e-mail address removed) den 04-02-12 21.45:

Yes, that's right. I'm finally getting the hang of hexadecimal and I've
deduced 16-bit comes from that 2 to the power of four is 16. But what does
that really mean, considering each "digit", as I still call them, can have
16 different numbers, and not 2? That would be 16 to the power of four which
is a large number, about 66 000 unless I'm mistaken.

:) Not quite. 'Bit's refer to the binary representation (base 2, as
hex is base 16) of a number. A 2-digit hex number, say 0x82, can also
be written as an 8-digit binary number (an 8-bit number: 'bit' is
short for 'binary digit'): 0b1000_0010. The 0x here indicates hex, and
the 0b binary; the _s are just put in to make the number easier to
read.

Hexadecimal has 16 different digits, binary but 2; and as you say, 2^4
= 16, so each hex digit represents 4 binary digits. Thus a four-digit
hex number is a 4*4 = 16-bit binary number: as you say, there are
65536 of them.

You can get Perl to print out the decimal, hex and binary
representations of a number using sprintf with the %d, %x and %b
formats.
Oh my goodness, that's a lot of characters. Why doesn't everyone just learn
English? ;-)

It is indeed a lot... most of them are unused at present, but they had
just too many with all the Chinese-Japanese-Korean ideograms and all
the Arabic ligatures to fit into 16 bits.
Well, what I've done is used latin-1 literals and saved the file in latin-1
encoding. Then I have used utf8 codes like \x{201D} to represent utf8
characters. I've written "use bytes" at the top of my perl script. Forgive
my ignorance but how would it behave differently with "use encoding latin1"
at the top?

'use bytes' disables Perl's Unicode support, and makes it treat all
strings as sequences of 8-bit bytes. When 'use bytes' is not in
effect, strings can be thought of as sequences of 21-bit numbers (in
fact, the representation is more compact than that, which occasionally
'leaks through' when things go wrong).

Under 'use bytes', you are declaring that your data is 'binary' as
opposed to 'textual'. The fact that if you treat it as textual Perl
will pretend it's Latin1 is for backwards compatibility only: I would
say that 'strictly' speaking Perl ought to give an error if you try
and use characters outside of ASCII (but then, Perl didn't get where
it is today by being strict about things :). In fact, under 'use
bytes', even if you state some data is textual by pushing an :encoding
layer onto the filehandle, Perl will still treat the data as 8-bit
bytes; which is one of the ways the underlying representation can
'leak through' as I mentioned above.

'use encoding 'latin1'' *just* declares that your source file is in
Latin1. It doesn't affect how Perl views your data at all: data that
comes from a filehandle marked with :raw will be considered to be
'binary', ie. a sequence of 8-bit bytes; and data which comes from a
filehandle marked with :encoding will be considered to be 'textual',
i.e. a sequence of 21-bit Unicode codepoints.

This is all a little confusing: you may need to think about it a bit
before it sinks in. I know I did... :)

References: perldoc perluniintro, perldoc perlunicode, unicode.org,
perldoc PerlIO, perldoc PerlIO::encoding.

Ben
 
B

Ben Morrow

Tulan W. Hu said:
How about perl 5.8.2?

5.8.1+ no longer does this *by default*. You can still make the
filehandle UTF8ish manually, or give perl the -C switch to make it
honour the environment vars.
I got an utf8 file, but I just use regular open to read it
and I print the string out after I read them. It seems ok.
I use Unicode::String to convert the lines to latin1.
The following code seems to work ok.

use File::Slurp;
use Unicode::String qw(utf8 latin1);

my @l2 = ();
@l2 = read_file('filename');
foreach my $nline (@l2) {
my $l = utf8($nline);
print "$l";
print $l->latin1;
}

Do you see any problem with this?

None at all if you're happy with it. This is how one would have done
it pre-5.8. The point of 5.8 is that it can now be done like this
instead:

# state that our file is in UTF8
open my $FILE, '<:encoding(utf8)', 'filename' or die...;

# state that we want output in latin1
binmode STDOUT, ':encoding(latin1)';

# now just copy it all across
print while <$FILE>;

which is simpler.

Ben
 
T

Tulan W. Hu

Ben Morrow said:
None at all if you're happy with it. This is how one would have done
it pre-5.8. The point of 5.8 is that it can now be done like this
instead:

# state that our file is in UTF8
open my $FILE, '<:encoding(utf8)', 'filename' or die...;

# state that we want output in latin1
binmode STDOUT, ':encoding(latin1)';

# now just copy it all across
print while <$FILE>;

which is simpler.

I tried the above and got the following error message
"\x{2019}" does not map to iso-8859-1 at utf.pl line 8, <$FILE> line 161.
but the pre-5.8 code just removes the characters for me.
In my case, I want it just removes the char instead of giving me an error
since other programs cannot handle unicode yet.
 
A

Alan J. Flavell

I tried the above and got the following error message
"\x{2019}" does not map to iso-8859-1 at utf.pl line 8, <$FILE> line 161.

That's a correct statment of fact, isn't it?
but the pre-5.8 code just removes the characters for me.

Why is it advantageous to hide away an error? It's not as if you
can't hide it for yourself, if you know in advance that hiding is what
you want; but if you haven't decided in advance, then surely an error
report is preferable to unannounced loss of data?
In my case, I want it just removes the char

Shouldn't be too hard to program, no?
since other programs cannot handle unicode yet.

Maybe it would be more constructive to down-convert it into some kind
of ASCII or iso-8859-1 surrogate, though.
 
B

Bart Lateur

Tulan said:
I tried the above and got the following error message
"\x{2019}" does not map to iso-8859-1 at utf.pl line 8, <$FILE> line 161.
but the pre-5.8 code just removes the characters for me.

Don't use Latin-1 for the encoding, try cp1252 (AKA Windows) instead.
That turns out to be chr(0x92) ("right single quotation mark"). For the
whole list, see


<http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT>

If you don't want the Windows character set, I'd replace all "single
quotation marks" with apostrophes, ("'", chr(39)), and all "double
quotation marks" with quotes ('"', chr(34)).
 
T

Tulan W. Hu

Bart Lateur said:
Don't use Latin-1 for the encoding, try cp1252 (AKA Windows) instead.
That turns out to be chr(0x92) ("right single quotation mark"). For the
whole list, see

<http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT>

If you don't want the Windows character set, I'd replace all "single
quotation marks" with apostrophes, ("'", chr(39)), and all "double
quotation marks" with quotes ('"', chr(34)).

you are right. I don't want the window only character set.
I have no control on the input files because I got them from a vendor.
I have to convert those utf8 files to latin1 for other programs.
Is 'iso-8859-1' the same as 'latin1'?
 
A

Alan J. Flavell

Is 'iso-8859-1' the same as 'latin1'?

Strictly speaking, "Latin1" denotes a character repertoire, not a
specific encoding. The Latin 1 repertoire can be encoded with
CP850 (sometimes called "DOS Latin 1"), CP-1047 (EBCDIC Latin 1),
iso-8859-1, or indeed with a subset of utf-8 etc. or some other
encoding which covers the repertoire. That's the theoretical
position.

But in practice, when anyone refers to Latin 1 in relation to
a character encoding, they surely mean the appropriate ISO encoding,
namely iso-8859-1.

By the way, don't forget that not all ISO repertoires are Latin. So,
although they both start at 1, after a while, the numbering gets out
of step: the encoding iso-8859-7 is for the Greek repertoire, not
Latin-anything, and there are encodings for Cyrillic, Arabic,
Hebrew; the iso encoding for the latin 9 repertoire is iso-8859-15,
for example.
 
B

Bart Lateur

Tulan said:
you are right. I don't want the window only character set.
I have no control on the input files because I got them from a vendor.
I have to convert those utf8 files to latin1 for other programs.

Eh, Windows (cp 1252) is Latin-1 plus a few more printable characters
where Latin-1 has "control characters" (but actually nothing, it's a
"taboo zone").
Is 'iso-8859-1' the same as 'latin1'?

Yes.
 
B

Bart Lateur

Alan said:
By the way, don't forget that not all ISO repertoires are Latin. So,
although they both start at 1, after a while, the numbering gets out
of step: the encoding iso-8859-7 is for the Greek repertoire, not
Latin-anything, and there are encodings for Cyrillic, Arabic,
Hebrew; the iso encoding for the latin 9 repertoire is iso-8859-15,
for example.

For more info on this, the best site I ever found is
<http://czyborra.com>, one (st the time) student's work. However, it's a
few years old, not really maintained, and worst of all, sometimes the
DNS doesn't even resolve -- which is a real shame.

Google still has quite a lot of its pages in its cache, and perhaps
there are other websites that hold archive of old sites...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,055
Latest member
SlimSparkKetoACVReview

Latest Threads

Top