Guide for dealing with alternate character sets?

B

Bernie Cosell

I'm pretty buffaloed about the prospect of moving some of my programs
to the new version of RedHat that is native UTF-8. I don't understand
all the implications of it, and I wonder if there's some kind of
tutorial or programming or practices guide to deal with it besides
perlunicode(1)?

I note that much of my Perl code is already ugly because of a
character convention mismatch: on our system, the line terminator is
just \012, but on stuff coming in over a socket, the line terminator
is \015\012 and so I have some really sloppy code for inserting and
removing the "\r" in [most of? :eek:)] the right places in the code, but
it has always felt a bit awkward.

Once we move to the new system, it'll get worse: *most* of the stuff
coming in over TCP connections will still be just ISO-Latin [with
\r\n] and my "local files" will be UTF-8 [with just \n], and I don't
know *what* I'm going to do.

I've read the "perlunicode" man page and it is more or less clear, but
I'm not sure how to structure my program in an environment that
necessarily has to handle data streams that could be *either* UTF-8 or
ISO-Latin [it is at least fathomable, if a bit tricky, to do one or
the other]. And in the process, any tricks for cleaning up the
programming to handle \r\n vs \n on a per-stream basis would be
nice..:eek:)

Thanks!
/Bernie\
 
A

Alan J. Flavell

I'm pretty buffaloed about the prospect of moving some of my programs
to the new version of RedHat that is native UTF-8. I don't understand
all the implications of it, and I wonder if there's some kind of
tutorial or programming or practices guide to deal with it besides
perlunicode(1)?

That's a hard one at the moment. AFAICS the doc still has massively
raw edges, looking in places more like implementers' jottings than
finished user documentation. But it's getting there.
I note that much of my Perl code is already ugly because of a
character convention mismatch: on our system, the line terminator is
just \012, but on stuff coming in over a socket, the line terminator
is \015\012 and so I have some really sloppy code for inserting and
removing the "\r"

<pounce>

What you're removing is \015. It might happen to be the same as \r
on your platform, but you should take heed of what perlport says.
in [most of? :eek:)] the right places in the code, but
it has always felt a bit awkward.

Seems to me that you need to get a grip on that *before* you tackle
unicode. Unicode sure doesn't help you with that detail (and if you
check the past discussions relating to utf-16, you'll see that we
seemed to have found a bug in newline handling here, which I'm afraid
I lost sight of, so I'm not sure if anyone took it to the developers
and/or found a resolution for).
Once we move to the new system, it'll get worse: *most* of the stuff
coming in over TCP connections will still be just ISO-Latin [with
\r\n] and my "local files" will be UTF-8 [with just \n], and I don't
know *what* I'm going to do.

Basically you need to either keep them apart, or settle on a canonical
internal representation and convert the other.

Try to keep internal and external representations apart, as if they
were measured in different currencies and had to be converted each
time they cross the border. The similarity is misleading - you'd
stand a better chance of getting this right if your internal coding
was totally different from the external one (as e.g IBM mainframes)
where such discipline is unavoidable.
I've read the "perlunicode" man page and it is more or less clear, but
I'm not sure how to structure my program in an environment that
necessarily has to handle data streams that could be *either* UTF-8 or
ISO-Latin

Will you be told, or do you have to guess?

And if you get anywhere near some recent Windows apps, you're likely
to get utf-16 to deal with also.
[it is at least fathomable, if a bit tricky, to do one or
the other]. And in the process, any tricks for cleaning up the
programming to handle \r\n vs \n on a per-stream basis would be
nice..:eek:)

If you want to work portably, please stop referring to \r\n when
dealing with sockets data: perlport's advice is better.

About the only positive thing I can say at this point is that CR and
LF are exactly the same in utf-8 as they are in iso-8859-1 as they are
in us-ascii.
 
L

Lawrence D¹Oliveiro

I note that much of my Perl code is already ugly because of a
character convention mismatch: on our system, the line terminator is
just \012, but on stuff coming in over a socket, the line terminator
is \015\012..,

All code for reading text files should be able to deal with all three
line-termination conventions: CR, LF and CR-LF. PostScript figured out
how to do this back in the 1980s, when is everybody else going to catch
up?

For writing files, the Internet standard is CR-LF.
Once we move to the new system, it'll get worse: *most* of the stuff
coming in over TCP connections will still be just ISO-Latin [with
\r\n] and my "local files" will be UTF-8 [with just \n], and I don't
know *what* I'm going to do.

My feeling is to convert everything incoming to Unicode before working
on it, and if necessary convert it to something else before outputting
it.

For character manipulation, a fixed-length code like UTF-16 is probably
easier to work with than a variable-length code like UTF-8. The only
real reason for the existence of UTF-8 is backward compatibility: it
allows all the old code, that was written over decades to deal only with
7-bit ASCII, to be declared Unicode-compatible (after a fashion),
because 7-bit ASCII is a subset of UTF-8.
 
A

Alan J. Flavell

For character manipulation, a fixed-length code like UTF-16

utf-16 is not "fixed length".
is probably easier to work with than a variable-length code like
UTF-8.

If you're working with Perl, then I'd recommend working _with_ Perl.
The only real reason for the existence of UTF-8 is backward
compatibility:

There's something in what you say, but it isn't the whole story.

But anyway, one can work with Unicode in Perl without tangling with
these details. The internal representation can be kept under the
covers as long as everything goes to plan (it can, admittedly, be
useful to understand the internal representation when things start
going wrong and detailed debugging is called for). But it would be
frankly perverse to disregard Perl's Unicode support and insist on
handling data in utf-16 format, as you seem to be suggesting - and
especially if it results in needing to tangle with surrogate pairs.

There _might_ be something to be said for working in UCS-4 (as it used
to be called) or utf-32.
allows all the old code, that was written over decades to deal only with
7-bit ASCII, to be declared Unicode-compatible (after a fashion),

So how would you describe utf-7 in your terms?

cheers
 
A

Alan J. Flavell

And I confess that I haven't even *tried* to worry about specific
char sets
^^^^^^^^^

Think "encodings". That attribute "charset" is an unfortunate piece
of MIME terminology - a hang-over from simpler times.

In Perl 5.8 representation, the native text "character set" internally
is Unicode, or at least a subset of it, irrespective of the external
encoding of the data.

(That's excluding the situation where you read an external encoding,
effectively as binary, and manipulate it yourself, rather than taking
advantage of its native text functionality.)
[most everything from the INternet comes in with a charset specification..
but I can't deal with ISO-8859-15 or who knows what,

_You_ don't need to. Perl handles that, as long as you talk to it
nicely...
so I just assume [sigh!] plain ISO-Latin.

Iso Latin _what_??? There's at least nine of them, and some of them
exist in alternative encodings (e.g CP-1047 Latin-1 EBCDIC).

The term "ISO Latin (n)" properly defines a _repertoire_ of
characters. Sure, it might then seem natural to represent them by
using the applicable iso-8859 encoding (iso-8859-15 for Latin-9, for
example), but it isn't the only option. It's best to keep a clear
head when tangling with this stuff.
Is there some more-general way, assuming I
convert over to using UTF-8 "inside" my programs to take a string that I
think is, say, ISO-8859-15 or some Korean character set or whatever and
convert it to UTF?

Surely, what you want to convert to "inside" your programs is Perl's
own Unicode representation? (Which happens to be based on utf-8, but
you don't need to deal with that directly).
I note that in perlunicode it says:

Sorry, I don't find that text in the document. Which version are you
looking at? For any kind of sensible progress, you should be using at
least 5.8.0. Oh yeah, Google finds that text in two-year-old documents
that appear to relate to Perl 5.6.*. I'd say give those a miss -
you're going to make yourself a lot more work trying to get this to
work in 5.6, and by the time you've got it working, 5.6 will be so
long in the tooth that no-one will care.
but I confess to being a bit confused by the man page: is there actual
machinery for converting to UTF-8? Can I somehow take a string that I
believe to be ISO-Latin and in some easy way "UTF-ize" it

Several ways.

If you were referring to iso-8859-1, then in Perl 5.8.0 you just use
it, and when Perl feels the need to do so it will automagically
upgrade it to Unicode. Take a look at
http://www.perldoc.com/perl5.8.0/pod/perluniintro.html#Perl's-Unicode-Model

If you're dealing with, say, iso-8859-15 then you'd want to specify an
encoding layer (when doing i/o), or apply an explicit conversion.

Beware of the interaction with your locale setting! In RedHat 9 for
example the default locale setting will imply Unicode, and Perl (5.8.0
and later) will respond to this in various ways.
have a UTF internal string and ISO-Latin it for pushing out to a web
browsers or SMTP server or whatever].

Not such a good idea. Modern browsers handle utf-8 fine, and that
crusty old NN4.* handles utf-8 somewhat better than it handles
8-bit-encoded data plus references.

So if you've got utf-8, flaunt it...

(WebTV excepted, I should say. But that would take us too far OT
for this group).
 
B

Ben Morrow

Alan J. Flavell said:
Beware of the interaction with your locale setting! In RedHat 9 for
example the default locale setting will imply Unicode, and Perl (5.8.0
and later) will respond to this in various ways.

5.8.0 only. This feature was considered to cause more problems than it
was worth, and was removed by default (although it can be activated
with a stolen -C switch) in 5.8.1. If you have a Unicode default
locale, it may be worth switching to 5.8.1 to save yourself some
pain... :)

Ben
 
A

Alan J. Flavell

Ignore surrogates.

No thanks.
I don't know what the hell they're doing there.

All the more reason not to ignore them, then, until one does know.

Check the meaning of the "utf" initialism ("transformation format"),
as opposed to the true fixed-length representation(s), specifically
iso-10646-ucs-4. AIUI, Unicode don't recommend UCS format for
transmission (interworking) encoding, but if you want a fixed-length
internal representation then ucs-4 *could* be an appropriate choice.

But, as I said, Perl has already made its choice of internal
representation. I'd say "use it". Some complain that it's
inefficient, but don't forget the first three rules of optimisation:
"don't optimise yet".
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,766
Messages
2,569,569
Members
45,042
Latest member
icassiem

Latest Threads

Top