Where is the ucs-32 codec?

beni.cherniavsky · Jun 4, 2006

Python seems to be missing a UCS-32 codec, even in wide builds (not
that it the build should matter).
Is there some deep reason or should I just contribute a patch?

If it's just a bug, should I call the codec 'ucs-32' or 'utf-32'? Or
both (aliased)?
There should be '-le' and '-be' variats, I suppose. Should there be a
variant without explicit endianity, using a BOM to decide (like
'utf-16')?
And it should combine surrogates into valid characters (on all builds),
like the 'utf-8' codec does, right?

Erik Max Francis · Jun 4, 2006

Python seems to be missing a UCS-32 codec, even in wide builds (not
that it the build should matter).
Is there some deep reason or should I just contribute a patch?

If it's just a bug, should I call the codec 'ucs-32' or 'utf-32'? Or
both (aliased)?
There should be '-le' and '-be' variats, I suppose. Should there be a
variant without explicit endianity, using a BOM to decide (like
'utf-16')?
And it should combine surrogates into valid characters (on all builds),
like the 'utf-8' codec does, right?

Note that UTF-32 is UCS-4. UCS-32 ("Universial Character Set in 32
octets") wouldn't make much sense.

Not that Python has a UCS-4 encoding available either. I'm really not
sure why.

Méta-MCI · Jun 4, 2006

Hi!

Look at: http://cjkpython.berlios.de (iconvcodec)

(Serge Orlov has built a version for Python 2.4 "special for me"; thanks to
him).

@-salutations

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Jun 5, 2006

Python seems to be missing a UCS-32 codec, even in wide builds (not
that it the build should matter).
Is there some deep reason or should I just contribute a patch?

The only reason is that nobody has needed one so far, and because
it is quite some work to do if done correctly. Why do you need it?

There should be '-le' and '-be' variats, I suppose. Should there be a
variant without explicit endianity, using a BOM to decide (like
'utf-16')?
Right.

And it should combine surrogates into valid characters (on all builds),
like the 'utf-8' codec does, right?

Right.

Also, it should support the incremental interface (as any multi-byte
codec should).

If you want it complete, it should also support line-oriented input.
Notice that .readline/.readlines is particularly difficult to implement,
as you can't rely on the underlying stream's .readline implementation
to provide meaningful results.

While we are discussing problems: there also is the issue whether
..readline/.readlines should take the additional Unicode linebreak
characters into account (e.g. U+2028, U+2029), and if so, whether
that should be restricted to "universal newlines" mode.

Regards,
Martin

Erik Max Francis · Jun 5, 2006

Martin said:
The only reason is that nobody has needed one so far, and because
it is quite some work to do if done correctly. Why do you need it?

Why would it be "quite some work"? Converting from UTF-16 to UTF-32 is
pretty straightforward, and UTF-16 is already supported.

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Jun 5, 2006

Erik said:
Why would it be "quite some work"? Converting from UTF-16 to UTF-32 is
pretty straightforward, and UTF-16 is already supported.

I would like to see it correct, unlike the current UTF-16 codec. Perhaps
whoever contributes an UTF-32 codec could also deal with the defects of
the UTF-16 codec.

Regards,
Martin

cben · Jun 9, 2006

Méta-MCI said:
Hi!

Look at: http://cjkpython.berlios.de (iconvcodec)

(Serge Orlov has built a version for Python 2.4 "special for me"; thanks to
him).

Thanks for the pointer.
iconvcodec should do the job, but I still want a native implementation
to be included with any python.

cben · Jun 9, 2006

Somebody asked me about generating UTF-32 (he didn't have choice of the
output format).
I was about to propose the obvious ``u.encode('utf-32')`` but
discovered it's missing.
Someone proposed 'unicode-internal' but it depends on the build and is
an ugly answer.
Next time, I want Guido's Time Machine to just work, so I have to fix
this ;-).

I would like to see it correct, unlike the current UTF-16 codec. Perhaps
whoever contributes an UTF-32 codec could also deal with the defects of
the UTF-16 codec.

Now this is interesting, as I hoped to base my code on UTF-16 (and
perhaps UTF-8 for combining surrogates)... Can you elaborate?

I could attempt to fix UTF-16 as well but I don't have the expertise to
choose the right behaviour,
so you'll have to specify precisely what it should do (that it doesn't
do now).

Fredrik Lundh · Jun 9, 2006

Somebody asked me about generating UTF-32 (he didn't have choice of the
output format). I was about to propose the obvious ``u.encode('utf-32')``
> but discovered it's missing.

hint 1:
'\x00\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o'

hint 2:
'little'

</F>

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?= · Jun 9, 2006

Now this is interesting, as I hoped to base my code on UTF-16 (and
perhaps UTF-8 for combining surrogates)... Can you elaborate?

The codec doesn't do line-oriented input correctly (i.e. readline);
it raises NotImplementedError.

Regards,
Martin

Encoding of surrogate code points to UTF-8	14	Oct 8, 2013
Wrong default endianess in utf-16 and utf-32 !?	4	Oct 12, 2010
64-bit integers where the implementation supports max 32-bit ints	37	Aug 5, 2013
SECURITY ADVISORY [PSF-2006-001] Buffer overrun in repr() for UCS-4encoded unicode strings	0	Oct 12, 2006
What is the most astounding C++ syntax construct?	0	Dec 22, 2022
Upgrade from Windows-1252 to UCS-2	12	Jun 20, 2007
UTF - SEEK_SET workaround for BOM encoding(utf-16/32) layer Bug	2	Aug 5, 2009
Converting to UCS-2 or UTF-16 for use by a C extension	0	Jun 7, 2007

Where is the ucs-32 codec?

beni.cherniavsky

Erik Max Francis

Méta-MCI

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Erik Max Francis

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

cben

cben

Fredrik Lundh

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads