Where is the ucs-32 codec?

  • Thread starter beni.cherniavsky
  • Start date
B

beni.cherniavsky

Python seems to be missing a UCS-32 codec, even in wide builds (not
that it the build should matter).
Is there some deep reason or should I just contribute a patch?

If it's just a bug, should I call the codec 'ucs-32' or 'utf-32'? Or
both (aliased)?
There should be '-le' and '-be' variats, I suppose. Should there be a
variant without explicit endianity, using a BOM to decide (like
'utf-16')?
And it should combine surrogates into valid characters (on all builds),
like the 'utf-8' codec does, right?
 
E

Erik Max Francis

Python seems to be missing a UCS-32 codec, even in wide builds (not
that it the build should matter).
Is there some deep reason or should I just contribute a patch?

If it's just a bug, should I call the codec 'ucs-32' or 'utf-32'? Or
both (aliased)?
There should be '-le' and '-be' variats, I suppose. Should there be a
variant without explicit endianity, using a BOM to decide (like
'utf-16')?
And it should combine surrogates into valid characters (on all builds),
like the 'utf-8' codec does, right?

Note that UTF-32 is UCS-4. UCS-32 ("Universial Character Set in 32
octets") wouldn't make much sense.

Not that Python has a UCS-4 encoding available either. I'm really not
sure why.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Python seems to be missing a UCS-32 codec, even in wide builds (not
that it the build should matter).
Is there some deep reason or should I just contribute a patch?

The only reason is that nobody has needed one so far, and because
it is quite some work to do if done correctly. Why do you need it?
There should be '-le' and '-be' variats, I suppose. Should there be a
variant without explicit endianity, using a BOM to decide (like
'utf-16')?
Right.

And it should combine surrogates into valid characters (on all builds),
like the 'utf-8' codec does, right?

Right.

Also, it should support the incremental interface (as any multi-byte
codec should).

If you want it complete, it should also support line-oriented input.
Notice that .readline/.readlines is particularly difficult to implement,
as you can't rely on the underlying stream's .readline implementation
to provide meaningful results.

While we are discussing problems: there also is the issue whether
..readline/.readlines should take the additional Unicode linebreak
characters into account (e.g. U+2028, U+2029), and if so, whether
that should be restricted to "universal newlines" mode.

Regards,
Martin
 
E

Erik Max Francis

Martin said:
The only reason is that nobody has needed one so far, and because
it is quite some work to do if done correctly. Why do you need it?

Why would it be "quite some work"? Converting from UTF-16 to UTF-32 is
pretty straightforward, and UTF-16 is already supported.
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Erik said:
Why would it be "quite some work"? Converting from UTF-16 to UTF-32 is
pretty straightforward, and UTF-16 is already supported.

I would like to see it correct, unlike the current UTF-16 codec. Perhaps
whoever contributes an UTF-32 codec could also deal with the defects of
the UTF-16 codec.

Regards,
Martin
 
C

cben

Somebody asked me about generating UTF-32 (he didn't have choice of the
output format).
I was about to propose the obvious ``u.encode('utf-32')`` but
discovered it's missing.
Someone proposed 'unicode-internal' but it depends on the build and is
an ugly answer.
Next time, I want Guido's Time Machine to just work, so I have to fix
this ;-).
I would like to see it correct, unlike the current UTF-16 codec. Perhaps
whoever contributes an UTF-32 codec could also deal with the defects of
the UTF-16 codec.
Now this is interesting, as I hoped to base my code on UTF-16 (and
perhaps UTF-8 for combining surrogates)... Can you elaborate?

I could attempt to fix UTF-16 as well but I don't have the expertise to
choose the right behaviour,
so you'll have to specify precisely what it should do (that it doesn't
do now).
 
F

Fredrik Lundh

Somebody asked me about generating UTF-32 (he didn't have choice of the
output format). I was about to propose the obvious ``u.encode('utf-32')``
> but discovered it's missing.

hint 1:
'\x00\x00\x00H\x00\x00\x00e\x00\x00\x00l\x00\x00\x00l\x00\x00\x00o'

hint 2:
'little'

</F>
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Now this is interesting, as I hoped to base my code on UTF-16 (and
perhaps UTF-8 for combining surrogates)... Can you elaborate?

The codec doesn't do line-oriented input correctly (i.e. readline);
it raises NotImplementedError.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top