Codecs for ISO 8859-11 (Thai) and 8859-16 (Romanian)

P

Peter Jacobi

I've seen from the 2.4alpha announcements, that the CJK codecs made it
into this version.

I'd like to ask whether (or how to) add the missing ISO 8859 codes:

ISO 8859-11 (= TIS620) for Thai
ISO 8859-16 for Romanian

They are easily built from the Unicode mapping files like the other
ISO 8859 codecs and it would just be nice, if they were included in
the standard distribution.

Peter
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Peter said:
They are easily built from the Unicode mapping files like the other
ISO 8859 codecs and it would just be nice, if they were included in
the standard distribution.

Can you produce a patch? Please upload it to sf.net/projects/python.

ISO-8859-11 is actually very difficult to implement, as it is unclear
whether the characters \x80..\x9F are assigned in this character set
or not. In fact, it is unclear whether the character set contains
even C0.

Regards,
Martin
 
C

Christos TZOTZIOY Georgiou

I'd like to ask whether (or how to) add the missing ISO 8859 codes:

Martin asked for a patch, which would be nice if you could provide. On
"how": just take any lib/encodings/iso8859_?.py and edit the dict
argument to the decoding_map.update call.
 
R

Richard Brodie

Martin v. Löwis said:
ISO-8859-11 is actually very difficult to implement, as it is unclear
whether the characters \x80..\x9F are assigned in this character set
or not. In fact, it is unclear whether the character set contains
even C0.

That seems like a very fine distinction to me; the Unicode mapping tables
are the same for those points as in ISO-8859-1, so what's the difference?
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Richard said:
That seems like a very fine distinction to me; the Unicode mapping tables
are the same for those points as in ISO-8859-1, so what's the difference?

For ISO-8859-1, I believe the standard actually says that those code
points are C1. For ISO-8859-11, you can find various statements in the
net, some claiming that it includes C1, and some claiming that it
doesn't. Somebody would actually have to take a look at ISO-8859-11 to
find out what is the case.

The issue is complicated by two facts:
- many sources indicate that ISO-8859-11 is derived by taking TIS-620,
and adding NBSP into 0xa0. Now, it seems quite clear that TIS-620 does
*not* include C1.
- some sources indicate certain restrictrions wrt. to control functions,
eg. in

http://www.nectec.or.th/it-standards/iso8859-11/

which says "control functions are not used to create composite graphic
symbols from two or more graphic characters (see 6). "
I don't know what this means, especially as section 6 does not talk
about control functions. Section 7 says that any control functions
are out of scope of ISO 8859, which I believe is factually incorrect.

Regards,
Martin
 
P

Peter Jacobi

Hi Christos, All,

Christos "TZOTZIOY" Georgiou said:
Martin asked for a patch, which would be nice if you could provide. On
"how": just take any lib/encodings/iso8859_?.py and edit the dict
argument to the decoding_map.update call.

Thanks for the hint, but I've already succeeded in generating the
necessary files. It's even easier than your solutions, as the utility
gencodec.py in Tools/Scripts generate these automatically from (1:1)
Unicode mapping files (ftp://ftp.unicode.org/Public/MAPPINGS/).

I'll add the generated files at the end of this post.

The remaining question, and it seems the more difficult one, is a
question of process. Whether and how to add these to the normal
Python distribution.

Regards,
Peter Jacobi

Thai:
=== iso8859_11.py ===
""" Python Character Mapping Codec generated from '8859-11.TXT' with gencodec.py.

Written by Marc-Andre Lemburg ([email protected]).

(c) Copyright CNRI, All Rights Reserved. NO WARRANTY.
(c) Copyright 2000 Guido van Rossum.

"""#"

import codecs

### Codec APIs

class Codec(codecs.Codec):

def encode(self,input,errors='strict'):

return codecs.charmap_encode(input,errors,encoding_map)

def decode(self,input,errors='strict'):

return codecs.charmap_decode(input,errors,decoding_map)

class StreamWriter(Codec,codecs.StreamWriter):
pass

class StreamReader(Codec,codecs.StreamReader):
pass

### encodings module API

def getregentry():

return (Codec().encode,Codec().decode,StreamReader,StreamWriter)

### Decoding Map

decoding_map = codecs.make_identity_dict(range(256))
decoding_map.update({
0x00a1: 0x0e01, # THAI CHARACTER KO KAI
0x00a2: 0x0e02, # THAI CHARACTER KHO KHAI
0x00a3: 0x0e03, # THAI CHARACTER KHO KHUAT
0x00a4: 0x0e04, # THAI CHARACTER KHO KHWAI
0x00a5: 0x0e05, # THAI CHARACTER KHO KHON
0x00a6: 0x0e06, # THAI CHARACTER KHO RAKHANG
0x00a7: 0x0e07, # THAI CHARACTER NGO NGU
0x00a8: 0x0e08, # THAI CHARACTER CHO CHAN
0x00a9: 0x0e09, # THAI CHARACTER CHO CHING
0x00aa: 0x0e0a, # THAI CHARACTER CHO CHANG
0x00ab: 0x0e0b, # THAI CHARACTER SO SO
0x00ac: 0x0e0c, # THAI CHARACTER CHO CHOE
0x00ad: 0x0e0d, # THAI CHARACTER YO YING
0x00ae: 0x0e0e, # THAI CHARACTER DO CHADA
0x00af: 0x0e0f, # THAI CHARACTER TO PATAK
0x00b0: 0x0e10, # THAI CHARACTER THO THAN
0x00b1: 0x0e11, # THAI CHARACTER THO NANGMONTHO
0x00b2: 0x0e12, # THAI CHARACTER THO PHUTHAO
0x00b3: 0x0e13, # THAI CHARACTER NO NEN
0x00b4: 0x0e14, # THAI CHARACTER DO DEK
0x00b5: 0x0e15, # THAI CHARACTER TO TAO
0x00b6: 0x0e16, # THAI CHARACTER THO THUNG
0x00b7: 0x0e17, # THAI CHARACTER THO THAHAN
0x00b8: 0x0e18, # THAI CHARACTER THO THONG
0x00b9: 0x0e19, # THAI CHARACTER NO NU
0x00ba: 0x0e1a, # THAI CHARACTER BO BAIMAI
0x00bb: 0x0e1b, # THAI CHARACTER PO PLA
0x00bc: 0x0e1c, # THAI CHARACTER PHO PHUNG
0x00bd: 0x0e1d, # THAI CHARACTER FO FA
0x00be: 0x0e1e, # THAI CHARACTER PHO PHAN
0x00bf: 0x0e1f, # THAI CHARACTER FO FAN
0x00c0: 0x0e20, # THAI CHARACTER PHO SAMPHAO
0x00c1: 0x0e21, # THAI CHARACTER MO MA
0x00c2: 0x0e22, # THAI CHARACTER YO YAK
0x00c3: 0x0e23, # THAI CHARACTER RO RUA
0x00c4: 0x0e24, # THAI CHARACTER RU
0x00c5: 0x0e25, # THAI CHARACTER LO LING
0x00c6: 0x0e26, # THAI CHARACTER LU
0x00c7: 0x0e27, # THAI CHARACTER WO WAEN
0x00c8: 0x0e28, # THAI CHARACTER SO SALA
0x00c9: 0x0e29, # THAI CHARACTER SO RUSI
0x00ca: 0x0e2a, # THAI CHARACTER SO SUA
0x00cb: 0x0e2b, # THAI CHARACTER HO HIP
0x00cc: 0x0e2c, # THAI CHARACTER LO CHULA
0x00cd: 0x0e2d, # THAI CHARACTER O ANG
0x00ce: 0x0e2e, # THAI CHARACTER HO NOKHUK
0x00cf: 0x0e2f, # THAI CHARACTER PAIYANNOI
0x00d0: 0x0e30, # THAI CHARACTER SARA A
0x00d1: 0x0e31, # THAI CHARACTER MAI HAN-AKAT
0x00d2: 0x0e32, # THAI CHARACTER SARA AA
0x00d3: 0x0e33, # THAI CHARACTER SARA AM
0x00d4: 0x0e34, # THAI CHARACTER SARA I
0x00d5: 0x0e35, # THAI CHARACTER SARA II
0x00d6: 0x0e36, # THAI CHARACTER SARA UE
0x00d7: 0x0e37, # THAI CHARACTER SARA UEE
0x00d8: 0x0e38, # THAI CHARACTER SARA U
0x00d9: 0x0e39, # THAI CHARACTER SARA UU
0x00da: 0x0e3a, # THAI CHARACTER PHINTHU
0x00db: None,
0x00dc: None,
0x00dd: None,
0x00de: None,
0x00df: 0x0e3f, # THAI CURRENCY SYMBOL BAHT
0x00e0: 0x0e40, # THAI CHARACTER SARA E
0x00e1: 0x0e41, # THAI CHARACTER SARA AE
0x00e2: 0x0e42, # THAI CHARACTER SARA O
0x00e3: 0x0e43, # THAI CHARACTER SARA AI MAIMUAN
0x00e4: 0x0e44, # THAI CHARACTER SARA AI MAIMALAI
0x00e5: 0x0e45, # THAI CHARACTER LAKKHANGYAO
0x00e6: 0x0e46, # THAI CHARACTER MAIYAMOK
0x00e7: 0x0e47, # THAI CHARACTER MAITAIKHU
0x00e8: 0x0e48, # THAI CHARACTER MAI EK
0x00e9: 0x0e49, # THAI CHARACTER MAI THO
0x00ea: 0x0e4a, # THAI CHARACTER MAI TRI
0x00eb: 0x0e4b, # THAI CHARACTER MAI CHATTAWA
0x00ec: 0x0e4c, # THAI CHARACTER THANTHAKHAT
0x00ed: 0x0e4d, # THAI CHARACTER NIKHAHIT
0x00ee: 0x0e4e, # THAI CHARACTER YAMAKKAN
0x00ef: 0x0e4f, # THAI CHARACTER FONGMAN
0x00f0: 0x0e50, # THAI DIGIT ZERO
0x00f1: 0x0e51, # THAI DIGIT ONE
0x00f2: 0x0e52, # THAI DIGIT TWO
0x00f3: 0x0e53, # THAI DIGIT THREE
0x00f4: 0x0e54, # THAI DIGIT FOUR
0x00f5: 0x0e55, # THAI DIGIT FIVE
0x00f6: 0x0e56, # THAI DIGIT SIX
0x00f7: 0x0e57, # THAI DIGIT SEVEN
0x00f8: 0x0e58, # THAI DIGIT EIGHT
0x00f9: 0x0e59, # THAI DIGIT NINE
0x00fa: 0x0e5a, # THAI CHARACTER ANGKHANKHU
0x00fb: 0x0e5b, # THAI CHARACTER KHOMUT
0x00fc: None,
0x00fd: None,
0x00fe: None,
0x00ff: None,
})

### Encoding Map

encoding_map = codecs.make_encoding_map(decoding_map)
=== eof ===

Romanian:
=== iso8859_16.py ===
""" Python Character Mapping Codec generated from '8859-16.TXT' with gencodec.py.

Written by Marc-Andre Lemburg ([email protected]).

(c) Copyright CNRI, All Rights Reserved. NO WARRANTY.
(c) Copyright 2000 Guido van Rossum.

"""#"

import codecs

### Codec APIs

class Codec(codecs.Codec):

def encode(self,input,errors='strict'):

return codecs.charmap_encode(input,errors,encoding_map)

def decode(self,input,errors='strict'):

return codecs.charmap_decode(input,errors,decoding_map)

class StreamWriter(Codec,codecs.StreamWriter):
pass

class StreamReader(Codec,codecs.StreamReader):
pass

### encodings module API

def getregentry():

return (Codec().encode,Codec().decode,StreamReader,StreamWriter)

### Decoding Map

decoding_map = codecs.make_identity_dict(range(256))
decoding_map.update({
0x00a1: 0x0104, # LATIN CAPITAL LETTER A WITH OGONEK
0x00a2: 0x0105, # LATIN SMALL LETTER A WITH OGONEK
0x00a3: 0x0141, # LATIN CAPITAL LETTER L WITH STROKE
0x00a4: 0x20ac, # EURO SIGN
0x00a5: 0x201e, # DOUBLE LOW-9 QUOTATION MARK
0x00a6: 0x0160, # LATIN CAPITAL LETTER S WITH CARON
0x00a8: 0x0161, # LATIN SMALL LETTER S WITH CARON
0x00aa: 0x0218, # LATIN CAPITAL LETTER S WITH COMMA BELOW
0x00ac: 0x0179, # LATIN CAPITAL LETTER Z WITH ACUTE
0x00ae: 0x017a, # LATIN SMALL LETTER Z WITH ACUTE
0x00af: 0x017b, # LATIN CAPITAL LETTER Z WITH DOT ABOVE
0x00b2: 0x010c, # LATIN CAPITAL LETTER C WITH CARON
0x00b3: 0x0142, # LATIN SMALL LETTER L WITH STROKE
0x00b4: 0x017d, # LATIN CAPITAL LETTER Z WITH CARON
0x00b5: 0x201d, # RIGHT DOUBLE QUOTATION MARK
0x00b8: 0x017e, # LATIN SMALL LETTER Z WITH CARON
0x00b9: 0x010d, # LATIN SMALL LETTER C WITH CARON
0x00ba: 0x0219, # LATIN SMALL LETTER S WITH COMMA BELOW
0x00bc: 0x0152, # LATIN CAPITAL LIGATURE OE
0x00bd: 0x0153, # LATIN SMALL LIGATURE OE
0x00be: 0x0178, # LATIN CAPITAL LETTER Y WITH DIAERESIS
0x00bf: 0x017c, # LATIN SMALL LETTER Z WITH DOT ABOVE
0x00c3: 0x0102, # LATIN CAPITAL LETTER A WITH BREVE
0x00c5: 0x0106, # LATIN CAPITAL LETTER C WITH ACUTE
0x00d0: 0x0110, # LATIN CAPITAL LETTER D WITH STROKE
0x00d1: 0x0143, # LATIN CAPITAL LETTER N WITH ACUTE
0x00d5: 0x0150, # LATIN CAPITAL LETTER O WITH DOUBLE ACUTE
0x00d7: 0x015a, # LATIN CAPITAL LETTER S WITH ACUTE
0x00d8: 0x0170, # LATIN CAPITAL LETTER U WITH DOUBLE ACUTE
0x00dd: 0x0118, # LATIN CAPITAL LETTER E WITH OGONEK
0x00de: 0x021a, # LATIN CAPITAL LETTER T WITH COMMA BELOW
0x00e3: 0x0103, # LATIN SMALL LETTER A WITH BREVE
0x00e5: 0x0107, # LATIN SMALL LETTER C WITH ACUTE
0x00f0: 0x0111, # LATIN SMALL LETTER D WITH STROKE
0x00f1: 0x0144, # LATIN SMALL LETTER N WITH ACUTE
0x00f5: 0x0151, # LATIN SMALL LETTER O WITH DOUBLE ACUTE
0x00f7: 0x015b, # LATIN SMALL LETTER S WITH ACUTE
0x00f8: 0x0171, # LATIN SMALL LETTER U WITH DOUBLE ACUTE
0x00fd: 0x0119, # LATIN SMALL LETTER E WITH OGONEK
0x00fe: 0x021b, # LATIN SMALL LETTER T WITH COMMA BELOW
})

### Encoding Map

encoding_map = codecs.make_encoding_map(decoding_map)
=== eof ===
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Peter said:
The remaining question, and it seems the more difficult one, is a
question of process. Whether and how to add these to the normal
Python distribution.

The process is actually very easy. Anybody willing to contribute them
would have to upload them to SF (sf.net/projects/python).

Regards,
Martin
 
P

Peter Jacobi

Hi Martin, All,

Martin v. Löwis said:
The process is actually very easy. Anybody willing to contribute them
would have to upload them to SF (sf.net/projects/python).

Perhaps I have just misunderstood your email. I read it this way (in my own words):

Taking into account unanswered questions about ISO 8859-11 and TIS620,
whoever wants to contribute, has to do provider further research,
starting with, but not limited to, buying the ISO standard.

The prospective contributor in addition has to provide support for this
patch and answer all questions about the details involved.

Sorry, this is in the moment out of scope for me. I have a patch, using
information from a source which is reliable enough for my personal
requirements, and now the patch is on USENET available for everyone
who wants to investigate further.

Regards,
Peter Jacobi
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Peter said:
Perhaps I have just misunderstood your email. I read it this way (in my own words):

[snipped]
No - this is indeed my view on the issue. However, this is a technical
view; the *process* is completely independent, and very straight
forward. Submit the patch to SF, and somebody (probably Marc-Andre
Lemburg) will review it. The reviewer might ask questions or request
further changes (such as adding documentation); then the patch gets
accepted or rejected.

I know that *I* would ask questions as to why the submitter thinks the
patch is correct, and I would request that the submitter commits to
maintaining the patch. If you are unwilling to make such a commitment,
I can understand that - it just means that Python 2.4 might not have
these codecs (and we haven't discussed the 8859-16 at all).

Regards,
Martin
 
P

Peter Jacobi

I've added an entry in the RFE tracker at http://sf.net/projects/python

Regarding the correctness doubts, I can provide these three points os far:

a) ISO 8859-n vs ISO-8859-n
If the information at
http://en.wikipedia.org/wiki/ISO_8859-1#ISO_8859-1_vs_ISO-8859-1
is correct, Python 8859-n
codecs do implement the ISO standard charsets ISO 8859-n
in the specialized IANA forms ISO-8859-n (and in agreement
with the Unicode mapping files). So any difficult C0/C1
wording in the original ISO standard can be disregarded.

b) libiconv ISO 8859-11
The implementation by Bruno Haible in libiconv does agree
with the Unicode mapping file:
http://cvs.sourceforge.net/viewcvs.py/libiconv/libiconv/lib/

c) IBM ICU4C
The implementation in ICU4C does agree with the Unicode
mapping file:
http://oss.software.ibm.com/cvs/icu/charset/data/ucm/

Regards,
Peter Jacobi
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Peter said:
a) ISO 8859-n vs ISO-8859-n
If the information at
http://en.wikipedia.org/wiki/ISO_8859-1#ISO_8859-1_vs_ISO-8859-1
is correct, Python 8859-n
codecs do implement the ISO standard charsets ISO 8859-n
in the specialized IANA forms ISO-8859-n (and in agreement
with the Unicode mapping files). So any difficult C0/C1
wording in the original ISO standard can be disregarded.

I see. According to RFC 1345, this is definitely the case
for ISO-8859-1. ISO-8859-16 is not defined in an RFC, but
in

http://www.iana.org/assignments/charset-reg/ISO-8859-16

This is a confusing document, as it both refers to ISO/IEC
8859-16:2001 (no control characters), and the Unicode character
map (with control characters). We might interpret this as a
mistake, and assume that it was intended to include control
characters (as all the other ISO-8859-n).

For ISO-8859-11, the situation is even more confusing, as
that is no registered IANA character set, according to

http://www.iana.org/assignments/character-sets

Therefore, it would be a protocol violation (strictly speaking)
if one would use iso-8859-11 in, say, a MIME charset= header.

Regards,
Martin
 
P

Peter Jacobi

Hi Martin, All,

Martin v. Löwis said:
Therefore, it would be a protocol violation (strictly speaking)
if one would use iso-8859-11 in, say, a MIME charset= header.

Strictly speaking, there are some more dark corners to check.
All ISO charsets should be, strictly speaking, qualified by year. And
in fact there were some prominent changes, e.g. in 8859-7 (greek).
What to do of them?

Looking around:
- the RFC references a fixed year old version
- Unicode mapping files and libiconv track the newest version
- IBM ICU4C provides all versions
- Python (not by planning, I assume) has a "middle" version with
some features of the old mapping table (no currency signs) and some
features of the new (0xA1=0x2018, 0xA2=0x2019)

Weird.

Best Regards,
Peter Jacobi
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Peter said:
Looking around:
- the RFC references a fixed year old version
- Unicode mapping files and libiconv track the newest version
- IBM ICU4C provides all versions
- Python (not by planning, I assume) has a "middle" version with
some features of the old mapping table (no currency signs) and some
features of the new (0xA1=0x2018, 0xA2=0x2019)

Indeed. Adding new codecs is not a matter of just compiling a few files
that somebody else has produced, but requires a lot of expertise.
Therefore, I would have preferred if Python would not have included any
codecs, but relied on the codecs that come with the platform (e.g. iconv
on Unix, IE DLLs on Windows).

Now, things came out differently, and we are now in charge of
maintaining what we got. This requires great care, and expert volunteers
are always welcome. Unfortunately, in the Unicode/character sets/l10n
world, there is no one true way, so experts need to stand up and voice
their opinion, hoping that contributors become atleast aware of the
issues.

In the specific case of ISO-8859-7, I was until just now unaware of the
issue - I would not have guessed that ISO dared to ever change a part
of 8859. If this is ever going to be changed, I would suggest the
following approach:
- provide two encodings: ISO-8859-7:1987, and ISO-8859-7:2003. Without
checking, I would hope that the version in RFC 1345 is identical with
8859-7:1987
- Make ISO-8859-7 an alias for ISO-8859-7:1987
Of course, somebody should really talk to IANA and come up with
preferred MIME name. Apparently, ISO-8859-7-EURO and ISO-8859-7-2003
have been proposed.

Regards,
Martin
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Peter said:
a) ISO 8859-n vs ISO-8859-n
If the information at
http://en.wikipedia.org/wiki/ISO_8859-1#ISO_8859-1_vs_ISO-8859-1
is correct, Python 8859-n
codecs do implement the ISO standard charsets ISO 8859-n
in the specialized IANA forms ISO-8859-n (and in agreement
with the Unicode mapping files). So any difficult C0/C1
wording in the original ISO standard can be disregarded.

I have just asked Markus Kuhn about this, who has registered
ISO-8859-16 with IANA. He believes that his registration does
not include control characters (neither C0 nor C1), just as
the ISO standard does not contain any. Wrt. RFC 1345 he points
out that this is not an Internet Standard, but a private
collection of Keld Simonsen, i.e. it is not binding.

Regards,
Martin
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,754
Messages
2,569,527
Members
44,999
Latest member
MakersCBDGummiesReview

Latest Threads

Top