Hi Michael,
Processing LDIF is one thing, doing LDAP operations another.
LDIF itself is meant to be ASCII-clean. But each attribute value can carry any
byte sequence (e.g. attribute 'jpegPhoto'). There's no further processingby
module LDIF - it simply returns byte sequences.
The access protocol LDAPv3 mandates UTF-8 encoding for Unicode strings onthe
wire if attribute syntax is DirectoryString, IA5String (mainly ASCII) or similar.
So if you're LDIF input returns UTF-16 encoded attribute values for e.g.
attribute 'cn' or 'o' or another attribute not being of OctetString or Binary
syntax something's wrong with the producer of the LDIF data.
That could be, I am using ms's ldifde.exe to dump a domino and AD directoryfor
comparative processing. The problem is I don't have much control on the data in
the directory and I do know that DN's have non ascii characters unique to the
I wonder what the string really is. At least the base64-encoding you provided
before decodes as UTF-8 but I'm not sure whether it's the right sequence of
Unicode code points you're expecting.
u'det\\3310wbb\\pg'
I still can't figure out what you're really doing though. I'd recommend to
strip down your operations to a very simple test code snippet illustrating the
issue and post that here.
So I have removed all my likely broken attempts at working with this data and will
soon have some simple code but at this point I may have an indication of what is
awry with my data.
After parsing the data for a user I am simply taking a value from the ldif file and writing
it back out to another which fails, the value parsed is:
officestreetaddress:: T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ==
File "C:\Python27\lib\site-packages\ldif.py", line 202, in unparse
self._unparseChangeRecord(record)
File "C:\Python27\lib\site-packages\ldif.py", line 181, in _unparseChangeRecord
self._unparseAttrTypeandValue(mod_type,mod_val)
File "C:\Python27\lib\site-packages\ldif.py", line 142, in _unparseAttrTypeandValue
self._unfoldLDIFLine(':: '.join([attr_type,base64.encodestring(attr_value).replace('\n','')]))
File "C:\Python27\lib\base64.py", line 315, in encodestring
pieces.append(binascii.b2a_base64(chunk))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdf' in position 7: ordinal not in range(128)
c:\python27\lib\base64.py(315)encodestring()
-> pieces.append(binascii.b2a_base64(chunk))
(Pdb) l
310 def encodestring(s):
311 """Encode a string into multiple lines of base-64 data."""
312 pieces = []
313 for i in range(0, len(s), MAXBINSIZE):
314 chunk = s[i : i + MAXBINSIZE]
315 -> pieces.append(binascii.b2a_base64(chunk))
316 return "".join(pieces)
317
318
319 def decodestring(s):
320 """Decode a string."""
(Pdb) args
s = Otto-Meßmer-Straße 1
So moving up a frame or two and looking at the entry dict, I see a modlist entry of:
('streetAddress', [u'Otto-Me\xdfmer-Stra\xdfe 1']) which is correct:
In [2]: 'T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ=='.decode('base64').decode('utf-8')
Out[2]: u'Otto-Me\xdfmer-Stra\xdfe 1'
Looking at the stack trace, I think I see the issue:
(Pdb) import base64
(Pdb) base64.encodestring(u'Otto-Me\xdfmer-Stra\xdfe 1'.encode('utf-8')).replace('\n','')
'T3R0by1NZcOfbWVyLVN0cmHDn2UgMQ=='
I now have the exact the value I started with. Ensuring where I ever handlethe original
values that I return utf-8 decoded objects for use in a modlist to later write and Sub
classing LDIFWriter and overriding _unparseAttrTypeandValue to do the encoding has
eliminated all the errors.
What remains finally is ldifde.exe's output of what looks like U+00BF, or an inverted question
mark for some values, otherwise this issue looks solved.
Thanks for everything,
jlc