Newbie problem with codecs

D

derek / nul

Good people

I am writing my first python program, so be gentle :)
This program needs to r/w UTF16LE files.

When I run the program I get the following error

Traceback (most recent call last):
File "apply_physics.py", line 12, in ?
codecs.lookup(BOM_UTF16_LE)
NameError: name 'BOM_UTF16_LE' is not defined



Could someone point to my mistake please?

Derek


==============================
#!c:/program files/python/python.exe
#
# win32 python 2.3

import sys, string
import codecs

codecs.lookup(BOM_UTF16_LE)
 
A

Alex Martelli

derek / nul wrote:
...
Traceback (most recent call last):
File "apply_physics.py", line 12, in ?
codecs.lookup(BOM_UTF16_LE)
NameError: name 'BOM_UTF16_LE' is not defined

Could someone point to my mistake please?

Change the statement to:

codecs.lookup(codecs,BOM_UTF16_LE)


Alex
 
A

Andrew Dalke

derek said:
Alex
Change the statement to:
codecs.lookup(codecs,BOM_UTF16_LE)

Typo? Shouldn't that be a "." instead of a "."?

In any case
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\PYTHON23\Lib\encodings\__init__.py", line 84, in search_function
globals(), locals(), _import_tail)
ValueError: Empty module name
In any case, the "BOM" means "byte order marker" and
the constant is the string prefix used to indicate which
UTF16 encoding is used. It isn't the encoding name.

Perhaps the following is what the OP wanted?
(<built-in function utf_16_le_encode>, <built-in function utf_16_le_decode>,

But I am not Martin. ;)

Andrew
(e-mail address removed)
 
A

Alex Martelli

Andrew Dalke wrote:
...
Typo? Shouldn't that be a "." instead of a "."?

A dot, not a comma -- sorry, the font I use makes them hard
to tell apart (at least w/my failing eyesight) and they're right
next to each other on the keyboard... which I guess is why I
think you have dots on both sides of "instead" in your phrase?-).

In any case

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\PYTHON23\Lib\encodings\__init__.py", line 84, in
search_function
globals(), locals(), _import_tail)
ValueError: Empty module name

In any case, the "BOM" means "byte order marker" and
the constant is the string prefix used to indicate which
UTF16 encoding is used. It isn't the encoding name.

Perfectly right -- codecs.BOM_UTF16_LE is just a 2-character
string which doesn't name a codec (but rather gives the BOM
for one). I saw that obvious-cause NameError and didn't look
any deeper -- thanks for doing so.

Perhaps the following is what the OP wanted?

(<built-in function utf_16_le_encode>, <built-in function

I won't dare guess, but it's certainly one possibility.


Alex
 
D

derek / nul

Typo? Shouldn't that be a "." instead of a "."?

In any case

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\PYTHON23\Lib\encodings\__init__.py", line 84, in search_function
globals(), locals(), _import_tail)
ValueError: Empty module name

In any case, the "BOM" means "byte order marker" and
the constant is the string prefix used to indicate which
UTF16 encoding is used. It isn't the encoding name.

Perhaps the following is what the OP wanted?

(<built-in function utf_16_le_encode>, <built-in function utf_16_le_decode>,
<class encodings.utf_16_le.StreamReader at 0x01396840>, <class
encodings.utf_16_le.StreamWriter at 0x01396810>)

Andrew, this what I was expecting, but my system does not do it.

codecs.lookup("utf-16-le")

this is the code cut from my program, but there is NO output from my program.

I am using www.python.org/doc/lib/module-codecs.html 4.9 codecs -- Codec
registry and base classes page 1

Do I assume that the codecs module is not working?

Derek
 
D

derek / nul

Still, this might help. Suppose you wanted to read from a utf-16-le
encoded file and write to a utf-8 encoded file. You can do

Very close, I want to read a utf16le into memory, convert to text, change 100
lines in the file, convert back to utf16le and write back to disk.
The other options is to do the conversion through strings
instead of through files.

# s = "....some set of bytes with your utf-16 in it .."
s = open("input.utf16", "rb").read() # the whole file

# convert to unicode, given the encoding
t = unicode(s, "utf-16-le")

# convert to utf-8 encoding
s2 = t.encode("utf-8")

open("output.utf8", "rb").write(s2)

My code so far
-------------------------------------------
import codecs
codecs.lookup("utf-16-le")
eng_file = open("c:/program files/microsoft games/train
simulator/trains/trainset/dash9/dash9.eng", "rb").read() # read the whole file

t = unicode(eng_file, "utf-16-le")
print t
-----------------------------------------------------

The print fails (as expected) with a non printing char '\ufeff' which is of
course the BOM.
Is there a nice way to strip off the BOM?

The line where the conversion to utf8 is, I would like to convert to text but I
cannot find a built in command.

Many thanks so far
 
A

Andrew Dalke

derek / nul
My code so far ...
t = unicode(eng_file, "utf-16-le")
print t
-----------------------------------------------------

The print fails (as expected) with a non printing char '\ufeff' which is of
course the BOM.
Is there a nice way to strip off the BOM?

How does it fail? It may be because print tries to convert the
data as appropriate for your IDE or terminal, and fails. Eg, the
default expects ASCII. See

http://www.python.org/cgi-bin/faqw.py?req=show&file=faq04.102.htp

Asa a guess, since you're on MS Windows, your terminal might
expect mbcs. Try

print t.encode('mbcs')

If you really want to strip it off, do t[2:] (or [4:]?), to get the
string after the first 2/4 characters (the BOM) in the string. But
I doubt that's the correct solution.

Andrew
(e-mail address removed)
 
D

derek / nul

derek / nul

How does it fail?

File "apply_physics.py", line 21, in ?
print t
File "C:\Program Files\Python\lib\encodings\cp850.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position
0: character maps to said:
It may be because print tries to convert the
data as appropriate for your IDE or terminal, and fails. Eg, the
default expects ASCII. See

http://www.python.org/cgi-bin/faqw.py?req=show&file=faq04.102.htp

Asa a guess, since you're on MS Windows, your terminal might
expect mbcs. Try

print t.encode('mbcs')

If you really want to strip it off, do t[2:] (or [4:]?), to get the
string after the first 2/4 characters (the BOM) in the string. But
I doubt that's the correct solution.

Andrew
(e-mail address removed)
 
D

derek / nul

derek / nul:

I don't know enough to handle this problem. Anyone else care to try?

Andrew,

I am not concerned about that problem.
I need a pointer to converting utf-16-le to text

Derek
 
M

Mike Brown

derek / nul said:
Very close, I want to read a utf16le into memory, convert to text, change 100
lines in the file, convert back to utf16le and write back to disk.


My code so far
-------------------------------------------
import codecs
codecs.lookup("utf-16-le")
eng_file = open("c:/program files/microsoft games/train
simulator/trains/trainset/dash9/dash9.eng", "rb").read() # read the whole file

t = unicode(eng_file, "utf-16-le")
print t
-----------------------------------------------------

The print fails (as expected) with a non printing char '\ufeff' which is of
course the BOM.
Is there a nice way to strip off the BOM?

derek / nul said:
I need a pointer to converting utf-16-le to text

If there is a BOM, then it is not UTF-16LE; it is UTF-16.
 
D

derek / nul

If there is a BOM, then it is not UTF-16LE; it is UTF-16.

This paragraph is from http://www.egenix.com/files/python/unicode-proposal.txt

It explains the difference between utf-16-le and utf-16-be




Standard Codecs:
----------------

Standard codecs should live inside an encodings/ package directory in the
Standard Python Code Library. The __init__.py file of that directory should
include a Codec Lookup compatible search function implementing a lazy module
based codec lookup.

Python should provide a few standard codecs for the most relevant
encodings, e.g.

'utf-8': 8-bit variable length encoding
'utf-16': 16-bit variable length encoding (little/big endian)
'utf-16-le': utf-16 but explicitly little endian
'utf-16-be': utf-16 but explicitly big endian
'ascii': 7-bit ASCII codepage
'iso-8859-1': ISO 8859-1 (Latin 1) codepage
'unicode-escape': See Unicode Constructors for a definition
'raw-unicode-escape': See Unicode Constructors for a definition
'native': Dump of the Internal Format used by Python

Common aliases should also be provided per default, e.g. 'latin-1'
for 'iso-8859-1'.

Note: 'utf-16' should be implemented by using and requiring byte order
marks (BOM) for file input/output.

All other encodings such as the CJK ones to support Asian scripts
should be implemented in separate packages which do not get included
in the core Python distribution and are not a part of this proposal.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top