Newbie problem with codecs

derek / nul · Aug 21, 2003

Good people

I am writing my first python program, so be gentle

This program needs to r/w UTF16LE files.

When I run the program I get the following error

Traceback (most recent call last):
File "apply_physics.py", line 12, in ?
codecs.lookup(BOM_UTF16_LE)
NameError: name 'BOM_UTF16_LE' is not defined

Could someone point to my mistake please?

Derek

==============================
#!c:/program files/python/python.exe
#
# win32 python 2.3

import sys, string
import codecs

codecs.lookup(BOM_UTF16_LE)

Alex Martelli · Aug 21, 2003

derek / nul wrote:
...

Traceback (most recent call last):
File "apply_physics.py", line 12, in ?
codecs.lookup(BOM_UTF16_LE)
NameError: name 'BOM_UTF16_LE' is not defined

Could someone point to my mistake please?

Change the statement to:

codecs.lookup(codecs,BOM_UTF16_LE)

Alex

Andrew Dalke · Aug 21, 2003

derek said:
Alex
Change the statement to:
codecs.lookup(codecs,BOM_UTF16_LE)

Typo? Shouldn't that be a "." instead of a "."?

In any case
Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\PYTHON23\Lib\encodings\__init__.py", line 84, in search_function
globals(), locals(), _import_tail)
ValueError: Empty module name
In any case, the "BOM" means "byte order marker" and
the constant is the string prefix used to indicate which
UTF16 encoding is used. It isn't the encoding name.

Perhaps the following is what the OP wanted?
(<built-in function utf_16_le_encode>, <built-in function utf_16_le_decode>,

But I am not Martin.

Andrew
(e-mail address removed)

Alex Martelli · Aug 21, 2003

Andrew Dalke wrote:
...

Typo? Shouldn't that be a "." instead of a "."?

A dot, not a comma -- sorry, the font I use makes them hard
to tell apart (at least w/my failing eyesight) and they're right
next to each other on the keyboard... which I guess is why I
think you have dots on both sides of "instead" in your phrase?-).

In any case

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\PYTHON23\Lib\encodings\__init__.py", line 84, in
search_function
globals(), locals(), _import_tail)
ValueError: Empty module name

In any case, the "BOM" means "byte order marker" and
the constant is the string prefix used to indicate which
UTF16 encoding is used. It isn't the encoding name.

Perfectly right -- codecs.BOM_UTF16_LE is just a 2-character
string which doesn't name a codec (but rather gives the BOM
for one). I saw that obvious-cause NameError and didn't look
any deeper -- thanks for doing so.

Perhaps the following is what the OP wanted?

(<built-in function utf_16_le_encode>, <built-in function

I won't dare guess, but it's certainly one possibility.

Alex

derek / nul · Aug 21, 2003

Typo? Shouldn't that be a "." instead of a "."?

In any case

Traceback (most recent call last):
File "<interactive input>", line 1, in ?
File "C:\PYTHON23\Lib\encodings\__init__.py", line 84, in search_function
globals(), locals(), _import_tail)
ValueError: Empty module name

In any case, the "BOM" means "byte order marker" and
the constant is the string prefix used to indicate which
UTF16 encoding is used. It isn't the encoding name.

Perhaps the following is what the OP wanted?

(<built-in function utf_16_le_encode>, <built-in function utf_16_le_decode>,
<class encodings.utf_16_le.StreamReader at 0x01396840>, <class
encodings.utf_16_le.StreamWriter at 0x01396810>)

Andrew, this what I was expecting, but my system does not do it.

codecs.lookup("utf-16-le")

this is the code cut from my program, but there is NO output from my program.

I am using www.python.org/doc/lib/module-codecs.html 4.9 codecs -- Codec
registry and base classes page 1

Do I assume that the codecs module is not working?

Derek

derek / nul · Aug 22, 2003

Still, this might help. Suppose you wanted to read from a utf-16-le
encoded file and write to a utf-8 encoded file. You can do

Very close, I want to read a utf16le into memory, convert to text, change 100
lines in the file, convert back to utf16le and write back to disk.

The other options is to do the conversion through strings
instead of through files.

# s = "....some set of bytes with your utf-16 in it .."
s = open("input.utf16", "rb").read() # the whole file

# convert to unicode, given the encoding
t = unicode(s, "utf-16-le")

# convert to utf-8 encoding
s2 = t.encode("utf-8")

open("output.utf8", "rb").write(s2)

My code so far
-------------------------------------------
import codecs
codecs.lookup("utf-16-le")
eng_file = open("c:/program files/microsoft games/train
simulator/trains/trainset/dash9/dash9.eng", "rb").read() # read the whole file

t = unicode(eng_file, "utf-16-le")
print t
-----------------------------------------------------

The print fails (as expected) with a non printing char '\ufeff' which is of
course the BOM.
Is there a nice way to strip off the BOM?

The line where the conversion to utf8 is, I would like to convert to text but I
cannot find a built in command.

Many thanks so far

Andrew Dalke · Aug 22, 2003

derek / nul

My code so far ...
t = unicode(eng_file, "utf-16-le")
print t
-----------------------------------------------------

The print fails (as expected) with a non printing char '\ufeff' which is of
course the BOM.
Is there a nice way to strip off the BOM?

How does it fail? It may be because print tries to convert the
data as appropriate for your IDE or terminal, and fails. Eg, the
default expects ASCII. See

http://www.python.org/cgi-bin/faqw.py?req=show&file=faq04.102.htp

Asa a guess, since you're on MS Windows, your terminal might
expect mbcs. Try

print t.encode('mbcs')

If you really want to strip it off, do t[2:] (or [4:]?), to get the
string after the first 2/4 characters (the BOM) in the string. But
I doubt that's the correct solution.

Andrew
(e-mail address removed)

derek / nul · Aug 22, 2003

derek / nul

How does it fail?

File "apply_physics.py", line 21, in ?
print t
File "C:\Program Files\Python\lib\encodings\cp850.py", line 18, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position

0: character maps to said:
It may be because print tries to convert the
data as appropriate for your IDE or terminal, and fails. Eg, the
default expects ASCII. See

http://www.python.org/cgi-bin/faqw.py?req=show&file=faq04.102.htp

Asa a guess, since you're on MS Windows, your terminal might
expect mbcs. Try

print t.encode('mbcs')

If you really want to strip it off, do t[2:] (or [4:]?), to get the
string after the first 2/4 characters (the BOM) in the string. But
I doubt that's the correct solution.

Andrew
(e-mail address removed)

derek / nul · Aug 22, 2003

derek / nul:

I don't know enough to handle this problem. Anyone else care to try?

Andrew,

I am not concerned about that problem.
I need a pointer to converting utf-16-le to text

Derek

Mike Brown · Aug 22, 2003

derek / nul said:
Very close, I want to read a utf16le into memory, convert to text, change 100
lines in the file, convert back to utf16le and write back to disk.

My code so far
-------------------------------------------
import codecs
codecs.lookup("utf-16-le")
eng_file = open("c:/program files/microsoft games/train
simulator/trains/trainset/dash9/dash9.eng", "rb").read() # read the whole file

t = unicode(eng_file, "utf-16-le")
print t
-----------------------------------------------------

The print fails (as expected) with a non printing char '\ufeff' which is of
course the BOM.
Is there a nice way to strip off the BOM?

Click to expand...

derek / nul said:

I need a pointer to converting utf-16-le to text

Click to expand...

If there is a BOM, then it is not UTF-16LE; it is UTF-16.

derek / nul · Aug 23, 2003

If there is a BOM, then it is not UTF-16LE; it is UTF-16.

This paragraph is from http://www.egenix.com/files/python/unicode-proposal.txt

It explains the difference between utf-16-le and utf-16-be

Standard Codecs:
----------------

Standard codecs should live inside an encodings/ package directory in the
Standard Python Code Library. The __init__.py file of that directory should
include a Codec Lookup compatible search function implementing a lazy module
based codec lookup.

Python should provide a few standard codecs for the most relevant
encodings, e.g.

'utf-8': 8-bit variable length encoding
'utf-16': 16-bit variable length encoding (little/big endian)
'utf-16-le': utf-16 but explicitly little endian
'utf-16-be': utf-16 but explicitly big endian
'ascii': 7-bit ASCII codepage
'iso-8859-1': ISO 8859-1 (Latin 1) codepage
'unicode-escape': See Unicode Constructors for a definition
'raw-unicode-escape': See Unicode Constructors for a definition
'native': Dump of the Internal Format used by Python

Common aliases should also be provided per default, e.g. 'latin-1'
for 'iso-8859-1'.

Note: 'utf-16' should be implemented by using and requiring byte order
marks (BOM) for file input/output.

All other encodings such as the CJK ones to support Asian scripts
should be implemented in separate packages which do not get included
in the core Python distribution and are not a part of this proposal.

codecs in a chroot / without fs access	3	Jan 10, 2012
[newbie] problem with usbtmc-communication	18	Dec 4, 2012
codecs latin1 unicode standard output file	8	Dec 15, 2003
[newbie] problem with if then	8	Jun 9, 2013
Problem with module 'evolution' after moving system	2	Nov 24, 2010
Import fails (newbie)	2	Jun 17, 2010
Slightly OT - using PyUIC from Eclipse	0	Apr 30, 2014
[ANN] Speed up Charmap codecs with fastcharmap module	0	Oct 16, 2005

Newbie problem with codecs

derek / nul

Alex Martelli

Andrew Dalke

Alex Martelli

derek / nul

derek / nul

Andrew Dalke

derek / nul

derek / nul

Mike Brown

derek / nul

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads