Unicode list

Rehceb Rotkiv · Apr 1, 2007

Hello,

I have this little grep-like program:

++++++++++snip++++++++++
#!/usr/bin/python

import sys
import re

pattern = sys.argv[1]
inputfile = file(sys.argv[2], 'r')

for line in inputfile:
matches = re.findall(pattern, line)
if matches:
print matches
++++++++++snip++++++++++

Like this, the program prints some characters as strange escape
sequences, which is due to the input file being encoded in utf-8: When I
convert "re.findall..." to a string and wrap an "unicode()" around it,
the matches get printed correctly. Is it possible to make "matches"
unicode without saving it as a single string first? The function "unicode
()" seems only to work for strings. Or is there a general way of telling
Python to abandon the ancient and evil land of iso-8859 for good and use
utf-8 only?

Regards,
Rehceb

Paul Boddie · Apr 1, 2007

Rehceb said:
Hello,

I have this little grep-like program:

++++++++++snip++++++++++
#!/usr/bin/python

import sys
import re

pattern = sys.argv[1]
inputfile = file(sys.argv[2], 'r')

for line in inputfile:
matches = re.findall(pattern, line)
if matches:
print matches
++++++++++snip++++++++++

Like this, the program prints some characters as strange escape
sequences, which is due to the input file being encoded in utf-8:

So the UTF-8 data gets printed to your terminal which isn't configured
for UTF-8, right?

When I convert "re.findall..." to a string and wrap an "unicode()" around it,
the matches get printed correctly.

How do you meaningfully convert it to a string? The matches variable
refers to a list, but you surely don't want to be dealing with the
list's string representation.

Is it possible to make "matches" unicode without saving it as a single string first?

Why not convert your input into Unicode and then, for the benefit of
certain kinds of character classes, use re.findall in Unicode mode (by
specifying re.U as a flag)? Then, each match will be produced as a
Unicode object.

The function "unicode()" seems only to work for strings. Or is there a general way of telling
Python to abandon the ancient and evil land of iso-8859 for good and use utf-8 only?

The only refuge from ancient and evil lands is found by climbing the
mountain of Unicode: convert from encoded text as soon as you can,
work only with Unicode objects, produce encoded text only when
necessary.

Paul

Guest · Apr 1, 2007

Like this, the program prints some characters as strange escape

sequences, which is due to the input file being encoded in utf-8: When I
convert "re.findall..." to a string and wrap an "unicode()" around it,
the matches get printed correctly. Is it possible to make "matches"
unicode without saving it as a single string first? The function "unicode
()" seems only to work for strings. Or is there a general way of telling
Python to abandon the ancient and evil land of iso-8859 for good and use
utf-8 only?

Python does not live in the ancient and evi land of iso-8859; it lives
in the ancient and evil land of ASCII.

When printing a list, the individual elements are converted with repr(),
not with str(). For a string object, repr() adds escape codes for all
bytes that are not printable ASCII characters. To avoid this call to
repr, you need to iterate over the list yourself, and print it:

if matches:
for m in matches:
print m,
print

HTH,
Martin

Georg Brandl · Apr 1, 2007

Rehceb said:
Hello,

I have this little grep-like program:

++++++++++snip++++++++++
#!/usr/bin/python

import sys
import re

pattern = sys.argv[1]
inputfile = file(sys.argv[2], 'r')

for line in inputfile:
matches = re.findall(pattern, line)
if matches:
print matches
++++++++++snip++++++++++

Like this, the program prints some characters as strange escape
sequences, which is due to the input file being encoded in utf-8

As Paul said, your terminal is likely set to iso-8859 encoding, which
is why it doesn't display UTF-8 correctly. The above program produces
correct UTF-8 output.

What you could do is:
1. read the file in as unicode
2. print the unicode to the terminal (will use the terminal encoding) or
convert the unicode to strings with an explicit encoding before printing

codecs.open() is very helpful for step 1, BTW.

Georg

Rehceb Rotkiv · Apr 1, 2007

When printing a list, the individual elements are converted with repr(),

not with str(). For a string object, repr() adds escape codes for all
bytes that are not printable ASCII characters.

Thanks Martin, you're right, it were the repr() calls that messed up the
output. Iterating the array like you proposed is even 1/100s faster

Regards,
Rehceb

Thinking Unicode	0	Aug 8, 2013
split lines from stdin into a list of unicode strings	0	Aug 28, 2013
Python 3.3, gettext and Unicode problems	0	Dec 31, 2012
Python and unicode	8	Sep 19, 2010
Ascii to Unicode.	16	Jul 28, 2010
helping with unicode	4	Jul 3, 2012
Unicode problem	5	Apr 7, 2007
decoding a byte array that is unicode escaped?	2	Nov 6, 2009

Unicode list

Rehceb Rotkiv

Paul Boddie

Guest

Georg Brandl

Rehceb Rotkiv

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads