printing list containing unicode string

X

Xah Lee

If i have a nested list, where the atoms are unicode strings, e.g.

# -*- coding: utf-8 -*-
ttt=[[u"¡ú",u"¡ü"], [u"¦Á¦Â¦Ã"],...]
print ttt

how can i print it without getting the u'\u1234' notation?
i.e. i want it print just like this: [[u"¡ú"], ...]

I can of course write a loop then for each string use
"encode("utf-8")", but is there a easier way?

Thx.

Xah
(e-mail address removed)
¡Æ http://xahlee.org/
 
C

Carsten Haese

If i have a nested list, where the atoms are unicode strings, e.g.

# -*- coding: utf-8 -*-
ttt=[[u"→",u"↑"], [u"αβγ"],...]
print ttt

how can i print it without getting the u'\u1234' notation?
i.e. i want it print just like this: [[u"→"], ...]

I can of course write a loop then for each string use
"encode("utf-8")", but is there a easier way?

It's not quite clear why you want to do this, but this is how you could
do it:

print repr(ttt).decode("unicode_escape").encode("utf-8")

However, I am getting the impression that this is a "How can I use 'X'
to achieve 'Y'?" question instead of the preferable "How can I achieve
'Y'?" type of question. In other words, printing the repr() of a list
might not be the best solution to reach the actual goal, which you have
not stated.

HTH,
 
X

Xah Lee

Xah Lee wrote:

If i have a nested list, where the atoms are unicode strings, e.g.

# -*- coding: utf-8 -*-
ttt=[[u"¡ú",u"¡ü"], [u"¦Á¦Â¦Ã"],...]
print ttt

how can i print it without getting the u'\u1234' notation?
i.e. i want it print just like this: [[u"¡ú"], ...]


Carsten Haese wrote:

It's not quite clear why you want to do this, but this is how you
could
do it:

print repr(ttt).decode("unicode_escape").encode("utf-8")


Super! Thanks a lot.

About why i want to... i think it's just simpler and easier on the
eye?

here's a example output from my program:
[[u' ', 1022], [u'¡ü', 472], [u' ', 128], [u'¡úw', 300], [u'¡ús', 12],
[u'¡ú|', 184],...]

wouldn't it be preferable if Python print like this by default...

Xah
(e-mail address removed)
¡Æ http://xahlee.org/
 
X

Xah Lee

Google groups seems to be stripping my quotation marks lately.
Here's a retry to post my previous message.

--------------------------------------------------------------

Xah Lee wrote:

If i have a nested list, where the atoms are unicode strings, e.g.
# -*- coding: utf-8 -*-
ttt=[[u"¡ú",u"¡ü"], [u"¦Á¦Â¦Ã"],...]
print ttt

how can i print it without getting the u'\u1234' notation?
i.e. i want it print just like this: [[u"¡ú"], ...]


Carsten Haese wrote:

It's not quite clear why you want to do this, but this is how you
could do it:

print repr(ttt).decode("unicode_escape").encode("utf-8")


Super! Thanks a lot.

About why i want to... i think it's just simpler and easier on the
eye?

here's a example output from my program:
[[u' ', 1022], [u'¡ü', 472], [u' ', 128], [u'¡úw', 300], [u'¡ús', 12],
[u'¡ú|', 184],...]

wouldn't it be preferable if Python print like this by default...

Xah
(e-mail address removed)
¡Æ http://xahlee.org/
 
X

Xah Lee

This post is about some notes and corrections to a online article
regarding unicod and python.

--------------

by happenstance i was reading:

Unicode HOWTO
http://www.amk.ca/python/howto/unicode

Here's some problems i see:

¡¤ No conspicuous authorship. (however, oddly, it has a conspicuous
acknowledgement of names listing.) (This problem is a indirect
consequence of communism fanatism ushered by OpenSource movement)
(Originally i was just going to write to the author on some
corrections.)

¡¤ It's very wasteful of space. In most texts, the majority of the
code points are less than 127, or less than 255, so a lot of space is
occupied by zero bytes.

Not true. In Asia, most chars has unicode number above 255. Considered
globally, *possibly* today there are more computer files in Chinese
than in all latin-alphabet based lang.

¡¤ Many Internet standards are defined in terms of textual data, and
can't handle content with embedded zero bytes.

Not sure what he mean by "can't handle content with embedded zero
bytes". Overall i think this sentence is silly, and he's probably
thinking in unix/linux.

¡¤ Encodings don't have to handle every possible Unicode
character, ....

This is inane. A encoding, by definition, turns numbers into binary
numbers (in our context, it means a encoding handles all unicode chars
by definition). What he really meant to say is something like this:
"Practically speaking, most computer languages in western society
don't need to support unicode with respect to the language's source
file"

¡¤
UTF-8 has several convenient properties:
1. It can handle any Unicode code point.
....


As mentioned before, by definition, any Unicode encoding encodes all
unicode char set. The mentioning of above as a "convenient property"
is inane.

¡¤ 4.UTF-8 is fairly compact; the majority of code points are turned
into two bytes, and values less than 128 occupy only a single byte.

Note here, that utf-8 is relative compact only if most of your text
are latin alphabets. If you are not a occidental men and you write
Chinese, utf-8 is comparatively inefficient. (utf-8 as one of the
Unicode encoding is probably comparatively inefficient for japanese,
korean, Arabic, or any non-latin-alphabet based langs)

Also note, the article overly focus on utf-8. Microsoft's Windows NT,
is probably the first major operating system that support unicode
throughly, and they use utf-16. For Much of America and Europe, which
are currently roughly the leader in computing, utf-8 is more efficient
in some sense (e.g. at least in disk space requirements). But consider
global computing, in particular Chinese & Japanese, utf-16 is overall
superior than utf-8.

Part of the reason, that utf-8 is favored in this article, has to do
with Linux (and unix). The reason unixes in general have choosen utf-8
instead of utf-16, is largely because unix is one motherfucking bag of
shit that it is impossible to support utf-16 without scraping a large
chuck of unix things.

PS I did not read the article in detail, but only roughly to see how
Python handle unicode because i was often confused by python's encode/
decode/unicode methods and functions.

.... am gonna continue reading that article about Python specific
issues...

also note, this post is posted thru groups.google.com, and it contains
the double angled quotation mark chars. As of 2 weeks ago, it
quotation marks seems to be deleted in the process of posting, i.e.
unicode name: "LEFT-POINTING DOUBLE ANGLE QUOTATION MARK" and "RIGHT-
POINTING DOUBLE ANGLE QUOTATION MARK". Here, i enclose the double-
angled quation mark inside a double curly quote: " ". If inside the
double curly quote you see spaces, than that means google groups
fucked up.

References and Further readings:

¡¤ Unicode in Perl & Python
http://xahlee.org/perl-python/unicode.html

¡¤ the Journey of a Foreign Character thru Internet
http://xahlee.org/Periodic_dosage_dir/t2/non-ascii_journey.html

¡¤ Unicode Characters Example
http://xahlee.org/Periodic_dosage_dir/t1/20040505_unicode.html

¡¤ Python's unicodedata module
http://xahlee.org/perl-python/unicodedata_module.html

¡¤ Emacs and Unicode Tips
http://xahlee.org/emacs/emacs_n_unicode.html

¡¤ Java Tutorial: Unicode in Java
http://xahlee.org/java-a-day/unicode_in_java.html

¡¤ Character Sets and Encoding in HTML
http://xahlee.org/js/html_chars.html

Xah
(e-mail address removed)
¡Æ http://xahlee.org/
 
J

J. Cliff Dyer

Xah said:
This post is about some notes and corrections to a online article
regarding unicod and python.

--------------

by happenstance i was reading:

Unicode HOWTO
http://www.amk.ca/python/howto/unicode

Here's some problems i see:

¡¤ No conspicuous authorship. (however, oddly, it has a conspicuous
acknowledgement of names listing.) (This problem is a indirect
consequence of communism fanatism ushered by OpenSource movement)
(Originally i was just going to write to the author on some
corrections.)

¡¤ It's very wasteful of space. In most texts, the majority of the
code points are less than 127, or less than 255, so a lot of space is
occupied by zero bytes.

Not true. In Asia, most chars has unicode number above 255. Considered
globally, *possibly* today there are more computer files in Chinese
than in all latin-alphabet based lang.
That's an interesting point. I'd be interested to see numbers on
that, and how those numbers have changed over the past five years.
Sadly, such data is most likely impossible to obtain.

However, it should be pointed out that most *code*, whether written in
the United States, New Zealand, India, China, or Botswana is written
in English. In part because it has become a standard of sorts, much
as italian was a standard for musical notation, due in part to the
US's former (and perhaps current, but certainly fading) dominance in
the field, and in part to the lack of solid support for unicode among
many programming languages and compilers. Thus the author's bias, while
inaccurate, is still understandable.
¡¤ Many Internet standards are defined in terms of textual data, and
can't handle content with embedded zero bytes.

Not sure what he mean by "can't handle content with embedded zero
bytes". Overall i think this sentence is silly, and he's probably
thinking in unix/linux.

¡¤ Encodings don't have to handle every possible Unicode
character, ....

This is inane. A encoding, by definition, turns numbers into binary
numbers (in our context, it means a encoding handles all unicode chars
by definition). What he really meant to say is something like this:
"Practically speaking, most computer languages in western society
don't need to support unicode with respect to the language's source
file"

¡¤
UTF-8 has several convenient properties:
1. It can handle any Unicode code point.
...


As mentioned before, by definition, any Unicode encoding encodes all
unicode char set. The mentioning of above as a "convenient property"
is inane.
No, it's not inane. UCS-2, for example, is a fixed width, 2-byte
encoding that can handle any unicode code point up to 0xffff, but
cannot handle the 3 and 4 byte extension sets. UCS-2 was developed
for applications in which having fixed width characters is essential,
but has the limitations of not being able to handle any Unicode code
point. IIRC, when it was developed, it did handle every code point,
and then Unicode grew. There is also a UCS-4 to handle this
limitation. UTF-16 is based on a two-byte unit, but is variable
width, like UTF-8, which makes it flexible enough to handle any code
point, but harder to process, and a bear to seek through to a certain
point.

(I'm politely ignoring your ill-reasoned attacks on non-Microsoft OSes).

Cheers,
Cliff
 
M

Marc 'BlackJack' Rintsch

・ Many Internet standards are defined in terms of textual data, and
can't handle content with embedded zero bytes.

Not sure what he mean by "can't handle content with embedded zero
bytes". Overall i think this sentence is silly, and he's probably
thinking in unix/linux.

No he's probably thinking of all the text based protocols (HTTP, SMTP, …)
and that one of the most used programming languages, C, can't cope with
embedded null bytes in strings.
・ Encodings don't have to handle every possible Unicode
character, ....

This is inane. A encoding, by definition, turns numbers into binary
numbers (in our context, it means a encoding handles all unicode chars
by definition).

How do you encode chinese characters with the ISO-8859-1 encoding? This
encoding obviously doesn't handle *all* unicode characters.
・
UTF-8 has several convenient properties:
1. It can handle any Unicode code point.
...


As mentioned before, by definition, any Unicode encoding encodes all
unicode char set. The mentioning of above as a "convenient property"
is inane.

You are being silly here.

Ciao,
Marc 'BlackJack' Rintsch
 
X

Xah Lee

J. Cliff Dyer wrote:
" ...UCS-2, for example, is a fixed width, 2-byte encoding that can
handle any unicode code point up to 0xffff, but cannot handle the 3
and 4 byte extension sets. "

I was going to reply to say that this is a good point. But on my way i
looked up wikipedia,
http://en.wikipedia.org/wiki/UTF-16/UCS-2

quote:
" In computing, UTF-16 (16-bit Unicode Transformation Format) is a
variable-length character encoding for Unicode, capable of encoding
the entire Unicode repertoire. "

and
" UCS-2 (2-byte Universal Character Set) is an obsolete character
encoding which is a predecessor to UTF-16. The UCS-2 encoding form is
nearly identical to that of UTF-16, except that it does not support
surrogate pairs and therefore can only encode characters in the BMP
range U+0000 through U+FFFF. "

So, the matter isn't simple. (i.e. it is not decisive to say i'm
incorrect in my original criticism about that article's statement on
utf-8.)

------------

Btw, i think i should mention, that i have read from cover to cover
the unicode 3 specification in 2002. (one heavy, thick, large, deep
blue colored book)

Another resource that contributed my understanding of unicode, is the
book
"CJKV Information Processing" by Ken Lunde, which i read in the same
year.

Also of interest, is that i learned about a year ago, the chinese
encoding
http://en.wikipedia.org/wiki/GB_18030
which is required by law for all computers sold in China to support,
is actually a Unicode encoding. Specifically, in encompasses all the
chars in Unicode.

Also relevant info in our discussion, is that recently i was looking
at alexa.com's web ranking:

http://alexa.com/site/ds/top_sites?ts_mode=global&lang=none

and noticed several pure chinese lang websites are among the top 100.

Baidu.com (°Ù¶È) is at top 8 today, followed by
ÌÚѶÍø (http://www.qq.com) at 12, and
ÐÂÀË sina.com.cn at 19, etc.

It is somewhat amazing in the context of computing and languages. No
other non-English lang comes close.

(Note here also, Chinese as measured by number of speakers, is roughly
4 times that of English.
http://en.wikipedia.org/wiki/Ethnologue_list_of_most_spoken_languages
This fact, coupled with developement and commercialization of China in
the past decade, are reasons of the above web ranking result.
)

Not relevant in our discussion, but I happend to also notice a site
named youporn.com (was ranked 69 few weeks ago). youporn.com is
basically like youtube.com, but with porn vids. It has long been my
thought, that the progress of humanity in a society can be measured as
by its popularity and acceptance of porn. (in fact i recall seeing
some academic (or not) report about this few months ago... couldn't
remember where now) Society as a whole, have improved dramatically
since the communication revolution in particulart started with the
web.

(see Xah's Porn Outspeak
http://xahlee.org/PageTwo_dir/Personal_dir/porn_movies.html

For more info about youtube.com, see:

http://en.wikipedia.org/wiki/Youporn

curious party might also check out

http://en.wikipedia.org/wiki/Youtube

which is a major phenomenon, in my opinion, contributed to the
progress of humanities far more than, say, any university or
educational institution.

(my thesis in general in this direction, is that communication, the
main media of knowledge, is the utmost factor in human animal's
progress with respect to what's generally considered humanitarianism.
More important than, say, the need to decry war, have laws, maintain
peace, spread gospels, aid the poor, ... etc. (and in fact, in this
thesis, i consider what commonly considered as good activities such as
aiding the poor, or any moral attitude and activities about good of
humanity (such as OpenSource), are in fact criminal in their effects
and almost in their intention too ...)) )

PS for some reason message posted thru google groups service since the
past week or so are stripping off the unicode chars double angle
brackets (U+00AB and U+00BB). For that reason, in this msg i've also
used double curly quotes "" whenever i have double angle brackets.

Xah
(e-mail address removed)
¡Æ http://xahlee.org/
 
S

Sion Arrowsmith

Xah Lee said:
" It's very wasteful of space. In most texts, the majority of the
code points are less than 127, or less than 255, so a lot of space is
occupied by zero bytes. "

Not true. In Asia, most chars has unicode number above 255. Considered
globally, *possibly* today there are more computer files in Chinese
than in all latin-alphabet based lang.

This doesn't hold water. There are many good reasons for preferring
UTF16 over UTF8, but unless you know you're only ever going to be
handling scripts from Unicode blocks above Arabic, it's reasonable
to assume that UTF8 will be at least as compact. Consider that
transcoding a Chinese file from UTF16 to UTF8 will probably increase
its size by 50% (the CJK ideograph blocks encode to 3 bytes). While
transcoding a document in a Western European langauge the other way
can be expected to increase its size by up to 100% (every single-
byte character is doubled). You'd have to be talking about double to
volume of CJK data before switching from UTF8 to UTF16 becomes even
a break-even proposition space-wise.

(It's curious to note that the average word length in English is
often taken to be 6 letters. Similarly, in UTF8-encoded Chinese the
average word length is 6 bytes....)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,483
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top