unicode by default

T

Terry Reedy

Right. *Under the hood* Python uses UCS-2 (which is not exactly the
same thing as UTF-16, by the way) to represent Unicode strings.

I know some people say that, but according to the definitions of the
unicode consortium, that is wrong! The earlier UCS-2 *cannot* represent
chars in the Supplementary Planes. The later (1996) UTF-16, which Python
uses, can. The standard considers 'UCS-2' obsolete long ago. See

https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
or http://www.unicode.org/faq/basic_q.html#14

The latter says: "Q: What is the difference between UCS-2 and UTF-16?
A: UCS-2 is obsolete terminology which refers to a Unicode
implementation up to Unicode 1.1, before surrogate code points and
UTF-16 were added to Version 2.0 of the standard. This term should now
be avoided."

It goes on: "Sometimes in the past an implementation has been labeled
"UCS-2" to indicate that it does not support supplementary characters
and doesn't interpret pairs of surrogate code points as characters. Such
an implementation would not handle processing of character properties,
code point boundaries, collation, etc. for supplementary characters."

I know that 16-bit Python *does* use surrogate pairs for supplementary
chars and at least some properties work for them. I am not sure exactly
what the rest means.
However, this is entirely transparent. To the Python programmer, a
unicode string is just an abstraction of a sequence of code-points.
You don't need to think about UCS-2 at all. The only times you need
to worry about encodings are when you're encoding unicode characters
to byte strings, or decoding bytes to unicode characters, or opening a
stream in text mode; and in those cases the only encoding that matters
is the external one.

If one uses unicode chars in the Supplementary Planes above the BMP (the
first 2**16), which require surrogate pairs for 16 bit unicode (UTF-16),
then the abstraction leaks.
 
I

Ian Kelly

I know some people say that, but according to the definitions of the unicode
consortium, that is wrong! The earlier UCS-2 *cannot* represent chars in the
Supplementary Planes. The later (1996) UTF-16, which Python uses, can. The
standard considers 'UCS-2' obsolete long ago. See

https://secure.wikimedia.org/wikipedia/en/wiki/UTF-16/UCS-2
or http://www.unicode.org/faq/basic_q.html#14

At the first link, in the section _Use in major operating systems and
environments_ it states, "The Python language environment officially
only uses UCS-2 internally since version 2.1, but the UTF-8 decoder to
"Unicode" produces correct UTF-16. Python can be compiled to use UCS-4
(UTF-32) but this is commonly only done on Unix systems."

PEP 100 says:

The internal format for Unicode objects should use a Python
specific fixed format <PythonUnicode> implemented as 'unsigned
short' (or another unsigned numeric type having 16 bits). Byte
order is platform dependent.

This format will hold UTF-16 encodings of the corresponding
Unicode ordinals. The Python Unicode implementation will address
these values as if they were UCS-2 values. UCS-2 and UTF-16 are
the same for all currently defined Unicode character points.
UTF-16 without surrogates provides access to about 64k characters
and covers all characters in the Basic Multilingual Plane (BMP) of
Unicode.

It is the Codec's responsibility to ensure that the data they pass
to the Unicode object constructor respects this assumption. The
constructor does not check the data for Unicode compliance or use
of surrogates.

I'm getting out of my depth here, but that implies to me that while
Python stores UTF-16 and can correctly encode/decode it to UTF-8,
other codecs might only work correctly with UCS-2, and the unicode
class itself ignores surrogate pairs.

Although I'm not sure how much this might have changed since the
original implementation, especially for Python 3.
 
J

jmfauth

...
to worry about encodings are when you're encoding unicode characters
to byte strings, or decoding bytes to unicode characters


A small but important correction/clarification:

In Unicode, "unicode" does not encode a *character*. It
encodes a *code point*, a number, the integer associated
to the character.

jmf
 
H

harrismh777

jmfauth said:
A small but important correction/clarification:

In Unicode, "unicode" does not encode a*character*. It
encodes a*code point*, a number, the integer associated
to the character.

That is a huge code-point... pun intended.

.... and there is another point that I continue to be somewhat puzzled
about, and that is the issue of fonts.

On of my hobbies at the moment is ancient Greek (biblical studies,
Septuaginta LXX, and Greek New Testament). I have these texts on my
computer in a folder in several formats... pdf, unicode 'plaintext',
osis.xml, and XML.

These texts may be found at http://sblgnt.com

I am interested for the moment only in the 'plaintext' stream,
because it is unicode. ( first, in unicode, according to all the doc
there is no such thing as 'plaintext,' so keep that in mind).

When I open the text stream in one of my unicode editors I can see
'most' of the characters in a rudimentary Greek font with accents;
however, I also see many tiny square blocks indicating (I think) that
the code points do *not* have a corresponding character in my unicode
font for that Greek symbol (whatever it is supposed to be).

The point, or question is, how does one go about making sure that
there is a corresponding font glyph to match a specific unicode code
point for display in a particular terminal (editor, browser, whatever) ?

The unicode consortium is very careful to make sure that thousands
of symbols have a unique code point (that's great !) but how do these
thousands of symbols actually get displayed if there is no font
consortium? Are there collections of 'standard' fonts for unicode that
I am not aware? Is there a unix linux package that can be installed
that drops at least 'one' default standard font that will be able to
render all or 'most' (whatever I mean by that) code points in unicode?
Is this a Python issue at all?


kind regards,
m harris
 
R

Robert Kern

The unicode consortium is very careful to make sure that thousands of symbols
have a unique code point (that's great !) but how do these thousands of symbols
actually get displayed if there is no font consortium? Are there collections of
'standard' fonts for unicode that I am not aware?

There are some well-known fonts that try to cover a large section of the Unicode
standard.

http://en.wikipedia.org/wiki/Unicode_typeface
Is there a unix linux package
that can be installed that drops at least 'one' default standard font that will
be able to render all or 'most' (whatever I mean by that) code points in
unicode? Is this a Python issue at all?

Not really.

--
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
that is made terrible by our own mad attempt to interpret it as though it had
an underlying truth."
-- Umberto Eco
 
T

Terry Reedy

The unicode consortium is very careful to make sure that thousands of
symbols have a unique code point (that's great !) but how do these
thousands of symbols actually get displayed if there is no font
consortium? Are there collections of 'standard' fonts for unicode that I
am not aware? Is there a unix linux package that can be installed that
drops at least 'one' default standard font that will be able to render
all or 'most' (whatever I mean by that) code points in unicode? Is this
a Python issue at all?

Easy, practical use of unicode is still a work in progress.
 
H

harrismh777

Terry said:
Easy, practical use of unicode is still a work in progress.

Apparently... the good news for me is that SBL provides their unicode
font here:

http://www.sbl-site.org/educational/biblicalfonts.aspx

I'm getting much closer here, but now the problem is typing. The pain
with unicode fonts is that the glyph is tied to the code point for the
represented character, and not tied to any code point that matches any
keyboard scan code for typing. :-}

So, I can now see the ancient text with accents and aparatus in all of
my editors, but I still cannot type any ancient Greek with my
keyboard... because I have to make up a keymap first. <sigh>

I don't find that SBL (nor Logos Software) has provided keymaps as
yet... rats.

I can read the test with Python though... yessss.


m harris
 
N

Nobody

The unicode consortium is very careful to make sure that thousands
of symbols have a unique code point (that's great !) but how do these
thousands of symbols actually get displayed if there is no font
consortium? Are there collections of 'standard' fonts for unicode that I
am not aware? Is there a unix linux package that can be installed that
drops at least 'one' default standard font that will be able to render all
or 'most' (whatever I mean by that) code points in unicode?

Using the original meaning of "font" (US) or "fount" (commonwealth), you
can't have a single font cover the whole of Unicode. A font isn't a random
set of glyphs, but a set of glyphs in a common style, which can only
practically be achieved for a specific alphabet.

You can bundle multiple fonts covering multiple repertoires into a single
TTF (etc) file, but there's not much point.

In software, the term "font" is commonly used to refer to some ad-hoc
mapping between codepoints and glyphs. This typically works by either
associating each specific font with a specific repertoire (set of
codepoints), or by simply trying each font in order until one is found
with the correct glyph.

This is a sufficiently common problem that the FontConfig library exists
to simplify a large part of it.
Is this a Python issue at all?

No.
 
J

jmfauth

...
I'm getting much closer here,
...

You should really understand, that Unicode is a domain per
se. It is independent from any os's, programming languages
or applications. It is up to these tools to be "unicode"
compliant.

Working in a full unicode mode (at least for texts) is
today practically a solved problem. But you have to ensure
the whole toolchain is unicode compliant (editors,
fonts (OpenType technology), rendering devices, ...).

Tip. This list is certainly not the best place to grab
informations. I suggest you start by getting informations
about XeTeX. XeTeX is the "new" TeX engine working only
in a unicode mode. From this starting point, you will
fall on plenty web sites speaking about the "unicode
world", tools, fonts, ...

A variant is to visit sites speaking about *typography*.

jmf
 
T

Terry Reedy

Terry Reedy wrote:

Apparently... the good news for me is that SBL provides their unicode
font here:

http://www.sbl-site.org/educational/biblicalfonts.aspx

I'm getting much closer here, but now the problem is typing. The pain
with unicode fonts is that the glyph is tied to the code point for the
represented character, and not tied to any code point that matches any
keyboard scan code for typing. :-}

So, I can now see the ancient text with accents and aparatus in all of
my editors, but I still cannot type any ancient Greek with my
keyboard... because I have to make up a keymap first. <sigh>

I don't find that SBL (nor Logos Software) has provided keymaps as
yet... rats.

You need what is called, at least with Windows, an IME -- Input Method
Editor. These are part of (or associated with) the OS, so they can be
used with *any* application that will accept unicode chars (in whatever
encoding) rather than just ascii chars. Windows has about a hundred or
so, including Greek. I do not know if that includes classical Greek with
the extra marks.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,774
Messages
2,569,599
Members
45,164
Latest member
quinoxflush
Top