decoding keyboard input when using curses

A

Arnaud Delobelle

Hi all,

I am looking for advice on how to use unicode with curses. First I will
explain my understanding of how curses deals with keyboard input and how
it differs with what I would like.

The curses module has a window.getch() function to capture keyboard
input. This function returns an integer which is more or less:

* a byte if the key which was pressed is a printable character (e.g. a,
F, &);

* an integer > 255 if it is a special key, e.g. if you press KEY_UP it
returns 259.

As far as I know, curses is totally unicode unaware, so if the key
pressed is printable but not ASCII, the getch() function will return one
or more bytes depending on the encoding in the terminal.

E.g. given utf-8 encoding, if I press the key 'é' on my keyboard (which
encoded as '\xc3\xa9' in utf-8), I will need two calls to getch() to get
this: the first one will return 0xC3 and the second one 0xA9.

Instead of getting a stream of bytes and special keycodes (with value >
255) from getch(), what I want is a stream of *unicode characters* and
special keycodes.

So, still assuming utf-8 encoding in the terminal, if I type:

Té[KEY_UP]ça

iterating call to the getch() function will give me this sequence of
integers:

84, 195, 169, 259, 195, 167, 97
T- é------- KEY_UP ç------- a-

But what I want to get this stream instead:

u'T', u'é', 259, u'ç', u'a'


I can pipe the stream of output from getch() directly through an
instance of codecs.getreader('utf-8') because getch() sometimes returns
the integer values of the 'special keys'.

Now I will present to you the solution I have come up with so far. I am
really unsure whether it is a good way to solve this problem as both
unicode and curses still feel quite mysterious to me. What I would
appreciate is some advice on how to do it better - or someone to point
out that I have a gross misunderstanding of what is going on!

This has been tested in Python 2.5

-------------------- uctest.py ------------------------------
# -*- coding:utf-8 -*-

import codecs
import curses

# This gives the return codes given by curses.window.getch() when
# "Té[KEY_UP]ça" is typed in a terminal with utf-8 encoding:

codes = map(ord, "Té") + [curses.KEY_UP] + map(ord, "ça")


# This class defines a file-like object from a curses window 'win'
# whose read() function will return the next byte (as a character)
# given by win.getch() if it's a byte or return the empty string and
# set the code attribute to the value of win.getch().

# It is not used in this test, The Stream class below is used
# instead.

class CursesStream(object):
def __init__(self, win):
self.getch = self.win.getch
def read(self):
c = self.getch()
if c == -1:
self.code = None
return ''
elif c > 255:
self.code = c
return ''
else:
return chr(c)

# This class simulates CursesStream above with a predefined list of
# keycodes to return - handy for testing.

class Stream(object):
def __init__(self, codes):
self.codes = iter(codes)
def read(self):
try:
c = self.codes.next()
except StopIteration:
self.code = None
return ''
if c > 255:
self.code = c
return ''
else:
return chr(c)

def getkeys(stream, encoding):
'''Given a CursesStream object and an encoding, yield the decoded
unicode characters and special keycodes that curses sends'''
read = codecs.getreader(encoding)(stream).read
while True:
c = read()
if c:
yield c
elif stream.code is None:
return
else:
yield stream.code


# Test getkeys with

for c in getkeys(Stream(codes), 'utf-8'):
if isinstance(c, unicode):
print 'Char\t', c
else:
print 'Code\t', c

-------------------- running uctest.py ------------------------------

marigold:junk arno$ python uctest.py
Char T
Char é
Code 259
Char ç
Char a

Thanks if you have read this far!
 
A

Arnaud Delobelle

Arnaud Delobelle said:
I can pipe the stream of output from getch() directly through an
^^^ I mean *can't*
instance of codecs.getreader('utf-8') because getch() sometimes returns
the integer values of the 'special keys'.
[...]

I reread my post 3 times before sending it, honest!
 
C

Chris Jones


Disclaimer: I am not familiar with the curses python implementation and
I'm neither an ncurses nor a "unicode" expert by a long shot.

:)
I am looking for advice on how to use unicode with curses. First I will
explain my understanding of how curses deals with keyboard input and how
it differs with what I would like.

The curses module has a window.getch() function to capture keyboard
input. This function returns an integer which is more or less:

* a byte if the key which was pressed is a printable character (e.g. a,
F, &);

* an integer > 255 if it is a special key, e.g. if you press KEY_UP it
returns 259.

The getch(3NCURSES) function returns an integer. Provide it's large
enough to accomodate the highest possible value, the actual size in
bytes of the integer should be irrelevant.
As far as I know, curses is totally unicode unaware,

My impression is that rather than "unicode unaware", it is "unicode
transparent" - or (nitpicking) "UTF8 transparent" - since I'm not sure
other flavors of unicode are supported.
so if the key pressed is printable but not ASCII,

... nitpicking again, but ASCII is a 7-bit encoding: 0-127.
the getch() function will return one or more bytes depending on the
encoding in the terminal.

I don't know about the python implementation, but my guess is that it
should closely follow the underlying ncurses API - so the above is
basically correct, although it's not a question of the number of bytes
but rather the returned range of integers - if your locale is en.US then
that should be 0-255.. if it is en_US.utf8 the range is considerably
larger.
E.g. given utf-8 encoding, if I press the key 'é' on my keyboard (which
encoded as '\xc3\xa9' in utf-8), I will need two calls to getch() to get
this: the first one will return 0xC3 and the second one 0xA9.

No. A single call to getch() will grab your " é" and return 0xc3a9,
decimal 50089.
Instead of getting a stream of bytes and special keycodes (with value >
255) from getch(), what I want is a stream of *unicode characters* and
special keycodes.

This is what getch(3NCURSES) does: it returns the integer value of one
"unicode character".

Likewise, I would assume that looping over the python equivalent of
getch() will not return a stream of bytes but rather a "stream" of
integers that map one to one to the "unicode characters" that were
entered at the terminal.

Note: I am only familiar with languages such as English, Spanish,
French, etc. where only one terminal cell is used for each glyph. My
understanding is that things get somewhat more complicated with
languages that require so-called "wide characters" - two terminal cells
per character, but that's a different issue.
So, still assuming utf-8 encoding in the terminal, if I type:

Té[KEY_UP]ça

iterating call to the getch() function will give me this sequence of
integers:

84, 195, 169, 259, 195, 167, 97
T- é------- KEY_UP ç------- a-

But what I want to get this stream instead:

u'T', u'é', 259, u'ç', u'a'

No, for the above, getch() will return:

84, 50089, 259, 50087, 97

... which is "functionally" equivalent to:

u'T', u'é', 259, u'ç', u'a'

[..]

So shouldn't this issue boil down to just a matter of casting the
integers to the "u" data type?

This short snippet may help clarify the above:

-----------------------------------------------------------------------
#include <locale.h>
#include <ncurses.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int unichar;

int main(int argc, char *argv[])
{
setlocale(LC_ALL, "en_US.UTF.8"); /* make sure UTF8 */
initscr(); /* start curses mode */
raw();
keypad(stdscr, TRUE); /* pass special keys */
unichar = getch(); /* read terminal */

mvprintw(24, 0, "Key pressed is = %4x ", unichar);

refresh();
getch(); /* wait */
endwin(); /* leave curses mode */
return 0;
}
-----------------------------------------------------------------------

Hopefully you have access to a C compiler:

$ gcc -lncurses uni00.c -o uni00

Hope this helps... Whatever the case, please keep me posted.

CJ
 
A

Arnaud Delobelle

Hi Chris, thanks for your detailed reply.
Disclaimer: I am not familiar with the curses python implementation and
I'm neither an ncurses nor a "unicode" expert by a long shot.

:)


The getch(3NCURSES) function returns an integer. Provide it's large
enough to accomodate the highest possible value, the actual size in
bytes of the integer should be irrelevant.

Sorry I was somehow mixing up what happens in general and what happens
with utf-8 (probably because I have only done test with utf-8), where
the number of bytes used to encode a character varies.
My impression is that rather than "unicode unaware", it is "unicode
transparent" - or (nitpicking) "UTF8 transparent" - since I'm not sure
other flavors of unicode are supported.

.. nitpicking again, but ASCII is a 7-bit encoding: 0-127.


I don't know about the python implementation, but my guess is that it
should closely follow the underlying ncurses API - so the above is
basically correct, although it's not a question of the number of bytes
but rather the returned range of integers - if your locale is en.US then
that should be 0-255.. if it is en_US.utf8 the range is considerably
larger.

In my tests, my locale is en_GB.utf8 and the python getch() function
does return a number of bytes - see below.
No. A single call to getch() will grab your " é" and return 0xc3a9,
decimal 50089.

It is the case though that on my machine, if I press 'é' then call
getch() it will return 0xC3. A further call to getch() will return
0xA9. This I was I was talking about getch() returning bytes: to me it
behaves as if it returns the encoded characters byte by byte.
This is what getch(3NCURSES) does: it returns the integer value of one
"unicode character".

It is not what happens in my tests. I have made a simple testing
script, see below.
Likewise, I would assume that looping over the python equivalent of
getch() will not return a stream of bytes but rather a "stream" of
integers that map one to one to the "unicode characters" that were
entered at the terminal.


Note: I am only familiar with languages such as English, Spanish,
French, etc. where only one terminal cell is used for each glyph. My
understanding is that things get somewhat more complicated with
languages that require so-called "wide characters" - two terminal cells
per character, but that's a different issue.
So, still assuming utf-8 encoding in the terminal, if I type:

Té[KEY_UP]ça

iterating call to the getch() function will give me this sequence of
integers:

84, 195, 169, 259, 195, 167, 97
T- é------- KEY_UP ç------- a-

But what I want to get this stream instead:

u'T', u'é', 259, u'ç', u'a'

No, for the above, getch() will return:

84, 50089, 259, 50087, 97

.. which is "functionally" equivalent to:

u'T', u'é', 259, u'ç', u'a'

[..]

So shouldn't this issue boil down to just a matter of casting the
integers to the "u" data type?

This short snippet may help clarify the above:

-----------------------------------------------------------------------
#include <locale.h>
#include <ncurses.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int unichar;

int main(int argc, char *argv[])
{
setlocale(LC_ALL, "en_US.UTF.8"); /* make sure UTF8 */
initscr(); /* start curses mode */
raw();
keypad(stdscr, TRUE); /* pass special keys */
unichar = getch(); /* read terminal */

mvprintw(24, 0, "Key pressed is = %4x ", unichar);

refresh();
getch(); /* wait */
endwin(); /* leave curses mode */
return 0;
}
-----------------------------------------------------------------------

Hopefully you have access to a C compiler:

$ gcc -lncurses uni00.c -o uni00

Thanks for this. When I test it on my machine (BTW it is MacOS 10.5.7),
if I type an ASCII character (e.g. 'A'), I get its ASCII code (0x41),
but if I type a non-ascii character (e.g. '§') I get back to the prompt
immediately. It must be because two values are queued for getch. I
should try it on a Linux machine, but I don't have one handy at the
moment.

I have made a little test script in Python which is similar but will
only stop when 'Esc' is pressed.

--------------------------------------------------
import curses

def getcodes(win):
codes = []
while True:
c = win.getch()
if c == 27:
return codes
codes.append(c)

print curses.wrapper(getcodes)
--------------------------------------------------

If I try this in a Terminal and type 'souçi[ESC]', I get this:

[115, 111, 117, 195, 167, 105]
s--, o--, u--, ç-------, i--

As you see, two calls to getch() were necessary after typing 'ç'.
BTW on the same terminal:

marigold:junk arno$ locale
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=

I will have to do tests with other encodings.
 
C

Chris Jones

On Sun, May 31, 2009 at 04:05:20AM EDT, Arnaud Delobelle wrote:

[..]
Thanks for this. When I test it on my machine (BTW it is MacOS 10.5.7),
if I type an ASCII character (e.g. 'A'), I get its ASCII code (0x41),
but if I type a non-ascii character (e.g. '§') I get back to the prompt
immediately. It must be because two values are queued for getch. I
should try it on a Linux machine, but I don't have one handy at the
moment.

Well so much for transparency :-(

Try this:

#include <locale.h>
#include <ncurses.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

int ct;
wint_t unichar;

int main(int argc, char *argv[])
{
setlocale(LC_ALL, ""); /* make sure UTF8 */
initscr();
raw();
keypad(stdscr, TRUE);
ct = get_wch(&unichar); /* read character */
mvprintw(24, 0, "Key pressed is = %4x ", unichar);

refresh();
get_wch();
endwin();
return 0;
}

gcc -lncursesw uni10.c -o uni10 # different lib..
^

Seems there's more to it than my assupmtion that the python wrapper was
not wrapping as transparaently as it should.. here I'm using wide
characters.. not sure how this translates to python.

CJ
 
A

Arnaud Delobelle

Chris Jones said:
Try this:

#include <locale.h>
#include <ncurses.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/* Here I need to add the following include to get wint_t on macOS X*/

#include said:
int ct;
wint_t unichar;

int main(int argc, char *argv[])
{
setlocale(LC_ALL, ""); /* make sure UTF8 */
initscr();
raw();
keypad(stdscr, TRUE);
ct = get_wch(&unichar); /* read character */
mvprintw(24, 0, "Key pressed is = %4x ", unichar);

refresh();
get_wch();
endwin();
return 0;
}

gcc -lncursesw uni10.c -o uni10 # different lib..
^

My machine doesn't know about libncursesw:

marigold:c arno$ ls /usr/lib/libncurses*
/usr/lib/libncurses.5.4.dylib
/usr/lib/libncurses.dylib
/usr/lib/libncurses.5.dylib

So I've compiled it with libncurses as before and it works.

This is what I get:

If I run the program and type 'é', I get a code of 'e9'.

In python:
é

So it has been encoded using isolatin1. I really don't understand why.
I'll have to investigate this further.

If I change the line:

setlocale(LC_ALL, ""); /* make sure UTF8 */

to

setlocale(LC_ALL, "en_GB.UTF-8"); /* make sure UTF8 */

then the behaviour is the same as before (i.e. get_wch() gets called
twice instantly).

I'll do some more investigating (when I can think of *what* to
investigate) and I will tell you my findings.

Thanks
 
C

Chris Jones

Try this:

#include <locale.h>
#include <ncurses.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

/* Here I need to add the following include to get wint_t on macOS X*/

#include <wctype.h>

Ah.. interesting. My posts were rather linux-centric, I must say.

Naturally, the library & header files setup on MacOS would likely
differ.

[..]
My machine doesn't know about libncursesw:

marigold:c arno$ ls /usr/lib/libncurses*
/usr/lib/libncurses.5.4.dylib
/usr/lib/libncurses.dylib
/usr/lib/libncurses.5.dylib

So I've compiled it with libncurses as before and it works.

Nothing to complain about.. makes it even more seamless..!
This is what I get:

If I run the program and type 'é', I get a code of 'e9'.

In python:

é

So it has been encoded using isolatin1. I really don't understand why.
I'll have to investigate this further.

That's why I tested both my efforts with a euro symbol that I entered
via the Compose key + E= .. which generates 0x20AC in a UTF-8 setup.
If I change the line:

setlocale(LC_ALL, ""); /* make sure UTF8 */

to

setlocale(LC_ALL, "en_GB.UTF-8"); /* make sure UTF8 */

then the behaviour is the same as before (i.e. get_wch() gets called
twice instantly).

I'll do some more investigating (when I can think of *what* to
investigate) and I will tell you my findings.

This is really strange.

I am absolutely certain that I had gotten my initial version, with the
old getch() .. etc. routines to work, but since I continued
experimenting with the same source snippet, I no longer had available to
investigate where I erred.

I copy/pasted the snippet back from my earlier post.. and so far I
haven't been able to get it work again.. :-(

The logic behind my assuming that getch() et al. should work
transparently in a UTF-8 context is that:

1. the 3NCURSES man pages are rather detailed and I could not find UTF-8
or "unicode" mentioned anywhere apart from some obscure configuration
option that is likely turned on everywhere by default.

2. More importantly, having to switch to the "wide character" macros or
functions instead of the "narrow" versions would have meant that just
about every application that uses ncurses would have had to be
heavily patched - I would have imagined that all the changes would be
made in the library in a transparent fashion so that the maintainers
of such apps would only have had to link against the new ncurses
lib..???

I'll keep you posted if I find something useful.

CJ
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,046
Latest member
Gavizuho

Latest Threads

Top