Python And Internationalization maybe a pre-pep?

B

Brian Kelley

I have been using gettext and various utilities to provide
internationalization for a wxPython application and have not really been
liking the process. Essentially it uses a macro-style notation to
indicate which strings should be internationalized.

Essentially, everything looks like this
_("STRING")

A good description of this system is located here
http://wiki.wxpython.org/index.cgi/Internationalization

I got to thinking that this is python, I shouldn't have to do this. Why
not have a mechanism to capture this data at compile time?

I made a new python string type and got enough working on the parser to
accept

string = i"whatever"

Which is just a subtype of string. When python parses this it adds it
in an entry in the internationalization table. (essentially a
dictionary) It also addes it as an entry to the standard gettext
mechanism to indicate strings that need to be localized. When the
string is created, if a value exists in the international table for the
given string, the string is converted to the current locale.

The locale is set through the standard locale module. When this
happens, all internationalized strings, when used, are converted to
their international representations (using unicode if necessary)
automagically through the gettext system.

Hopefully I'm not reinventing the wheel but one nice thing about this
system is that after some slight code modifications to the python
sources declaring exception error messages to be international, I now
have a nice dictionary of error messages that need to be localized.
These would be localized in the standard way using gettext tools.

Now, I don't know if this is a good idea or not, and I may be
reinventing some wheels, but it has some appealing characteristics and
ties in with the gettext system really well. Of course, I'm a bit leary
with changing the python parser but I was uncomfortable with the two
step process of writing the code, then passing it through a source
analyzer so that it could be internationalized.

Of course I have some issues, like if you use an international string in
a dictionary and the internal value changes based on the locale bad
things happen. This requires that the locale be set only once at the
beginning of the application otherwise undesirable events occur :)

What I would like is something like

import internationalization

and then at compile time strings like i"whatever" would be analyzed but
would be normal strings without "import internationalization"

So, am I being silly, redundant or just plain different?

Brian
 
M

Martin v. Loewis

Brian said:
Essentially, everything looks like this
_("STRING")

I got to thinking that this is python, I shouldn't have to do this. Why
not have a mechanism to capture this data at compile time?

Precisely because it is Python you should have to do this. The language
that looks like line noise is a different one...

Explicit is better than implicit.
Simple is better than complex.
Which is just a subtype of string. When python parses this it adds it
in an entry in the internationalization table. (essentially a
dictionary) It also addes it as an entry to the standard gettext
mechanism to indicate strings that need to be localized. When the
string is created, if a value exists in the international table for the
given string, the string is converted to the current locale.

Using what textual domain?
Now, I don't know if this is a good idea or not, and I may be
reinventing some wheels, but it has some appealing characteristics and
ties in with the gettext system really well. Of course, I'm a bit leary
with changing the python parser but I was uncomfortable with the two
step process of writing the code, then passing it through a source
analyzer so that it could be internationalized.

What is your problem with that technique? You have to extract the
strings *anyway*, because the translators should get a list of the
strings to translate, and not need to worry with your source code.
So, am I being silly, redundant or just plain different?

I do wonder what kind of application are you looking at. How
many strings? How many different translators? How often do
the strings change? Are the strings presented to the immediate
user of the application sitting in front of the terminal where
the application is running, or are there multiple simultaneous
accesses to the same application, e.g. through a Web server?

Regards,
Martin
 
B

Brian Kelley

Martin said:
Precisely because it is Python you should have to do this. The language
that looks like line noise is a different one...

Explicit is better than implicit.
Simple is better than complex.

I suppose we see a little differently here. What I am suggesting I
think is still explicit, the major difference is that when the code is
compiled the internationalization tables are generated.

I don't see too much difference between using
_("STRING")

to indicate a string to be internationalized and
i"STRING"

Except that the later will be automatically noticed at compile time, not
run-time.

I have also played with

international("STRING")
Using what textual domain?
I am using the same mechanism as gettext but tying it into the "import
locale" mechanism.
What is your problem with that technique? You have to extract the
strings *anyway*, because the translators should get a list of the
strings to translate, and not need to worry with your source code.

I think this happens fine in both cases. The mechanism for
internationalizing with wxPython just didn't feel, well, pythonic. It
felt kind of tacked into place. Of course I feel the same way about
most C macros :)
I do wonder what kind of application are you looking at. How
many strings? How many different translators? How often do
the strings change? Are the strings presented to the immediate
user of the application sitting in front of the terminal where
the application is running, or are there multiple simultaneous
accesses to the same application, e.g. through a Web server?

The number of strings doesn't really matter I think as long as you can
automatically generate the ones that need to be translated. Both
mechanisms do this.

I hadn't previously thought about multiple simultaneous users but this
could fit in nicely. After some experimentation with the string
classes, it turns out that as long as the __repr__ of the string stays
unchanged, i.e. in this case the original english version, then the
__str__ of a string (the locale specific changes) can change willy-nilly
and not affect things like dictionaries and class name lookups.

Perhaps I am optimizing the wrong end of the stick. I could change the
gettext _("STRING") functionality to (mostly) do what I want without the
need for parser changes. However, I was entranced by the thought of
writing (in normal python)

raise Exception(i"You shouldn't reach this exception at line %s"%line)

and automatically generating the translation files at compile time. Of
course this is different than

raise Exception(_("You shouldn't reach this exception at line %s"%line))

which generates them at run-time or by using the source code-analyzer.
I personally think the first one is slightly more explicit. Perhaps

raise Exception(
internationalize("You shouldn't reach this exception at line %s"%line)
)

would be better.

Compile-time generation just feels safer to me. The code-analyzer might
miss special cases and what not, but if it is generated at compile time
it will work all the time.

I suppose that there really isn't much difference in the long run as
long as tools exist that make these translations relatively easy but I
can't quite shake the thought of trying to teach my brother, a
biologist, how to enable localization in his code in an easy manner.

Anyway, you have made good points and maybe all I need are some better
integrated tools for localizing my applications.

Since you brought it up, how does gettext handle multiple users? Does
each user get a different gettext library (dll) instance?

Brian
 
S

Serge Orlov

Brian Kelley said:
The number of strings doesn't really matter I think as long as you can
automatically generate the ones that need to be translated. Both
mechanisms do this.

I hadn't previously thought about multiple simultaneous users but this
could fit in nicely. After some experimentation with the string
classes, it turns out that as long as the __repr__ of the string stays
unchanged, i.e. in this case the original english version, then the
__str__ of a string (the locale specific changes) can change willy-nilly
and not affect things like dictionaries and class name lookups.

If you keep translator instance in a global variable then users won't
be able to select different languages.
I suppose that there really isn't much difference in the long run as
long as tools exist that make these translations relatively easy but I
can't quite shake the thought of trying to teach my brother, a
biologist, how to enable localization in his code in an easy manner.

Anyway, you have made good points and maybe all I need are some better
integrated tools for localizing my applications.

Exactly. At the link you provided, the description of creating database of
strings for translation takes 120 words whereas the whole text is about
2500 words. And I beleive collection of strings to translate should take
place only before a release. How often is it?
Since you brought it up, how does gettext handle multiple users? Does
each user get a different gettext library (dll) instance?

No. Different translator class instances. Your pre-pep doesn't help
here.

-- Serge.
 
N

Neil Hodgson

Brian Kelley:
I made a new python string type and got enough working on the parser to
accept

string = i"whatever"

I would not like to see strings gain more interpretation prefix
characters without very strong justification as the combinational growth
makes code interpretation more difficult. The "i" would need to be used in
conjunction with the "u" or "U" Unicode prefix and possibly the "r" or "R"
raw prefix.

The order of interpretation here is reasonably obvious as the
internationalization would be applied after the treatment as Unicode and
raw. However, when used in conjunction with other proposed prefix characters
such as for variable interpolation, the ordering becomes more questionable.

Neil
 
?

=?ISO-8859-1?Q?=22Martin_v=2E_L=F6wis=22?=

Brian said:
I don't see too much difference between using
_("STRING")

to indicate a string to be internationalized and
i"STRING"

Except that the later will be automatically noticed at compile time, not
run-time.

I'm not really sure what "compile time" is in Python. If it is the time
when the .pyc files are generated: What happens if the user using the
pyc files has a language different from the language of the
administrator creating the files?

Or, what precisely does it mean that it is "noticed"?

I am using the same mechanism as gettext but tying it into the "import
locale" mechanism.

That does not answer my question? How precisely do you define the
textual domain?

In gettext, I typically put a function at the beginning of each module

def _(msg):
return gettext.dgettext("grep", msg)

thus defining the textual domain as "grep". In your syntax, I see no
place to define the textual domain.
I think this happens fine in both cases. The mechanism for
internationalizing with wxPython just didn't feel, well, pythonic. It
felt kind of tacked into place. Of course I feel the same way about
most C macros :)

Can you be more specific? Do you dislike function calls?
Perhaps I am optimizing the wrong end of the stick.

Why is this an optimization in the first place? Do you make anything run
faster? If so, what and how?
Compile-time generation just feels safer to me. The code-analyzer might
miss special cases and what not, but if it is generated at compile time
it will work all the time.

Please again elaborate: What is compile-time, and what is generation,
in this context?
Since you brought it up, how does gettext handle multiple users? Does
each user get a different gettext library (dll) instance?

Gettext, in Python, is a pure Python module - so no DLL.

The gettext module provides the GNUTranslation object. You can have
multiple such objects, and need to explicitly use the language's
object on gettext() invocation, e.g.

print current_language.gettext("This is good")

If you can be certain that users will be subsequent, you can do

def _(msg):
return current_language.gettext(msg)

print _("This is good")

There might be other ways to determine the current context, e.g.
by looking at thread state. In that case, you still can use
the _ function, but implement it in looking at the thread state.

Regards,
Martin
 
B

Brian Kelley

Martin said:
I'm not really sure what "compile time" is in Python. If it is the time
when the .pyc files are generated: What happens if the user using the
pyc files has a language different from the language of the
administrator creating the files?

Or, what precisely does it mean that it is "noticed"?




That does not answer my question? How precisely do you define the
textual domain?

In gettext, I typically put a function at the beginning of each module

def _(msg):
return gettext.dgettext("grep", msg)

thus defining the textual domain as "grep". In your syntax, I see no
place to define the textual domain.



Can you be more specific? Do you dislike function calls?



Why is this an optimization in the first place? Do you make anything run
faster? If so, what and how?



Please again elaborate: What is compile-time, and what is generation,
in this context?

I'll back away for a second to try to explain my previous rationale.
Now that it appears to have changed that is.

By compile-time I meant essentially, not-run time. For instance if you
had a function

def foo(...):
return i"Message"

You wouldn't have to execute foo directly for the parse tree to notice
that i"Message" was a string that is supposed to be internationalized.
This process gets this string placed in the internationalization table
which I would then use various tools to modify and translate.

The international string class would then has an internal mechanism to
notice what the locale is and what the value of the string should be
when printed or used. This is transparent to dictionaries and the like,
so all the developer has to do is use

i"Message"

This should tranparent to Gui Widgets and the like so you can place
i"Message" in a list box, it will automatically get internationalized
but when you get the result of the user selected, it still can be used
to dictionary look ups exactly the same as the original i"Message".
The gettext module provides the GNUTranslation object. You can have
multiple such objects, and need to explicitly use the language's
object on gettext() invocation, e.g.

print current_language.gettext("This is good")

This is the mechanism that I was originally using for the internation
string class.
If you can be certain that users will be subsequent, you can do

def _(msg):
return current_language.gettext(msg)

print _("This is good")

My version of this is
message = i"This is good"
print message
There might be other ways to determine the current context, e.g.
by looking at thread state. In that case, you still can use
the _ function, but implement it in looking at the thread state.

I'll take a look at this.

An example of how I am currently using this is as follows:

message = i"This is good"
or
message = international("This is good")

lookup = {}
print message # -> "This is good"
lookup[message] = func

locale.set_locale("es") # spanish, this is from memory, might be wrong
print message # -> "Esto es bueno"
assert lookup[message] == func

local.set_locale("it") # italian
print message # -> "Ciò è buona"
assert lookup[message] == func

My current thinking is as follows, I could get most of this effect by
not having a new string type (i"") and just making an
internationalization string type. Then I would just use code analyzer
tools to generate the translation tables as Serge suggested, heck, they
could even use the parse module if I was so inclined.

In either case, I far prefer using internation("This is good") over the
_("This is good") function call. Mainly because it is more explicit and
using _ overrides the interpreters built in for getting last result.
4

Thanks for your inspection of this rambling.
Brian
 
M

Martin v. =?iso-8859-15?q?L=F6wis?=

Brian Kelley said:
def foo(...):
return i"Message"

You wouldn't have to execute foo directly for the parse tree to notice
that i"Message" was a string that is supposed to be
internationalized. This process gets this string placed in the
internationalization table which I would then use various tools to
modify and translate.

But this is the technology today! The xgettext utility (either
GNU xgettext, or xgettext.py) extracts internationalized strings
from Python code into "internationalization tables". These tables
are called "PO files". They get translated, then converted into
an efficient internal representation, called "mo files".

I'm uncertain how this relates to the internationalization tables
you are proposing, but please recognize that a single application
may have *multiple* such tables, called "textual domains". To
find a translation, you not only need the message id, but you
also need the textual domain.
This should tranparent to Gui Widgets and the like so you can place
i"Message" in a list box, it will automatically get internationalized
but when you get the result of the user selected, it still can be used
to dictionary look ups exactly the same as the original i"Message".

Using what textual domain?
My current thinking is as follows, I could get most of this effect by
not having a new string type (i"") and just making an
internationalization string type. Then I would just use code analyzer
tools to generate the translation tables as Serge suggested, heck,
they could even use the parse module if I was so inclined.

xgettext.py already performs very simple parsing of source files,
to extract messages. Why reinvent the wheel?
In either case, I far prefer using internation("This is good") over
the _("This is good") function call.

So do call it internation(), or internationalization() or something
else. It is just a function:

def internation(msg):
return gettext.dgettext("some domain", msg)

Regards,
Martin
 
B

Brian Kelley

Martin said:
I'm uncertain how this relates to the internationalization tables
you are proposing, but please recognize that a single application
may have *multiple* such tables, called "textual domains". To
find a translation, you not only need the message id, but you
also need the textual domain.

I understand this. I was just curious whether this should be internal
or external to python. Various comments have indicated that it really
doesn't matter. This might be a red-herring to the current discussion,
let's forget that I mentioned parsing at all :)
Using what textual domain?

As I understand it, it doesn't matter. I have been wrong before though.
The textual domains are set by setting the locale and what I
considered the "neat" part was that the dictionary look ups are
independent of textual domain. That is, you could switch textual
domains on the fly and not affect dictionary look-ups. I don't think
that with the original gettext system that this would work. That is as
I understand it, if you changed the locale you need to call the _()
function again which means that any previous dictionaries would need to
be re-created with the new string values.
xgettext.py already performs very simple parsing of source files,
to extract messages. Why reinvent the wheel?

I agree with this. I just initially didn't like the idea of using an
independent source code analyzer. Let's assume that I will start with
xgettext and modify that instead.
So do call it internation(), or internationalization() or something
else. It is just a function:

Ahh, but mine isn't a function, it is a subclass of string.

So my version of gettext, calling
_("Hello")

doesn't return the current localized string, it returns a string object
that looks up on the fly the current localization for the purposes of
printing, essentially str operations.

Your comments about looking at the thread state are quite useful, I
might play with that.

Again, if I am reinventing the wheel on this, I apologize. And again,
thanks for bearing with me.

Brian
 
B

Brian Kelley

Neil said:
Brian Kelley:




I would not like to see strings gain more interpretation prefix
characters without very strong justification as the combinational growth
makes code interpretation more difficult. The "i" would need to be used in
conjunction with the "u" or "U" Unicode prefix and possibly the "r" or "R"
raw prefix.

The order of interpretation here is reasonably obvious as the
internationalization would be applied after the treatment as Unicode and
raw. However, when used in conjunction with other proposed prefix characters
such as for variable interpolation, the ordering becomes more questionable.

If the gettext mechanism is followed, it would make sense that
i"whatever" would be a special form of unicode so multiple prefixes
might not be necessary.

That being said, after some of the preceding discussions, I'm not sure
that introducing a new built-in datatype would be that useful. I'm
leaning toward something like:

string = international("whatever")

Which would be a bit more typing but would allow

international(r"whatever")

I think that we've pretty much beaten this topic into the ground though,
it's time to finish coding up the data type and see how it works.

Thanks for your input.
Brian
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,431
Messages
2,571,679
Members
48,796
Latest member
Greg L.

Latest Threads

Top