Prothon should not borrow Python strings!

P

Paul Prescod

I skimmed the tutorial and something alarmed me.

"Strings are a powerful data type in Prothon. Unlike many languages,
they can be of unlimited size (constrained only by memory size) and can
hold any arbitrary data, even binary data such as photos and movies.They
are of course also good for their traditional role of storing and
manipulating text."

This view of strings is about a decade out of date with modern
programmimg practice. From the programmer's point of view, a string
should be a list of characters. Characters are logical objects that have
properties defined by Unicode. This is the model used by Java,
Javascript, XML and C#.

Characters are an extremely important logical concept for human beings
(computers are supposed to serve human beings!) and they need
first-class representation. It is an accident of history that the
language you grew up with has so few characters that they can have a
one-to-one correspondance with bytes.

I can understand why you might be afraid to tackle all of Unicode for
version 1.0. Don't bother. All you need to do today to avoid the dead
end is DO NOT ALLOW BINARY DATA IN STRINGS. Have a binary data type.
Have a character string type. Give them a common "prototype" if you
wish. Let them share methods. But keep them separate in your code. The
result of reading a file is a binary data string. The result of parsing
an XML file is a character string. These are as different as the bits
that represent an integer in a particular file format and a logical integer.

Even if your character data type is today limited to characters between
0 and 255, you can easily extend that later. But once you have megabytes
of code that makes no distinction between characters and bytes it will
be too late. It would be like trying to tease apart integers and floats
after having treated them as indistinguishable. (which brings me to my
next post)

Paul Prescod
 
M

Mark Hahn

Paul Prescod said:
I can understand why you might be afraid to tackle all of Unicode for
version 1.0. Don't bother. All you need to do today to avoid the dead
end is DO NOT ALLOW BINARY DATA IN STRINGS. Have a binary data type.
Have a character string type. Give them a common "prototype" if you
wish. Let them share methods. But keep them separate in your code. The
result of reading a file is a binary data string. The result of parsing
an XML file is a character string. These are as different as the bits
that represent an integer in a particular file format and a logical
integer.

This is very timely. I would like to resolve issues like this by July and
that deadline is coming up very fast.

We have had discussions on the Prothon mailing list about how to handle
Unicode properly but no one pointed this out. It makes perfect sense to me.

Is there any dynamic language that already does this right for us to steal
from or is this new territory? I know for sure that I don't want to steal
Java's streams. I remember hating them with a passion.
 
P

Paul Prescod

Mark said:
...

Is there any dynamic language that already does this right for us to steal
from or is this new territory? I know for sure that I don't want to steal
Java's streams. I remember hating them with a passion.

I don't consider myself an expert: there are just some big mistakes that
I can recognize. But I'll give you as much guidance as I can.

Start here:

http://www.joelonsoftware.com/articles/Unicode.html

Summary:

"""It does not make sense to have a string without knowing what encoding
it uses. You can no longer stick your head in the sand and pretend that
"plain" text is ASCII.

There Ain't No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you
have to know what encoding it is in or you cannot interpret it or
display it to users correctly."""

One thing I should have told you is that it is just as important to get
your internal APIs right as your syntax. If you embed the "ASCII
assumption" into your APIs you will have a huge legacy of third party
modules that expect all characters to be <255 and you'll be stuck in the
same cul de sac as Python.

I would define macros like

#define PROTHON_CHAR int

and functions like

Prothon_String_As_UTF8
Prothon_String_As_ASCII // raises error if there are high characters

Obviously I can't think through the whole API. Look at Python,
JavaScript and JNI, I guess.

http://java.sun.com/docs/books/jni/html/objtypes.html#4001

The gist is that extensions should not poke into the character string
data structure expecting the data to be a "char *" of ASCII bytes.
Rather it should ask you to decode the data into a new buffer. Maybe you
could do some tricky buffer reuse if the encoding they ask for happens
to be the same as your internal structure (look at the Java "isCopy"
stuff). But if you promise users the ability to directly fiddle with the
internal data then you may have to break that promise one day.

To get from a Prothon string to a C string requires encoding because
_there ain't no such thing as a plain string_. If the C programmer
doesn't tell you how they want the data encoded, how will you know?

If you get the APIs right, it will be much easier to handle everything
else later.

Choosing an internal encoding is actually pretty tricky because there
are space versus time tradeoffs and you need to make some guesses about
how often particular characters are likely to be useful to your users.

==

On the question of types: there are two models that seem to work okay in
practice. Python's split between byte strings and Unicode strings is
actually not bad except that the default string literal is a BYTE string
(for historical reasons) rather than a character string.
3

Here's what Javascript does (i.e. better):

<script>
str = "a \u1234"
alert(str.length) // 3
</script>

===

By the way, if you have the courage to distance yourself from every
other language under the sun, I would propose that you throw an
exception on unknown escape sequences. It is very easy in Python to
accidentally used an escape sequence that is incorrect as above. Plus,
it is near impossible to add new escape sequences to Python because they
may break some code somewhere. I don't understand why this case is
special enough to break the usual Python commitment to "not guess" what
programmers mean in the face of ambiguity. This is another one of those
things you have to get right at the beginning because it is tough to
change later! Also, I totally hate how character numbers are not
delimited. It should be \u{1} or \u{1234} or \u{12345}. I find Python
totally weird:
UnicodeDecodeError: 'unicodeescape' codec can't decode bytes in position
0-4: end of string in escape sequence

====

So anyhow, the Python model is that there is a distinction between
character strings (which Python calls "unicode strings") and byte
strings (called 8-bit strings). If you want to decode data you are
reading from a file, you can just:

file("filename").read().decode("ascii")

or

file("filename").read().decode("utf-8")

Here's an illustration of a clean split between character strings and
byte strings:
"abc"

Now the Javascript model, which also seems to work, is a little bit
different. There is only one string type, but each character can take
values up to 2^16 (more on this number later).

http://www.mozilla.org/js/language/es4/formal/notation.html#string

If you read binary data in JavaScript, the implementations seem to just
map each byte to a corresponding Unicode code point (another way of
saying that is that they default to the latin-1 encoding). This should
work in most browsers:

<SCRIPT language = "Javascript">
datafile = "http://www.python.org/pics/pythonHi.gif"

httpconn = new XMLHttpRequest();
httpconn.open("GET",datafile,false);
httpconn.send(null);
alert(httpconn.responseText);
</SCRIPT>
<BODY></BODY>
</HTML>

(ignore the reference to "Xml" above. For some reason Microsoft decided
to conflate XML and HTTP in their APIs. In this case we are doing
nothing with XML whatsoever)

I was going to write that Javascript also has a function that allows you
to explicitly decode. That would be logical. You could imagine that you
could do as many levels of decoding as you like:

objXml.decode("utf-8").decode("latin-1").decode("utf-8").decode("koi8-r")

This model is a little bit "simpler" in that there is only one string
object and the programmer just keeps straight in their head whether it
has been decoded already (or how many times it has been decoded, if for
some strange reason it were double or triple-encoded).

But it turns out that I can't find a Javascript Unicode decoding
function through Google. More evidence that Javascript is brain-dead I
suppose.

Anyhow, that describes two models: one where byte (0-255) and character
(0-2**16 or 2**32) strings are strictly separated and one where byte
strings are just treated as a subset of character strings. What you
absolutely do not want is to leave character handling totally in the
domain of the application programmer as C and early and versions of
Python did.

On to character ranges. Strictly speaking, the Unicode cap is 2^20
characters. You'll notice that this is just beyond 2^16, which is a much
more convenient (and space efficient) number. There are three basic ways
of dealing with this situation.

1. You can use two bytes per character and simply ignore the issue.
"Those characters are not available. Deal with it!" That isn't as crazy
as it sounds because the high characters are not in common use yet.

2. You could directly use 3 (or more likely 4) bytes per character.
"Memory is cheap. Deal with it!"

3. You could do tricks where you sort of page switch from two-byte to
four-byte mode using "surrogates".[1] This is actually not that far from
"1" if you leave the manipulation of the surrogates entirely in
application code. I believe this is the strategy used by Java[2] and
Javascript.[3]

[1] http://www.i18nguy.com/surrogates.html

[2] "The methods that only accept a char value cannot support
supplementary characters. They treat char values from the surrogate
ranges as undefined characters."

http://java.sun.com/j2se/1.5.0/docs/api/java/lang/Character.html

"Characters are single Unicode 16-bit code points. We write them
enclosed in single quotes ‘ and ’. There are exactly 65536 characters:
‘«u0000»’, ‘«u0001»’, ...,‘A’, ‘B’, ‘C’, ..., ‘«uFFFF»’ (see also
notation for non-ASCII characters). Unicode surrogates are considered to
be pairs of characters for the purpose of this specification."

[3] http://www.mozilla.org/js/language/js20-2000-07/formal/notation.html

From a correctness point of view, 4-byte chars is obviously
Unicode-correct. From a performance point of view, most language
designers people have chosen to sweep the issue under the table and hope
that 16 bits per char continue to be enough "most of the time" and that
those who care about more will explicitly write their own code to deal
with high characters.

Paul Prescod
 
R

Roger Binns

Mark said:
Is there any dynamic language that already does this right for us to steal
from or is this new territory? I know for sure that I don't want to steal
Java's streams. I remember hating them with a passion.

Java's bytes being signed also caused no end of annoyance for me.
In our protocol marshalling code (thankfully mostly auto generated)
there was lots of code just to turn the signed bytes back into
unsigned bytes.

(I also *very* strongly agree with Paul.)

Roger
 
R

Roger Binns

Choosing an internal encoding is actually pretty tricky because there
are space versus time tradeoffs and you need to make some guesses about
how often particular characters are likely to be useful to your users.

There are two ways to deal with it. One is to convert to an internal
"UNICODE" format such as utf8, or using arrays of 16 or 32 bit integers.

You also have to decide if you are going to normalise the string.
For example you can have characters followed by a combining accent.
On display they are one character, but often there is a codepoint
for the single character combined with the accent, so you could
reduce the two down to one. There are also other characters such as
those that specify the direction of the following text which are
considered noise in some contexts.

The other way of dealing with things is to keep the text as it
was given, and not do any conversion or normalisation on it.
This is generally more future proof, but does burden other code
with having to deal with conversion issues (for example NT/2K/XP
only uses 16 bits for codepoints which is less than the full
range now).

If you want to score extra bonus points, you should also store
the locale of the string along with the encoding. I won't elaborate
here why.

Another design to consider is to allow tags that cover character
ranges and then assign properties to those tags (such as locale,
encoding), but importantly allow multiple tags per character.
(If you have used the Tk text widget you'll understand what I
am thinking of).
By the way, if you have the courage to distance yourself from every
other language under the sun, I would propose that you throw an
exception on unknown escape sequences.

Perl did that first :) It didn't distinguish between arrays of
bytes and arrays of characters so you easily end up with humunguous
amounts of warnings about invalid UTF8 stuff when dealing with
bytes. (I have no idea what goes on under the hood - you just
see it when installing Perl stuff like SpamAssassin).

In addition to all the excellent notes from Paul, I would recommend
you consult with someone familiar with the locale and encoding
issues for Hebrew, Arabic and various oriental languages such
as Japanese, Korean, Vietnamese and Tibetan. Bonus points for
Tamil :)

Just to make life even more interesting, you should realise that
there is more than one system of digits. You can see how Java
handles the issue here:

http://java.sun.com/j2se/1.4.2/docs/api/java/awt/font/NumericShaper.html

Since you are doing new language design, I also think there would
be great value in forcing things so that you do not have
strings embedded in the program, and they have to come from
external resource files. This also gives you the opportunity to
deal with string interpolation issues and get them right.
(It also means that "Hello, World" remains one line, but also
requires an external file with the message, or some other
mechanism).

The other Java i18n pages make for interesting reading:

http://java.sun.com/j2se/corejava/intl/index.jsp

Roger
 
G

Greg Ewing

Paul said:
The
result of reading a file is a binary data string. The result of parsing
an XML file is a character string.

What if the file you're reading is a text file?
 
M

Mark Hahn

Roger said:
In addition to all the excellent notes from Paul, I would recommend
you consult with someone familiar with the locale and encoding
issues for Hebrew, Arabic and various oriental languages such
as Japanese, Korean, Vietnamese and Tibetan. Bonus points for
Tamil :)

I sure hope you are kidding. If not you are scaring me away from doing
anything.

I want to do the best thing. I want someone who knows what's best and that
I can trust to help out and tell me what to do. I want to develop Prothon,
not become an expert on glyphs and international character coding.
Since you are doing new language design, I also think there would
be great value in forcing things so that you do not have
strings embedded in the program, and they have to come from
external resource files. This also gives you the opportunity to
deal with string interpolation issues and get them right.
(It also means that "Hello, World" remains one line, but also
requires an external file with the message, or some other
mechanism).

Do you mean for the interpreter or some enabling tool for the Prothon
programs? Doing this for the interpreter is on the to-do list.

Thanks for all the tips. I'd like for you and Paul to help out with Prothon
in this area if you could. At least let me bounce the plans off you two as
I go.
 
R

Roger Binns

Greg said:
What if the file you're reading is a text file?

On Windows, Linux and Mac (and most other operating systems)
it is stored as a sequence of bytes. To convert the bytes
to a sequence of characters (ie text) you have to know
what the encoding was that produced the sequence of
bytes.

This can be non-trivial, but pretending that the issue
doesn't exist leads you down the path and issues present
today in Python and several other languages.

Roger
 
F

Fredrik Lundh

Greg said:
What if the file you're reading is a text file?

if you don't know what encoding a file uses, "text files" contains chunks of
binary data separated by newlines (and/or carriage return characters).

http://www.python.org/peps/pep-0320.html mentions a textfile(filename,
mode, encoding) constructor that hides the ugly "U" flag, and sets up proper
codecs, if necessary.

since you don't always know the encoding until you've looked inside the
file (cf. emacs encoding directive, Python, XML, etc), it would also be nice
to have a "setencoding" method (or a writable "encoding" attribute). but
adding that to existing file-like objects may turn out to be a lot of work;
easy-to-find-and-use stream wrappers are probably a better idea.

</F>
 
G

gabriele renzi

I sure hope you are kidding. If not you are scaring me away from doing
anything.

Sorry if It's kind of OT, but a huge thread about this appeared in
comp.lang.ruby some time ago.
Quoting a little for you:

"""
|As far as I can see, currently 20 bits are sufficient :)
|http://www.unicode.org/charts/
|
|And anything after "Special" looks really quite special to me. At least
|western languages as well as Kanji, Hiragana and Katakana are supported.
|IMHO pragmatically 16 bits are good enough.

I assume you're saying that there's no more than 65536 characters on
earth in daily use, even including Asian ideograms (Kanjis).

You are right, if we can live in the idealistic world.

The problems are:

* Japan, China, Korea and Taiwan have characters from same origin,
but with different glyph (appearance). Due to Han unification,
Unicode assigns same character code number to those characters.
We used to use encodings to switch country information (script) in
internationalized applications. Unicode does not allow this
approach. We need to implement another layer to switch script.

* Due to historical reason and unification, some characters do not
round trip through conversion from/to Unicode. Sometimes we loose
information by implicit Unicode conversion.

* Asian people have used multibyte encoding (EUC-JP for example) for
long time. We have gigabytes of legacy encoding files. The cost
of code conversion is not negligible. We also have to care about
the round trip problem.

* There are some huge set of characters little known to western
world. For example, the TRON code contains 170,000 characters.
They are important to researchers, novelists, and people who care
characters.
"""
 
P

Paul Prescod

Greg said:
What if the file you're reading is a text file?

In the most rigorously consistent model, you would decode the data, just
as if you were reading a file that happened to be constructed of a list
of integers.

But of course there are a variety of shortcuts you could implement, like
a "text file" object or a "read as ASCII" flag for a file object or ...
practicality beats purity.

Paul Prescod
 
P

Paul Prescod

You make some good points, Robert. But bear in mind that we're trying to
design 1.0 of a language and that the real language designer has no
Unicode experience...

Roger said:
There are two ways to deal with it. One is to convert to an internal
"UNICODE" format such as utf8, or using arrays of 16 or 32 bit integers.

Agree: space versus time.
You also have to decide if you are going to normalise the string.
For example you can have characters followed by a combining accent.
On display they are one character, but often there is a codepoint
for the single character combined with the accent, so you could
reduce the two down to one. There are also other characters such as
those that specify the direction of the following text which are
considered noise in some contexts.

First, it is probably too much work to normalize for a 1.0 language
designer (even Python doesn't). Second, it is quite possibly the wrong
thing to do at a programming language level. Just as sometimes you want
to work with the raw bits of a file, sometimes you will want to work
with the un-normalized representation of a string.
The other way of dealing with things is to keep the text as it
was given, and not do any conversion or normalisation on it.
This is generally more future proof, but does burden other code
with having to deal with conversion issues (for example NT/2K/XP
only uses 16 bits for codepoints which is less than the full
range now).

Surrogate pairs are a little different than accent
normalization...surrogate pairs are just a space-saving hack to make up
for the difference between 16 bit implementations and the 20 bit space
Unicode uses.
Another design to consider is to allow tags that cover character
ranges and then assign properties to those tags (such as locale,
encoding), but importantly allow multiple tags per character.
(If you have used the Tk text widget you'll understand what I
am thinking of).

I'd say that's also beyond 1.0!
Perl did that first :)

Sorry for the confusion. At this point in the discussion I was not
talking about Unicode issues any more. I was just talking about plain
old escape sequences:
abc\q\y\z

In addition to all the excellent notes from Paul, I would recommend
you consult with someone familiar with the locale and encoding
issues for Hebrew, Arabic and various oriental languages such
as Japanese, Korean, Vietnamese and Tibetan. Bonus points for
Tamil :)

That's probably a little daunting for 1.0. The question is what is the
minimum possible he can get away with in the next few months.
> ...
Since you are doing new language design, I also think there would
be great value in forcing things so that you do not have
strings embedded in the program, and they have to come from
external resource files. This also gives you the opportunity to
deal with string interpolation issues and get them right.
(It also means that "Hello, World" remains one line, but also
requires an external file with the message, or some other
mechanism).

Seems a little over-strict for me. If I'm writing an HTML handling
program I have to keep the HTML tags in a separate file?

I think it is a good idea to have a built-in language mechanism for
localization but it is another thing I'd put off beyond 1.0.

Paul Prescod
 
R

Roger Binns

Mark said:
I sure hope you are kidding. If not you are scaring me away from doing
anything.

You can get the design done right early, and worry about the implementation
later.

[ external string resource files ]
Do you mean for the interpreter or some enabling tool for the Prothon
programs? Doing this for the interpreter is on the to-do list.

I mean for user programs and making it a fundamental part of the
language (ie you must use it).

The current model of print statements compatible with 1950's teletypes
is old and busted.

How about inspiration from 1960's era mainframes (or the more recent AS/400)?

The AS/400 is actually an excellent example of an alternate approach.
Your programs have a seperate resource that defines "screens" (think
of a full display on a terminal). The resource is rich in that it
can be longer than a screenfull (ie you need page down), and it
defines both output fields and input fields, including validation
information and type information for the fields.

When the web arrived, they could instantly web enable the applications
with zero changes to the application code. Similarly you could build
a gui for the apps automatically as well.

Taking a step back to what that could inspire in Prothon, how about making
I/O richer. Instead of teletype style print, make the resource files
richer.

I should be able to run someones program and have it output
to a teletype, a GUI, HTML, XML or whatever else becomes popular in
the next ten years, without changing a line of code of the application.
(Bonus points for encapsulating the command line arguments that way :)

Now you could do all this through libraries (ie as an optional part
of the libraries rather than part of the language design) but the
moment you do that, a lot of code won't use it, and a scheme like the
above is only useful if all code uses it.

Additionally the current schemes available in languages like Python
is a royal pain in the butt. The language makes it hard to do.

Try the following steps:

- Start with a program that says "Hello, World"
- Then change it to say the it in English and French
- Then make it take a command line argument that accepts a
name, and have it say Hello, Name (or Name, Hello depending
on the locale)
- Ok, now make it output HTML, XML, teletype and a gui

If the language/library design is done well, each of those should
add one line of code. Every language I am aware of the moment
makes it take way way more than that, which leads people not to
bother, which means that code randomly doesn't work with other
libraries, and everything is still stuck at the lowest common
denominator (printing to a teletype).

Roger
 
R

Roger Binns

Paul said:
Agree: space versus time.

It is somewhat more subtle than that. It was keep all strings as
you get them (unnormalised) and then consume CPU when having to
deal with them vs consuming CPU up front when originally presented
with a string and normalise it and munge it into a universal
storage encoding.
First, it is probably too much work to normalize for a 1.0 language
designer (even Python doesn't). Second, it is quite possibly the wrong
thing to do at a programming language level. Just as sometimes you want
to work with the raw bits of a file, sometimes you will want to work
with the un-normalized representation of a string.

The language implementor does however need to take a stance.
If they decide that normalisation will never happen, then all
other code may have to deal with normalisation issues (for
example what is len on an unnormalized string?)

Conversely they could decide to always normalize which means that
other code doesn't have to worry about it.

The worst thing to do is not make any decision, since that
is equivalent to making both decisions and code will always
have to worry about wether it is or isn't normalised.
I'd say that's also beyond 1.0!

It could be implemented beyond 1.0, but should be designed before
that. We have already seen the email pointing out some of the
issues with the Han unification and how you really need to know
the character origin to render it correctly even though the codepoint
is the same.
That's probably a little daunting for 1.0. The question is what is the
minimum possible he can get away with in the next few months.

For design you need to get it right at the begining. For implementation
you can wait a while. The moment you start taking short cuts, they
turn into arbitrary design decisions and you tie yourself into stuff
you wouldn't want to be.
Seems a little over-strict for me. If I'm writing an HTML handling
program I have to keep the HTML tags in a separate file?

Why not? See my earlier response to Mark for some ideas on how
to handle that.

Roger
 
M

Mark Hahn

Paul et al: I have a number of wild ideas and questions about text and
binary strings and also a few things to discuss about the long integers you
brought up, but the Python list is not the proper place to drag these
discussions out.

Is there any chance I can get you (and hopefully others participating here)
to join the prothon-user mailing list for a week or two to discuss these
issues? The traffic is only one tenth the traffic here on c.l.p. so it
won't be much burdon. Our discussions are interesting. (We have no pesky
users yet :)

You can experience the warm feeling that can only be acheived by helping
steer a new language away from mediocrity and towards greatness <big grin>.
 
P

Paul Prescod

Mark said:
Paul et al: I have a number of wild ideas and questions about text and
binary strings and also a few things to discuss about the long integers you
brought up, but the Python list is not the proper place to drag these
discussions out.

Is there any chance I can get you (and hopefully others participating here)
to join the prothon-user mailing list for a week or two to discuss these
issues? The traffic is only one tenth the traffic here on c.l.p. so it
won't be much burdon. Our discussions are interesting. (We have no pesky
users yet :)

I'll join, but I don't know when I'll find time to contribute...long
weekend coming up etc.

Paul Prescod
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top