RE Module Performance

Fábio Santos · Jul 15, 2013

Does this mean he passes the Turing test?

Chris Angelico · Jul 15, 2013

Does this mean he passes the Turing test?

Yes, absolutely. The original Turing test was defined in terms of five
minutes of analysis, and Dihedral and jmf have clearly been
indistinguishably human across that approximate period.

ChrisA

Tim Delaney · Jul 16, 2013

Yes, absolutely. The original Turing test was defined in terms of five
minutes of analysis, and Dihedral and jmf have clearly been
indistinguishably human across that approximate period.

The big difference between them is that the jmfbot does not appear to
evolve its routines in response to external sources - it seems to be stuck
in a closed feedback loop.

Tim Delaney

Devyn Collier Johnson · Jul 16, 2013

Am 07/12/2013 07:16 PM, schrieb MRAB:

FYI, you're better off going to http://pypi.python.org/pypi/regex
because that will take you to the latest version.

Thank you everyone for the suggestions. I have not tried them yet.

Devyn Collier Johnson

88888 Dihedral · Jul 18, 2013

Devyn Collier Johnsonæ–¼ 2013å¹´7æœˆ16æ—¥æ˜ŸæœŸäºŒUTC+8ä¸‹åˆ6æ™‚30åˆ†33ç§’å¯«é“ï¼š

Am 07/12/2013 07:16 PM, schrieb MRAB:

Thank you everyone for the suggestions. I have not tried them yet.

Devyn Collier Johnson

I was thinking to decompose RE patterns into string matching
formats of various strings in some formats.

Anyway that involves some compiler techniques.

wxjmfauth · Jul 24, 2013

Le samedi 13 juillet 2013 01:13:47 UTC+2, Michael Torrie a écrit :

Variable number of bytes is a problematic way to saying it. UTF-8 is a

variable-number-of-bytes encoding scheme where each character can be 1,

2, 4, or more bytes, depending on the unicode character. As you can

imagine this sort of encoding scheme would be very slow to do slicing

with (looking up a character at a certain position). Python uses

fixed-width encoding schemes, so they preserve the O(n) lookup speeds,

but python will use 1, 2, or 4 bytes per every character in the string,

depending on what is needed. Just in case the OP might have

misunderstood what you are saying.

jmf sees the case where a string is promoted from one width to another,

and thinks that the brief slowdown in string operations to accomplish

this is a problem. In reality I have never seen anyone use the types of

string operations his pseudo benchmarks use, and in general Python 3's

string behavior is pretty fast. And apparently much more correct than

if jmf's ideas of unicode were implemented.

------

Sorry, you are not understanding Unicode. What is a Unicode
Transformation Format (UTF), what is the goal of a UTF and
why it is important for an implementation to work with a UTF.

Short example. Writing an editor with something like the
FSR is simply impossible (properly).

jmf

Chris Angelico · Jul 24, 2013

Short example. Writing an editor with something like the
FSR is simply impossible (properly).

jmf, have you ever written an editor with *any* string representation?
Are you speaking from any level of experience at all?

ChrisA

Chris Angelico · Jul 24, 2013

I've screwed up plenty of times in python, but can write code like a pro
when I'm feeling better(on SSI and medicaid). An editor can be built simply,
but it's preference that makes the difference. Some might have used tkinter,
gtk. wxpython or other methods for the task.

I think the main issue in responding is your library preference, or widget
set preference. These can make you right with some in your response, or
wrong with others that have a preferable gui library that coincides with
one's personal cognitive structure that makes t

jmf's point is more about writing the editor widget (Scintilla, as
opposed to SciTE), which most people will never bother to do. I've
written several text editors, always by embedding someone else's
widget, and therefore not concerning myself with its internal string
representation. Frankly, Python's strings are a *terrible* internal
representation for an editor widget - not because of PEP 393, but
simply because they are immutable, and every keypress would result in
a rebuilding of the string. On the flip side, I could quite plausibly
imagine using a list of strings; whenever text gets inserted, the
string gets split at that point, and a new string created for the
insert (which also means that an Undo operation simply removes one
entire string). In this usage, the FSR is beneficial, as it's possible
to have different strings at different widths.

But mainly, I'm just wondering how many people here have any basis
from which to argue the point he's trying to make. I doubt most of us
have (a) implemented an editor widget, or (b) tested multiple
different internal representations to learn the true pros and cons of
each. And even if any of us had, that still wouldn't have any bearing
on PEP 393, which is about applications, not editor widgets. As stated
above, Python strings before AND after PEP 393 are poor choices for an
editor, ergo arguing from that standpoint is pretty useless. Not that
that bothers jmf...

ChrisA

Michael Torrie · Jul 24, 2013

Sorry, you are not understanding Unicode. What is a Unicode
Transformation Format (UTF), what is the goal of a UTF and
why it is important for an implementation to work with a UTF.

Really? Enlighten me.

Personally, I would never use UTF as a representation *in memory* for a
unicode string if it were up to me. Why? Because UTF characters are
not uniform in byte width so accessing positions within the string is
terribly slow and has to always be done by starting at the beginning of
the string. That's at minimum O(n) compared to FSR's O(1). Surely you
understand this. Do you dispute this fact?

UTF is a great choice for interchange, though, and indeed that's what it
was designed for.

Are you calling for UTF to be adopted as the internal, in-memory
representation of unicode? Or would you simply settle for UCS-4?
Please be clear here. What are you saying?

Short example. Writing an editor with something like the
FSR is simply impossible (properly).

How? FSR is just an implementation detail. It could be UCS-4 and it
would also work.

Chris Angelico · Jul 24, 2013

Really? Enlighten me.

Personally, I would never use UTF as a representation *in memory* for a
unicode string if it were up to me. Why? Because UTF characters are
not uniform in byte width so accessing positions within the string is
terribly slow and has to always be done by starting at the beginning of
the string. That's at minimum O(n) compared to FSR's O(1). Surely you
understand this. Do you dispute this fact?

Take care here; UTF is a general term for Unicode Translation Formats,
of which one (UTF-32) is fixed-width. Every other UTF-n is variable
width, though, so your point still stands. UTF-32 is the basis for
Python's FSR.

ChrisA

Michael Torrie · Jul 24, 2013

Frankly, Python's strings are a *terrible* internal representation
for an editor widget - not because of PEP 393, but simply because
they are immutable, and every keypress would result in a rebuilding
of the string. On the flip side, I could quite plausibly imagine
using a list of strings; whenever text gets inserted, the string gets
split at that point, and a new string created for the insert (which
also means that an Undo operation simply removes one entire string).
In this usage, the FSR is beneficial, as it's possible to have
different strings at different widths.

Very good point. Seems like this is exactly what is tripping up jmf in
general. His pseudo benchmarks are bogus for this exact reason. No one
uses python strings in this fashion. Editors certainly would not. But
then again his argument in the past does not mention editors. But it
makes me wonder if jmf is using python strings appropriately, or even
realizes they are immutable.

But mainly, I'm just wondering how many people here have any basis
from which to argue the point he's trying to make. I doubt most of
us have (a) implemented an editor widget, or (b) tested multiple
different internal representations to learn the true pros and cons
of each.

Maybe, but simply thinking logically, FSR and UCS-4 are equivalent in
pros and cons, and the cons of using UCS-2 (the old narrow builds) are
well known. UCS-2 simply cannot represent all of unicode correctly.
This is in the PEP of course.

His most recent argument that Python should use UTF as a representation
is very strange to be honest. The cons of UTF are apparent and widely
known. The main con is that UTF strings are O(n) for indexing a
position within the string.

Terry Reedy · Jul 24, 2013

I used exactly this, a list of strings, for a Python-coded text-only
mock editor to replace the tk Text widget in idle tests. It works fine
for the purpose. For small test texts, the inefficiency of immutable
strings is not relevant.

Tk apparently uses a C-coded btree rather than a Python list. All
details are hidden, unless one finds and reads the source ;-), but but
it uses C arrays rather than Python strings.

For my purpose, the mock Text works the same in 2.7 and 3.3+.

Maybe, but simply thinking logically, FSR and UCS-4 are equivalent in
pros and cons,

They both have the pro that indexing is direct *and correct*. The cons
are different.

and the cons of using UCS-2 (the old narrow builds) are
well known. UCS-2 simply cannot represent all of unicode correctly.

Python's narrow builds, at least for several releases, were in between
USC-2 and UTF-16 in that they used surrogates to represent all unicodes
but did not correct indexing for the presence of astral chars. This is a
nuisance for those who do use astral chars, such as emotes and CJK name
chars, on an everyday basis.

Stefan Behnel · Jul 24, 2013

FÃ¡bio Santos, 16.07.2013 00:54:

Does this mean he passes the Turing test?

I'd say that "it" appears more correct. Or is there any indication of a
specific bot gender? (I sure might have missed it...)

Note that being of a specific gender is definitely not required to pass the
Turing test. I'm not even sure pretending to have one is required.

Stefan

Chris Angelico · Jul 24, 2013

I used exactly this, a list of strings, for a Python-coded text-only mock
editor to replace the tk Text widget in idle tests. It works fine for the
purpose. For small test texts, the inefficiency of immutable strings is not
relevant.

Tk apparently uses a C-coded btree rather than a Python list. All details
are hidden, unless one finds and reads the source ;-), but but it uses C
arrays rather than Python strings.

For my purpose, the mock Text works the same in 2.7 and 3.3+.

Thanks for that report! And yes, it's going to behave exactly the same
way, because its underlying structure is an ordered list of ordered
lists of Unicode codepoints, ergo 3.3/PEP 393 is merely a question of
performance. But if you put your code onto a narrow build, you'll have
issues as seen below.

They both have the pro that indexing is direct *and correct*. The cons are
different.

They're close enough, though. It's simply a performance tradeoff - use
the memory all the time, or take a bit of overhead to give yourself
the option of using less memory. The difference is negligible compared
to...

Python's narrow builds, at least for several releases, were in between USC-2
and UTF-16 in that they used surrogates to represent all unicodes but did
not correct indexing for the presence of astral chars. This is a nuisance
for those who do use astral chars, such as emotes and CJK name chars, on an
everyday basis.

.... this. If nobody had ever thought of doing a multi-format string
representation, I could well imagine the Python core devs debating
whether the cost of UTF-32 strings is worth the correctness and
consistency improvements... and most likely concluding that narrow
builds get abolished. And if any other language (eg ECMAScript)
decides to move from UTF-16 to UTF-32, I would wholeheartedly support
the move, even if it broke code to do so. To my mind, exposing UTF-16
surrogates to the application is a bug to be fixed, not a feature to
be maintained. But since we can get the best of both worlds with only
a small amount of overhead, I really don't see why anyone should be
objecting.

ChrisA

Terry Reedy · Jul 24, 2013

Thanks for that report! And yes, it's going to behave exactly the same
way, because its underlying structure is an ordered list of ordered
lists of Unicode codepoints, ergo 3.3/PEP 393 is merely a question of
performance. But if you put your code onto a narrow build, you'll have
issues as seen below.

I carefully said 'For my purpose', which is to replace the tk Text
widget. Up to 8.5, Tk's text is something like Python's narrow-build
unicode.

If put astral chars into the toy editor, then yes, it would not work on
narrow builds, but would on 3.3+.

...

If nobody had ever thought of doing a multi-format string
representation, I could well imagine the Python core devs debating
whether the cost of UTF-32 strings is worth the correctness and
consistency improvements... and most likely concluding that narrow
builds get abolished. And if any other language (eg ECMAScript)
decides to move from UTF-16 to UTF-32, I would wholeheartedly support
the move, even if it broke code to do so.

Making a UTF-16 implementation correct requires converting abstract
'character' array indexes to concrete double byte array indexes. The
simple O(n) method of scanning the string from the beginning for each
index operation is too slow. When PEP393 was being discussed, I devised
a much faster way to do the conversion.

The key idea is to add an auxiliary array of the abstract indexes of the
astral chars in the abstract array. This is easily created when the
string is created and can be done afterward with one linear scan (which
is how I experimented with Python code). The length of that array is the
number of surrogate pairs in the concrete 16-bit codepoint array.
Subtracting that number from the length of the concrete array gives the
length of the abstract array.

Given a target index of a character in the abstract array, use the
auxiliary array to determine k, the number of astral characters that
precede the target character. That can be done with either a O(k) linear
scan or O(log k) binary search. Add 2 * k to the abstract index to get
the corresponding index in the concrete array. When slicing a string
with i0 and i1, slice the auxiliary array with k0 and k1 and adjusting
the contained indexes downward to get the corresponding auxiliary array.

To my mind, exposing UTF-16 surrogates to the application is a bug
to be fixed, not a feature to be maintained.

It is definitely not a feature, but a proper UTF-16 implementation would
not expose them except to codecs, just as with the PEP 393
implementation. (In both cases, I am excluding the sys size function as
'exposing to the application'.)

But since we can get the best of both worlds with only
a small amount of overhead, I really don't see why anyone should be
objecting.

I presume you are referring to the PEP 393 1-2-4 byte implementation.
Given how well it has been optimized, I think it was the right choice
for Python. But a language that now uses USC2 or defective UTF-16 on all
platforms might find the auxiliary array an easier fix.

Chris Angelico · Jul 24, 2013

It is definitely not a feature, but a proper UTF-16 implementation would not
expose them except to codecs, just as with the PEP 393 implementation. (In
both cases, I am excluding the sys size function as 'exposing to the
application'.)

I presume you are referring to the PEP 393 1-2-4 byte implementation. Given
how well it has been optimized, I think it was the right choice for Python.
But a language that now uses USC2 or defective UTF-16 on all platforms might
find the auxiliary array an easier fix.

I'm referring here to objections like jmf's, and also to threads like this:

http://mozilla.6506.n7.nabble.com/Flexible-String-Representation-full-Unicode-for-ES6-td267585.html

According to the ECMAScript people, UTF-16 and exposing surrogates to
the application is a critical feature to be maintained. I disagree.
But it's not my language, so I'm stuck with it. (I ended up writing a
little wrapper function in C that detects unpaired surrogates, but
that still doesn't deal with the possibility that character indexing
can create a new character that was never there to start with.)

ChrisA

Michael Torrie · Jul 24, 2013

I'm referring here to objections like jmf's, and also to threads like this:

http://mozilla.6506.n7.nabble.com/Flexible-String-Representation-full-Unicode-for-ES6-td267585.html

According to the ECMAScript people, UTF-16 and exposing surrogates to
the application is a critical feature to be maintained. I disagree.
But it's not my language, so I'm stuck with it. (I ended up writing a
little wrapper function in C that detects unpaired surrogates, but
that still doesn't deal with the possibility that character indexing
can create a new character that was never there to start with.)

This is starting to drift off topic here now, but after reading your
comments on that post, and others objections, I don't fully understand
why making strings simply "unicode" in javascript breaks compatibility
with older scripts. What operations are performed on strings that
making unicode an abstract type would break? Is it just in the input
and output of text that must be decoded and encode? Why should a script
care about the internal representation of unicode strings? Is it
because the incorrect behavior of UTF-16 and the exposed surrogates (and
subsequent incorrect indexing) are actually depended on by some scripts?

Chris Angelico · Jul 25, 2013

I don't fully understand
why making strings simply "unicode" in javascript breaks compatibility
with older scripts. What operations are performed on strings that
making unicode an abstract type would break?

Imagine this in JavaScript and Python (apart from the fact that JS
doesn't do backslash escapes past 0x10000):

a = "asdf\U00012345qwer";
b = a[[..10];

What will this do? It depends on whether UTF-16 is used or not.

ChrisA

Dennis Lee Bieber · Jul 25, 2013

In absence of specific information to the contrary, I'll usually refer
to computers and programs by masculine pronouns; no particular reason
for it, just the same sort of convention that has most ships and boats
being "she".

Heh... The Grey Wolf (1999 Jeep Cherokee) is "female", but the
Dungbeetle (Aprilia Scarabeo 500) tends to be an "it" <G>

Serhiy Storchaka · Jul 25, 2013

24.07.13 21:15, Chris Angelico Ð½Ð°Ð¿Ð¸ÑÐ°Ð²(Ð»Ð°):

To my mind, exposing UTF-16
surrogates to the application is a bug to be fixed, not a feature to
be maintained.

Python 3 uses code points from U+DC80 to U+DCFF (which are in surrogates
area) to represent undecodable bytes with surrogateescape error handler.

import syntax	0	Jul 29, 2013
Cross-Platform Python3 Equivalent to notify-send	1	Jul 27, 2013
Aloha! Check out the Betabots!	0	Oct 1, 2013
Critic my module	13	Jul 25, 2013
PEP8 79 char max	3	Jul 29, 2013
List as Contributor	0	Jul 20, 2013
Play Ogg Files	0	Jul 20, 2013
Share Code Tips	13	Jul 19, 2013

RE Module Performance

Fábio Santos

Chris Angelico

Tim Delaney

Devyn Collier Johnson

88888 Dihedral

wxjmfauth

Chris Angelico

Chris Angelico

Michael Torrie

Chris Angelico

Michael Torrie

Terry Reedy

Stefan Behnel

Chris Angelico

Terry Reedy

Chris Angelico

Michael Torrie

Chris Angelico

Dennis Lee Bieber

Serhiy Storchaka

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads