Unicode and Python - how often do you index strings?

Chris Angelico · Jun 3, 2014

A current discussion regarding Python's Unicode support centres (or
centers, depending on how close you are to the cent[er]{2} of the
universe) around one critical question: Is string indexing common?

Python strings can be indexed with integers to produce characters
(strings of length 1). They can also be iterated over from beginning
to end. Lots of operations can be built on either one of those two
primitives; the question is, how much can NOT be implemented
efficiently over iteration, and MUST use indexing? Theories are great,
but solid use-cases are better - ideally, examples from actual
production code (actual code optional).

I know the collective experience of python-list can't fail to bring up
a few solid examples here

Thanks in advance, all!!

ChrisA

Roy Smith · Jun 3, 2014

Chris Angelico said:
A current discussion regarding Python's Unicode support centres (or
centers, depending on how close you are to the cent[er]{2} of the
universe)

<sarcasm style="regex-pedant">Um, you mean cent(er|re), don't you? The

pattern you wrote also matches centee and centrr. said:
around one critical question: Is string indexing common?

Not in our code. I've got 80008 non-blank lines of Python (2.7) source
handy. I tried a few heuristics to find patterns which might be string
indexing.

$ find . -name '*.py' | xargs egrep '\[[^]][0-9]+\]'

and then looked them over manually. I see this pattern a bunch of times
(in a single-use script):

data['shard_key'] = hashlib.md5(str(id)).hexdigest()[:4]

We do this once:

if tz_offset[0] == '-':

We do this somewhere in some command-line parsing:

process_match = args.process[:15]

There's this little gem:

return [dedup(x[1:-1].lower()) for x in
re.findall('(\[[^\]\[]+\]|$[^$$]+$)',title)]

It appears I wrote this one, but I don't remember exactly what I had in
mind at the time...

withhyphen = number if '-' in number else (number[:-2] + '-' +
number[-2:]) # big assumption here

Anyway, there's a bunch more, but the bottom line is that in our code,
indexing into a string (at least explicitly in application source code)
is a pretty rare thing.

Chris Angelico · Jun 3, 2014

Chris Angelico said:
Chris Angelico said:

A current discussion regarding Python's Unicode support centres (or
centers, depending on how close you are to the cent[er]{2} of the
universe)

Click to expand...

<sarcasm style="regex-pedant">Um, you mean cent(er|re), don't you? The
pattern you wrote also matches centee and centrr.</sarcasm>

Maybe there's someone who spells it that way! Let's not be excluding
people. That'd be rude.

around one critical question: Is string indexing common?

Click to expand...

Not in our code. I've got 80008 non-blank lines of Python (2.7) source
handy. I tried a few heuristics to find patterns which might be string
indexing.

$ find . -name '*.py' | xargs egrep '\[[^]][0-9]+\]'

and then looked them over manually. I see this pattern a bunch of times
(in a single-use script):

data['shard_key'] = hashlib.md5(str(id)).hexdigest()[:4]

Slicing is a form of indexing too, although in this case (slicing from
the front) it could be implemented on top of UTF-8 without much
problem.

withhyphen = number if '-' in number else (number[:-2] + '-' +
number[-2:]) # big assumption here

This *definitely* counts; if strings were represented internally in
UTF-8, this would involve two scans (although a smart implementation
could probably count backward rather than forward). By the way, any
time you slice up to the third from the end, you win two extra awesome
points, just for putting [:-3] into your code and having it mean
something. But I digress.

Anyway, there's a bunch more, but the bottom line is that in our code,
indexing into a string (at least explicitly in application source code)
is a pretty rare thing.

Thanks. Of course, the pattern you searched for is looking only for
literals; it's a bit harder to find cases where the index (or slice
position) comes from a variable or expression, and those situations
are also rather harder to optimize (the MD5 prefix is clearly better
scanned from the front, the number tail is clearly better scanned from
the back - but with a variable?).

ChrisA

Gregory Ewing · Jun 4, 2014

Chris said:
Maybe there's someone who spells it that way!

Come visit Pirate Island, the centrr of the universe!

wxjmfauth · Jun 4, 2014

Le mercredi 4 juin 2014 02:39:54 UTC+2, Chris Angelico a écrit :

A current discussion regarding Python's Unicode support centres (or

centers, depending on how close you are to the cent[er]{2} of the

universe) around one critical question: Is string indexing common?

Python strings can be indexed with integers to produce characters

(strings of length 1). They can also be iterated over from beginning

to end. Lots of operations can be built on either one of those two

primitives; the question is, how much can NOT be implemented

efficiently over iteration, and MUST use indexing? Theories are great,

but solid use-cases are better - ideally, examples from actual

production code (actual code optional).

I know the collective experience of python-list can't fail to bring up

a few solid examples here

Thanks in advance, all!!

ChrisA

=============

Like many, you are not understanding unicode because
you do not understand the coding of characters.

You do not understand the coding of the characters
because you do not understand the mathematics behind it.

You focussed on the wrong problem.

(All this stuff has been discussed, tested and worked on

20 (twenty) years ago.)

Sorry.

jmf

Rustom Mody · Jun 4, 2014

The language is ENGLISH so the correct spelling is Centre regional
variations my be common but they are incorrect

"my"?

O mee Oo my -- cockney (or Aussie) pedant??

Michael Torrie · Jun 4, 2014

Like many, you are not understanding unicode because
you do not understand the coding of characters.

If that is true, then I'm sure a well-written paragraph or two can set
him straight. You continually berate people for not understanding
unicode, but you've posted nothing to explain anything, nor demonstrate
your own understanding. That's one reason your posts are so frustrating
and considered trolling. You never ever explain yourself, instead just
flailing around and muttering about folks not understanding unicode,
just as you've done here, true to form.

You do not understand the coding of the characters
because you do not understand the mathematics behind it.

flamebaiting here... FSR *is* UTF-32 internally, compresses off leading
zero bits during string creation.

You focussed on the wrong problem.

Frankly it is you who is focused on the wrong problem, at least with
this particular thread. I think you got distracted by the subject line.
Chris's original post really has nothing to do with unicode at all.
He's simply asking for use cases for string indexing where O(1) is
desired or necessary. Could be old Python 2 byte strings, or Python 3
unicode strings. It does not matter. Unicode is orthogonal to his
question.

Maybe his purpose in asking the question is to justify a fixed-length
encoding scheme (which is what FSR actually is), or maybe it is to
explore the costs of using a much slower, but more compact,
variable-length encoding scheme like UTF-8. Particularly in the context
of low-memory applications where unicode support would be nice, but
memory is at a premium. But either way, you got hung up on the wrong thing.

(All this stuff has been discussed, tested and worked on
20 (twenty) years ago.)

Sorry.

As am I.

Rustom Mody · Jun 5, 2014

A current discussion regarding Python's Unicode support centres (or
centers, depending on how close you are to the cent[er]{2} of the
universe) around one critical question: Is string indexing common?

No exactly on-topic for this thread...
Still thought it might interest some:
http://www.unicodeit.net/

wxjmfauth · Jun 5, 2014

Le mercredi 4 juin 2014 16:50:59 UTC+2, Michael Torrie a écrit :

If that is true, then I'm sure a well-written paragraph or two can set

him straight. You continually berate people for not understanding

unicode, but you've posted nothing to explain anything, nor demonstrate

your own understanding. That's one reason your posts are so frustrating

and considered trolling. You never ever explain yourself, instead just

flailing around and muttering about folks not understanding unicode,

just as you've done here, true to form.

flamebaiting here... FSR *is* UTF-32 internally, compresses off leading

zero bits during string creation.

Frankly it is you who is focused on the wrong problem, at least with

this particular thread. I think you got distracted by the subject line.

Chris's original post really has nothing to do with unicode at all.

He's simply asking for use cases for string indexing where O(1) is

desired or necessary. Could be old Python 2 byte strings, or Python 3

unicode strings. It does not matter. Unicode is orthogonal to his

question.

Maybe his purpose in asking the question is to justify a fixed-length

encoding scheme (which is what FSR actually is), or maybe it is to

explore the costs of using a much slower, but more compact,

variable-length encoding scheme like UTF-8. Particularly in the context

of low-memory applications where unicode support would be nice, but

memory is at a premium. But either way, you got hung up on the wrong thing.

As am I.

=========

Unicode ?
I have the feeling is similar as explaining,
i (the imaginary number) is not equal to
sqrt(-1).

jmf

PS Once I gave you a link pointing
to unicode.org doc, you obviously did not read it.

Marko Rauhamaa · Jun 5, 2014

(e-mail address removed):

Unicode ?
I have the feeling is similar as explaining,
i (the imaginary number) is not equal to
sqrt(-1).

jmf

PS Once I gave you a link pointing
to unicode.org doc, you obviously did not read it.

Sir, you are an artist, a poet even!

With admiration,

Marko

wxjmfauth · Jun 5, 2014

Le jeudi 5 juin 2014 06:25:49 UTC+2, Rustom Mody a écrit :
%%%%%%%%%%

Stick with Xe(La)TeX and do not spend to much time on
the web, you will learn a lot about unicode.

Send me a private e-mail, I will explain how this
whole font configuration in a TeX unicode engine (not
utf-8 engine!) works.

jmf

Mark H Harris · Jun 5, 2014

{snipped all the mess}

And you have may time been given a link explaining the problems with
posting g=from google groups but deliberately choose to not make your
replys readable.

The problem is that thing look fine in google groups. What helps is
getting to see what the mess looks like from Thunderbird or equivalent.

Johannes Bauer · Jun 5, 2014

I know the collective experience of python-list can't fail to bring up
a few solid examples here

Just also grepped lots of code and have surprisingly few instances of
index-search. Most are with constant indices. One particular example
that comes up a lot is

line = line[:-1]

Which truncates the trailing "\n" of a textfile line.

Then some indexing in the form of

negative = (line[0] == "-")

All in all I'm actually a bit surprised this isn't too common.

Cheers,
Johannes

--

Zumindest nicht Ã¶ffentlich!

Ah, der neueste und bis heute genialste Streich unsere groÃŸen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos Ã¼ber RÃ¼diger Thomas in dsa <[email protected]>

Mark Lawrence · Jun 5, 2014

The problem is that thing look fine in google groups. What helps is
getting to see what the mess looks like from Thunderbird or equivalent.

Wrong. 99.99% of people when asked politely take action so there is no
problem. The remaining 0.01% consists of one complete ignoramus.

Joshua Landau · Jun 5, 2014

[Things]

Click to expand...

[Reply to things]

Please. Just don't.

Paul Rubin · Jun 5, 2014

Johannes Bauer said:
line = line[:-1]
Which truncates the trailing "\n" of a textfile line.

use line.rstrip() for that.

Johannes Bauer · Jun 5, 2014

Johannes Bauer said:
Johannes Bauer said:

line = line[:-1]
Which truncates the trailing "\n" of a textfile line.

Click to expand...

use line.rstrip() for that.

rstrip has different functionality than what I'm doing.

Cheers,
Johannes

--

Zumindest nicht öffentlich!

Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>

Ryan Hiebert · Jun 5, 2014

2014-06-05 13:42 GMT-05:00 Johannes Bauer said:
Johannes Bauer said:

line = line[:-1]
Which truncates the trailing "\n" of a textfile line.

Click to expand...

use line.rstrip() for that.

Click to expand...

rstrip has different functionality than what I'm doing.

How so? I was using line=line[:-1] for removing the trailing newline, and
just replaced it with rstrip('\n'). What are you doing differently?

Paul Rubin · Jun 5, 2014

Ryan Hiebert said:
How so? I was using line=line[:-1] for removing the trailing newline, and
just replaced it with rstrip('\n'). What are you doing differently?

rstrip removes all the newlines off the end, whether there are zero or
multiple. In perl the difference is chomp vs chop. line=line[:-1]
removes one character, that might or might not be a newline.

Chris Angelico · Jun 5, 2014

2014-06-05 13:42 GMT-05:00 Johannes Bauer said:
2014-06-05 13:42 GMT-05:00 Johannes Bauer said:

line = line[:-1]
Which truncates the trailing "\n" of a textfile line.

use line.rstrip() for that.

Click to expand...

rstrip has different functionality than what I'm doing.

Click to expand...

How so? I was using line=line[:-1] for removing the trailing newline, and
just replaced it with rstrip('\n'). What are you doing differently?

line = "Hello,\nworld!\n\n"
line[:-1] 'Hello,\nworld!\n'
line.rstrip('\n')

Click to expand...

Click to expand...

'Hello,\nworld!'

If it's guaranteed to end with exactly one newline, then and only then
will they be identical.

ChrisA

Python Unicode handling wins again -- mostly	67	Nov 29, 2013
Unicode strings, struct, and files	2	Oct 9, 2006
unicode and data strings	0	Jan 28, 2005
How Python works: What do you know about support for negative indices?	13	Sep 9, 2010
how do you do this	16	Oct 1, 2009
SQLAlchemy: How to do Table Reflection and MySQL?	3	Oct 20, 2012
Interning own classes like strings for speed and size?	11	Dec 27, 2010
File names, character sets and Unicode	1	Dec 12, 2008

Unicode and Python - how often do you index strings?

Chris Angelico

Roy Smith

Chris Angelico

Gregory Ewing

wxjmfauth

Rustom Mody

Michael Torrie

Rustom Mody

wxjmfauth

Marko Rauhamaa

wxjmfauth

Mark H Harris

Johannes Bauer

Mark Lawrence

Joshua Landau

Paul Rubin

Johannes Bauer

Ryan Hiebert

Paul Rubin

Chris Angelico

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads