Unicode and Python - how often do you index strings?

C

Chris Angelico

A current discussion regarding Python's Unicode support centres (or
centers, depending on how close you are to the cent[er]{2} of the
universe) around one critical question: Is string indexing common?

Python strings can be indexed with integers to produce characters
(strings of length 1). They can also be iterated over from beginning
to end. Lots of operations can be built on either one of those two
primitives; the question is, how much can NOT be implemented
efficiently over iteration, and MUST use indexing? Theories are great,
but solid use-cases are better - ideally, examples from actual
production code (actual code optional).

I know the collective experience of python-list can't fail to bring up
a few solid examples here :)

Thanks in advance, all!!

ChrisA
 
R

Roy Smith

Chris Angelico said:
A current discussion regarding Python's Unicode support centres (or
centers, depending on how close you are to the cent[er]{2} of the
universe)

<sarcasm style="regex-pedant">Um, you mean cent(er|re), don't you? The
pattern you wrote also matches centee and centrr. said:
around one critical question: Is string indexing common?

Not in our code. I've got 80008 non-blank lines of Python (2.7) source
handy. I tried a few heuristics to find patterns which might be string
indexing.

$ find . -name '*.py' | xargs egrep '\[[^]][0-9]+\]'

and then looked them over manually. I see this pattern a bunch of times
(in a single-use script):

data['shard_key'] = hashlib.md5(str(id)).hexdigest()[:4]

We do this once:

if tz_offset[0] == '-':

We do this somewhere in some command-line parsing:

process_match = args.process[:15]

There's this little gem:

return [dedup(x[1:-1].lower()) for x in
re.findall('(\[[^\]\[]+\]|\([^\)\(]+\))',title)]

It appears I wrote this one, but I don't remember exactly what I had in
mind at the time...

withhyphen = number if '-' in number else (number[:-2] + '-' +
number[-2:]) # big assumption here

Anyway, there's a bunch more, but the bottom line is that in our code,
indexing into a string (at least explicitly in application source code)
is a pretty rare thing.
 
C

Chris Angelico

Chris Angelico said:
A current discussion regarding Python's Unicode support centres (or
centers, depending on how close you are to the cent[er]{2} of the
universe)

<sarcasm style="regex-pedant">Um, you mean cent(er|re), don't you? The
pattern you wrote also matches centee and centrr.</sarcasm>

Maybe there's someone who spells it that way! Let's not be excluding
people. That'd be rude.
around one critical question: Is string indexing common?

Not in our code. I've got 80008 non-blank lines of Python (2.7) source
handy. I tried a few heuristics to find patterns which might be string
indexing.

$ find . -name '*.py' | xargs egrep '\[[^]][0-9]+\]'

and then looked them over manually. I see this pattern a bunch of times
(in a single-use script):

data['shard_key'] = hashlib.md5(str(id)).hexdigest()[:4]

Slicing is a form of indexing too, although in this case (slicing from
the front) it could be implemented on top of UTF-8 without much
problem.
withhyphen = number if '-' in number else (number[:-2] + '-' +
number[-2:]) # big assumption here

This *definitely* counts; if strings were represented internally in
UTF-8, this would involve two scans (although a smart implementation
could probably count backward rather than forward). By the way, any
time you slice up to the third from the end, you win two extra awesome
points, just for putting [:-3] into your code and having it mean
something. But I digress.
Anyway, there's a bunch more, but the bottom line is that in our code,
indexing into a string (at least explicitly in application source code)
is a pretty rare thing.

Thanks. Of course, the pattern you searched for is looking only for
literals; it's a bit harder to find cases where the index (or slice
position) comes from a variable or expression, and those situations
are also rather harder to optimize (the MD5 prefix is clearly better
scanned from the front, the number tail is clearly better scanned from
the back - but with a variable?).

ChrisA
 
W

wxjmfauth

Le mercredi 4 juin 2014 02:39:54 UTC+2, Chris Angelico a écrit :
A current discussion regarding Python's Unicode support centres (or

centers, depending on how close you are to the cent[er]{2} of the

universe) around one critical question: Is string indexing common?



Python strings can be indexed with integers to produce characters

(strings of length 1). They can also be iterated over from beginning

to end. Lots of operations can be built on either one of those two

primitives; the question is, how much can NOT be implemented

efficiently over iteration, and MUST use indexing? Theories are great,

but solid use-cases are better - ideally, examples from actual

production code (actual code optional).



I know the collective experience of python-list can't fail to bring up

a few solid examples here :)



Thanks in advance, all!!



ChrisA

=============

Like many, you are not understanding unicode because
you do not understand the coding of characters.

You do not understand the coding of the characters
because you do not understand the mathematics behind it.

You focussed on the wrong problem.

(All this stuff has been discussed, tested and worked on
20 (twenty) years ago.)

Sorry.

jmf
 
R

Rustom Mody

The language is ENGLISH so the correct spelling is Centre regional
variations my be common but they are incorrect

"my"?

O mee Oo my -- cockney (or Aussie) pedant??
 
M

Michael Torrie

Like many, you are not understanding unicode because
you do not understand the coding of characters.

If that is true, then I'm sure a well-written paragraph or two can set
him straight. You continually berate people for not understanding
unicode, but you've posted nothing to explain anything, nor demonstrate
your own understanding. That's one reason your posts are so frustrating
and considered trolling. You never ever explain yourself, instead just
flailing around and muttering about folks not understanding unicode,
just as you've done here, true to form.
You do not understand the coding of the characters
because you do not understand the mathematics behind it.

flamebaiting here... FSR *is* UTF-32 internally, compresses off leading
zero bits during string creation.
You focussed on the wrong problem.

Frankly it is you who is focused on the wrong problem, at least with
this particular thread. I think you got distracted by the subject line.
Chris's original post really has nothing to do with unicode at all.
He's simply asking for use cases for string indexing where O(1) is
desired or necessary. Could be old Python 2 byte strings, or Python 3
unicode strings. It does not matter. Unicode is orthogonal to his
question.

Maybe his purpose in asking the question is to justify a fixed-length
encoding scheme (which is what FSR actually is), or maybe it is to
explore the costs of using a much slower, but more compact,
variable-length encoding scheme like UTF-8. Particularly in the context
of low-memory applications where unicode support would be nice, but
memory is at a premium. But either way, you got hung up on the wrong thing.
(All this stuff has been discussed, tested and worked on
20 (twenty) years ago.)

Sorry.

As am I.
 
R

Rustom Mody

A current discussion regarding Python's Unicode support centres (or
centers, depending on how close you are to the cent[er]{2} of the
universe) around one critical question: Is string indexing common?

No exactly on-topic for this thread...
Still thought it might interest some:
http://www.unicodeit.net/
 
W

wxjmfauth

Le mercredi 4 juin 2014 16:50:59 UTC+2, Michael Torrie a écrit :
If that is true, then I'm sure a well-written paragraph or two can set

him straight. You continually berate people for not understanding

unicode, but you've posted nothing to explain anything, nor demonstrate

your own understanding. That's one reason your posts are so frustrating

and considered trolling. You never ever explain yourself, instead just

flailing around and muttering about folks not understanding unicode,

just as you've done here, true to form.








flamebaiting here... FSR *is* UTF-32 internally, compresses off leading

zero bits during string creation.






Frankly it is you who is focused on the wrong problem, at least with

this particular thread. I think you got distracted by the subject line.

Chris's original post really has nothing to do with unicode at all.

He's simply asking for use cases for string indexing where O(1) is

desired or necessary. Could be old Python 2 byte strings, or Python 3

unicode strings. It does not matter. Unicode is orthogonal to his

question.



Maybe his purpose in asking the question is to justify a fixed-length

encoding scheme (which is what FSR actually is), or maybe it is to

explore the costs of using a much slower, but more compact,

variable-length encoding scheme like UTF-8. Particularly in the context

of low-memory applications where unicode support would be nice, but

memory is at a premium. But either way, you got hung up on the wrong thing.








As am I.

=========

Unicode ?
I have the feeling is similar as explaining,
i (the imaginary number) is not equal to
sqrt(-1).

jmf

PS Once I gave you a link pointing
to unicode.org doc, you obviously did not read it.
 
M

Marko Rauhamaa

(e-mail address removed):
Unicode ?
I have the feeling is similar as explaining,
i (the imaginary number) is not equal to
sqrt(-1).

jmf

PS Once I gave you a link pointing
to unicode.org doc, you obviously did not read it.

Sir, you are an artist, a poet even!

With admiration,


Marko
 
W

wxjmfauth

Le jeudi 5 juin 2014 06:25:49 UTC+2, Rustom Mody a écrit :
%%%%%%%%%%

Stick with Xe(La)TeX and do not spend to much time on
the web, you will learn a lot about unicode.

Send me a private e-mail, I will explain how this
whole font configuration in a TeX unicode engine (not
utf-8 engine!) works.

jmf
 
M

Mark H Harris

{snipped all the mess}

And you have may time been given a link explaining the problems with
posting g=from google groups but deliberately choose to not make your
replys readable.

The problem is that thing look fine in google groups. What helps is
getting to see what the mess looks like from Thunderbird or equivalent.
 
J

Johannes Bauer

I know the collective experience of python-list can't fail to bring up
a few solid examples here :)

Just also grepped lots of code and have surprisingly few instances of
index-search. Most are with constant indices. One particular example
that comes up a lot is

line = line[:-1]

Which truncates the trailing "\n" of a textfile line.

Then some indexing in the form of

negative = (line[0] == "-")

All in all I'm actually a bit surprised this isn't too common.

Cheers,
Johannes


--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 
M

Mark Lawrence

The problem is that thing look fine in google groups. What helps is
getting to see what the mess looks like from Thunderbird or equivalent.

Wrong. 99.99% of people when asked politely take action so there is no
problem. The remaining 0.01% consists of one complete ignoramus.
 
J

Johannes Bauer

Johannes Bauer said:
line = line[:-1]
Which truncates the trailing "\n" of a textfile line.

use line.rstrip() for that.

rstrip has different functionality than what I'm doing.

Cheers,
Johannes

--
Zumindest nicht öffentlich!
Ah, der neueste und bis heute genialste Streich unsere großen
Kosmologen: Die Geheim-Vorhersage.
- Karl Kaos über Rüdiger Thomas in dsa <[email protected]>
 
R

Ryan Hiebert

2014-06-05 13:42 GMT-05:00 Johannes Bauer said:
Johannes Bauer said:
line = line[:-1]
Which truncates the trailing "\n" of a textfile line.

use line.rstrip() for that.

rstrip has different functionality than what I'm doing.


How so? I was using line=line[:-1] for removing the trailing newline, and
just replaced it with rstrip('\n'). What are you doing differently?
 
P

Paul Rubin

Ryan Hiebert said:
How so? I was using line=line[:-1] for removing the trailing newline, and
just replaced it with rstrip('\n'). What are you doing differently?

rstrip removes all the newlines off the end, whether there are zero or
multiple. In perl the difference is chomp vs chop. line=line[:-1]
removes one character, that might or might not be a newline.
 
C

Chris Angelico

2014-06-05 13:42 GMT-05:00 Johannes Bauer said:
line = line[:-1]
Which truncates the trailing "\n" of a textfile line.

use line.rstrip() for that.

rstrip has different functionality than what I'm doing.


How so? I was using line=line[:-1] for removing the trailing newline, and
just replaced it with rstrip('\n'). What are you doing differently?
line = "Hello,\nworld!\n\n"
line[:-1] 'Hello,\nworld!\n'
line.rstrip('\n')
'Hello,\nworld!'

If it's guaranteed to end with exactly one newline, then and only then
will they be identical.

ChrisA
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top