Trouble sorting lists (unicode/locale related?)

xtian · Sep 22, 2003

David Eppstein said:
For example, which one of the following would be more efficient, or ,
moreover, more pythonic?

if aa[:3] == 'abc':

vs

if aa.startswith('abc'):

Click to expand...

The latter is clearly more readable.

More Pythonic, too, I think. "Readability counts," and "There should be
one-- and preferably only one --obvious way to do it." In this case,
startswith must be the one obvious way, or else why would it exist in
the standard library at all?[/QUOTE]

It's also much more maintainable - if in the future that 'abc' needs
to change to 'abcdef', the slicing version requires changes in two
places (just asking for bugs), while startswith requires only one. Of
course, all these things (readability, maintainability and
pythonicism) are fairly closely interrelated.

xtian

Jeremy Fincher · Sep 22, 2003

Shu-Hsien Sheu said:
For example, which one of the following would be more efficient, or ,
moreover, more pythonic?

if aa[:3] == 'abc':

vs

if aa.startswith('abc'):

Python is about maintainability, and the latter is significantly more
maintainable than the former; if the string you're checking against
changes in size, using .startswith doesn't require you to change your
sourcecode anywhere else.

Jeremy

Paul · Sep 22, 2003

However, what if you don't want case sensitivity? For example, to
check if a file is a jpg, I do name[-3:].lower() == 'jpg'. This will
work with both foo.jpg and foo.JPG.

Is this slower than name.lower().endswith('jpg')? Is there a better
solution altogether?

Paul

Bob Gailer said:
At said:

Hi,

I have a question about the comparison of efficiency of string slicing and
using string.startswith.
For example, which one of the following would be more efficient, or ,
moreover, more pythonic?

if aa[:3] == 'abc':

vs

if aa.startswith('abc'):

Click to expand...

Here's one way to address this kind of question:
... st = time.time()
... for i in range(500000):
... if aa.startswith('abc')ass
... print time.time() - st
...... st = time.time()
... for i in range(500000):
... if aa[:3] == 'abc'ass
... print time.time() - st
...1.01100003719

Bob Gailer
(e-mail address removed)
303 442 2625

--

Alex Martelli · Sep 22, 2003

Paul said:
However, what if you don't want case sensitivity? For example, to
check if a file is a jpg, I do name[-3:].lower() == 'jpg'. This will
work with both foo.jpg and foo.JPG.

Is this slower than name.lower().endswith('jpg')? Is there a better
solution altogether?

timeit.py gives me 0.9 microseconds for the less maintainable
name[:-3].lower()=='jpg' vs 1.7 for name.lower().endswith('jpg')
[and over 3 for re's]. Point is, _do I *care*_? How many millions
of filenames will I check, when the extra overhead of checking
a million filenames in the more maintainable way is less than a
second? How long will it have taken me to get those millions
of filenames into memory in the first place?

If this operation IS on the bottleneck, and a 0.8 microseconds
difference matters, I'll do the slicing -- but 99 times out of 100
I'll do the .endswith instead (I'll _start_ that way 100 times out
of 100, and optimize it iff profiling tells me that matters...).

Alex

Peter Hansen · Sep 23, 2003

Paul said:
However, what if you don't want case sensitivity? For example, to
check if a file is a jpg, I do name[-3:].lower() == 'jpg'. This will
work with both foo.jpg and foo.JPG.

Is this slower than name.lower().endswith('jpg')? Is there a better
solution altogether?

Yes, of course.

import os
if os.path.splitext(name)[1].lower() == 'jpg':
pass

That also handles the problem with files named "ThisFileIs.NotAjpg"
being mistreated, as the other solutions do. ;-)

-Peter

David Eppstein · Sep 23, 2003

Peter Hansen said:
Paul said:

However, what if you don't want case sensitivity? For example, to
check if a file is a jpg, I do name[-3:].lower() == 'jpg'. This will
work with both foo.jpg and foo.JPG.

Is this slower than name.lower().endswith('jpg')? Is there a better
solution altogether?

Click to expand...

Yes, of course.

import os
if os.path.splitext(name)[1].lower() == 'jpg':
pass

That also handles the problem with files named "ThisFileIs.NotAjpg"
being mistreated, as the other solutions do. ;-)

I was about to post the same answer. One minor nit, though: it should be

os.path.splitext(name)[1].lower() == '.jpg'

It may be a little longer than the other solutions, but it expresses the
meaning more clearly.

Terry Reedy · Sep 23, 2003

Shu-Hsien Sheu said:
Hi,

For example, which one of the following would be more efficient, or ,
moreover, more pythonic?

if aa[:3] == 'abc':

This creates and deletes a temporary object.

if aa.startswith('abc'):

This makes a function call. As Bob showed for his system, the two
overheads are about the same for a three char prefix. If one were
checking a 30000 byte prefix, the call might win.

Terry J. Reedy

Peter Hansen · Sep 23, 2003

David said:
Peter Hansen said:

Paul said:

However, what if you don't want case sensitivity? For example, to
check if a file is a jpg, I do name[-3:].lower() == 'jpg'. This will
work with both foo.jpg and foo.JPG.

Is this slower than name.lower().endswith('jpg')? Is there a better
solution altogether?

Click to expand...

Yes, of course.

import os
if os.path.splitext(name)[1].lower() == 'jpg':
pass

That also handles the problem with files named "ThisFileIs.NotAjpg"
being mistreated, as the other solutions do. ;-)

Click to expand...

I was about to post the same answer. One minor nit, though: it should be

os.path.splitext(name)[1].lower() == '.jpg'

Oops, thanks! I _always_ make that mistake. It just seems that if
os.path.split() does not return any path separators in the components,
then os.path.splitext() shouldn't return the extension separator...

Luckily we have tests around here to catch that kind of thing.

-Peter

Peter Otten · Sep 23, 2003

Duncan said:
Note that if anyone proposes this seriously, it should generate a 3-tuple
(mapping(item), index, item) rather than the 2-tuple you suggest.

This is because the mapping function could reasonably be used to impose an
ordering on objects that have no natural order, so you need to be sure
that the comparison never falls through the the original object even where
the mapping compares equal. It also has the side effect of making the sort
stable (although if stability is a goal you might want another option to
reverse the sort which would use '-index' as the second element and call
.reverse() on the result).

So my demo implementation was faulty :-(
Let me try again:

def sort(self, cmpfunc=None, mapfunc=None):
if mapfunc:
list.sort(self, lambda x, y: cmp(mapfunc(x), mapfunc(y)))
else:
list.sort(self, cmpfunc)

Seriously, if sort() were to be thus extended, the dsu pattern would rather
act as a performance enhancement under the hood. I prefer plain C

struct {
PyObject *decorator;
int index; /* iff stability is not guaranteed by the sort algorithm */
PyObject *data;
}

over n-tuples and would hope to reuse the sorting infrastructure already
there in listobject.c to compare only the decorators. I am not familiar
with the C source, so be gentle if that is not feasible.

FWIW, I think something like this belongs in the standard library rather
than as a method on lists or a new builtin.

If you think of the comparison as a two-step process,

(1) extract or calculate an attribute
(2) wrap it into a comparison function

consider the following use cases:

(a) lambda x, y: cmp(y, x)
(b) lambda x, y: cmp(x.attr.lower(), y.attr.lower())

All code following the latter pattern could be rewritten (more clearly, I
think) as

alist.sort(mapfunc=lambda x: attr.lower())

and would automatically benefit from dsu (which could even be turned off
transparently for the client code for very large lists, where the memory
footprint becomes a problem).

The litmus test whether to put it into a utility function or into an already
existing method that needs only a minor and completely backwards-compatible
interface change would then be:

Are (b)-style comparison functions nearly as or more common than (a)-style
functions.

Peter

PS: You and Alex Martelli dismissed my somewhat unfocused idea faster than I
could sort it out. I hope you will find the above useful, though.

Shu-Hsien Sheu · Sep 23, 2003

Great thanks for all the answers!
I really enjoy Python and learned a lot here.

-shuhsien

Shu-Hsien Sheu · Sep 23, 2003

Hi,

In my understanding, using try/except rather than if/else is more
pythonic. However, sometimes it is difficult to use the later.
For example, I want to search for a sub string in a list composed of
strings. It is considered "possitive" if there is a match, no matter how
many.

my_test = ['something', 'others', 'still others']

case 1: try/except

hit = 0
for i in my_test:
try:
i.index('some')
hit = 1
except ValueError:
pass

case 2: if/else

hit = 0
for i in my_test:
if 'some' in i:
hit = 1

It seems to me that in a simple searching/matching, using if might be
better and the code is smaller. Try/except would have its strengh on
catching multiple errorrs. However, problems occur if the criteria is
composed of "or" rather than "and". For instance:

if (a in b) or (c in b):
*do something

try:
b.index(a)
b.index(c)
*do something
except ValueError:
pass

The above two are very different.

Am I right here?

-shuhsien

Gerrit Holl · Sep 23, 2003

Shu-Hsien Sheu said:
In my understanding, using try/except rather than if/else is more
pythonic. However, sometimes it is difficult to use the later.
For example, I want to search for a sub string in a list composed of
strings. It is considered "possitive" if there is a match, no matter how
many.

my_test = ['something', 'others', 'still others']

case 1: try/except

hit = 0
for i in my_test:
try:
i.index('some')
hit = 1
except ValueError:
pass

case 2: if/else

hit = 0
for i in my_test:
if 'some' in i:
hit = 1

Much faster would be:
def check():
for elem in my_test:
if 'some' in elem:
return True

....this way, it immediatly stops checking all following values once it finds
a single match.

It seems to me that in a simple searching/matching, using if might be
better and the code is smaller. Try/except would have its strengh on
catching multiple errorrs.
Agreed.

However, problems occur if the criteria is
composed of "or" rather than "and". For instance:

if (a in b) or (c in b):
*do something

try:
b.index(a)
b.index(c)
*do something
except ValueError:
pass

The above two are very different.

It would be more similar to use 'if (a in b) and (c in b)',
because that is what the try/except block does. If so, I
think it has the same effect.
I would absolutely prefer the former, because I don't like function
calls who neither change an object whose return value is
thrown away.

regards,
Gerrit.

Stephen Horne · Sep 23, 2003

Hi,

In my understanding, using try/except rather than if/else is more
pythonic.

When to use an exception can be a difficult issue in any language.

A common suggestion as to when to use them is 'only for errors' - but
what exactly is an error? Really, its just a special case - and if you
are handling that special case it isn't an error any more.

For example, if your program runs out of memory and cannot complete a
requested task, as long as it handles that special case by reporting
that it run out of memory and not by crashing or whatever, it is
probably doing what any decent requirements document would specify -
in case of insufficient memory, report the problem and cleanly abort
the tasks that cannot be completed without terminating or crashing. In
terms of the requirements, no error has occurred - a special case has
simply been handled exactly as per the requirements.

Actually, I'm not really convinced by that argument either. I think I
first read it in Stroustrup, though I can't be bothered checking ATM.
Anyway, what I'm trying to express is that when to use an exception is
more about intuitions and common sense than hard-and-fast rules.

The nearest thing to a hard-and-fast rule, though, is that if you
throw an exception the condition should really be exceptional - a
special case as opposed to a normal case. try/except is not just an
alternative to if/else, and neither is it an alternative to 'break' in
loops (though Icon programmers may be tempted to use it that way).

One major benefit of exceptions is that you don't have to keep
checking 'has any error occured' in multiple if statements in a
function. Any function that does anything even remotely complicated
will have many different possible error conditions. Making sure that
normal-case code is not run when a failure has already occurred can be
a major headache. Enough of a headache that in C, many people think
error handling is the only case where a 'goto' is acceptable (so an
error can trigger a goto straight to the cleanup code at the end of a
function, keeping the rest of the function clean).

There are other issues, of course - exceptions allow your functions to
cope properly with errors that you don't even know about in functions
that you call, for instance.

But basically, if you handle exceptional conditions by raising
exceptions, your logic will get much clearer as for the most part it
only has to deal with the normal case. The exceptional cases get
handled by the exception handlers.

But if you use try/except as an alternative to if/else, people who
have to read your code may well take up dark magics purely so they can
curse you more effectively ;-)

Tim Rowe · Sep 23, 2003

Hi,

In my understanding, using try/except rather than if/else is more
pythonic.

Rule of thumb: when the block of code is still doing what it's
supposed to do, use if/else. If it's failing to do what it's supposed
to do, use try/except. "except" should be an /exception/!

However, sometimes it is difficult to use the later.
For example, I want to search for a sub string in a list composed of
strings. It is considered "possitive" if there is a match, no matter how
many.

my_test = ['something', 'others', 'still others']

case 1: try/except

hit = 0
for i in my_test:
try:
i.index('some')
hit = 1
except ValueError:
pass

I'd reckon that to be a bad use of try/except; the "exception" is a
perfectly normal case.

case 2: if/else

hit = 0
for i in my_test:
if 'some' in i:
hit = 1

My /guess/ is that this would be faster than case 1, as well as
clearer!

It seems to me that in a simple searching/matching, using if might be
better and the code is smaller. Try/except would have its strengh on
catching multiple errorrs. However, problems occur if the criteria is
composed of "or" rather than "and". For instance:

if (a in b) or (c in b):
*do something

try:
b.index(a)
b.index(c)
*do something
except ValueError:
pass

The above two are very different.

Yes. The first is clear and concise, the second is verbose and
unclear! Also the second could mask a genuine ValueError if a, b, or
c is an evaluation rather than a simple variable, so you'd think that
neither a nor c was in b when in fact you have no idea: something went
invisibly wrong and you never actually completed the search.

So try/except /only/ when something has gone wrong and you need to go
into some sort of recovery or termination, /not/ for routine tests.

Dennis Lee Bieber · Sep 23, 2003

Shu-Hsien Sheu fed this fish to the penguins on Tuesday 23 September
2003 08:10 am:

It seems to me that in a simple searching/matching, using if might be
better and the code is smaller. Try/except would have its strengh on
catching multiple errorrs. However, problems occur if the criteria is
composed of "or" rather than "and". For instance:

if (a in b) or (c in b):
*do something

Python conditionals short-circuit. If the first clause is true, the
second (for an OR) is never even executed.

try:
b.index(a)
b.index(c)
*do something
except ValueError:
pass

The above two are very different.

Yes, in that this is equivalent to an AND, rather than an OR

--

Bob Gailer · Sep 24, 2003

At said:
Hi,

In my understanding, using try/except rather than if/else is more pythonic.

If/else and try/except are both Pythonic. In many cases you don't even get
to choose.

However, sometimes it is difficult to use the later.
For example, I want to search for a sub string in a list composed of
strings. It is considered "possitive" if there is a match, no matter how many.

my_test = ['something', 'others', 'still others']

case 1: try/except

hit = 0
for i in my_test:
try:
i.index('some')
hit = 1
except ValueError:
pass

case 2: if/else

hit = 0
for i in my_test:
if 'some' in i:
hit = 1

Consider breaking out of the loop at the first success

for i in my_test:
if 'some' in i:
hit = 1
break
else:
hit = 0

Iist comprehension can also be an alternative:

hit = [i for i in my_test if 'some' in i]

It seems to me that in a simple searching/matching, using if might be
better and the code is smaller. Try/except would have its strengh on
catching multiple errorrs. However, problems occur if the criteria is
composed of "or" rather than "and". For instance:

if (a in b) or (c in b):
*do something

try:
b.index(a)
b.index(c)
*do something
except ValueError:
pass

The above two are very different.

Am I right here?

AFAIAC its a mater of style.

Bob Gailer
(e-mail address removed)
303 442 2625

Hung Jung Lu · Sep 27, 2003

Tim Rowe said:
Rule of thumb: when the block of code is still doing what it's
supposed to do, use if/else. If it's failing to do what it's supposed
to do, use try/except. "except" should be an /exception/! ......
So try/except /only/ when something has gone wrong and you need to go
into some sort of recovery or termination, /not/ for routine tests.

You have a valid point of view, which nonetheless is not shared by
everyone. This is a recurring subject in the newsgroup.

Python exceptions have been used for other purposes, as can be seen
from Python FAQ (e.g. "4.22 Why is there no goto?" in
http://www.python.org/doc/faq/general.html)

The "for" loop in Python is also implemented internally with
exceptions. E.g.: http://groups.google.com/[email protected]&oe=UTF-8&output=gplain,
where it mentioned:

"... In some other languages, 'non failure' mode exceptions may be
unusual, but it's the normal idiom in Python."

regards,

Hung Jung

Hung Jung Lu · Sep 27, 2003

Shu-Hsien Sheu said:
catching multiple errorrs. However, problems occur if the criteria is
composed of "or" rather than "and". For instance:

if (a in b) or (c in b):
*do something

try:
b.index(a)
b.index(c)
*do something
except ValueError:
pass

For the "or" case, actually the exception trick works rather well.
Usually people raise events by hand:

class Found(Exception): pass

try:
if a in b: raise Found
if c in b: raise Found
except Found:
do something

Of course, one can supply additional information, as in:

try:
if a in b: raise Found('a in b')
if c in b: raise Found('c in b')
except Found, e:
print str(e)

Hung Jung

Tim Rowe · Sep 29, 2003

You have a valid point of view, which nonetheless is not shared by
everyone. This is a recurring subject in the newsgroup.

<examples snipped>

That's the nice thing about rules of thumb: they're not binding

Internal constructs don't bother me -- ultimately it all comes down to
machine code which will be a mass of goto's. That doesn't mean that
my code should be (could be!) a mass of goto's.

And the last time I needed a goto for anything I was programming in
BASIC; I don't need to use exceptions to emulate it in a language with
decent flow control constructs (Python qualifies, of course!).

Stephen Horne · Sep 29, 2003

Internal constructs don't bother me -- ultimately it all comes down to
machine code which will be a mass of goto's. That doesn't mean that
my code should be (could be!) a mass of goto's.

And the last time I needed a goto for anything I was programming in
BASIC; I don't need to use exceptions to emulate it in a language with
decent flow control constructs (Python qualifies, of course!).

Ah - a good old goto debate ;-)

I have long claimed that goto is useful in rare circumstances, that it
can be clearer and more maintainable than structured code in those
rare cases, and that while anything can be abused it is still
perfectly possible for a good programmer to use it responsibly and to
refactor when the code becomes too messy. Structured, OO, and indeed
any style of programming can become unreadable and unmaintainable
(hence the fuss about refactoring).

That said, it was quite a shock when I actually found a case where I
needed a goto about six months ago. Quite literally I made several
attempts at writing a function in a structured way and got it wrong
each time. Each time I came back to it the previous version seemed a
horrible mess so I rewrote it. Eventually I figured this was stupid,
especially as it didn't seem so hard in my head. I put that mental
picture onto paper - and as I did so I realised it simply wasn't a
structured job. It was very much a state transition model. But because
it completes all its work in one go, and doesn't need to maintain the
state between calls, it simply hadn't occurred to me to think of it
that way.

I remember telling a college lecturer that if state transition models
are appropriate for some things then so must gotos be appropriate, and
that in a 'short running' state machine a goto would be the logical
way to implement the transitions. At the time, I didn't think it would
be so long before I found an example ;-)

Anyway, this example was in some fairly low level library code I was
writing in C++. I don't see any need for goto in Python, trust that it
will never be added, and don't like to see exceptions abused that way.

Just sharing a rather pointless anecdote.

anybody help me	1	Feb 10, 2006
comp.lang.c Answers to Frequently Asked Questions (FAQ List)	15	Apr 1, 2006
ANN: 'rex', a module for easy creation and use of regular expressions	0	Jun 10, 2004
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Mar 1, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Jan 12, 2008
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Dec 15, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Nov 1, 2007
comp.lang.c Answers (Abridged) to Frequently Asked Questions (FAQ)	0	Apr 1, 2008

Trouble sorting lists (unicode/locale related?)

xtian

Jeremy Fincher

Paul

Alex Martelli

Peter Hansen

David Eppstein

Terry Reedy

Peter Hansen

Peter Otten

Shu-Hsien Sheu

Shu-Hsien Sheu

Gerrit Holl

Stephen Horne

Tim Rowe

Dennis Lee Bieber

Bob Gailer

Hung Jung Lu

Hung Jung Lu

Tim Rowe

Stephen Horne

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads