reduce() anomaly?

Georgy Pruss · Nov 19, 2003

| Georgy Pruss:
| > To me, it's very wrong that you can read any radix numbers, but can't
| > print them. If str(int(s)) == s and int(str(n)) == n (with some limits), I
| don't
| > see why str(n,radix) can't be symmetrical to int(s,radix).
|
| But your objection would also be handled with a special-case function
| which only took an int/long and a base and returned the string
| representation for that base, no? This could be a method or a class
| method of int, or a new function in math.

I have such function already.

| What makes it special enough
| to warrant being part of the string constructor?

Just wanted to get rid of some useless builtin functions and make the
language prettier.

I wouldn't mind if Python had some analog of scanf() as an opposite
operator to '%'.

BTW is there some Python's equivalents to C's strspn, strcspn, strpbrk,
which return a leading sub-string, entirely [not] consisting of characters
of some char.set; and something like strtoul which parses the string and
returns the number and the position where the scan ended?

| > BTW there's no symmetry for str() and list()/tuple()/dict() etc.
|
| There is a symmetry for str(float(s)) == s and str(complex(s)) == s.
| Why shouldn't those take a base?
|
| Well, almost symmetry. str(float("1.1")) != "1.1"

Fortunatelly, str() here is a bit wiser than repr().True

Regarding the complex numbers, I guess the inconsistency with them
is to some extent the consequence of the fact that Python has no
complex literals (contrary to e.g. J) and the str(cmplx) returns some
expression in parenthesis instead of one complex number. So this
case is the same as with lists and other compound objects.

Georgy

| Andrew
| (e-mail address removed)
|

Andrew Dalke · Nov 19, 2003

Georgy Pruss:

Just wanted to get rid of some useless builtin functions and make the
language prettier.

But I don't think that functions with special cases (and the special
case of 'takes a base but only if the first arg is an integer) is pretty.
As I've mentioned, I've not problems if hex/oct are moved to some
module in some future version of Python.

BTW is there some Python's equivalents to C's strspn, strcspn, strpbrk,
which return a leading sub-string, entirely [not] consisting of characters
of some char.set; and something like strtoul which parses the string and
returns the number and the position where the scan ended?

Not directly. In Python those are most often done with regexps.
Eg,

import re
def strspn(s, t):
# kinda slow way to construct the pattern, but it does correctly
# handle the '-' and ']' cases. Usually one would write the regexp
# directly and not try to get close to the C API.
pat = re.compile("(" + "|".join(map(re.escape, t) + ")*")
m = pat.match(s)
if not m:
return 0
return m.end()

Fortunatelly, str() here is a bit wiser than repr().
True

Wiser? Or more approximate? IEE 754 math cannot explicitly
store the value "1.1".

Regarding the complex numbers, I guess the inconsistency with them
is to some extent the consequence of the fact that Python has no
complex literals (contrary to e.g. J) and the str(cmplx) returns some
expression in parenthesis instead of one complex number. So this
case is the same as with lists and other compound objects.

In some sense you are correct. Python doesn't have a complex
literal -- it only has an imaginary literal, so '1+2j' is implemented as

0 LOAD_CONST 1 (1)
3 LOAD_CONST 2 (2j)
6 BINARY_ADD

So I'll change my statement and point out that

str(complex("3j")) == "3j"

(note the lack of parens) so the same symmetry you argue for
str/int should work for str/complex (though complex assumes
floats for the real and imaginary values, so it's more akin to the
str/float relationship.)

Andrew
(e-mail address removed)

Georgy Pruss · Nov 19, 2003

Paul Rubin · Nov 19, 2003

Andrew Dalke said:
The 'against' side says, as you do, that having list.sort required
to be stable now places an extra barrier to anyone else who
implements a list-alike sort, because that sort should now also
be stable.

It's also a barrier to anyone who implements list sort, not just
listalike sort. Of course that doesn't affect CPython or Jython
users, who already have a built-in stable list sort. But what about
PyPy or maybe someone trying to reimplement Python from scratch in C?

What mollifies the latter view is that some user-defined sorts
won't have ties, so there will never be a problem.

I'm not a sorting whiz but my feeling is that if an application needs
sort to be stable, that's already a sign that something isn't really
right about it. Sorting is supposed to mean turning an unordered
permutation into an ordered one. So if list.sort gives you a useable
answer (i.e. one that satisfies your application's needs), then
randomizing the list and then sorting it should also give you a
useable answer. The way to do that is make your comparison function
look at every part of the record that you care about, not just select
some single field and count on stability to take care of the rest. If
at all possible, design the comparison function so that there are
never any ties. To do otherwise has in my experience been a source of
bugs that don't become evident til much later.

Paul Rubin · Nov 19, 2003

Andrew Dalke said:
Do you want unix-style unique where repeats are merged into
one, so that 1 2 2 1 -> 1 2 1 or do you want it to return

That list is not already sorted and so uniqing it should throw an
exception. Uniq should only have to work on sorted lists.

Andrew Dalke · Nov 19, 2003

Paul Rubin:

It's also a barrier to anyone who implements list sort, not just
listalike sort. Of course that doesn't affect CPython or Jython
users, who already have a built-in stable list sort. But what about
PyPy or maybe someone trying to reimplement Python from scratch in C?

My recent statement on this hasn't changed -- this will affect
very few people, and the code to do a simple (perhaps not optimally
efficient for Python) sort is available in many books and libraries, so
should have very little influence on this decision.

I'm not a sorting whiz but my feeling is that if an application needs
sort to be stable, that's already a sign that something isn't really
right about it.

The standard example for stable sorts is in a spreadsheet-like
grid display where the ordering of the rows can be changed based
on the values in a given column (eg, by clicking on the column
header). Rows with column entries with the same value should
keep the same relative order, because that's what people expect.

Why? Because that lets the user define the sort order -- first
sort on tertiary key, then sort on secondary key, then sort on
primary key.

The way to do that is make your comparison function
look at every part of the record that you care about, not just select
some single field and count on stability to take care of the rest.

There are two answers to that. All non-stable sorts can be
turned into a stable sort by tacking on a final comparison based
on the original position. My spreadsheet example doesn't require
having a stable sort because I create my own. But then there's
the inelegance of creating a new list to store the original position,
and the extra dispatch needed to check that new field.

And if there isn't a stable sort, nor one induced by creating
an extra list, then what is the interface for creating the correct
comparison function? It could keep track of all the column
clicks, to know which ones are used as primary, secondary,
etc. sorts keys... but at some point it ends up with a tie
which cannot be resolved (in which case there's no problem)
or resolved by assuming the datatype of a given column.

For example, suppose I click on the 2nd column in

alpha 1
beta 2
beta 3
gamma 2
delta 2

There are ties with the three values of 2. The first column
can break the tie, but only by assuming that the first column
can be sorted alphabetically, to create

alpha 1
beta 2
delta 2
gamma 2
beta 3

I argue that the correct order should preserve the original
sort order, to create

alpha 1
beta 2
gamma 2
delta 2
beta 3

because the user's order is actually the alphabetical order of the
original greek alphabet, implicit in the original order of the data
file. (The user could click on the column header to resort for
that field, which would likely be done alphabetically, but at that
point that's the same as telling the computer to treat those fields
as english words, so sorting alphabetically is fine.)

If at all possible, design the comparison function so that there are
never any ties. To do otherwise has in my experience been a source of
bugs that don't become evident til much later.

I agree with that. But what to do when there are ties? I believe
people expect a stable sort (even without knowing what 'stable'
means) so it's helpful that Python's native sort be stable.

This is obviously a hard decision to make -- Python's sort has
been around for 10 years before this new requirement -- but I
don't think it's a poor one, and I've only seen theoretical
arguments for otherwise.

Andrew
(e-mail address removed)

Andrew Dalke · Nov 20, 2003

Paul Rubin:

Uniq should only have to work on sorted lists.

*shrug* Okay, that's different than the unix 'uniq'
command. I've needed the unix one more than I've
needed yours. And I've used what can now be
written as

unique_keys = dict.from_keys(lst).keys()

much more often. Well, it requires that the objects
be hashable and define == while yours requires
that they define == and < (or define < and use it
twice). Note that your uniq wouldn't work on
complex numbers because those don't define <.

Tack on a 'unique_keys.sort()' and those two lines
give pretty much what you want, except the check
that the original list is sorted.

Andrew
(e-mail address removed)

Aahz · Nov 20, 2003

Paul Rubin:

*shrug* Okay, that's different than the unix 'uniq' command. I've
needed the unix one more than I've needed yours.

Huh?!?! uniq has always to my knowledge only worked on sorted input.
Reading the man page on two different systems confirms my knowledge.

Paul Rubin · Nov 20, 2003

Andrew Dalke said:
needed yours. And I've used what can now be
written as

unique_keys = dict.from_keys(lst).keys()

This from_keys operation must be something new, and consing up a
dictionary like that is a woeful amount of overhead. But I guess
it would do the job.

A.M. Kuchling · Nov 20, 2003

Huh?!?! uniq has always to my knowledge only worked on sorted input.
Reading the man page on two different systems confirms my knowledge.

Yes, but uniq doesn't report an error if its input isn't sorted.

--amk

Ben Finney · Nov 20, 2003

A.M. Kuchling said:
A.M. Kuchling said:

[the Unix command] uniq doesn't report an error if its input isn't
sorted.

Click to expand...

Maybe it should. If its behavior on unsorted input isn't specified,
you shouldn't assume it will act in any particular way.

The specification on the man page for GNU uniq seems clear on this:

DESCRIPTION
Discard all but one of successive identical lines from INPUT
(or standard input), writing to OUTPUT (or standard output).

It doesn't care if the input is sorted or unsorted; it describes
behaviour on successive identical lines, not the total set of all lines.

Paul Rubin · Nov 20, 2003

A.M. Kuchling said:
Yes, but uniq doesn't report an error if its input isn't sorted.

Maybe it should. If its behavior on unsorted input isn't specified,
you shouldn't assume it will act in any particular way.

Erik Max Francis · Nov 20, 2003

Aahz said:
Huh?!?! uniq has always to my knowledge only worked on sorted input.
Reading the man page on two different systems confirms my knowledge.

uniq doesn't care whether the input is sorted or not. All it does is
collapse multiple consecutive duplicate lines into a single line. Using
uniq in conjunction with sort is certainly a common mode, but it's
hardly required.

--
Erik Max Francis && (e-mail address removed) && http://www.alcyone.com/max/
__ San Jose, CA, USA && 37 20 N 121 53 W && &tSftDotIotE
/ \
\__/ There are no dull subjects. There are only dull writers.
-- H.L. Mencken

Paul Rubin · Nov 20, 2003

Ben Finney said:
The specification on the man page for GNU uniq seems clear on this:

DESCRIPTION
Discard all but one of successive identical lines from INPUT
(or standard input), writing to OUTPUT (or standard output).

It doesn't care if the input is sorted or unsorted; it describes
behaviour on successive identical lines, not the total set of all lines.

OK, that's reasonable behavior.

Andrew Dalke · Nov 20, 2003

Me:
Paul Rubin:

This from_keys operation must be something new, and consing up a
dictionary like that is a woeful amount of overhead. But I guess
it would do the job.

Hmm, I misspelled it -- should be 'fromkeys' without the underscore.
Got confused with 'has_key'....

It's a new class method for dict

fromkeys(...)
dict.fromkeys(S[,v]) -> New dict with keys from S and values equal to v.
v defaults to None.

I said 'what can now be written as' meaning that before the
class method I would write it as

d = {}
for x in lst:
d[x] = 1
unique_keys = d.keys()

I don't know what you mean by 'consing up a dictionary'
taking up a 'woeful amount of overhead'. It's a bit of overhead,
but not woefully so. (And what does 'consing' mean? It's
a Lisp thing, yes? A 'cons cell' is the first / rest pair?)

BTW, another option is

unique_keys = list(sets.Set(lst))

but I still haven't internalized the sets module enough to
remember it trippingly.

For the case where I want a list of unique keys and where
I don't care about the resulting order then either of these
should work nicely.

Andrew
(e-mail address removed)

Paul Rubin · Nov 20, 2003

Andrew Dalke said:
I don't know what you mean by 'consing up a dictionary'
taking up a 'woeful amount of overhead'. It's a bit of overhead,
but not woefully so. (And what does 'consing' mean? It's
a Lisp thing, yes? A 'cons cell' is the first / rest pair?)

Sorry, yeah, Lisp jargon. A cons cell is a pair but "consing" has a
more generalized informal meaning of allocating any kind of storage on
the heap, which will have probably to be garbage collected later.
Consing up an object means building it up dynamically.

The way I usually uniq a list goes something like (untested):

def uniq(list):
p = 0
for i in xrange(1, len(list)):
if list != list[p]:
p += 1
list[p] = list
del list[p+1:]

So it just scans through the list once and then does a single del
operation.

BTW, another option is

unique_keys = list(sets.Set(lst))

but I still haven't internalized the sets module enough to
remember it trippingly.

Click to expand...

Yes, more consing

Terry Reedy · Nov 20, 2003

Paul Rubin said:
Sorry, yeah, Lisp jargon. A cons cell is a pair but "consing" has a
more generalized informal meaning of allocating any kind of storage on
the heap, which will have probably to be garbage collected later.
Consing up an object means building it up dynamically.

This 'generalized informal' meaning is new to me also ;-).

The way I usually uniq a list goes something like (untested):

def uniq(list):
p = 0
for i in xrange(1, len(list)):
if list != list[p]:
p += 1
list[p] = list
del list[p+1:]

So it just scans through the list once and then does a single del
operation.

I believe this requires list to be sorted, generally O(nlogn) if not
already so, while hash solution (sets, dict) is O(n) (+O(n) temporary
space). So either method can be 'best' in particular situation.

Terry J. Reedy

Curt · Nov 20, 2003

Erik Max Francis said:
Aahz wrote:

uniq doesn't care whether the input is sorted or not. All it does is
collapse multiple consecutive duplicate lines into a single line. Using
uniq in conjunction with sort is certainly a common mode, but it's
hardly required.

curty@einstein:~$ less uniq.txt
flirty
curty
flirty
curty

curty@einstein:~$ uniq uniq.txt
flirty
curty
flirty
curty

curty@einstein:~$ sort uniq.txt | uniq
curty
flirty

Maybe my uniq is unique.

curty@einstein:~$ man uniq

NAME
uniq - remove duplicate lines from a sorted file
******

Paul Rubin · Nov 20, 2003

Terry Reedy said:
This 'generalized informal' meaning is new to me also ;-).

See also: http://catb.org/~esr/jargon/html/C/cons.html

I believe this requires list to be sorted,

Yes, by intention. The discussion topic was removing duplicates from
a sorted list. I should have been a little more explicit though.

Ben Finney · Nov 20, 2003

Erik Max Francis said:
Erik Max Francis said:

[Unix command] uniq doesn't care whether the input is sorted or not.
All it does is collapse multiple consecutive duplicate lines into a
single line. Using uniq in conjunction with sort is certainly a
common mode, but it's hardly required.

Click to expand...

curty@einstein:~$ less uniq.txt
flirty
curty
flirty
curty

curty@einstein:~$ uniq uniq.txt
flirty
curty
flirty
curty

curty@einstein:~$ sort uniq.txt | uniq
curty
flirty

Maybe my uniq is unique.

curty@einstein:~$ man uniq

NAME
uniq - remove duplicate lines from a sorted file

I suspect your uniq is not unique; merely poorly documented. Pass it a
file consisting of unsorted input, but multiple consecutive identical
lines; you should see them collapsed to one.

Your OS needs a better 'man uniq', since that description doesn't say
what the expected behaviour is with unsorted input. The GNU 'man uniq'
doesn't mention sorted input, but applies to any input.

Read STDIN as bytes rather than a string	0	Jun 19, 2012
NamedTemporaryFile does not match documentation	0	Feb 21, 2013
SSL on 3.2.2	0	Feb 25, 2012
Random module missing?	0	Sep 23, 2009
Why aren't OrderedDicts comparable with < etc?	17	Jul 16, 2009
trying to find nose.tools.assert_raises_regexp	2	May 23, 2014
(any)dbm module lacks a context manager	0	Jan 31, 2013
how to solve it?	1	Aug 1, 2011

reduce() anomaly?

Georgy Pruss

Andrew Dalke

Georgy Pruss

Paul Rubin

Paul Rubin

Andrew Dalke

Andrew Dalke

Aahz

Paul Rubin

A.M. Kuchling

Ben Finney

Paul Rubin

Erik Max Francis

Paul Rubin

Andrew Dalke

Paul Rubin

Terry Reedy

Curt

Paul Rubin

Ben Finney

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads