Does '#hash' mean anything in IDLE?

J

John Coleman

Greetings,
I am currently trying to learn Python through the excellent
"Learning Python" book. I wrote my first non-trivial program, which
began with several comment lines. One of the comment lines began with
'#hash'. IDLE doesn't colorize it as a comment line but instead colors
the word 'hash' in purple as if it were a key word. Wierd. The behavior
seems easy to trigger: Just open up a new window in IDLE and enter
these two lines:

#This is a test
#hash should still be a comment line

Then, after saving, the second line is not colored as a comment line
though the first is.
Is this a bug, or do comment lines which begin with #hash have some
special meaning?
My program ran fine, so it seems that the interpreter itself is
ignoring the line.

-John Coleman
 
J

John Coleman

John said:
Greetings,
I am currently trying to learn Python through the excellent
"Learning Python" book. I wrote my first non-trivial program, which
began with several comment lines. One of the comment lines began with
'#hash'. IDLE doesn't colorize it as a comment line but instead colors
the word 'hash' in purple as if it were a key word. Wierd. The behavior
seems easy to trigger: Just open up a new window in IDLE and enter
these two lines:

#This is a test
#hash should still be a comment line

Then, after saving, the second line is not colored as a comment line
though the first is.
Is this a bug, or do comment lines which begin with #hash have some
special meaning?
My program ran fine, so it seems that the interpreter itself is
ignoring the line.

-John Coleman

It isn't just #hash, but also things like #dict, #int, #len at the
start of a comment line which defeats IDLE's colorization algorithm.
Interestingly, things like #while or #for behave as expected so it
seems to be built-ins rather than keywords which are the problem. To
answer my own question - this is pretty clearly a (harmless) bug.

-John Coleman
 
J

John Salerno

me too!
It isn't just #hash, but also things like #dict, #int, #len at the
start of a comment line which defeats IDLE's colorization algorithm.
Interestingly, things like #while or #for behave as expected so it
seems to be built-ins rather than keywords which are the problem. To
answer my own question - this is pretty clearly a (harmless) bug.

also notice that putting a space after # stops the problem
 
J

John Coleman

John said:
me too!


also notice that putting a space after # stops the problem

How do you like Python so far? I like dictionary objects the best so
far. I'm coming to Python from VBScript, so I already knew the value of
such things, but in Python they are better supported.

Here is the program I was talking about, which *really* shows the power
of dictionaries:

*****************************************************************************************

#Python program to discover word with most 1-word anagrams

#The following hash function has the property
#that words which are anagrams of each other
#hash to the same string. It assumes that input
#is lower case in range 'a' to 'z'

def letter_hash(word):
codes = 26 * [0]
for c in word:
codes[ord(c)-97] +=1
return_string = ''
for i in range(26):
j = codes
if j > 0:
return_string += (str(j)+chr(i+97))
return return_string

#main program:

hashes = {}

#first load dictionary of hashes

for line in open('C:\\yawl.txt').readlines():
word = line.strip().lower() #to be safe
my_hash = letter_hash(word)
hashes.setdefault(my_hash,[]).append(word)

#now find word with most anagrams

max_len = 0
max_words = []
for word_list in hashes.itervalues():
if len(word_list) > max_len:
max_len = len(word_list)
max_words = word_list
print max_words


**********************************************************

"yawl" stands for "Yet Another Word List". It is a text-list of some
240,000 English words, including all sorts of archaic and technical
phrases. Google for "yawl word list" if you want to track down a copy.
The output is

['apers', 'apres', 'asper', 'pares', 'parse', 'pears', 'prase',
'presa', 'rapes', 'reaps', 'spaer', 'spare', 'spear']

These 13 words are anagrams of each other. They contain some pretty
obscure words: asper is a 17th century Turkish coin and spaer is an
archaic Scottish-dialect word word for prophet (you can see "speaker"
if you squint).

-John Coleman
 
B

Blackbird

John Coleman said:
John said:
me too!


also notice that putting a space after # stops the problem

How do you like Python so far? I like dictionary objects the best so
far. I'm coming to Python from VBScript, so I already knew the value
of such things, but in Python they are better supported.

Here is the program I was talking about, which *really* shows the
power of dictionaries:

****************************************************************************
*************

#Python program to discover word with most 1-word anagrams
[...]

Nice!

I think this simpler version of letter_hash should work too:

def letter_hash(word):
w = [c for c in word]
w.sort()
return "".join(w)
 
J

John Coleman

Blackbird said:
John Coleman said:
John said:
John Coleman wrote:
John Coleman wrote:
Greetings,
I am currently trying to learn Python through the excellent
"Learning Python" book.

me too!

It isn't just #hash, but also things like #dict, #int, #len at the
start of a comment line which defeats IDLE's colorization algorithm.
Interestingly, things like #while or #for behave as expected so it
seems to be built-ins rather than keywords which are the problem. To
answer my own question - this is pretty clearly a (harmless) bug.

also notice that putting a space after # stops the problem

How do you like Python so far? I like dictionary objects the best so
far. I'm coming to Python from VBScript, so I already knew the value
of such things, but in Python they are better supported.

Here is the program I was talking about, which *really* shows the
power of dictionaries:

****************************************************************************
*************

#Python program to discover word with most 1-word anagrams
[...]

Nice!

I think this simpler version of letter_hash should work too:

def letter_hash(word):
w = [c for c in word]
w.sort()
return "".join(w)

Nice suggestion. No need to actually count the multiplicity as long as
you don't lose the information. Your function is much more readable
than mine.

-John Coleman
 
J

John Salerno

John said:
How do you like Python so far?

So far I'm pretty impressed with its simplicity. Seems like anything I
write in 5+ lines can be condensed to one line! :) Coming from C#, it
was just such a surprise to see how much easier it is to do things in
Python. I'm also interested in the dictionary object, simply because
it's something a little different than I'm used to, and very convenient
to have built-in.
 
S

Scott David Daniels

John said:
Blackbird said:
I think this simpler version of letter_hash should work too:

def letter_hash(word):
w = [c for c in word]
w.sort()
return "".join(w)

And, for 2.4 or later:

def letter_hash(word):
return "".join(sorted(word))

sorted takes an iterable, and strings are iterables.

--Scott David Daniels
(e-mail address removed)
 
P

Paul Rubin

Scott David Daniels said:
And, for 2.4 or later:

def letter_hash(word):
return "".join(sorted(word))

sorted takes an iterable, and strings are iterables.

I don't think the "hash" is really a hash in the normal sense--in
particular, it has to be collision-free. So I'd just call it
"sorted_word". Here's my version of the program:

================================================================
from sets import Set

def sorted_word(word):
return ''.join(sorted(word))

d = {}
for line in file('/usr/share/dict/words'):
word = line.strip().lower()
d.setdefault(sorted_word(word), Set()).add(word)

print sorted(d.iteritems(), key=lambda (x,y): -len(y))[:1]

================================================================

For the version of /usr/dict/words with Fedora Core 4, I get the words
('reast', 'stare', 'arest', 'tares', 'resat', 'aster', 'treas',
'teras', 'tears', 'rates', 'serta', 'tarse', 'astre', 'strae',
'tresa'). I don't know what most of those words mean.

Note that I sorted the dictionary items in order to get the max
element. That is sort of bogus because it's an O(N log N) operation
while finding the maximum should only need O(N). But it leads to
a convenient spelling. It would be nice if "max" accepted a "key"
argument the way that the sorting functions do.

I used sets.Set instead of a list in order to prevent ABC and abc
(identical after processing through .lower()) from counting as
anagrams of each other.
 
J

John Coleman

Scott said:
John said:
Blackbird said:
I think this simpler version of letter_hash should work too:

def letter_hash(word):
w = [c for c in word]
w.sort()
return "".join(w)

And, for 2.4 or later:

def letter_hash(word):
return "".join(sorted(word))

sorted takes an iterable, and strings are iterables.

--Scott David Daniels
(e-mail address removed)

Impressive. It is this ability to create single expressions (like
"".join(sorted(word))) which take the place of entire algorithms in
other languages which I find one of the most striking features of
Python.

-John Coleman
 
S

Scott David Daniels

Paul said:
I don't think the "hash" is really a hash in the normal sense--in
particular, it has to be collision-free. So I'd just call it
"sorted_word". Here's my version of the program:

================================================================
from sets import Set
"Cute" form for this:

try:
set
except NameError:
from sets import Set as set

Then you get native sets for 2.4, and sets.Set for 2.3
d = {}
for line in file('/usr/share/dict/words'):
word = line.strip().lower()
d.setdefault(sorted_word(word), Set()).add(word)

print sorted(d.iteritems(), key=lambda (x,y): -len(y))[:1] ....

Note that I sorted the dictionary items in order to get the max
element. That is sort of bogus because it's an O(N log N) operation
while finding the maximum should only need O(N). But it leads to
a convenient spelling. It would be nice if "max" accepted a "key"
argument the way that the sorting functions do.

Using a variant of DSU (Decorate-Sort-Undecorate) with max for S,
rather than sort:

print max((len(words), words) for words in d.itervalues())
or:
size, words = max((len(words), words) for words in d.itervalues())
print size, sorted(words)


--Scott David Daniels
(e-mail address removed)
 
B

Blackbird

Scott said:
Paul said:
I don't think the "hash" is really a hash in the normal sense--in
particular, it has to be collision-free. So I'd just call it
"sorted_word". Here's my version of the program:

================================================================
from sets import Set
"Cute" form for this:

try:
set
except NameError:
from sets import Set as set

Then you get native sets for 2.4, and sets.Set for 2.3
d = {}
for line in file('/usr/share/dict/words'):
word = line.strip().lower()
d.setdefault(sorted_word(word), Set()).add(word)

print sorted(d.iteritems(), key=lambda (x,y): -len(y))[:1] ...

Note that I sorted the dictionary items in order to get the max
element. That is sort of bogus because it's an O(N log N) operation
while finding the maximum should only need O(N). But it leads to
a convenient spelling. It would be nice if "max" accepted a "key"
argument the way that the sorting functions do.

Using a variant of DSU (Decorate-Sort-Undecorate) with max for S,
rather than sort:

print max((len(words), words) for words in d.itervalues())
or:
size, words = max((len(words), words) for words in
d.itervalues()) print size, sorted(words)


--Scott David Daniels
(e-mail address removed)

Your code is Pylegant. Jee, I just learned of Python two weeks ago, and my
copy of the cookbook arrived yesterday. And now I'm coining new words.
What is this language doing to me?

Blackbird
 
P

Paul Rubin

Scott David Daniels said:
Using a variant of DSU (Decorate-Sort-Undecorate) with max for S,
rather than sort:

print max((len(words), words) for words in d.itervalues())
or:
size, words = max((len(words), words) for words in d.itervalues())
print size, sorted(words)

That's nice and concise but it doesn't completely fix the running time
issue. max(words, key=len) should run in O(N) time where N is the
number of words, but
max((len(words), words) for words in d.itervalues())
might need time proportional to the total lengths of the words. I.e.
suppose words=['aaaaaaaa','aaaaaaab','aaaaaaac']. They are the same
length so the cmp builtin proceeds to the next component of the tuples
being compared. That means the strings get compared character by
character. That's maybe not too bad for dictionary words, but isn't
too great for arbitrary strings which might be millions of chars each.

In general that DSU pattern doesn't even always work: you might want
the max of objects which don't directly support any comparison methods
that the cmp builtin understands. The "key" callable is needed to
extract something that can be compared, or else there could be a
user-supplied comparison function like .sort() and sorted support.
 
K

Kent Johnson

Paul said:
Scott David Daniels said:
Using a variant of DSU (Decorate-Sort-Undecorate) with max for S,
rather than sort:

print max((len(words), words) for words in d.itervalues())
or:
size, words = max((len(words), words) for words in d.itervalues())
print size, sorted(words)


That's nice and concise but it doesn't completely fix the running time
issue. max(words, key=len) should run in O(N) time where N is the
number of words, but
max((len(words), words) for words in d.itervalues())
might need time proportional to the total lengths of the words. I.e.
suppose words=['aaaaaaaa','aaaaaaab','aaaaaaac']. They are the same
length so the cmp builtin proceeds to the next component of the tuples
being compared. That means the strings get compared character by
character. That's maybe not too bad for dictionary words, but isn't
too great for arbitrary strings which might be millions of chars each.

use
max((len(words), i, words) for i, words in enumerate(d.itervalues()))

The index will always disambiguate and words will never be compared.
In general that DSU pattern doesn't even always work: you might want
the max of objects which don't directly support any comparison methods
that the cmp builtin understands.

Using the index as above prevents the objects from ever being compared.

Kent
 
P

Paul Rubin

Kent Johnson said:
use
max((len(words), i, words) for i, words in enumerate(d.itervalues()))

The index will always disambiguate and words will never be compared.

OK, but that starts to get pretty obscure. Supporting key and cmp
args for max and min is a lot cleaner. Maybe there's some more
general approach to supporting them for all these functions (sorted,
..sort(), min, max, heapq, etc.) in an automatic, uniform way, e.g. by
wrapping the sequences instead of using keyword args, perhaps by
extending generic __cmp__ somehow. I'll think about this some more.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,680
Members
48,796
Latest member
Greg L.

Latest Threads

Top