On 12 Jul 2006 08:56:56 -0700, (e-mail address removed) declaimed the following
in comp.lang.python:
This is the second thread this week in which someone inquires about
using hash() on lots of URLs... if a third one appears I'd suspect a
homework assignment...
key, so I do not want to have the keys stored in memory. Could I just
hash() the url strings first and use the resulting integer as the key?
Hash functions are not guaranteed to produce UNIQUE values. They
typically are used to determine an entry point in a presized array.
Since multiple data values may have the same hash value, one needs to
account for collisions. One determines collisions by comparing the data
value against the value found in the table (not the hash value -- you
still need access to the original data value) and "jumping" (based upon
some algorithm) if the values do not match... Or the initial table
contains pointers to (linked) lists of the real values.
I think what I'm after here is more like a tradition hash table. If I
do it this way am I going to get the memory savings I am after? Will
the hash function always generate unique keys? Also, would the same
technique work for a set?
No, No, Sets already use the Python method (typically the object ID)
to determine uniqueness, though for strings it may be straight
comparisons -- I've not used Python sets (heck, I only just upgraded to
2.4 a few weeks ago).
For "tradition" hash table:
http://www.sparknotes.com/cs/searching/hashtables/section1.html
http://en.wikipedia.org/wiki/Hashtable
Hash tables are not a data storage optimization -- they ARE a data
/search/ optimization...
If storage space is a concern, and not look-up time, I'd suggest a
tree structure, based upon the parts of the URL... Using the above
links, a tree of the form:
{ "http" : [
{ "
www.sparknotes.com" : [
{ "cs" : [
{ "searching" : [
{ "hashtables" : [
"section1.html" ]
} ]
} ]
} ,
{ "en.wikipedia.ord" : [
{ "wiki: [
"Hashtable" ]
} ]
} ] ,
"ftp" : [ ... ]
}
This structure stores each component of the URL only ONCE, so you
don't have all the repeated data for nodes within a site. Since you know
they are URLs, the "://" is implicit for the top level look-up, and a
"/" is implicit for each node. The end of the tree is reached when the
item found in that level's list is not a dictionary. NOTE: you may have
to handle the case of an "implicit" page using dictionary with a null
entry as a special flag:
http://some.site/ #default index.html
http://some.site/aPage.htm #explicit page
{ "http" : [
{ "some.site" : [ "",
"aPage.htm" ]
} ]
}
This is only reasonable for in-memory data where the prime concern
is memory size and not processing to insert/retrieve/delete data. If an
RDBM is going to be used, it would be better to just rely upon the DBMS
to be decently optimized for indexing operations using the full URL.
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/