How do I get a reference to a KEY value of a dictionary?

B

Ben Finney

Is this a waste of time? Because I know in python I cannot be certain
that the argument strings that are read from files are even garbage
collected anyway. I could certainly do the job with duplicate
strings, but it would seem wasteful. I am a C programmer, and old
habits die hard.

Python dynamically binds names ("variables") to objects; many types,
including immutable strings, will not be duplicated if an identical one
is already stored.

The upshot is: store an identical string as many times as you like,
because PYthon will simply keep references to the one string object for
each duplicate.
 
A

Andy C

I am new to python, so please bear with me if I am making some
conceptual error.

Basically I want to create a graph with an adjacency list
representation, but I don't want any of the adjacency lists to have
duplicate strings when it is avoidable. I have a function createEdge
that adds an edge to the graph. The arguments will be distinct since
they are read from text files. But basically I want to use the
dictionary as a string pool, and if the argument string equals
something in the pool already, don't use the argument string, just a
use a reference to something in the string pool already.

Is this a waste of time? Because I know in python I cannot be certain
that the argument strings that are read from files are even garbage
collected anyway. I could certainly do the job with duplicate
strings, but it would seem wasteful. I am a C programmer, and old
habits die hard.

The following code works -- I tested it and entries in the adjacency
list that are equivalent are in fact identical. But it seems rather
stupid to have a dictionary of the form {'alice': 'alice', 'bob':
'bob'}, etc. i.e. the keys and values are the same. It would be nice
if I can get a reference to the key in the same time as a hash lookup.
I know I can do dictionary.keys().index( 'string' ) or something but
that would be pretty inefficient, I believe.

class GraphAccumulator:
def __init__( self, fileFunction ):
self.fileFunction = fileFunction
self.adjList = {} # adjacency list
self.nodes = {} # list of nodes for
preventing string dupes

def createEdge( self, node1, node2 ):

# another way of doing by using a dictionary

nodes = self.nodes
if nodes.has_key( node2 ):
node2 = nodes[ node2 ]
else:
nodes[ node2 ] = node2

adjList = self.adjList
if adjList.has_key( node1 ):
adjList[ node1 ].append( node2 );
else:
adjList[ node1 ] = [ node2 ]; # singleton edge list
 
T

Terry Reedy

Andy C said:
I am new to python, so please bear with me if I am making some
conceptual error.

Basically I want to create a graph with an adjacency list
representation, but I don't want any of the adjacency lists to have
duplicate strings when it is avoidable. I have a function createEdge
that adds an edge to the graph. The arguments will be distinct since
they are read from text files. But basically I want to use the
dictionary as a string pool, and if the argument string equals
something in the pool already, don't use the argument string, just a
use a reference to something in the string pool already.

Thinking in terms of 'references' can both help and hinder. (There
have been some long recent discussion.)

Are you familiar with this?

Help on built-in function intern:

intern(...)
intern(string) -> string

``Intern'' the given string. This enters the string in the
(global)
table of interned strings whose purpose is to speed up dictionary
lookups.
Return the string itself or the previously interned string object
with the
same value.

TJR
 
A

Andy C

I don't think this is true in many cases. For example:
False

Also, when the strings come from different files read in using readline(),
it appears that duplicates aren't eliminated. I tried my code without the
hash table duplicate elimination, and the adjacency lists has non-identical
strings. After I constructed the adjacency lists, I compared them with 'is'
and they were not identical. After I used my hash table hack solution, all
equivalent strings were identical.
 
A

Andy C

Thanks, I think that is exactly what I'm looking for. I guess if you do a =
intern(a), then if a is equivalent but not identical to something in the
global string table, it will assign a to reference the object in the global
string table, and thus the old "a" value can be garbage collected.

Well, for a C/C++ programmer, they ARE references, and I can't imagine
thinking otherwise. They are implemented as references/pointers in the
interpreter. Thinking in value semantics can simplify things, but there
will always be some cases where it doesn't work.

I think my example is a good one. If you have a graph in an adjacency list
representation, you don't want the nodes to be copies of each other. They
should just point to a global list of nodes. This can definitely matter...
the worst case is when you have a complete graph, where you would have
n(n-1)/2 (or n(n-1) for a complete directed graph) node objects where you
only need n. Instead of 100,000 nodes (not an unreasonable size), you could
have 10 billion (totally unreasonable).

But it is a good question whether the garbage collection will make the
difference worth it, in most cases. I could reassign my references, but I
don't know when the old objects get deleted. That is, I can do a =
intern(a), but the old value of a could hang around in memory for who knows
how long. If anyone could shed some light on this, I would appreciate it.
 
A

Andrew Dalke

Andy C:
Basically I want to create a graph with an adjacency list
representation, but I don't want any of the adjacency lists to have
duplicate strings when it is avoidable.

Took me a while but I finally figured out what you're asking.

Look in the docs for 'intern', which is a builtin. It's rarely used
because most people don't worry about this sort of duplication.
Is this a waste of time? Because I know in python I cannot be certain
that the argument strings that are read from files are even garbage
collected anyway. I could certainly do the job with duplicate
strings, but it would seem wasteful. I am a C programmer, and old
habits die hard.

You can never be certain that anything is every garbage collected.
Python-the-language makes no guarantees on that. Python-in-C
does make a stronger statement, but it's an implementation issue. As
a C programmer you'll be happy that it's a reference counted scheme
and so long as there are no cycles (which can't happen with a string),
the string will be deallocated when your code lets go of it.

This is true even of interned strings, at least in newer Pythons. If
no one references an interned string, it too goes away. (Older
Pythons make the strings immortal.)

Here's how you might change your code.

class GraphAccumulator:
def __init__( self, fileFunction ):
self.fileFunction = fileFunction
self.adjList = {} # adjacency list

def createEdge( self, node1, node2 ):

# another way of doing by using a dictionary
node1 = intern(node1)
node2 = intern(node2)
adjList = self.adjList
if adjList.has_key( node1 ):
adjList[ node1 ].append( node2 );
else:
adjList[ node1 ] = [ node2 ]; # singleton edge list

But personally I would put the intern outside of the graph
code - either in the reader or the code between the reader
and this one. Otherwise, consider:

# fully connect the graph
nodes = graph.adjList.keys()
for n1 in nodes:
for n2 in nodes:
if n1 != n2:
graph.createEdge(n1, n1)

This does extra interning even though it isn't needed.

But then again, normally I wouldn't worry about the interning
at all. ... Perhaps I should... Hmmm...

Andrew
(e-mail address removed)
 
A

Aahz

Basically I want to create a graph with an adjacency list
representation, but I don't want any of the adjacency lists to have
duplicate strings when it is avoidable. I have a function createEdge
that adds an edge to the graph. The arguments will be distinct since
they are read from text files. But basically I want to use the
dictionary as a string pool, and if the argument string equals
something in the pool already, don't use the argument string, just a
use a reference to something in the string pool already.

That makes some sense, but why use the string object for both key and
value? Just do

d[key] = True

You could optimize your coding style (if not your speed) in Python 2.3
by using sets. I'd recommend against using intern(), because in Python
2.2 and earlier, interned strings *never* get garbage collected. Even
in Python 2.3, you may end up with worse memory behavior.
 
A

Andy C

I don't see how I can do this and let me eliminate duplicates. I need to
assign the old duplicate string to the unique string that already exists.
Hence the question, how do I get a reference to the KEY value? I know I can
use keys() and do a linear search, but that is much more inefficient. I
would like to get a reference to the key value in the same time that it
takes to do a hash lookup (constant time).

Basically I would want to rewrite this section of the code I posted:

nodes = self.nodes
if nodes.has_key( node2 ):
node2 = nodes[ node2 ]
else:
nodes[ node2 ] = node2

This dictionary seems stupid, I agree. The keys and values are the same.
But in the first part of the if, I want to reassign node2 to an equivalent
string that already exists. How can I do that?

The intern solution seems reasonable, and it appears that it was designed
specifically for this problem. I wasn't aware of the implementation
problems. But I'm still curious about different ways to do it.
 
B

Bengt Richter

Ahhhh.... Right. Hmmmmm.... <thinks hard> You're correct, you do
need to set the value to the key. I think using a dict is better than
using intern(). Here's an optimization:

node2 = node_cache.setdefault(node2, node2)

I would sure be more comfortable with a change in nomenclature, e.g.,

node2name = node_name_cache.setdefault(node2name, node2name)

so that it is clear that you are dealing with name strings, not typical graph
vertex/node objects.

BTW, reading from a file, you can probably not in general avoid forward references,
(i.e., names of adjacent nodes yet to be defined), but they should be resolvable at the end,
so you could make nodes that refer to adjacent nodes directly instead of by name.
Nodes would carry a ref to their own (string) name, and go by that same name string in a graph dict,
so there should be no duplication of strings, just references to them. E.g.,
(I started to just sketch something, and ... well, it's hard to stop programming Python
before something is running at least preliminarily ;-) I subclassed dict to keep nodes in by name.

====< graph.py >=========================================================
def boxit(s):
lines = s.splitlines()
wid = max(map(len,lines))
ret = ['+-%s-+' % ('-'*wid)]
fmt = '| %%-%ds |'%wid
for line in lines: ret.append(fmt%line)
ret.append(ret[0])
return '\n'.join(ret)

def errep(Ex, *rest):
if __debug__: print boxit('%s: %s'%(Ex.__name__, rest and rest[0])) # and blunder on
else: raise Ex(*rest)

class Node(object):
__slots__ = 'name', 'adj_list'
def __init__(self, name):
self.name = name
self.adj_list = None # signify undefined, vs [] for no adjacent nodes

class Graph(dict):
def load(self, infile, print_lines=False):
for line in infile:
if print_lines: print line.rstrip()
sline = line.strip()
if not sline or sline.startswith('#'): continue
node_def_names = line.split() # assume line like 'nodename adj_node1_name adj_node2_name ...'
node_name = node_def_names[0]
if node_name in self:
# don't allow re-definition
errep(ValueError, 'Node %r being re-defined:\n new adj: %r\n old adj: %r' %(
node_name, node_def_names[1:], self[node_name].adj_list))
else:
self[node_name] = Node(node_name)
# set adj list, making making this a defined node
self[node_name].adj_list = node_def_names[1:]
# change adj lists to direct references to adjacent nodes
for node in self.values():
adj_list = node.adj_list
assert adj_list is not None
for i in xrange(len(adj_list)):
adj_name = adj_list
try:
adj_list = self[adj_name]
except KeyError:
errep( ValueError, 'node %r refers to undefined adj node %r' % (node.name, adj_name))
# continuing if debugging
adj_list = Node(adj_name) # adj_list None (vs []) signifies undefined
def show(self):
node_names = self.keys()
node_names.sort()
visited = {}
ret = []
def prtree(node, indent=0):
if node.name in visited:
if indent: ret.append('%s%s (see previous)'%(indent*' ',node.name))
else:
visited[node.name]=True
if node.adj_list is None:
ret.append('%s%s (undefined) '%(indent*' ',node.name))
return
ret.append('%s%s'%(indent*' ',node.name))
for adj in node.adj_list:
prtree(adj, indent+1)
for node_name in node_names:
prtree(self[node_name])
ret.append('')
return '\n'.join(ret)

def test(filepath=''):
graph = Graph() # name-string -> Node instance
if not filepath:
import StringIO
f_graph = StringIO.StringIO("""
A B C D
B E F
C G H
G A
D B
E
F
H
D E F
X E Y
""")
else:
f_graph = file(filepath)

print 'Loading graph from %s ...\n--' % (filepath or '(test)')
graph.load(f_graph, True) # print the lines read
print '--'
f_graph.close()

print graph.show()

if __name__ == '__main__':
import sys
args = sys.argv[1:]
test(args and args[0] or '')
=========================================================================

Result:

[16:33] C:\pywk\clp\pp>graph.py
Loading graph from (test) ...
--

A B C D
B E F
C G H
G A
D B
E
F
H
D E F
+----------------------------------------+
| ValueError: Node 'D' being re-defined: |
| new adj: ['E', 'F'] |
| old adj: ['B'] |
+----------------------------------------+
X E Y

+-------------------------------------------------------+
| ValueError: node 'X' refers to undefined adj node 'Y' |
+-------------------------------------------------------+
--
A
B
E
F
C
G
A (see previous)
H
D
B (see previous)
X
E (see previous)
Y (undefined)

(Not tested beyond what you see. NTBWYS? ;-)

Regards,
Bengt Richter
 
B

Bengt Richter

Well, that seems very sensical, so how come it hasn't made it into the
language? And what's wrong with intern? (Though intern only works on
strings, not for immutable objects in general. I believe someone was asking
a pretty much identical question here, and someone replied with the
'memoize' pattern).

Can this be done without C, now that you can subclass the built-in
dictionary?
For a subclass of dict that may be of interest, see my post in this thread
timestamped less than 2 minutes before this post of yours ;-)

Regards,
Bengt Richter
 
A

Andy C

Thanks, looks interesting... but it actually doesn't really address the
question I was asking. It doesn't matter since I think the intern solution
is fine for me. But if I'm not mistaken, your solution still stores the
node names in the values of the dictionary (via the Node objects). So in
effect you do still have a dictionary of the form { 'node1name':
('node1name', other node1 data), 'node2name': ('node2name', other node2
data)). It doesn't seem as silly when you have a real node object, though.
In my application the nodes are really just strings; I don't need anything
else.

Andy
 
B

Bengt Richter

Thanks, looks interesting... but it actually doesn't really address the
question I was asking. It doesn't matter since I think the intern solution
I thought it did ;-)
is fine for me. But if I'm not mistaken, your solution still stores the
node names in the values of the dictionary (via the Node objects). So in
No, just references, not duplicate strings.
effect you do still have a dictionary of the form { 'node1name':
('node1name', other node1 data), 'node2name': ('node2name', other node2
data)). It doesn't seem as silly when you have a real node object, though.
In my application the nodes are really just strings; I don't need anything
else.
Well, don't you need the adjacency lists you were talking about? I assumed they were
all lists of out-edges associated with graph vertices. So I defined a Node class
with the two things: a name and an adj_list. You can eliminate the name reference
and just let the graph dict keys be the only reference, but then you will not
so conveniently have the name available to algorithms that process the nodes.
E.g., for some node myNode you might have to search the dict something like

for key,value in graph.items(): # untested!
if value is myNode: myName=key; break
else:
myName = "?? can't find name of %r ??'%myNode

One caveat though, if your graph has cycles, changing the adjacency lists to direct node
references instead of names will create reference cycles that may have to be broken for
the gc. I have to look into that. You can comment that fixup loop out if you want to
keep adjacency lists as name string references instead of node references.

BTW, have you looked at the sets module of 2.3? That would let you have a set of strings
with no associated values, and no duplicates (but how do you represent adjacency lists then?)

Anyway, unless I goofed, the { 'node1name':('node1name', other node 1 data) } analogy is
not using two separate strings. It's more like the result of (untested!):

s = 'node1name'
graph.update({ s:(s, other node 1 data)}) # pseudo-tuple-element at graph[1] ;-)
del s

To check, I added the assert below, where it scans the node names in the
graph.show() method, and it didn't complain:

def show(self):
node_names = self.keys()
8< snip >8
for node_name in node_names:
assert node_name is self[node_name].name
prtree(self[node_name])
ret.append('')
return '\n'.join(ret)

Regards,
Bengt Richter
 
A

Andy C

is fine for me. But if I'm not mistaken, your solution still stores the
No, just references, not duplicate strings.

OK, you're right -- I assumed that the key and value were duplicated, but
that's not the case. I don't know why I thought that, maybe since the key
must be immutable and the value is mutable. So I guess having a dictionary
of the form { 'A': 'A', 'B': 'B' } is not as stupid as it first seems, since
only a reference is stored. But still I wonder why the language doesn't
have a facility for getting a reference to the key value in constant time.
Apparently someone did it by modifying the C source for the dictionary to
add a ref_key() accessor. It seems like it would be useful quite often.
One caveat though, if your graph has cycles, changing the adjacency lists to direct node
references instead of names will create reference cycles that may have to be broken for
the gc. I have to look into that. You can comment that fixup loop out if you want to
keep adjacency lists as name string references instead of node references.

Good point... I thought though that the new python (2.3?) will GC data with
cycles.
BTW, have you looked at the sets module of 2.3? That would let you have a set of strings
with no associated values, and no duplicates (but how do you represent
adjacency lists then?)

Actually, I did run across the sets type recently. It is actually built *on
top of* dict, so it's no better. Ex:
a = Set([1,2,3])
a Set([1, 2, 3])
a._data
{1: True, 2: True, 3: True}

It just stores a dummy value as the value of the dict.

BTW I am using a regular dictionary for the adjacency list, with the list of
destination edges. I was using another dictionary for the 'cache' of names.

Thanks for the help.

Andy
 
A

Aahz

OK, you're right -- I assumed that the key and value were duplicated,
but that's not the case. I don't know why I thought that, maybe since
the key must be immutable and the value is mutable. So I guess having
a dictionary of the form { 'A': 'A', 'B': 'B' } is not as stupid as it
first seems, since only a reference is stored. But still I wonder why
the language doesn't have a facility for getting a reference to the key
value in constant time. Apparently someone did it by modifying the C
source for the dictionary to add a ref_key() accessor. It seems like
it would be useful quite often.

Well, you're the first person I recall caring about this specific issue.
Of course, general caching issues come up quite frequently. All the
people I've seen wanting to use intern() come at it from a performance
rather than memory perspective, for which a dict would be no use. ;-)
 
A

Andy C

These days I have an extension which merely subclasses dict to add on
a key reference method and an increment method which does something
like:

You did this in C or python? If in python, how did you do it? (without
keys()?) If in C, why don't you submit it to the source tree?

thanks,
Andy
 
J

John Machin

Andy C said:
You did this in C or python? If in python, how did you do it? (without
keys()?) If in C, why don't you submit it to the source tree?

C. There were two people interested in having it in the core, me and Pat Malone.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,023
Latest member
websitedesig25

Latest Threads

Top