searching a value of a dict (each value is a list)

S

Seongsu Lee

Hi,

I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.

dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.

How can I do this with Python? Ideas?
 
S

Seongsu Lee

Hi,

I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.

dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.

How can I do this with Python? Ideas?

Hi,

I just let the dict work in bidirectional fashion so that
I can find out what I want by both key and value. A mark
or prefix was needed to distinguish between keys originated
from keys and keys originated from values. (value * (-1))

from pprint import pprint
dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}
for k, v in dict.items():
for x in v:
dict[x * -1] = k
pprint(dict)

{-105: 900000,
-104: 900000,
-103: 900000,
-102: 900000,
-101: 900000,
-100: 900000,
-22: 900001,
-21: 900001,
-20: 900001,
-19: 999999,
-18: 999999,
-17: 999999,
-16: 999999,
-15: 999999,
-12: 2,
-11: 2,
-10: 2,
-5: 1,
-4: 1,
-3: 1,
-2: 1,
-1: 1,
1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

What do you think of this? Ideas with less space complexity?
 
P

Pablo Ziliani

Seongsu Lee escribió:
Hi,

I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
(...)

I want to find out the key value which has a specific
integer in the list of its value.

Sorry if this is unhelpful, but have you considered moving your data
model a proper database?
I ask because unless someone knows of a specific module, I think we are
in DB's authentic realm. Is the fastest solution, probably not just for
this particular operation you are trying to do.

Regards,
Pablo
 
S

Seongsu Lee

Seongsu Lee escribió:




Sorry if this is unhelpful, but have you considered moving your data
model a proper database?
I ask because unless someone knows of a specific module, I think we are
in DB's authentic realm. Is the fastest solution, probably not just for
this particular operation you are trying to do.

Regards,
Pablo

Hi Pablo,

Thank you for your posting! I wanted to solve the problem within
a given environment, python, and I think it is solved by
a dictionary with bidirectional key. I have posted it and want to
know if other knows more elegant way to do it.
 
B

Bruno Desthuilliers

Seongsu Lee a écrit :
Hi,

I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.

dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.

How can I do this with Python? Ideas?

A common solution is to build a reversed index - that is, a dict mapping
values to lists of keys. Now if you have a really huge dataset, this may
become cumbersome.
 
B

bearophileHUGS

Seongsu Lee:
What do you think of this? Ideas with less space complexity?

You can put the second group of keys in a second dictionary, so you
don't have to mangle them, and it may be a bit faster.

Regarding the space complexity, I don't know how you can reduce it
with Python. Probably you can create a list of long, sort it and use
bisect on it to find the keys. Such longs can be a combination of
shifted integer OR the other integer that is the key of the original
dict. But I don't how much you can gain like this.

Another solution to reduce space is to use a tiny external module
written in C, Pyrex or D. Here follows some simple D code you can
modify a bit to make it work with Pyd (http://pyd.dsource.org/):


import std.traits: ReturnType;
import std.stdio: writefln;

struct TyInt_int {
int el, n;
int opCmp(TyInt_int other) {
if (el == other.el)
return 0;
return (el < other.el) ? -1 : 1;
}
}

int bisect(TyElem, TyData, TyFun)(TyElem[] a, TyData x, TyFun key) {
int lo = 0;
int hi = a.length;
while (lo < hi) {
int mid = (lo + hi) / 2;
if (x < key(a[mid]))
hi = mid;
else
lo = mid + 1;
}
return lo;
}

void main() {
int[][int] int_arr;
int_arr[1] = [1, 2, 3, 4, 5];
int_arr[1] = [10, 11, 12],
int_arr[900000] = [100, 101, 102, 103, 104, 105],
int_arr[900001] = [20, 21, 22],
int_arr[999999] = [15, 16, 17, 18, 19];

int tot_len = 0;
foreach(arr; int_arr)
tot_len += arr.length;

auto data_arr = new TyInt_int[](tot_len);
int i = 0;
foreach(n, arr; int_arr)
foreach(el; arr)
data_arr[i++] = TyInt_int(el, n);

data_arr.sort;
writefln(bisect(data_arr, 100, (TyInt_int ii){return ii.el;}));
}

Bye,
bearophile
 
J

Jeremy C B Nicoll

Seongsu Lee said:
Hi,

I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.

dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.

Are the integers in the lists unique? I mean, could 104 occur in more than
one list? If it did, would it matter which key was returned?
How can I do this with Python? Ideas?

When I see something like this my natural response is to think that the data
structure is inappropriate for the use it's being put to.

The code someone else posted to reverse the keys is all very well, but
surely hugely wasteful on cpu, maybe storage, and elapsed time.

Even if the dict in this form is needed for some other reason, couldn't the
code that created it also create a reverse index at the same time?
 
B

bearophileHUGS

Jeremy C B Nicoll:
The code someone else posted to reverse the keys is all very well, but
surely hugely wasteful on cpu, maybe storage, and elapsed time.

If you are talking about my D code then I know it, the creation of the
first dict has to be skipped, if possible... The code I have posted
must be adapted.

Bye,
bearophile
 
J

Jeremy C B Nicoll

Jeremy C B Nicoll:

If you are talking about my D code then I know it...

No I meant the code that used python to iterate over the dict and create
zillions of extra keys. I've deleted earlier posts in the thread and wasn't
sure who suggested that.
 
Z

Zepo Len

I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.

dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.

How can I do this with Python? Ideas?

def find_key(dict, num):
for k in dict:
if num in dict[k]:
return k
 
S

Seongsu Lee

Seongsu Lee said:
I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.
dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}
I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.

Are the integers in the lists unique? I mean, could 104 occur in more than
one list? If it did, would it matter which key was returned?

Yes, the intergers in the lists are unique.
When I see something like this my natural response is to think that the data
structure is inappropriate for the use it's being put to.

The code someone else posted to reverse the keys is all very well, but
surely hugely wasteful on cpu, maybe storage, and elapsed time.

Even if the dict in this form is needed for some other reason, couldn't the
code that created it also create a reverse index at the same time?

The reason I use the dict for my data is to speed up the search by
key.

The code could create also a reverse index (a reverse dict) at the
time. But as you said, it waste the space and I wanted to ask someone
who may know some way to reduce the waste of space while searching
fast.
 
J

Jeremy C B Nicoll

Seongsu Lee said:
The reason I use the dict for my data is to speed up the search by key.

Ok, I understand that once the overhead of creating the dict has been done,
getting access to values within it is quick. And taking the time to create
a set of reverse keys speeds up the reverse access.

Rather than scanning the whole dict and creating reverse keys for everything
in it first, there might be an advantage in making search logic test if
there is a reverse key for the negative integer to be searched for, and if
so use it, otherwise scan the dict creating and examining reverse keys until
the integer is found. That way if the integer you're looking for is early
in the dict you'd only create reverse keys for the integers from the start
of the dict until the required one.


We don't know what external(?) process created the data that you've stored
in the dict, nor what you use it for - or more to the point - how often. If
you're going to make just one search of that data then there's little point
in having a fast search after a slow dict creation. On the other hand if
you have many many searches to do the initial overhead might be acceptable.


(I don't know how slow creating the dict would be for a typical example of a
million keys each keying lists of 1-1000 integers.)
The code could create also a reverse index (a reverse dict) at the
time. But as you said, it waste the space and I wanted to ask someone
who may know some way to reduce the waste of space while searching
fast.

Is the dict used by anything else? If the data in it was held in some other
form would that cause your program (or other programs) lots of problems? If
the range of values of the integers being stored is suitable, you might
sensibly use several or many smaller dicts to store all the data (and thus
save time reverse-keying much less of it).
 
A

Adonis Vargas

Seongsu said:
Hi,

I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.

dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.

How can I do this with Python? Ideas?

You can try this:

items = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

def findItem(item, dictionary):
for key, value in dictionary.iteritems():
if item in value:
print key, value

findItem(104, items)

This will allow you to work with the existing dataset without needing to
duplicate it. It will print all occurrances.

Also, you should never use reserved words like 'dict' this creates
confusion and can cause Python to misbehave since you are rebinding the
name.

Hope this helps.

Adonis Vargas
 
B

bearophileHUGS

Adonis Vargas:
Also, you should never use reserved words like 'dict' this creates
confusion and can cause Python to misbehave since you are rebinding the
name.
Adonis Vargas

After hearing this suggestion for the 300th time, I think it may be
the moment to fix this problem in Python3, and make the Python
compiler issue a syntax error if someone tries to reassign such kind
of words, like dict, set, etc.

Bye,
bearophile
 
S

Seongsu Lee

Seongsu said:
I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.
dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}
I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.
How can I do this with Python? Ideas?

You can try this:

items = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}

def findItem(item, dictionary):
for key, value in dictionary.iteritems():
if item in value:
print key, value

findItem(104, items)

This will allow you to work with the existing dataset without needing to
duplicate it. It will print all occurrances.

Hi,

Yes, it works. But I think it works in O(n * m), doesn't it?
(n is # of keys in the dictionary and m is # of items in the list.)
So, we need to create a reverse index. (a reverse dictionary) or
need something better at least, I think.
Also, you should never use reserved words like 'dict' this creates
confusion and can cause Python to misbehave since you are rebinding the
name.

Yep. :)
 
M

MonkeeSage

Seongsu said:
Hi,
I have a dictionary with million keys. Each value in the
dictionary has a list with up to thousand integers.
Follow is a simple example with 5 keys.
dict = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}
I want to find out the key value which has a specific
integer in the list of its value. For example, if I search
104 in the list, 900000 must be returned.
How can I do this with Python? Ideas?
You can try this:
items = {1: [1, 2, 3, 4, 5],
2: [10, 11, 12],
900000: [100, 101, 102, 103, 104, 105],
900001: [20, 21, 22],
999999: [15, 16, 17, 18, 19]}
def findItem(item, dictionary):
for key, value in dictionary.iteritems():
if item in value:
print key, value
findItem(104, items)
This will allow you to work with the existing dataset without needing to
duplicate it. It will print all occurrances.

Hi,

Yes, it works. But I think it works in O(n * m), doesn't it?
(n is # of keys in the dictionary and m is # of items in the list.)
So, we need to create a reverse index. (a reverse dictionary) or
need something better at least, I think.
Also, you should never use reserved words like 'dict' this creates
confusion and can cause Python to misbehave since you are rebinding the
name.

Yep. :)
Hope this helps.
Adonis Vargas- µû¿Â ÅؽºÆ® ¼û±â±â -
- µû¿Â ÅؽºÆ® º¸±â -

If I'm not mistaken, building a reverse dictionary like that will be
O(n*m) because dict/list access is O(n) (ammortized). Somebody correct
me if I'm wrong. In that case, it really depends on how you will use
the dict to see whether you get any benefit from building the reversed
dict. If you want to do several lookups, then the initial overhead
(speed/memory) of building the reversed dict might be worth it so that
you can just run lookups at O(n). But if you only need it once, it is
a waste of time and space to create a reverse dict when your access
time is the same for the lookup as for building the reversed dict.

If you do need more than one lookup, it would also be a good
optimization strategy to build the reverse dict in parallel, as you
execute the first search; that way you can combine the time spent on
building the reverse dict and the lookup, to get a total of O(n*m)
rather than O(n^2*m). The first search is "free" since you need the
reverse dict anyway.

Regards,
Jordan
 
N

Neil Cerutti

If I'm not mistaken, building a reverse dictionary like that will be
O(n*m) because dict/list access is O(n) (ammortized). Somebody correct
me if I'm wrong. In that case, it really depends on how you will use
the dict to see whether you get any benefit from building the reversed
dict. If you want to do several lookups, then the initial overhead
(speed/memory) of building the reversed dict might be worth it so that
you can just run lookups at O(n).

It also depends on if the dictionary shall be mutated between
reverse lookups.
But if you only need it once, it is a waste of time and space
to create a reverse dict when your access time is the same for
the lookup as for building the reversed dict.

If you do need more than one lookup, it would also be a good
optimization strategy to build the reverse dict in parallel, as
you execute the first search; that way you can combine the time
spent on building the reverse dict and the lookup, to get a
total of O(n*m) rather than O(n^2*m). The first search is
"free" since you need the reverse dict anyway.

It wouldn't be merely an optimization if reverse lookups and
mutations were interleaved.
 
M

MonkeeSage

It also depends on if the dictionary shall be mutated between
reverse lookups.



It wouldn't be merely an optimization if reverse lookups and
mutations were interleaved.

Well true, but you enter a whole other level of complexity in that
case...something like Theta(log(n*(m-n))). I might have calculated
that incorrectly, but that just goes to show how complex a lookup
is(!) in such a case.

Regards,
Jordan
 
M

MonkeeSage

It also depends on if the dictionary shall be mutated between
reverse lookups.



It wouldn't be merely an optimization if reverse lookups and
mutations were interleaved.

Well true, but you enter a whole other level of complexity in that
case...something like Theta(log(n*(m-n))). I might have calculated
that incorrectly, but that just goes to show how complex a lookup
is(!) in such a case.

Regards,
Jordan
 
M

MonkeeSage

Well true, but you enter a whole other level of complexity in that
case...something like Theta(log(n*(m-n))). I might have calculated
that incorrectly, but that just goes to show how complex a lookup
is(!) in such a case.

Regards,
Jordan

Sorry for the double-post...google is being beligerant right now.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top