which datastructure for fast sorted insert?

N

notnorwegian

im writing a webcrawler.
after visiting a new site i want to store it in alphabetical order.

so obv i want fast insert. i want to delete duplicates too.

which datastructure is best for this?
 
N

notnorwegian

use a set to store them:
s=set()
s.add('a')
s.add('b')
s set(['a', 'b'])
s.add('a')
s set(['a', 'b'])
s.add('c')
s

set(['a', 'c', 'b'])



it does remove duplicates, but is it not ordered. to order it you can
use:

['a', 'b', 'c']

hth,
Rares

sets dont seem to be so good because there is no way to iterate them.

s.pop() remove and return an arbitrary element from s; raises
KeyError if empty

i dont want to remove i just want to get the string that is stored
there.
 
M

Marc 'BlackJack' Rintsch

sets dont seem to be so good because there is no way to iterate them.

Err:

In [82]: for x in set(['a', 'b', 'c']):
....: print x
....:
a
c
b

Ciao,
Marc 'BlackJack' Rintsch
 
G

Gabriel Genellina

im writing a webcrawler.
after visiting a new site i want to store it in alphabetical order.

so obv i want fast insert. i want to delete duplicates too.

which datastructure is best for this?

Use a list, and the bisect module to keep it sorted:

py> import bisect
py> def insert_nodupes(slist, item):
.... i = bisect.bisect_left(slist, item)
.... if i>=len(slist) or slist != item:
.... slist.insert(i, item)
....
py> items = []
py> insert_nodupes(items, 'link to some site')
py> insert_nodupes(items, 'another link')
py> insert_nodupes(items, 'third site')
py> insert_nodupes(items, 'another link')
py> items
['another link', 'link to some site', 'third site']
 
T

Terry Reedy

| >>> l=list(s)
| >>> l.sort()

This can be condensed to l = sorted(s)

| >>> l
| ['a', 'b', 'c']
 
S

Stefan Behnel

im writing a webcrawler.
after visiting a new site i want to store it in alphabetical order.

so obv i want fast insert. i want to delete duplicates too.

which datastructure is best for this?

Keep the data redundantly in two data structures. Use collections.deque() to
append and remove as in a queue, and set() to find duplicates.

Again, no ordering, but very fast insert/delete/dup-check.

Stefan
 
R

Rodrigo Lazo

Stefan Behnel said:
Keep the data redundantly in two data structures. Use collections.deque() to
append and remove as in a queue, and set() to find duplicates.

what about heapq for sorting?
 
I

I V

Use a list, and the bisect module to keep it sorted:

That's worth doing if you need the data to be sorted after each insert.
If the OP just needs the data to be sorted at the end, using a data
structure with fast inserts (like a set) and then sorting at the end will
likely be faster.
 
N

notnorwegian

sets dont seem to be so good because there is no way to iterate them.

Err:

In [82]: for x in set(['a', 'b', 'c']):
   ....:     print x
   ....:
a
c
b

Ciao,
        Marc 'BlackJack' Rintsch


i meant like set[pos], not iterate but access a specific position in
the set.

now i have to do:
s = toSet.pop()
toSet.add(s)

if i want to get the url of the next item in a set, how would i do
that?
i can do this with a list:

def scrapeSitesX(startAddress, toList, listPos):
return scrapeSites(toList[listPos], toList, listPos+1)
to use recursion with a list.
i can work around that but it doesnt get as elegant.
 
I

I V

i meant like set[pos], not iterate but access a specific position in the
set.

If you need to access arbitrary elements, use a list instead of a set
(but you'll get slower inserts). OTOH, if you just need to be able to get
the next item from the set, you can use an iterator:
now i have to do:
s = toSet.pop()
toSet.add(s)

i = iter(toSet)

s = i.next()
if i want to get the url of the next item in a set, how would i do that?
i can do this with a list:

def scrapeSitesX(startAddress, toList, listPos): return
scrapeSites(toList[listPos], toList, listPos+1) to use recursion with a
list.
i can work around that but it doesnt get as elegant.

Do you have to use recursion here? Why not:

for site in toList:
scrape(site)
 
M

miller.paul.w

im writing a webcrawler.
after visiting a new site i want to store it in alphabetical order.

so obv i want fast insert. i want to delete duplicates too.

which datastructure is best for this?

I think you ought to re-examine your requirements. Why do you need to
store in alphabetical order? (And, what does that even mean for a web
site?) Personally, I'd probably just use a simple dict for in-memory
storage, and use a database for external storage. In that case,
adding a site or page to the "seen" structure is a simple matter of
doing

urls[key] = url

Dicts are basically hash tables, so this insert is amortized O(1) and
you get dupe-detection for free, provided the key is unique. If you
want to display the contents in alphabetical order by "key," you can
do this by:

keyz = sorted (urls.keys())
for key in keyz:
print urls[key]

The problem with this is that you pay the sorting cost every time you
want to display the results. If that is a problem, you can maintain
an in-memory index (and rely on the database to do it for you in the
case of externally stored results).
 
N

notnorwegian

i meant like set[pos], not iterate but access a specific position in the
set.

If you need to access arbitrary elements, use a list instead of a set
(but you'll get slower inserts). OTOH, if you just need to be able to get
the next item from the set, you can use an iterator:
now i have to do:
s = toSet.pop()
toSet.add(s)

i = iter(toSet)

s = i.next()
if i want to get the url of the next item in a set, how would i do that?
i can do this with a list:
def scrapeSitesX(startAddress, toList, listPos): return
scrapeSites(toList[listPos], toList, listPos+1) to use recursion with a
list.
i can work around that but it doesnt get as elegant.

Do you have to use recursion here? Why not:

for site in toList:
scrape(site)


i have tried both.

anyway how do i concatenate sets? which i have to if use to funcions
like you suggest.

and also when iterating a set is there any other way to avoid
exceptions than have a counter and compare to len(set).
is there no method that cna check if iter != None
 
N

notnorwegian

i meant like set[pos], not iterate but access a specific position in the
set.
If you need to access arbitrary elements, use a list instead of a set
(but you'll get slower inserts). OTOH, if you just need to be able to get
the next item from the set, you can use an iterator:
i = iter(toSet)
s = i.next()
if i want to get the url of the next item in a set, how would i do that?
i can do this with a list:
def scrapeSitesX(startAddress, toList, listPos): return
scrapeSites(toList[listPos], toList, listPos+1) to use recursion with a
list.
i can work around that but it doesnt get as elegant.
Do you have to use recursion here? Why not:
for site in toList:
scrape(site)

i have tried both.

anyway how do i concatenate sets? which i have to if use to funcions
like you suggest.

and also when iterating a set is there any other way to avoid
exceptions than have a counter and compare to len(set).
is there no method that cna check if iter != None

nevermind got it how to concatenate.

but no builtin method for concatenating right?
 
N

notnorwegian

Traceback (most recent call last):
File "C:/Python25/Progs/WebCrawler/spider2.py", line 47, in <module>
x = scrapeSites("http://www.yahoo.com")
File "C:/Python25/Progs/WebCrawler/spider2.py", line 31, in
scrapeSites
site = iterator.next()
RuntimeError: Set changed size during iteration


def joinSets(set1, set2):
for i in set2:
set1.add(i)
return set1

def scrapeSites(startAddress):
site = startAddress
sites = set()
iterator = iter(sites)
pos = 0
while pos < 10:#len(sites):
newsites = scrapeSite(site)
joinSets(sites, newsites)
pos += 1
site = iterator.next()
return sites

def scrapeSite(address):
toSet = set()
site = urllib.urlopen(address)
for row in site:
obj = url.search(row)
if obj != None:
toSet.add(obj.group())
return toSet


wtf? im not multithreading or anything so how can the size change here?
 
S

sturlamolden

what about heapq for sorting?

Heap is the data structure to use for 'fast (nearly) sorted inserts'.
But heapq do not support (as far as I know) deletion of duplicates.
But a custom heap class coud do that of course.
 
G

Gabriel Genellina

def joinSets(set1, set2):
for i in set2:
set1.add(i)
return set1

Use the | operator, or |=
Traceback (most recent call last):
File "C:/Python25/Progs/WebCrawler/spider2.py", line 47, in <module>
x = scrapeSites("http://www.yahoo.com")
File "C:/Python25/Progs/WebCrawler/spider2.py", line 31, in
scrapeSites
site = iterator.next()
RuntimeError: Set changed size during iteration

You will need two sets: the one you're iterating over, and another collecting new urls. Once you finish iterating the first, continue with the new ones; stop when it's empty.
def scrapeSites(startAddress):
site = startAddress
sites = set()
iterator = iter(sites)
pos = 0
while pos < 10:#len(sites):
newsites = scrapeSite(site)
joinSets(sites, newsites)
pos += 1
site = iterator.next()
return sites

Try this (untested):

def scrapeSites(startAddress):
allsites = set() # all links found so far
pending = set([startAddress]) # pending sites to examine
while pending:
newsites = set() # new links
for site in pending:
newsites |= scrapeSite(site)
pending = newsites - allsites
allsites |= newsites
return allsites
wtf? im not multithreading or anything so how can the size change here?

You modified the set you were iterating over. Another example of the same problem:

d = {'a': 1, 'b': 2, 'c':3}
for key in d:
d[key+key]=0
 
I

I V

def scrapeSites(startAddress):
site = startAddress
sites = set()
iterator = iter(sites)
pos = 0
while pos < 10:#len(sites):
newsites = scrapeSite(site)
joinSets(sites, newsites)

You change the size of "sites" here, which means you can no longer use
the iterator.
wtf? im not multithreading or anything so how can the size change here?

How about:

def scrape_sites(start_address):
sites = set([start_address])
sites_scraped = set()
# This will run until it doesn't find any new sites, which may be
# a long time
while len(sites) > 0:
new_sites = set()
for site in sites:
new_sites.update(scrape_site(site))
sites_scraped.update(sites)
sites = new_sites.difference(sites_scraped)
return sites
 
P

Paul Rubin

im writing a webcrawler.
after visiting a new site i want to store it in alphabetical order.
so obv i want fast insert. i want to delete duplicates too.
which datastructure is best for this?

Nobody has mentioned balanced-tree data structures so you might want
to look them up.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,431
Messages
2,571,677
Members
48,796
Latest member
Greg L.

Latest Threads

Top