which datastructure for fast sorted insert?

notnorwegian · May 25, 2008

im writing a webcrawler.
after visiting a new site i want to store it in alphabetical order.

so obv i want fast insert. i want to delete duplicates too.

which datastructure is best for this?

Rares Vernica · May 25, 2008

use a set to store them:

s=set()
s.add('a')
s.add('b')
s set(['a', 'b'])
s.add('a')
s set(['a', 'b'])
s.add('c')
s set(['a', 'c', 'b'])

Click to expand...

Click to expand...

it does remove duplicates, but is it not ordered. to order it you can
use:
['a', 'b', 'c']

hth,
Rares

notnorwegian · May 25, 2008

use a set to store them:

s=set()
s.add('a')
s.add('b')
s set(['a', 'b'])
s.add('a')
s set(['a', 'b'])
s.add('c')
s

Click to expand...

Click to expand...

set(['a', 'c', 'b'])

it does remove duplicates, but is it not ordered. to order it you can
use:

['a', 'b', 'c']

hth,
Rares

sets dont seem to be so good because there is no way to iterate them.

s.pop() remove and return an arbitrary element from s; raises
KeyError if empty

i dont want to remove i just want to get the string that is stored
there.

Marc 'BlackJack' Rintsch · May 25, 2008

sets dont seem to be so good because there is no way to iterate them.

Err:

In [82]: for x in set(['a', 'b', 'c']):
....: print x
....:
a
c
b

Ciao,
Marc 'BlackJack' Rintsch

Gabriel Genellina · May 25, 2008

En Sun said:
im writing a webcrawler.
after visiting a new site i want to store it in alphabetical order.

so obv i want fast insert. i want to delete duplicates too.

which datastructure is best for this?

Use a list, and the bisect module to keep it sorted:

py> import bisect
py> def insert_nodupes(slist, item):
.... i = bisect.bisect_left(slist, item)
.... if i>=len(slist) or slist != item:
.... slist.insert(i, item)
....
py> items = []
py> insert_nodupes(items, 'link to some site')
py> insert_nodupes(items, 'another link')
py> insert_nodupes(items, 'third site')
py> insert_nodupes(items, 'another link')
py> items
['another link', 'link to some site', 'third site']

Terry Reedy · May 25, 2008

| >>> l=list(s)
| >>> l.sort()

This can be condensed to l = sorted(s)

| >>> l
| ['a', 'b', 'c']

Stefan Behnel · May 25, 2008

im writing a webcrawler.
after visiting a new site i want to store it in alphabetical order.

so obv i want fast insert. i want to delete duplicates too.

which datastructure is best for this?

Keep the data redundantly in two data structures. Use collections.deque() to
append and remove as in a queue, and set() to find duplicates.

Again, no ordering, but very fast insert/delete/dup-check.

Stefan

Rodrigo Lazo · May 25, 2008

Stefan Behnel said:
Keep the data redundantly in two data structures. Use collections.deque() to
append and remove as in a queue, and set() to find duplicates.

what about heapq for sorting?

I V · May 25, 2008

Use a list, and the bisect module to keep it sorted:

That's worth doing if you need the data to be sorted after each insert.
If the OP just needs the data to be sorted at the end, using a data
structure with fast inserts (like a set) and then sorting at the end will
likely be faster.

notnorwegian · May 25, 2008

sets dont seem to be so good because there is no way to iterate them.

Click to expand...

Err:

In [82]: for x in set(['a', 'b', 'c']):
....: print x
....:
a
c
b

Ciao,
Marc 'BlackJack' Rintsch

i meant like set[pos], not iterate but access a specific position in
the set.

now i have to do:
s = toSet.pop()
toSet.add(s)

if i want to get the url of the next item in a set, how would i do
that?
i can do this with a list:

def scrapeSitesX(startAddress, toList, listPos):
return scrapeSites(toList[listPos], toList, listPos+1)
to use recursion with a list.
i can work around that but it doesnt get as elegant.

I V · May 25, 2008

i meant like set[pos], not iterate but access a specific position in the
set.

If you need to access arbitrary elements, use a list instead of a set
(but you'll get slower inserts). OTOH, if you just need to be able to get
the next item from the set, you can use an iterator:

now i have to do:
s = toSet.pop()
toSet.add(s)

i = iter(toSet)

s = i.next()

if i want to get the url of the next item in a set, how would i do that?
i can do this with a list:

def scrapeSitesX(startAddress, toList, listPos): return
scrapeSites(toList[listPos], toList, listPos+1) to use recursion with a
list.
i can work around that but it doesnt get as elegant.

Do you have to use recursion here? Why not:

for site in toList:
scrape(site)

miller.paul.w · May 25, 2008

im writing a webcrawler.
after visiting a new site i want to store it in alphabetical order.

so obv i want fast insert. i want to delete duplicates too.

which datastructure is best for this?

I think you ought to re-examine your requirements. Why do you need to
store in alphabetical order? (And, what does that even mean for a web
site?) Personally, I'd probably just use a simple dict for in-memory
storage, and use a database for external storage. In that case,
adding a site or page to the "seen" structure is a simple matter of
doing

urls[key] = url

Dicts are basically hash tables, so this insert is amortized O(1) and
you get dupe-detection for free, provided the key is unique. If you
want to display the contents in alphabetical order by "key," you can
do this by:

keyz = sorted (urls.keys())
for key in keyz:
print urls[key]

The problem with this is that you pay the sorting cost every time you
want to display the results. If that is a problem, you can maintain
an in-memory index (and rely on the database to do it for you in the
case of externally stored results).

notnorwegian · May 25, 2008

i meant like set[pos], not iterate but access a specific position in the
set.

Click to expand...

If you need to access arbitrary elements, use a list instead of a set
(but you'll get slower inserts). OTOH, if you just need to be able to get
the next item from the set, you can use an iterator:

now i have to do:
s = toSet.pop()
toSet.add(s)

Click to expand...

i = iter(toSet)

s = i.next()

if i want to get the url of the next item in a set, how would i do that?
i can do this with a list:

Click to expand...

def scrapeSitesX(startAddress, toList, listPos): return
scrapeSites(toList[listPos], toList, listPos+1) to use recursion with a
list.
i can work around that but it doesnt get as elegant.

Click to expand...

Do you have to use recursion here? Why not:

for site in toList:
scrape(site)

i have tried both.

anyway how do i concatenate sets? which i have to if use to funcions
like you suggest.

and also when iterating a set is there any other way to avoid
exceptions than have a counter and compare to len(set).
is there no method that cna check if iter != None

notnorwegian · May 25, 2008

i meant like set[pos], not iterate but access a specific position in the
set.

Click to expand...

Click to expand...

If you need to access arbitrary elements, use a list instead of a set
(but you'll get slower inserts). OTOH, if you just need to be able to get
the next item from the set, you can use an iterator:

Click to expand...

i = iter(toSet)

Click to expand...

s = i.next()

if i want to get the url of the next item in a set, how would i do that?
i can do this with a list:
def scrapeSitesX(startAddress, toList, listPos): return
scrapeSites(toList[listPos], toList, listPos+1) to use recursion with a
list.
i can work around that but it doesnt get as elegant.

Click to expand...

Click to expand...

Do you have to use recursion here? Why not:

Click to expand...

for site in toList:
scrape(site)

Click to expand...

i have tried both.

anyway how do i concatenate sets? which i have to if use to funcions
like you suggest.

and also when iterating a set is there any other way to avoid
exceptions than have a counter and compare to len(set).
is there no method that cna check if iter != None

nevermind got it how to concatenate.

but no builtin method for concatenating right?

notnorwegian · May 25, 2008

Traceback (most recent call last):
File "C:/Python25/Progs/WebCrawler/spider2.py", line 47, in <module>
x = scrapeSites("http://www.yahoo.com")
File "C:/Python25/Progs/WebCrawler/spider2.py", line 31, in
scrapeSites
site = iterator.next()
RuntimeError: Set changed size during iteration

def joinSets(set1, set2):
for i in set2:
set1.add(i)
return set1

def scrapeSites(startAddress):
site = startAddress
sites = set()
iterator = iter(sites)
pos = 0
while pos < 10:#len(sites):
newsites = scrapeSite(site)
joinSets(sites, newsites)
pos += 1
site = iterator.next()
return sites

def scrapeSite(address):
toSet = set()
site = urllib.urlopen(address)
for row in site:
obj = url.search(row)
if obj != None:
toSet.add(obj.group())
return toSet

wtf? im not multithreading or anything so how can the size change here?

sturlamolden · May 25, 2008

what about heapq for sorting?

Heap is the data structure to use for 'fast (nearly) sorted inserts'.
But heapq do not support (as far as I know) deletion of duplicates.
But a custom heap class coud do that of course.

Gabriel Genellina · May 25, 2008

En Sun said:
def joinSets(set1, set2):
for i in set2:
set1.add(i)
return set1

Use the | operator, or |=

Traceback (most recent call last):
File "C:/Python25/Progs/WebCrawler/spider2.py", line 47, in <module>
x = scrapeSites("http://www.yahoo.com")
File "C:/Python25/Progs/WebCrawler/spider2.py", line 31, in
scrapeSites
site = iterator.next()
RuntimeError: Set changed size during iteration

You will need two sets: the one you're iterating over, and another collecting new urls. Once you finish iterating the first, continue with the new ones; stop when it's empty.

def scrapeSites(startAddress):
site = startAddress
sites = set()
iterator = iter(sites)
pos = 0
while pos < 10:#len(sites):
newsites = scrapeSite(site)
joinSets(sites, newsites)
pos += 1
site = iterator.next()
return sites

Try this (untested):

def scrapeSites(startAddress):
allsites = set() # all links found so far
pending = set([startAddress]) # pending sites to examine
while pending:
newsites = set() # new links
for site in pending:
newsites |= scrapeSite(site)
pending = newsites - allsites
allsites |= newsites
return allsites

wtf? im not multithreading or anything so how can the size change here?

You modified the set you were iterating over. Another example of the same problem:

d = {'a': 1, 'b': 2, 'c':3}
for key in d:
d[key+key]=0

I V · May 25, 2008

def scrapeSites(startAddress):
site = startAddress
sites = set()
iterator = iter(sites)
pos = 0
while pos < 10:#len(sites):
newsites = scrapeSite(site)
joinSets(sites, newsites)

You change the size of "sites" here, which means you can no longer use
the iterator.

wtf? im not multithreading or anything so how can the size change here?

How about:

def scrape_sites(start_address):
sites = set([start_address])
sites_scraped = set()
# This will run until it doesn't find any new sites, which may be
# a long time
while len(sites) > 0:
new_sites = set()
for site in sites:
new_sites.update(scrape_site(site))
sites_scraped.update(sites)
sites = new_sites.difference(sites_scraped)
return sites

Paul Rubin · May 26, 2008

im writing a webcrawler.
after visiting a new site i want to store it in alphabetical order.
so obv i want fast insert. i want to delete duplicates too.
which datastructure is best for this?

Nobody has mentioned balanced-tree data structures so you might want
to look them up.

Sorted indexes in Python	0	Oct 17, 2008
Help to decide which sw to use	2	Nov 18, 2023
How do I use Find and Loop in VBA for Excel to identify, delete, and insert blank row for values greater than 6?	0	Feb 28, 2022
newbie: datastructure `dictionary' question	4	Sep 9, 2006
Best way to insert sorted in a list	10	Jun 17, 2011
Unable to read input from keyboard, in below C code, for a BST.	0	Jul 20, 2025
Can't decide which language to get back into programming with	1	Mar 28, 2023
Im having some issues with my html website	1	Jun 3, 2024

which datastructure for fast sorted insert?

notnorwegian

Rares Vernica

notnorwegian

Marc 'BlackJack' Rintsch

Gabriel Genellina

Terry Reedy

Stefan Behnel

Rodrigo Lazo

I V

notnorwegian

I V

miller.paul.w

notnorwegian

notnorwegian

notnorwegian

sturlamolden

Gabriel Genellina

I V

Paul Rubin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads