[SQL] Pick random rows from SELECT?

Discussion in 'Python' started by Gilles Ganault, Sep 21, 2009.

  1. Hello

    I have a working Python script that SELECTs rows from a database to
    fetch a company's name from a web-based database.

    Since this list is quite big and the site is the bottleneck, I'd like
    to run multiple instances of this script, and figured a solution would
    be to pick rows at random from the dataset, check in my local database
    if this item has already been taken care of, and if not, download
    details from the remote web site.

    If someone's done this before, should I perform the randomization in
    the SQL query (SQLite using the APSW wrapper
    http://code.google.com/p/apsw/), or in Python?

    Thank you.

    Here's some simplified code:

    sql = 'SELECT id,label FROM companies WHERE activity=1'
    rows=list(cursor.execute(sql))
    for row in rows:
    id = row[0]
    label = row[1]

    print strftime("%H:%M")
    url = "http://www.acme.com/details.php?id=%s" % id
    req = urllib2.Request(url, None, headers)
    response = urllib2.urlopen(req).read()

    name = re_name.search(response)
    if name:
    name = name.group(1)
    sql = 'UPDATE companies SET name=? WHERE id=?'
    cursor.execute(sql, (name,id) )
    Gilles Ganault, Sep 21, 2009
    #1
    1. Advertising

  2. Gilles Ganault

    Peter Otten Guest

    Gilles Ganault wrote:

    > I have a working Python script that SELECTs rows from a database to
    > fetch a company's name from a web-based database.
    >
    > Since this list is quite big and the site is the bottleneck, I'd like
    > to run multiple instances of this script, and figured a solution would
    > be to pick rows at random from the dataset, check in my local database
    > if this item has already been taken care of, and if not, download
    > details from the remote web site.
    >
    > If someone's done this before, should I perform the randomization in
    > the SQL query (SQLite using the APSW wrapper
    > http://code.google.com/p/apsw/), or in Python?
    >
    > Thank you.
    >
    > Here's some simplified code:
    >
    > sql = 'SELECT id,label FROM companies WHERE activity=1'
    > rows=list(cursor.execute(sql))
    > for row in rows:
    > id = row[0]
    > label = row[1]
    >
    > print strftime("%H:%M")
    > url = "http://www.acme.com/details.php?id=%s" % id
    > req = urllib2.Request(url, None, headers)
    > response = urllib2.urlopen(req).read()
    >
    > name = re_name.search(response)
    > if name:
    > name = name.group(1)
    > sql = 'UPDATE companies SET name=? WHERE id=?'
    > cursor.execute(sql, (name,id) )


    I don't think you need to randomize the requests. Instead you could control
    a pool of worker processes using

    http://docs.python.org/library/multiprocessing.html

    Peter
    Peter Otten, Sep 21, 2009
    #2
    1. Advertising

  3. On Mon, 21 Sep 2009 10:59:38 +0200, Gilles Ganault <>
    declaimed the following in gmane.comp.python.general:

    > Since this list is quite big and the site is the bottleneck, I'd like
    > to run multiple instances of this script, and figured a solution would
    > be to pick rows at random from the dataset, check in my local database
    > if this item has already been taken care of, and if not, download
    > details from the remote web site.
    >

    You really think making MULTIPLE, overlapping requests to a web site
    is going to be more efficient than just suffering the single transfer
    time of one large query?

    > If someone's done this before, should I perform the randomization in
    > the SQL query (SQLite using the APSW wrapper
    > http://code.google.com/p/apsw/), or in Python?
    >

    Pardon, I thought you implied the bottleneck is the web-site
    database -- I'd worry about any web-site that exposes a file-server
    based database to direct user access.

    > Here's some simplified code:
    >
    > sql = 'SELECT id,label FROM companies WHERE activity=1'
    > rows=list(cursor.execute(sql))
    > for row in rows:
    > id = row[0]
    > label = row[1]
    >
    > print strftime("%H:%M")
    > url = "http://www.acme.com/details.php?id=%s" % id
    > req = urllib2.Request(url, None, headers)
    > response = urllib2.urlopen(req).read()
    >
    > name = re_name.search(response)
    > if name:
    > name = name.group(1)
    > sql = 'UPDATE companies SET name=? WHERE id=?'
    > cursor.execute(sql, (name,id) )


    Ah... You mean you are retrieving the names from a local database,
    and then requesting web-site details based upon that name.

    No matter how you look at it, you appear to want to process the
    entire local list of companies... Multiple randomized local queries will
    just add to the final run-time as you start to get duplicates -- and
    have to reject that one to query for another random name.

    I'd suggest either a pool of threads -- 5-10, each reading company
    names from a shared QUEUE, which is populated by the main thread
    (remember to commit() so that you don't block on database updates by the
    threads). OR... determine how many companies there are, and start
    threads feeding them <start> and <length> (length being #names /
    #threads, round up -- start then being 0*length+1, 1*length+1, etc...)
    and use those in thread specific selects using "... limit <length>
    offset <start>"... This way each thread retrieves its own limited set of
    companies (make sure to use the same sorting criteria).
    --
    Wulfraed Dennis Lee Bieber KD6MOG
    HTTP://wlfraed.home.netcom.com/
    Dennis Lee Bieber, Sep 22, 2009
    #3
  4. On Mon, 21 Sep 2009 21:40:02 -0700, Dennis Lee Bieber
    <> wrote:
    > I'd suggest either a pool of threads -- 5-10, each reading company
    >names from a shared QUEUE, which is populated by the main thread
    >(remember to commit() so that you don't block on database updates by the
    >threads). OR... determine how many companies there are, and start
    >threads feeding them <start> and <length> (length being #names /
    >#threads, round up -- start then being 0*length+1, 1*length+1, etc...)
    >and use those in thread specific selects using "... limit <length>
    >offset <start>"... This way each thread retrieves its own limited set of
    >companies (make sure to use the same sorting criteria).


    Thanks for the help :)
    Gilles Ganault, Sep 23, 2009
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. tom c
    Replies:
    5
    Views:
    373
    tom c
    Nov 1, 2006
  2. globalrev
    Replies:
    4
    Views:
    734
    Gabriel Genellina
    Apr 20, 2008
  3. Pranjal Jain
    Replies:
    3
    Views:
    141
    saras
    Apr 10, 2008
  4. VK
    Replies:
    15
    Views:
    1,090
    Dr J R Stockton
    May 2, 2010
  5. eli m
    Replies:
    3
    Views:
    119
    David Hutto
    Feb 10, 2013
Loading...

Share This Page