design choice: multi-threaded / asynchronous wxpython client?

Discussion in 'Python' started by bullockbefriending bard, Apr 27, 2008.

  1. I am a complete ignoramus and newbie when it comes to designing and
    coding networked clients (or servers for that matter). I have a copy
    of Goerzen (Foundations of Python Network Programming) and once
    pointed in the best direction should be able to follow my nose and get
    things sorted... but I am not quite sure which is the best path to
    take and would be grateful for advice from networking gurus.

    I am writing a program to display horse racing tote odds in a desktop
    client program. I have access to an HTTP (open one of several URLs,
    and I get back an XML doc with some data... not XML-RPC.) source of
    XML data which I am able to parse and munge with no difficulty at all.
    I have written and successfully tested a simple command line program
    which allows me to repeatedly poll the server and parse the XML. Easy
    enough, but the real world production complications are:

    1) The data for the race about to start updates every (say) 15
    seconds, and the data for earlier and later races updates only every
    (say) 5 minutes. There is no point for me to be hammering the server
    with requests every 15 seconds for data for races after the upcoming
    race... I should query for this perhaps every 150s to be safe. But for
    the upcoming race, I must not miss any updates and should query every
    ~7s to be safe. So... in the middle of a race meeting the situation
    might be:
    race 1 (race done with, no-longer querying), race 2 (race done with,
    no longer querying) race 3 (about to start, data on server for this
    race updating every 15s, my client querying every 7s), races 4-8 (data
    on server for these races updating every 5 mins, my client querying
    every 2.5 mins)

    2) After a race has started and betting is cut off and there are
    consequently no more tote updates for that race (it is possible to
    determine when this occurs precisely because of an attribute in the
    XML data), I need to stop querying (say) race 3 every 7s and remove
    race 4 from the 150s query group and begin querying its data every 7s.

    3) I need to dump this data (for all races, not just current about to
    start race) to text files, store it as BLOBs in a DB *and* update real
    time display in a wxpython windowed client.

    My initial thought was to have two threads for the different update
    polling cycles. In addition I would probably need another thread to
    handle UI stuff, and perhaps another for dealing with file/DB data
    write out. But, I wonder if using Twisted is a better idea? I will
    still need to handle some threading myself, but (I think) only for
    keeping wxpython happy by doing all this other stuff off the main
    thread + perhaps also persisting received data in yet another thread.

    I have zero experience with these kinds of design choices and would be
    very happy if those with experience could point out the pros and cons
    of each (synchronous/multithreaded, or Twisted) for dealing with the
    two differing sample rates problem outlined above.

    Many TIA!
     
    bullockbefriending bard, Apr 27, 2008
    #1
    1. Advertising

  2. bullockbefriending bard

    Eric Wertman Guest

    HI, that does look like a lot of fun... You might consider breaking
    that into 2 separate programs. Write one that's threaded to keep a db
    updated properly, and write a completely separate one to handle
    displaying data from your db. This would allow you to later change or
    add a web interface without having to muck with the code that handles
    data.
     
    Eric Wertman, Apr 27, 2008
    #2
    1. Advertising

  3. bullockbefriending bard

    David Guest

    >
    > 1) The data for the race about to start updates every (say) 15
    > seconds, and the data for earlier and later races updates only every
    > (say) 5 minutes. There is no point for me to be hammering the server
    > with requests every 15 seconds for data for races after the upcoming


    Try using an HTTP HEAD instruction instead to check if the data has
    changed since last time.
     
    David, Apr 27, 2008
    #3
  4. On Apr 27, 10:05 pm, "Eric Wertman" <> wrote:
    > HI, that does look like a lot of fun... You might consider breaking
    > that into 2 separate programs.  Write one that's threaded to keep a db
    > updated properly, and write a completely separate one to handle
    > displaying data from your db.  This would allow you to later change or
    > add a web interface without having to muck with the code that handles
    > data.


    Thanks for the good point. It certainly is a lot of 'fun'. One of
    those jobs which at first looks easy (XML, very simple to parse data),
    but a few gotchas in the real-time nature of the beast.

    After thinking about your idea more, I am sure this decoupling of
    functions and making everything DB-centric can simplify a lot of
    issues. I quite like the idea of persisting pickled or YAML data along
    with the raw XML (for archival purposes + occurs to me I might be able
    to do something with XSLT to get it directly into screen viewable form
    without too much work) to a DB and then having a client program which
    queries most recent time-stamped data for display.

    A further complication is that at a later point, I will want to do
    real-time time series prediction on all this data (viz. predicting
    actual starting prices at post time x minutes in the future). Assuming
    I can quickly (enough) retrieve the relevant last n tote data samples
    from the database in order to do this, then it will indeed be much
    simpler to make things much more DB-centric... as opposed to
    maintaining all this state/history in program data structures and
    updating it in real time.
     
    bullockbefriending bard, Apr 27, 2008
    #4
  5. bullockbefriending bard

    Jorge Godoy Guest

    bullockbefriending bard wrote:

    > A further complication is that at a later point, I will want to do
    > real-time time series prediction on all this data (viz. predicting
    > actual starting prices at post time x minutes in the future). Assuming
    > I can quickly (enough) retrieve the relevant last n tote data samples
    > from the database in order to do this, then it will indeed be much
    > simpler to make things much more DB-centric... as opposed to
    > maintaining all this state/history in program data structures and
    > updating it in real time.


    If instead of storing XML and YAML you store the data points, you can do
    everything from inside the database.

    PostgreSQL supports Python stored procedures / functions and also support
    using R in the same way, for manipulating data. Then you can work with
    everything and just retrieve the resulting information.

    You might try storing the raw data and the XML / YAML, but I believe that
    keeping those sync'ed might cause you some extra work.
     
    Jorge Godoy, Apr 27, 2008
    #5
  6. On Apr 27, 10:10 pm, David <> wrote:
    > >  1) The data for the race about to start updates every (say) 15
    > >  seconds, and the data for earlier and later races updates only every
    > >  (say) 5 minutes. There is  no point for me to be hammering the server
    > >  with requests every 15 seconds for data for races after the upcoming

    >
    > Try using an HTTP HEAD instruction instead to check if the data has
    > changed since last time.


    Thanks for the suggestion... am I going about this the right way here?

    import urllib2
    request = urllib2.Request("http://get-rich.quick.com")
    request.get_method = lambda: "HEAD"
    http_file = urllib2.urlopen(request)

    print http_file.headers

    ->>>
    Age: 0
    Date: Sun, 27 Apr 2008 16:07:11 GMT
    Content-Length: 521
    Content-Type: text/xml; charset=utf-8
    Expires: Sun, 27 Apr 2008 16:07:41 GMT
    Cache-Control: public, max-age=30, must-revalidate
    Connection: close
    Server: Microsoft-IIS/6.0
    X-Powered-By: ASP.NET
    X-AspNet-Version: 1.1.4322
    Via: 1.1 jcbw-nc3 (NetCache NetApp/5.5R4D6)

    Date is the time of the server response and not last data update. Data
    is definitely time of server response to my request and bears no
    relation to when the live XML data was updated. I know this for a fact
    because right now there is no active race meeting and any data still
    available is static and many hours old. I would not feel confident
    rejecting incoming data as duplicate based only on same content length
    criterion. Am I missing something here?

    Actually there doesn't seem to be too much difficulty performance-wise
    in fetching and parsing (minidom) the XML data and checking the
    internal (it's an attribute) update time stamp in the parsed doc. If
    timings got really tight, presumably I could more quickly check each
    doc's time stamp with SAX (time stamp comes early in data as one might
    reasonably expect) before deciding whether to go the whole hog with
    minidom if the time stamp has in fact changed since I last polled the
    server.

    But if there is something I don't get about HTTP HEAD approach, please
    let me know as a simple check like this would obviously be a good
    thing for me.
     
    bullockbefriending bard, Apr 27, 2008
    #6
  7. bullockbefriending bard

    Jorge Godoy Guest

    bullockbefriending bard wrote:

    > 3) I need to dump this data (for all races, not just current about to
    > start race) to text files, store it as BLOBs in a DB *and* update real
    > time display in a wxpython windowed client.


    Why in a BLOB? Why not into specific data types and normalized tables? You
    can also save the BLOB for backup or auditing, but this won't allow you to
    use your DB to the best of its capabilities... It will just act as a data
    container, the same as a network share (which would not penalize you too
    much to have connections open/closed).
     
    Jorge Godoy, Apr 27, 2008
    #7
  8. On 2008-04-27, David <> wrote:
    >>
    >> 1) The data for the race about to start updates every (say) 15
    >> seconds, and the data for earlier and later races updates only every
    >> (say) 5 minutes. There is no point for me to be hammering the server
    >> with requests every 15 seconds for data for races after the upcoming

    >
    > Try using an HTTP HEAD instruction instead to check if the data has
    > changed since last time.


    Get If-Modified-Since is still better
    (http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html 14.25)

    --
    Jarkko Torppa
     
    Jarkko Torppa, Apr 27, 2008
    #8
  9. I think twisted is overkill for this problem. Threading, elementtree
    and urllib should more than suffice. One thread polling the server for
    each race with the desired polling interval. Each time some data is
    treated, that thread sends a signal containing information about what
    changed. The gui listens to the signal and will, if needed, update
    itself with the new information. The database handler also listens to
    the signal and updates the db.



    2008/4/27, bullockbefriending bard <>:
    > I am a complete ignoramus and newbie when it comes to designing and
    > coding networked clients (or servers for that matter). I have a copy
    > of Goerzen (Foundations of Python Network Programming) and once
    > pointed in the best direction should be able to follow my nose and get
    > things sorted... but I am not quite sure which is the best path to
    > take and would be grateful for advice from networking gurus.
    >
    > I am writing a program to display horse racing tote odds in a desktop
    > client program. I have access to an HTTP (open one of several URLs,
    > and I get back an XML doc with some data... not XML-RPC.) source of
    > XML data which I am able to parse and munge with no difficulty at all.
    > I have written and successfully tested a simple command line program
    > which allows me to repeatedly poll the server and parse the XML. Easy
    > enough, but the real world production complications are:
    >
    > 1) The data for the race about to start updates every (say) 15
    > seconds, and the data for earlier and later races updates only every
    > (say) 5 minutes. There is no point for me to be hammering the server
    > with requests every 15 seconds for data for races after the upcoming
    > race... I should query for this perhaps every 150s to be safe. But for
    > the upcoming race, I must not miss any updates and should query every
    > ~7s to be safe. So... in the middle of a race meeting the situation
    > might be:
    > race 1 (race done with, no-longer querying), race 2 (race done with,
    > no longer querying) race 3 (about to start, data on server for this
    > race updating every 15s, my client querying every 7s), races 4-8 (data
    > on server for these races updating every 5 mins, my client querying
    > every 2.5 mins)
    >
    > 2) After a race has started and betting is cut off and there are
    > consequently no more tote updates for that race (it is possible to
    > determine when this occurs precisely because of an attribute in the
    > XML data), I need to stop querying (say) race 3 every 7s and remove
    > race 4 from the 150s query group and begin querying its data every 7s.
    >
    > 3) I need to dump this data (for all races, not just current about to
    > start race) to text files, store it as BLOBs in a DB *and* update real
    > time display in a wxpython windowed client.
    >
    > My initial thought was to have two threads for the different update
    > polling cycles. In addition I would probably need another thread to
    > handle UI stuff, and perhaps another for dealing with file/DB data
    > write out. But, I wonder if using Twisted is a better idea? I will
    > still need to handle some threading myself, but (I think) only for
    > keeping wxpython happy by doing all this other stuff off the main
    > thread + perhaps also persisting received data in yet another thread.
    >
    > I have zero experience with these kinds of design choices and would be
    > very happy if those with experience could point out the pros and cons
    > of each (synchronous/multithreaded, or Twisted) for dealing with the
    > two differing sample rates problem outlined above.
    >
    > Many TIA!
    >
    >
    >
    >
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >



    --
    mvh Björn
     
    BJörn Lindqvist, Apr 27, 2008
    #9
  10. On Apr 27, 11:12 pm, Jorge Godoy <> wrote:
    > bullockbefriending bard wrote:
    > > A further complication is that at a later point, I will want to do
    > > real-time time series prediction on all this data (viz. predicting
    > > actual starting prices at post time x minutes in the future). Assuming
    > > I can quickly (enough) retrieve the relevant last n tote data samples
    > > from the database in order to do this, then it will indeed be much
    > > simpler to make things much more DB-centric... as opposed to
    > > maintaining all this state/history in program data structures and
    > > updating it in real time.

    >
    > If instead of storing XML and YAML you store the data points, you can do
    > everything from inside the database.
    >
    > PostgreSQL supports Python stored procedures / functions and also support
    > using R in the same way, for manipulating data.  Then you can work with
    > everything and just retrieve the resulting information.
    >
    > You might try storing the raw data and the XML / YAML, but I believe that
    > keeping those sync'ed might cause you some extra work.


    Tempting thought, but one of the problems with this kind of horse
    racing tote data is that a lot of it is for combinations of runners
    rather than single runners. Whilst there might be (say) 14 horses in a
    race, there are 91 quinella price combinations (1-2 through 13-14,
    i.e. the 2-subsets of range(1, 15)) and 364 trio price combinations.
    It is not really practical (I suspect) to have database tables with
    columns for that many combinations?

    I certainly DO have a horror of having my XML / whatever else formats
    getting out of sync. I also have to worry about the tote company later
    changing their XML format. From that viewpoint, there is indeed a lot
    to be said for storing the tote data as numbers in tables.
     
    bullockbefriending bard, Apr 27, 2008
    #10
  11. On Apr 27, 11:27 pm, "BJörn Lindqvist" <> wrote:
    > I think twisted is overkill for this problem. Threading, elementtree
    > and urllib should more than suffice. One thread polling the server for
    > each race with the desired polling interval. Each time some data is
    > treated, that thread sends a signal containing information about what
    > changed. The gui listens to the signal and will, if needed, update
    > itself with the new information. The database handler also listens to
    > the signal and updates the db.


    So, if i understand you correctly:

    Assuming 8 races and we are just about to start the race 1, we would
    have 8 polling threads with the race 1 thread polling at faster rate
    than the other ones. after race 1 betting closed, could dispense with
    that thread, change race 2 thread to poll faster, and so on...? I had
    been rather stupidly thinking of just two polling threads, one for the
    current race and one for races not yet run... but starting out with a
    thread for each extant race seems simpler given there then is no need
    to handle the mechanics of shifting the polling of races from the
    omnibus slow thread to the current race fast thread.

    Having got my minidom parser working nicely, I'm inclined to stick
    with it for now while I get other parts of the problem licked into
    shape. However, I do take your point that it's probably overkill for
    this simple kind of structured, mostly numerical data and will try to
    find time to experiment with the elementtree approach later. No harm
    at all in shaving the odd second off document parse times.
     
    bullockbefriending bard, Apr 27, 2008
    #11
  12. bullockbefriending bard

    Jorge Godoy Guest

    bullockbefriending bard wrote:

    > Tempting thought, but one of the problems with this kind of horse
    > racing tote data is that a lot of it is for combinations of runners
    > rather than single runners. Whilst there might be (say) 14 horses in a
    > race, there are 91 quinella price combinations (1-2 through 13-14,
    > i.e. the 2-subsets of range(1, 15)) and 364 trio price combinations.
    > It is not really practical (I suspect) to have database tables with
    > columns for that many combinations?
    >
    > I certainly DO have a horror of having my XML / whatever else formats
    > getting out of sync. I also have to worry about the tote company later
    > changing their XML format. From that viewpoint, there is indeed a lot
    > to be said for storing the tote data as numbers in tables.


    I don't understand anything about horse races... But it should be possible
    to normalize such information into some tables (not necessarily one). But
    then, there is nothing that prevents you from having dozens of columns on
    one table if it is needed (it might not be the most efficient solution
    performance and disk space-wise depending on what you have, but it works).

    Using things like that you can even enhance your system and provide more
    information about each horse, its race history, price history, etc.

    I love working with data and statistics, so even though I don't know the
    rules and workings of horse racings, I can think of several things I'd like
    to track or extract from the information you seem to have :)

    How does that price thing work? Are these the ratio of payings for bets?
    What is a quinella or a trio? Two or three horses in a defined order
    winning the race?
     
    Jorge Godoy, Apr 27, 2008
    #12
  13. bullockbefriending bard

    David Guest

    > Date is the time of the server response and not last data update. Data
    > is definitely time of server response to my request and bears no
    > relation to when the live XML data was updated. I know this for a fact
    > because right now there is no active race meeting and any data still
    > available is static and many hours old. I would not feel confident
    > rejecting incoming data as duplicate based only on same content length
    > criterion. Am I missing something here?


    It looks like the data is dynamically generated on the server, so the
    web server doesn't know if/when the data changed. You will usually see
    this for static content (images, html files, etc). You could go by the
    Cache-Control line and only fetch data every 30 seconds, but it's
    possible for you to miss some updates this way.

    Another thing you could try (if necessary, this is a bit of an
    overkill) - download the first part of the XML (GET request with a
    range header), and check the timestamp you mentinoed. If that changed
    then re-request the doc (a download resume is risky, the XML might
    change between your 2 requests).

    David.
     
    David, Apr 27, 2008
    #13
  14. bullockbefriending bard

    David Guest

    > 3) I need to dump this data (for all races, not just current about to
    > start race) to text files, store it as BLOBs in a DB *and* update real
    > time display in a wxpython windowed client.


    A few important questions:

    1) How real-time must the display be? (should update immediately after
    you get new XML data, or is it ok to update a few seconds later?).

    2) How much data is being processed at peak? (100 records a second, 1000?)

    3) Does your app need to share fetched data with other apps? If so,
    how? (read from db, download HTML, RPC, etc).

    4) Does your app need to use data from previous executions? (eg: if
    you restart it, does it need to have a fully populated UI, or can it
    start from an empty UI and start updating as it downloads new XML
    updates).

    How you answer the above questionss determines what kind of algorithm
    will work best.

    David.

    PS: I suggest that you contact the people you're downloading the XML
    from if you haven't already. eg: it might be against their TOS to
    constantly scrape data (I assume not, since they provide XML). You
    don't want them to black-list your IP address ;-). Also, maybe they
    have ideas for efficient data retrieval (eg: RSS feeds).
     
    David, Apr 27, 2008
    #14
  15. bullockbefriending bard

    David Guest

    > Tempting thought, but one of the problems with this kind of horse
    > racing tote data is that a lot of it is for combinations of runners
    > rather than single runners. Whilst there might be (say) 14 horses in a
    > race, there are 91 quinella price combinations (1-2 through 13-14,
    > i.e. the 2-subsets of range(1, 15)) and 364 trio price combinations.
    > It is not really practical (I suspect) to have database tables with
    > columns for that many combinations?


    If you normalise your tables correctly, these will be represented as
    one-to many or many-to-many relationships in your database. Like the
    other poster I don't know the first thing about horses, and I may be
    misunderstanding something, but here is one (basic) normalised db
    schema:

    tables & descriptions:

    - horse - holds info about each horse
    - race - one record per race. Has times, etc
    - race_hourse - holds records linking horses and races together.

    You can derive all possible horse combinations from the above info.
    You don't need to store it in the db unless you need to link something
    else to it (eg: betting data). In which case:

    - combination - represents one combination of horses.
    - combination_horse - links a combinaition to 1 horse. 1 of these per
    horse per combination.
    - bet - Represents a bet. Has foreign relationship with combination
    (and other tables, eg: better, race)

    With a structure like the above you don't need hudreds of database columns :)

    David.
     
    David, Apr 27, 2008
    #15
  16. bullockbefriending bard wrote:

    > 1) The data for the race about to start updates every (say) 15
    > seconds, and the data for earlier and later races updates only
    > every
    > (say) 5 minutes. There is no point for me to be hammering the
    > server with requests every 15 seconds for data for races after the
    > upcoming race... I should query for this perhaps every 150s to be
    > safe. But for the upcoming race, I must not miss any updates and
    > should query every
    > ~7s to be safe. So... in the middle of a race meeting the
    > situation might be:


    I don't fully understand this, but can't you design the server in a
    way that you can connect to it and it notifies you about important
    things? IMHO, polling isn't ideal.

    > My initial thought was to have two threads for the different
    > update polling cycles. In addition I would probably need another
    > thread to handle UI stuff, and perhaps another for dealing with
    > file/DB data write out.


    No need for any additional threads. UI, networking and file I/O can
    operate asynchronously. Using wxPython's timers with callback
    functions, you should need only standard Python modules (except
    wx).

    > But, I wonder if using Twisted is a better idea?


    IMHO that's only advisable if you like to create own protocols and
    reuse them in different apps, or need full-featured customisable
    implementations of advanced protocols.

    Additionally, you'd *have to* use multiple threads: One for the
    Twisted event loop and one for the wxPython one.

    There is a wxreactor in Twisted which integrates the wxPython event
    loop, but I stopped using it due to strange deadlock problems which
    began with some wxPython version. Also, it seems it's no more in
    development. But my alternative works perfectly (main thread with
    Twisted, and a GUI thread for wxPython, communicating over Python
    standard queues).

    You'd only need additional threads if you would do heavy number
    crunching inside the wxPython or Twisted thread. For the respective
    event loop not to hang, it's advisable to use a separate thread for
    long-running calculations.

    > I have zero experience with these kinds of design choices and
    > would be very happy if those with experience could point out the
    > pros and cons of each (synchronous/multithreaded, or Twisted) for
    > dealing with the two differing sample rates problem outlined
    > above.


    I'd favor "as few threads as neccessary" approach. In my experience
    this saves pain (i. e. deadlocks and boilerplate queueing code).

    Regards,


    Björn

    --
    BOFH excuse #27:

    radiosity depletion
     
    Bjoern Schliessmann, Apr 27, 2008
    #16
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David
    Replies:
    0
    Views:
    554
    David
    Sep 24, 2003
  2. Mickey Segal
    Replies:
    0
    Views:
    926
    Mickey Segal
    Feb 2, 2004
  3. crystalattice

    wxPython default radiobox choice

    crystalattice, Aug 24, 2006, in forum: Python
    Replies:
    2
    Views:
    3,714
    crystalattice
    Aug 24, 2006
  4. ian douglas
    Replies:
    2
    Views:
    1,021
    Randy Howard
    Jul 30, 2004
  5. miles.jg
    Replies:
    16
    Views:
    910
    Alf P. Steinbach
    Nov 14, 2007
Loading...

Share This Page