regarding threading

Discussion in 'Python' started by akash shetty, Oct 14, 2003.

  1. akash shetty

    akash shetty Guest

    hi,
    im developing a code which requires searching a large
    database(bioological) for certain patterns.the size of
    the file is 3.5GB . the search pattern is a ten letter
    string.the database consists of paragraphs.
    the code ive developed searches the data
    paragraphwise.
    (using xreadlines).
    but this takes an awful amt of time.(abt 7 mins)
    is there anyway to speed this up.
    is use of threading feasible and what code do i
    thread( since all i do is process the database).there
    are no other concurrent tasks. so do i divide the
    database into parts and multithread the searching on
    these parts concurrently. is this feasible. or shud i
    be using some kind of multiprocessing running the
    parts(files) as diff processes.
    please help.
    thanx

    __________________________________
    Do you Yahoo!?
    The New Yahoo! Shopping - with improved product search
    http://shopping.yahoo.com
    akash shetty, Oct 14, 2003
    #1
    1. Advertising

  2. akash shetty wrote:

    > but this takes an awful amt of time.(abt 7 mins)
    > is there anyway to speed this up.
    > is use of threading feasible and what code do i
    > thread( since all i do is process the database).there
    > are no other concurrent tasks. so do i divide the
    > database into parts and multithread the searching on
    > these parts concurrently. is this feasible. or shud i
    > be using some kind of multiprocessing running the
    > parts(files) as diff processes.


    Multiple threads/processes won't buy you anything unless you have a
    multiprocessor-machine. In fact, they'll slow down things, as context
    switches (which are considerably slower between processes than between
    threads) take also their time.

    Threads only buy you performance on single processor-machines if you have to
    deal with asynchronus events like network packets or userinteraction.

    For speeding up your search - if you search brute-force, you could try to go
    for something like a shift-and algorithm.

    And it might help to use C and memory-map parts of the file - but I have to
    admit that I have no expirience in that field.

    Diez
    Diez B. Roggisch, Oct 14, 2003
    #2
    1. Advertising

  3. akash shetty

    Andrew Dalke Guest

    akash shetty:
    > the file is 3.5GB . the search pattern is a ten letter
    > string.the database consists of paragraphs.
    > the code ive developed searches the data
    > paragraphwise.
    > (using xreadlines).
    > but this takes an awful amt of time.(abt 7 mins)


    How are you doing the search? character by character,
    string.find, or regular expressions? What's a "paragraph"?
    Might memory mapping the file speed things up?

    If you don't have a multiple processor machine,
    using threads won't make a difference. How many
    processors do you have on a machine?

    Andrew
    Andrew Dalke, Oct 14, 2003
    #3
  4. Actually, with Python even on a dual-processor machine,
    multi-threading will get you NO speed increase. This because even
    though you have multiple threads, only ONE of them is running at a
    time (whichever one has the Global Interpreter Lock, or GIL). Python
    switches between threads every so often (100 byte codes is the default
    if I remember correctly, but it can be changed).

    The exception is if you write a C extension module... you can
    explicitly release the GIL and reaquire it before returning to Python.
    That allows another Python thread to run at the same time as your C
    module.

    Some Python extension modules implement this (I've been working with
    Fredrik Lundh to get this into PIL), but most don't... it's a personal
    gripe of mine, but I understand the necessity for the time being. The
    GIL makes Python pretty "thread safe" even without locks on shared
    objects, but in my opinion that should be up to the programmer to deal
    with or die with by themselves.

    Hopefully some day we'll get to a Python version that can internally
    handle threads properly.

    Kevin Cazabon.

    "Diez B. Roggisch" <> wrote in message news:<bmgp8g$m2831$-berlin.de>...
    > akash shetty wrote:
    >
    > > but this takes an awful amt of time.(abt 7 mins)
    > > is there anyway to speed this up.
    > > is use of threading feasible and what code do i
    > > thread( since all i do is process the database).there
    > > are no other concurrent tasks. so do i divide the
    > > database into parts and multithread the searching on
    > > these parts concurrently. is this feasible. or shud i
    > > be using some kind of multiprocessing running the
    > > parts(files) as diff processes.

    >
    > Multiple threads/processes won't buy you anything unless you have a
    > multiprocessor-machine. In fact, they'll slow down things, as context
    > switches (which are considerably slower between processes than between
    > threads) take also their time.
    >
    > Threads only buy you performance on single processor-machines if you have to
    > deal with asynchronus events like network packets or userinteraction.
    >
    > For speeding up your search - if you search brute-force, you could try to go
    > for something like a shift-and algorithm.
    >
    > And it might help to use C and memory-map parts of the file - but I have to
    > admit that I have no expirience in that field.
    >
    > Diez
    Kevin Cazabon, Oct 14, 2003
    #4
  5. akash shetty

    Neil Hodgson Guest

    Andrew Dalke:

    > If you don't have a multiple processor machine,
    > using threads won't make a difference. How many
    > processors do you have on a machine?


    There may be some advantage in overlapping computation with I/O although
    it would depend on the relative costs of the search and I/O. With a 3.5
    Gigabyte file the problem may be I/O bound. In which case splitting the file
    onto multiple disks and using 1 thread for each split may increase
    performance.

    Neil
    Neil Hodgson, Oct 14, 2003
    #5
  6. akash shetty

    Andrew Dalke Guest

    Neil Hodgson:
    > In which case splitting the file
    > onto multiple disks and using 1 thread for each split may increase
    > performance.


    But then so would disk striping, or a bigger cache, or .. hmm,
    perhaps the data is on a networked filesystem and the slow
    performance comes from the network? Hard to know without
    more info from the OP.

    Andrew
    Andrew Dalke, Oct 15, 2003
    #6
  7. akash shetty

    Anand Pillai Guest

    If your program is network bound there might be some
    performance gain to be extracted by using threads, taking
    into account GIL and all that.

    I/O bound ... cannot say, it depends on how many I/O
    writes you do per second, the disk cache and whether
    you use multiple disks, too many factors.

    But in your case, it does not look as if the program is
    network bound. So threading may not help here and in fact
    might even slow down performance owing to GIL.

    The best option for you might be to speed up your search.
    If you are searching for patterns use regexps and not string
    search or character search, since that slows up matters
    considerably. If you are using just sub-string search *dont*
    use regexps as I found out that the simple string search
    is faster in most cases.

    Otherwise, think about indexing your data using LuPy or
    some other indexer and searching the index. You can write
    a small funciton that will rebuild this index when your
    actual data changes. Otherwise, i.e in most normal searches
    , use this index as a cache and search there.

    Index searching is a factor of times faster than searching using
    strings or regexps and a lot of research has gone into that.

    HTH.

    -Anand

    "Andrew Dalke" <> wrote in message news:<3e1jb.1308$>...
    > Neil Hodgson:
    > > In which case splitting the file
    > > onto multiple disks and using 1 thread for each split may increase
    > > performance.

    >
    > But then so would disk striping, or a bigger cache, or .. hmm,
    > perhaps the data is on a networked filesystem and the slow
    > performance comes from the network? Hard to know without
    > more info from the OP.
    >
    > Andrew
    >
    Anand Pillai, Oct 15, 2003
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Lenny Pervin

    Threading in a web-based app

    Lenny Pervin, Jul 5, 2003, in forum: ASP .Net
    Replies:
    1
    Views:
    996
    Oisin Grehan
    Jul 8, 2003
  2. Alina
    Replies:
    0
    Views:
    1,678
    Alina
    Jul 16, 2003
  3. Replies:
    9
    Views:
    1,030
    Mark Space
    Dec 29, 2007
  4. Steven Woody
    Replies:
    0
    Views:
    404
    Steven Woody
    Jan 9, 2009
  5. Steven Woody
    Replies:
    0
    Views:
    444
    Steven Woody
    Jan 9, 2009
Loading...

Share This Page