regarding threading

A

akash shetty

hi,
im developing a code which requires searching a large
database(bioological) for certain patterns.the size of
the file is 3.5GB . the search pattern is a ten letter
string.the database consists of paragraphs.
the code ive developed searches the data
paragraphwise.
(using xreadlines).
but this takes an awful amt of time.(abt 7 mins)
is there anyway to speed this up.
is use of threading feasible and what code do i
thread( since all i do is process the database).there
are no other concurrent tasks. so do i divide the
database into parts and multithread the searching on
these parts concurrently. is this feasible. or shud i
be using some kind of multiprocessing running the
parts(files) as diff processes.
please help.
thanx

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com
 
D

Diez B. Roggisch

akash said:
but this takes an awful amt of time.(abt 7 mins)
is there anyway to speed this up.
is use of threading feasible and what code do i
thread( since all i do is process the database).there
are no other concurrent tasks. so do i divide the
database into parts and multithread the searching on
these parts concurrently. is this feasible. or shud i
be using some kind of multiprocessing running the
parts(files) as diff processes.

Multiple threads/processes won't buy you anything unless you have a
multiprocessor-machine. In fact, they'll slow down things, as context
switches (which are considerably slower between processes than between
threads) take also their time.

Threads only buy you performance on single processor-machines if you have to
deal with asynchronus events like network packets or userinteraction.

For speeding up your search - if you search brute-force, you could try to go
for something like a shift-and algorithm.

And it might help to use C and memory-map parts of the file - but I have to
admit that I have no expirience in that field.

Diez
 
A

Andrew Dalke

akash shetty:
the file is 3.5GB . the search pattern is a ten letter
string.the database consists of paragraphs.
the code ive developed searches the data
paragraphwise.
(using xreadlines).
but this takes an awful amt of time.(abt 7 mins)

How are you doing the search? character by character,
string.find, or regular expressions? What's a "paragraph"?
Might memory mapping the file speed things up?

If you don't have a multiple processor machine,
using threads won't make a difference. How many
processors do you have on a machine?

Andrew
(e-mail address removed)
 
K

Kevin Cazabon

Actually, with Python even on a dual-processor machine,
multi-threading will get you NO speed increase. This because even
though you have multiple threads, only ONE of them is running at a
time (whichever one has the Global Interpreter Lock, or GIL). Python
switches between threads every so often (100 byte codes is the default
if I remember correctly, but it can be changed).

The exception is if you write a C extension module... you can
explicitly release the GIL and reaquire it before returning to Python.
That allows another Python thread to run at the same time as your C
module.

Some Python extension modules implement this (I've been working with
Fredrik Lundh to get this into PIL), but most don't... it's a personal
gripe of mine, but I understand the necessity for the time being. The
GIL makes Python pretty "thread safe" even without locks on shared
objects, but in my opinion that should be up to the programmer to deal
with or die with by themselves.

Hopefully some day we'll get to a Python version that can internally
handle threads properly.

Kevin Cazabon.
 
N

Neil Hodgson

Andrew Dalke:
If you don't have a multiple processor machine,
using threads won't make a difference. How many
processors do you have on a machine?

There may be some advantage in overlapping computation with I/O although
it would depend on the relative costs of the search and I/O. With a 3.5
Gigabyte file the problem may be I/O bound. In which case splitting the file
onto multiple disks and using 1 thread for each split may increase
performance.

Neil
 
A

Andrew Dalke

Neil Hodgson:
In which case splitting the file
onto multiple disks and using 1 thread for each split may increase
performance.

But then so would disk striping, or a bigger cache, or .. hmm,
perhaps the data is on a networked filesystem and the slow
performance comes from the network? Hard to know without
more info from the OP.

Andrew
(e-mail address removed)
 
A

Anand Pillai

If your program is network bound there might be some
performance gain to be extracted by using threads, taking
into account GIL and all that.

I/O bound ... cannot say, it depends on how many I/O
writes you do per second, the disk cache and whether
you use multiple disks, too many factors.

But in your case, it does not look as if the program is
network bound. So threading may not help here and in fact
might even slow down performance owing to GIL.

The best option for you might be to speed up your search.
If you are searching for patterns use regexps and not string
search or character search, since that slows up matters
considerably. If you are using just sub-string search *dont*
use regexps as I found out that the simple string search
is faster in most cases.

Otherwise, think about indexing your data using LuPy or
some other indexer and searching the index. You can write
a small funciton that will rebuild this index when your
actual data changes. Otherwise, i.e in most normal searches
, use this index as a cache and search there.

Index searching is a factor of times faster than searching using
strings or regexps and a lot of research has gone into that.

HTH.

-Anand
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,046
Latest member
Gavizuho

Latest Threads

Top