regarding threading

akash shetty · Oct 14, 2003

hi,
im developing a code which requires searching a large
database(bioological) for certain patterns.the size of
the file is 3.5GB . the search pattern is a ten letter
string.the database consists of paragraphs.
the code ive developed searches the data
paragraphwise.
(using xreadlines).
but this takes an awful amt of time.(abt 7 mins)
is there anyway to speed this up.
is use of threading feasible and what code do i
thread( since all i do is process the database).there
are no other concurrent tasks. so do i divide the
database into parts and multithread the searching on
these parts concurrently. is this feasible. or shud i
be using some kind of multiprocessing running the
parts(files) as diff processes.
please help.
thanx

__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com

Diez B. Roggisch · Oct 14, 2003

akash said:
but this takes an awful amt of time.(abt 7 mins)
is there anyway to speed this up.
is use of threading feasible and what code do i
thread( since all i do is process the database).there
are no other concurrent tasks. so do i divide the
database into parts and multithread the searching on
these parts concurrently. is this feasible. or shud i
be using some kind of multiprocessing running the
parts(files) as diff processes.

Multiple threads/processes won't buy you anything unless you have a
multiprocessor-machine. In fact, they'll slow down things, as context
switches (which are considerably slower between processes than between
threads) take also their time.

Threads only buy you performance on single processor-machines if you have to
deal with asynchronus events like network packets or userinteraction.

For speeding up your search - if you search brute-force, you could try to go
for something like a shift-and algorithm.

And it might help to use C and memory-map parts of the file - but I have to
admit that I have no expirience in that field.

Diez

Andrew Dalke · Oct 14, 2003

akash shetty:

the file is 3.5GB . the search pattern is a ten letter
string.the database consists of paragraphs.
the code ive developed searches the data
paragraphwise.
(using xreadlines).
but this takes an awful amt of time.(abt 7 mins)

How are you doing the search? character by character,
string.find, or regular expressions? What's a "paragraph"?
Might memory mapping the file speed things up?

If you don't have a multiple processor machine,
using threads won't make a difference. How many
processors do you have on a machine?

Andrew
(e-mail address removed)

Kevin Cazabon · Oct 14, 2003

Actually, with Python even on a dual-processor machine,
multi-threading will get you NO speed increase. This because even
though you have multiple threads, only ONE of them is running at a
time (whichever one has the Global Interpreter Lock, or GIL). Python
switches between threads every so often (100 byte codes is the default
if I remember correctly, but it can be changed).

The exception is if you write a C extension module... you can
explicitly release the GIL and reaquire it before returning to Python.
That allows another Python thread to run at the same time as your C
module.

Some Python extension modules implement this (I've been working with
Fredrik Lundh to get this into PIL), but most don't... it's a personal
gripe of mine, but I understand the necessity for the time being. The
GIL makes Python pretty "thread safe" even without locks on shared
objects, but in my opinion that should be up to the programmer to deal
with or die with by themselves.

Hopefully some day we'll get to a Python version that can internally
handle threads properly.

Kevin Cazabon.

Neil Hodgson · Oct 14, 2003

Andrew Dalke:

If you don't have a multiple processor machine,
using threads won't make a difference. How many
processors do you have on a machine?

There may be some advantage in overlapping computation with I/O although
it would depend on the relative costs of the search and I/O. With a 3.5
Gigabyte file the problem may be I/O bound. In which case splitting the file
onto multiple disks and using 1 thread for each split may increase
performance.

Neil

Andrew Dalke · Oct 15, 2003

Neil Hodgson:

In which case splitting the file
onto multiple disks and using 1 thread for each split may increase
performance.

But then so would disk striping, or a bigger cache, or .. hmm,
perhaps the data is on a networked filesystem and the slow
performance comes from the network? Hard to know without
more info from the OP.

Andrew
(e-mail address removed)

Anand Pillai · Oct 15, 2003

If your program is network bound there might be some
performance gain to be extracted by using threads, taking
into account GIL and all that.

I/O bound ... cannot say, it depends on how many I/O
writes you do per second, the disk cache and whether
you use multiple disks, too many factors.

But in your case, it does not look as if the program is
network bound. So threading may not help here and in fact
might even slow down performance owing to GIL.

The best option for you might be to speed up your search.
If you are searching for patterns use regexps and not string
search or character search, since that slows up matters
considerably. If you are using just sub-string search *dont*
use regexps as I found out that the simple string search
is faster in most cases.

Otherwise, think about indexing your data using LuPy or
some other indexer and searching the index. You can write
a small funciton that will rebuild this index when your
actual data changes. Otherwise, i.e in most normal searches
, use this index as a cache and search there.

Index searching is a factor of times faster than searching using
strings or regexps and a lot of research has gone into that.

HTH.

-Anand

[pysqlite] How do I use pysqlite in a multi-threading env.?	2	May 20, 2005
Multithreading and compatibility library (libconfig)	1	Jan 23, 2013
What is up with "=="?	5	Oct 8, 2003
Preferred Python idiom for handling non-existing dictionary keys and why?	8	Oct 10, 2003
formatting ruby code in html	2	Sep 25, 2003
Addition and multiplication puzzle	5	Oct 25, 2003
[ANN] win32-etc 0.1.0	0	Oct 24, 2003
xml in Ruby	0	Oct 2, 2003

regarding threading

akash shetty

Diez B. Roggisch

Andrew Dalke

Kevin Cazabon

Neil Hodgson

Andrew Dalke

Anand Pillai

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads