Is it better to use threads or fork in the following case

  • Thread starter grocery_stocker
  • Start date
G

grocery_stocker

Let's say there is a new zip file with updated information every 30
minutes on a remote website. Now, I wanna connect to this website
every 30 minutes, download the file, extract the information, and then
have the program search the file search for certain items.

Would it be better to use threads to break this up? I have one thread
download the data and then have another to actually process the data .
Or would it be better to use fork?
 
D

Diez B. Roggisch

grocery_stocker said:
Let's say there is a new zip file with updated information every 30
minutes on a remote website. Now, I wanna connect to this website
every 30 minutes, download the file, extract the information, and then
have the program search the file search for certain items.

Would it be better to use threads to break this up? I have one thread
download the data and then have another to actually process the data .
Or would it be better to use fork?

Neither. Why do you think you need concurrency at all?

Diez
 
G

grocery_stocker

Neither. Why do you think you need concurrency at all?

Okay, here is what was going through my mind. I'm a 56k dialup modem.
What happens it takes me 15 minutes to download the file? Now let's
say during those 15 minutes, the program needs to parse the data in
the existing file.
 
D

Diez B. Roggisch

grocery_stocker said:
Okay, here is what was going through my mind. I'm a 56k dialup modem.
What happens it takes me 15 minutes to download the file? Now let's
say during those 15 minutes, the program needs to parse the data in
the existing file.

Is this an exercise in asking 20 hypothetical questions?

Getting concurrency right isn't trivial, so if you absolute don't need
this, don't do it.

Diez
 
C

CTO

Probably better just to check HEAD and see if its updated within the
time you're
looking at before any unpack. Even on a 56k that's going to be pretty
fast, and
you don't risk unpacking an old file while a new version is on the
way.

If you still want to be able to unpack the old file if there's an
update then
you're probably right about needing to run it concurrently, and
personally I'd
just fork it for ease of use- it doesn't sound like you're trying to
run 100,000
of these at the same time, and you're saving the file anyway.

Geremy Condra
 
P

Paul Hankin

Okay, here is what was going through my mind. I'm a 56k dialup modem.
What happens it takes me 15 minutes to download the file? Now let's
say during those 15 minutes, the program needs to parse the data in
the existing file.

If your modem is going at full speed for those 15 minutes, you'll have
around 6.3Mb of data. Even after decompressing, and unless the data is
in some quite difficult to parse format, it'll take seconds to
process.
 
G

grocery_stocker

grocery_stocker schrieb:





Is this an exercise in asking 20 hypothetical questions?

No. This the prelude to me writing a real life python program.
 
G

Gabriel Genellina

If your modem is going at full speed for those 15 minutes, you'll have
around 6.3Mb of data. Even after decompressing, and unless the data is
in some quite difficult to parse format, it'll take seconds to
process.

In addition, the zip file format stores the directory at the end of the
file. So you can't process it until it's completely downloaded.
Concurrency doesn't help here.
 
C

CTO

In addition, the zip file format stores the directory at the end of the  
file. So you can't process it until it's completely downloaded.  
Concurrency doesn't help here.

Don't think that's relevant, if I'm understanding the OP correctly.
Lets say you've downloaded the file once and you're doing whatever
the app does with it. Now, while that's happening the half an hour
time limit comes up. Now you want to start another download, but
you also want to continue to work with the old version. Voila,
concurrency.
 
D

Dennis Lee Bieber

No. This the prelude to me writing a real life python program.

Lots of "real life python programs" don't need threading or other
spawned processes...

Your 56K dial-up is probably only running around 44kbps (no "56K"
modem, in the US, ever reaches that speed -- the FCC limited the maximum
allowed bit-rate on phone lines to around 52kbps, and since the actual
speed is affected by the cleanliness of the signal on the lines rarely
hits even 50kbps). Assuming 44,000bps, no handshake/protocol overhead,
that comes to 5,500bytes/sec => 330,000 bytes/min => 4,950,000 in 15
minutes... call it 5MB... What type of processing are you planning that
would take any fairly recent computer 15 minutes to handle 5MB of data
-- 5MB is about 6 minutes of MP3 audio, or 3-4 3.5MP JPEGs

Presuming your processing really does have the risk of running over
into the next download interval, I'd suggest at most two threads
(pseudo-code):

worklist = Queue.Queue()

def downloader():
while True:
startTime = time.time()
#imagine proper format conversions for strings
filename = BASEFILENAME + startTime
doDownload(filename)
worklist.put(filename)
#compute next download time taking into account elapsed time
sleep (startTime + 30mins) - time.time()

def processor():
while True:
filename = worklist.get()
doFileProcessing(filename)


This ensures that downloads start every 30 minutes (unless a
download runs over 30 minutes, in which case the sleep is negative, and
probably returns immediately) regardless of the processing duration. It
also ensures that the files are processed IN ORDER OF DOWNLOAD with NO
OVERLAPS.

Threading is probably suited, since the downloader is blocked on a
sleep call, letting the processor run full speed; and if the processor
is fast, it will block waiting for the next file to be available,
meaning the downloader gets full CPU usage.

--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
D

Diez B. Roggisch

CTO said:
Don't think that's relevant, if I'm understanding the OP correctly.
Lets say you've downloaded the file once and you're doing whatever
the app does with it. Now, while that's happening the half an hour
time limit comes up. Now you want to start another download, but
you also want to continue to work with the old version. Voila,
concurrency.

Which brings us backs to the "20 questions"-part of my earlier post. It
could be, but it could also be that processing takes seconds. Or it takes
so long that even concurrency won't help. Who knows?

Diez
 
C

CTO

Which brings us backs to the "20 questions"-part of my earlier post. It
could be, but it could also be that processing takes seconds. Or it takes
so long that even concurrency won't help. Who knows?

Probably the OP ;)

Geremy Condra
 
J

JanC

Gabriel said:
In addition, the zip file format stores the directory at the end of the
file. So you can't process it until it's completely downloaded.

Well, you *can* download the directory part first (if the HTTP server
supports it), and if you only need some files, you could then only
download these files out of the .zip, saving a lot in download time...
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,743
Messages
2,569,478
Members
44,899
Latest member
RodneyMcAu

Latest Threads

Top