[rubyzip + open-uri] reading zipfiles from a url?

J

Janus Bor

Hi everyone!

I'm trying to read a zipfile directly from a url. However, open-uri and
rubyzip don't seem to cooperate very well:

require 'zip/zip'
require 'open-uri'

url = "http://www.cibiv.at/~phuong/vien/8a375.zip"
zip_file = Zip::ZipFile.open(url)

This code produces a ZipError, saying it can't find the zip file.
open-uri obviously does not support Zip::ZipFile.open. I'm pretty new to
Ruby (and programming in general), is there any other way to open the
zipfile directly?

Writing all zipfiles to a new file on the hd would be a nightmare
perfomance wise, as I only need a few very small files out of every
zipfile, but I need to process a few hundred zipfiles...

Thanks in advance!
Janus
 
D

David Masover

url = "http://www.cibiv.at/~phuong/vien/8a375.zip"
zip_file = Zip::ZipFile.open(url)

Maybe there is a right way to do this...

I'm going to argue that it would be difficult at best. Zip files store
metadata, such as compressed file location, at the end of the file. After
that, you'd want to seek somewhere in the middle. Since open-uri is probably
meant to issue a single, straightforward HTTP request (that is, ask for the
whole file, from beginning to end), I'm not sure this would work well.

But that's just an educated guess.

Depending on how portable you need this to be, you might consider the
FUSE-based HTTPFS:

http://httpfs.sourceforge.net/

When I last tried this, performance was pretty terrible -- no caching, and it
would fetch the file in blocks, so _many_ separate HTTP requests. But it
would probably do what you want.
 
A

Axel Etzold

-------- Original-Nachricht --------
Datum: Sun, 13 Jul 2008 22:50:21 +0900
Von: David Masover <[email protected]>
An: (e-mail address removed)
Betreff: Re: [rubyzip + open-uri] reading zipfiles from a url?
url = "http://www.cibiv.at/~phuong/vien/8a375.zip"
zip_file = Zip::ZipFile.open(url)

Maybe there is a right way to do this...

I'm going to argue that it would be difficult at best. Zip files store
metadata, such as compressed file location, at the end of the file. After
that, you'd want to seek somewhere in the middle. Since open-uri is
probably
meant to issue a single, straightforward HTTP request (that is, ask for
the
whole file, from beginning to end), I'm not sure this would work well.

But that's just an educated guess.

Depending on how portable you need this to be, you might consider the
FUSE-based HTTPFS:

http://httpfs.sourceforge.net/

When I last tried this, performance was pretty terrible -- no caching, and
it
would fetch the file in blocks, so _many_ separate HTTP requests. But it
would probably do what you want.


Hi ---

there is also rio (http://rio.rubyforge.org/):

# Copy a file from a ftp server into a local file un-gzipping it
rio('ftp://host/afile.gz').gzip > rio('afile')

I have no idea what its performance would be.

Best regards,

Axel
 
M

Michal Suchanek

Maybe there is a right way to do this...

I'm going to argue that it would be difficult at best. Zip files store
metadata, such as compressed file location, at the end of the file. After
that, you'd want to seek somewhere in the middle. Since open-uri is probably
meant to issue a single, straightforward HTTP request (that is, ask for the
whole file, from beginning to end), I'm not sure this would work well.

But that's just an educated guess.

Depending on how portable you need this to be, you might consider the
FUSE-based HTTPFS:

http://httpfs.sourceforge.net/

When I last tried this, performance was pretty terrible -- no caching, and it
would fetch the file in blocks, so _many_ separate HTTP requests. But it
would probably do what you want.

This is how FUSE is designed. The caching is supposed to happen in the
kernel upper layers. The requests are sent as received from the kernel
so there is nothing httpfs can do about the granularity. In practice
the kernel requests up to ~16k chunks but probably only when the
application does large block reads.

I tried keep-alive which should speed up subsequent requests. However,
the sockets can then hang, and it takes time to detect that. Still
this should happen only when there are network problems anyway.

It is very nice for mounting CD or DVD images, extracting a single
file from a zip could also be faster. If you want all the files anyway
it's probably better to just download the zip.

Also last time I looked the httpfs at SF was broken, there was a bug
around the SSL #ifdefs in read/write. You may need to define or
undefine USE_SSL or fix the code.

FUSE is theoretically portable to *BSD but not the code there because
it uses undefined behaviour of directory operations to make the
underlying directory still visible after the mount.

Thanks

Michal
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,535
Members
45,007
Latest member
obedient dusk

Latest Threads

Top