Download the "head" of a large file?

E

erikcw

I'm trying to figure out how to download just the first few lines of a
large (50mb) text file form a server to save bandwidth. Can Python do
this?

Something like the Python equivalent of curl http://url.com/file.xml |
head -c 2048

Thanks!
Erik
 
B

Ben Charrow

erikcw said:
...download just the first few lines of a large (50mb) text file form a
server to save bandwidth..... Something like the Python equivalent of curl
http://url.com/file.xml | head -c 2048

If you're OK calling curl and head from within python:

from subprocess import Popen, PIPE
url = "http://docs.python.org/"
p1 = Popen(["curl", url], stdout = PIPE, stderr = PIPE)
p2 = Popen(["head", "-c", "1024"], stdin = p1.stdout, stdout = PIPE)
p2.communicate()[0]

If you want a pure python approach:

import urllib2
url = "http://docs.python.org/"
req = urllib2.Request(url)
f = urllib2.urlopen(req)
f.read(1024)

HTH,
Ben
 
J

John Yeung

I'm trying to figure out how to download just the first few lines of a
large (50mb) text file form a server to save bandwidth.  Can Python do
this?

Something like the Python equivalent of curlhttp://url.com/file.xml|
head -c 2048

urllib.urlopen gives you a file-like object, which you can then read
line by line or in fixed-size chunks. For example:

import urllib
chunk = urllib.urlopen('http://url.com/file.xml').read(2048)

At that point, chunk is just bytes, which you can write to a local
file, print, or whatever it is you want.

John
 
G

Gabriel Genellina

En Mon, 27 Jul 2009 19:40:25 -0300, John Yeung
urllib.urlopen gives you a file-like object, which you can then read
line by line or in fixed-size chunks. For example:

import urllib
chunk = urllib.urlopen('http://url.com/file.xml').read(2048)

At that point, chunk is just bytes, which you can write to a local
file, print, or whatever it is you want.

As the OP wants to save bandwidth, it's better to ask exactly the amount
of data to read. That is, add a Range header field [1] to the request, and
inspect the response for a corresponding Content-Range header [2].

py> import urllib2
py> url = "http://www.python.org/"
py> req = urllib2.Request(url)
py> req.add_header('Range', 'bytes=0-10239') # first 10K
py> f = urllib2.urlopen(req)
py> data = f.read()
py> print repr(data[-30:]), len(data)
'\t <a href="http://www.zope.' 10240
py> f.headers['Content-Range']
'bytes 0-10239/18196'
py> f.getcode()
206 # 206=Partial Content
py> f.close()

[1] http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35

[2] http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.16
 
D

Dennis Lee Bieber

I'm trying to figure out how to download just the first few lines of a
large (50mb) text file form a server to save bandwidth. Can Python do
this?

Something like the Python equivalent of curl http://url.com/file.xml |
head -c 2048
Presuming that | is a shell pipe operation, then doesn't that
command line use "curl" to download the entire file, and "head" to
display just the first 2k?
--
Wulfraed Dennis Lee Bieber KD6MOG
(e-mail address removed) (e-mail address removed)
HTTP://wlfraed.home.netcom.com/
(Bestiaria Support Staff: (e-mail address removed))
HTTP://www.bestiaria.com/
 
B

Ben Charrow

Dennis said:
Presuming that | is a shell pipe operation, then doesn't that
command line use "curl" to download the entire file, and "head" to
display just the first 2k?

No, the entire file is not downloaded. My understanding of why this is (which
could be wrong) is that the output of curl is piped to head, and once head gets
the first 2k it closes the pipe. Then, when curl tries to write to the pipe
again, it gets sent the SIGPIPE signal at which point it exits.

Cheers,
Ben
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,584
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top