Urllib's urlopen and urlretrieve

qoresucks · Feb 21, 2013

I only just started Python and given that I know nothing about network programming or internet programming of any kind really, I thought it would be interesting to try write something that could create an archive of a websitefor myself. With this I started trying to use the urllib library, however I am having a problem understanding why certain things wont work with the urllib.urlretrieve and urllib.urlopen then reading.

Why is it that when using urllib.urlopen then reading or urllib.urlretrieve, does it only give me parts of the sites, loosing the formatting, images, etc...? How can I get around this?

Lastly, while its a bit off topic, I lack a good understanding of network programming as a whole. From making programs communicate or to simply extract data from URL's, I don't know where to even begin, which has lead me to learning python to better understand it hopefully then carry it over to other languages I know. Can anyone give me some advice on where to begin learning this information? Even if its in another language.

Dave Angel · Feb 21, 2013

I only just started Python and given that I know nothing about network programming or internet programming of any kind really, I thought it would be interesting to try write something that could create an archive of a website for myself.

Please send your emails as text, not html; this is a text-based mailing
list.

To archive your website, use the rsync command. No need to write any
code, as rsync will descend into all the directories as needed, and
it'll get the actual website data, not the stuff that the web server
feeds to the browsers.

If for some reason you don't have rsync, you could use scp. But it
doesn't seem to be able to preserve attributes. It's also not smart
enough to only copy stuff that's been changed, when you want to update
incrementally.

rh · Feb 21, 2013

To archive your website, use the rsync command. No need to write any
code, as rsync will descend into all the directories as needed, and
it'll get the actual website data, not the stuff that the web server
feeds to the browsers.

How many websites let you suck down their content using rsync???
The request was for creating their own copy of a website.

If for some reason you don't have rsync, you could use scp. But it
doesn't seem to be able to preserve attributes. It's also not smart
enough to only copy stuff that's been changed, when you want to
update incrementally.

Ditto of above.

And how does this help someone just learning the language?

rh · Feb 21, 2013

On Thu, 21 Feb 2013 04:12:52 -0800 (PST)

I only just started Python and given that I know nothing about
network programming or internet programming of any kind really, I
thought it would be interesting to try write something that could
create an archive of a website for myself. With this I started trying
to use the urllib library, however I am having a problem
understanding why certain things wont work with the
urllib.urlretrieve and urllib.urlopen then reading.

Why is it that when using urllib.urlopen then reading or
urllib.urlretrieve, does it only give me parts of the sites, loosing
the formatting, images, etc...? How can I get around this?

urllib2 is the standard library in 2.7.3 to use, in 3.3 it is urllib
straight from the doc page

import urllib2
f = urllib2.urlopen('http://www.python.org/')
print f.read(100)
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<?xml-stylesheet href="./css/ht2html

And so your journey begins. With recursing into links, etc., etc.

Lastly, while its a bit off topic, I lack a good understanding of
network programming as a whole. From making programs communicate or
to simply extract data from URL's, I don't know where to even begin,
which has lead me to learning python to better understand it
hopefully then carry it over to other languages I know. Can anyone
give me some advice on where to begin learning this information? Even
if its in another language.

Also since you're new you may want to work with python3 but not
a requirement.

There are lots of free books online, search this list for links.
(you can search this list at gmane and probably elsewhere)

Dave Angel · Feb 21, 2013

How many websites let you suck down their content using rsync???
The request was for creating their own copy of a website.

Clearly this was his own website, since it's usually unethical to "suck
down" someone else's. And my message specifically said "To archive
*your* website..." As to the implied question of why, since he
presumably has the original sources, I can only relate my own
experience. I generate mine by a python program, but over time obsolete
files are left behind. Additionally, an overzealous SEO person
hand-edited my files. And finally, I reinstalled my system from scratch
a couple of months ago. So in order to see exactly what's out there, I
used rsync, about two weeks ago.

Dave Angel · Feb 21, 2013

<snip>
Why is it that when using urllib.urlopen then reading or urllib.urlretrieve, does it only give me parts of the sites, loosing the formatting, images, etc...? How can I get around this?

Start by telling us if you're using Python2 or Python3, as this library
is different for different versions. Also what OS, as there are lots of
useful utilities in Unix, and a different set in Windows or other
places. Even if the same program exists on both, it's likely to be
named differently.

My earlier reply assumed you were trying to get an accurate copy of your
website, presumably because your own local copy had gotten out of synch.
rh assumed differently, so I'll try again. If you're trying to
download someone else's, you should realize that you may be violating
copyright, and ought to get permission. It's one thing to extract a
file or two, but another entirely to try to capture the entire site.
And many sites consider all of the details proprietary. Others consider
the images proprietary, and enforce the individual copyrights.

You can indeed copy individual files with urlib or urlib2, but that's
just the start of the problem. A typical web page is written in html
(or xhtml, or ...), and displaying it is the job of a browser, not the
cat command. In addition, the page will generally refer to lots of
other files, with the most common being a css file and a few jpegs. So
you have to parse the page to find all those dependencies, and copy them
as well.

Next, the page may contain code (eg. php, javascript), or it may be code
(eg. Python or perl). In each of those cases, what you'll get isn't
exactly what you'd expect. If you try to fetch a python program,
generally what happens is it gets run, and you fetch its stdout instead.
On the other hand javascript gets executed by the browser, and I don't
know where php gets executed, or by whom. Finally, the page may make
use of resources which simply won't be visible to you without becoming a
hacker. Like my rsync and scp examples, you'll probably need a userid
and password to get into the guts.

If you want to play with some of this without programming, you could go
to your favorite browser, and View->Source. The method of doing that
varies with browser brand, version & OS, but it should be there on some
menu someplace. In Chrome, it's Tools->ViewSource.

Examples below extracted from the main page at python.org

<title>Python Programming Language – Official Website</title>

That simply sets the title for the page. It is not even part of the
body, it's part of the header for the page. In this case, the header
continues for 77 pages, including meta tags, javascript stuff, css
stuff, etc.

You might observe that angle brackets are used to enclose explicit kinds
of data. In the above example, it's a "title" element. And it's
enclosed with <title> and </title>

In xhtml, these will always come in pairs, like curly braces in C
programming. However, most web pages are busted, so parsing it is
sometimes troublesome. Most people seem to recommand Beautiful Soup, in
part because it tolerates many kinds of errors.

I'd get a good book on html programming, making sure it covers xhtml and
css. But I don't know what to recommend, as everything in my arsenal is
thoroughly dated.

Much of the body is devoted to the complexity of setting up the page in
a browser of variable size, varying fonts, user-overrides, etc. The
following exerpt:

<div style="align:center; padding-top: 0.5em; padding-left: 1em">
<a href="/psf/donations/"><img width="116" height="42"
src="/images/donate.png" alt="" title="" /></a>
</div>

The whole thing is a "div" or division. It's a individual chunk of the
page that might be placed almost anywhere within a bigger div or the
page itself. It has a style attribute, which gives hints to the browser
about what it wants. More commonly, the style will be indirected
through a separate css page.

It has an "a" tag, which shows a link. The link may be underlined, but
the css or the browser may override that. The url for the link is
specified in the 'src' attribute, the tooltip is specified in the alt
attribute. This is enclosing an 'img' tag, which describes a png image
file to be displayed, and specifies the scaling for it.

<h4><a href="/about/help/">Help</a></h4>

The h4 tag refers to css which specifies various things about how
this'll display. It's usually used for making larger and smaller
versions of text for titles and such.

<link rel="stylesheet" type="text/css" media="screen"
id="screen-switcher-stylesheet"
href="/styles/screen-switcher-default.css" />

This points to a css file, which refers to another one, called
styles.css. That's where you can see the definition for a style of h4

H1,H2,H3,H4,H5 {
font-family: Georgia, "Bitstream Vera Serif",
"New York", Palatino, serif;
font-weight:normal;
line-height: 1em;
}

This defines the common attributes for all the Hn series. Then they are
refined and overridden by:

H4
{
font-size: 125%;
color: #366D9C;
margin: 0.4em 0 0.0em 0;
}

So we see that H4 is 25% bigger than default. Similarly H3 is 35%, and
H2 is 40% bigger.

It's a very complicated topic, and I wish you luck on it. But it's not
clear that the first step should involve any Python programming. I got
all the above just with Chrome in its default setup. I haven't even
mentioned things like the Tools->DeveloperTools, or other stuff you
could get via plugins.

If you're copying these files with a view of being able to run them
locally, realize that for most websites, you need lots of installed
software to support being a webserver. If you're writing your own, you
can start simple, and maybe never need any of the extra tools. For
example, on my own website, I only needed static pages. So the python
code I used was to generate the web pages, which are then uploaded as is
to the site. They can be tested locally by simply making up a url which
starts

file://

instead of

http://

But as soon as I want database features, or counters, or user accounts,
or data entry, or randomness, I might add code that runs on the server,
and that's a lot trickier. Probably someone who has done it can tell us
I'm all wet, though.

qoresucks · Feb 22, 2013

Initially I was just trying the html, but later when I attempted more complicated sites that weren't my own I noticed that large bulks of the site were lost in the process. The urllib code essentially looks like what I was trying but it didn't work as I had expected.

To be more specific, after I got it working for my own little page, I attempted to take it further and get all the lessons from Learn Python The Hard Way. When I tried the same method on the first intro page to see if I was even getting it right, the html code was all there but upon opening it I noticed the format was all wrong, colors were off for the background, images, etc... were all missing. So clearly I ended up misunderstanding something and its something critical I need to understand.

As for the OS, I primarily use Mac OS, however well versed in linux and windows if there is anything specific out there that might help.

As for which version if Python, I have been using Python 2 to learn on as Iheard that Python 3 was still largely unadopted due to a lack of library support etc... by comparison. Are people adopting it fast enough now that I should consider learning on 3 instead of 2?

Also, it isn't so much to do it for technical reasons but rather I thought it would be something interesting and fun to learn some form of internet/network programming. Granted, its not the best approach, but I'm not really aware of too many others, and I it does seem interesting to me.

Python programming probably isn't the best way to initially approach this Iagree, but I wasn't sure what to research on or to get a better grasp of network/internet/web programming so I figured I would just dive head first and figure things out, and reinforce more programming while learning internet/network programming was my initial goal.

Thank you all for your responses though.

qoresucks · Feb 22, 2013

Initially I was just trying the html, but later when I attempted more complicated sites that weren't my own I noticed that large bulks of the site were lost in the process. The urllib code essentially looks like what I was trying but it didn't work as I had expected.

To be more specific, after I got it working for my own little page, I attempted to take it further and get all the lessons from Learn Python The Hard Way. When I tried the same method on the first intro page to see if I was even getting it right, the html code was all there but upon opening it I noticed the format was all wrong, colors were off for the background, images, etc... were all missing. So clearly I ended up misunderstanding something and its something critical I need to understand.

As for the OS, I primarily use Mac OS, however well versed in linux and windows if there is anything specific out there that might help.

As for which version if Python, I have been using Python 2 to learn on as Iheard that Python 3 was still largely unadopted due to a lack of library support etc... by comparison. Are people adopting it fast enough now that I should consider learning on 3 instead of 2?

Also, it isn't so much to do it for technical reasons but rather I thought it would be something interesting and fun to learn some form of internet/network programming. Granted, its not the best approach, but I'm not really aware of too many others, and I it does seem interesting to me.

Python programming probably isn't the best way to initially approach this Iagree, but I wasn't sure what to research on or to get a better grasp of network/internet/web programming so I figured I would just dive head first and figure things out, and reinforce more programming while learning internet/network programming was my initial goal.

Thank you all for your responses though.

Dave Angel · Feb 22, 2013

Initially I was just trying the html, but later when I attempted more complicated sites that weren't my own I noticed that large bulks of the site were lost in the process. The urllib code essentially looks like what I was trying but it didn't work as I had expected.

To be more specific, after I got it working for my own little page, I attempted to take it further and get all the lessons from Learn Python The Hard Way. When I tried the same method on the first intro page to see if I was even getting it right, the html code was all there but upon opening it I noticed the format was all wrong, colors were off for the background, images, etc... were all missing.

So how are you opening this html? In a text editor that somehow added
colors? Or were you opening it in a browser? In order for a browser to
render a non-trivial page, it may need lots of files other than the
html. Colors for example can be specified inline, in the header, or in
an external css file. If the page was designed to use the external css,
and it's missing or not in the right location, then the browser is going
to get the colors wrong.

Further, if the location (url) is relative, then you can create a
similar directory structure, and the browser will find it. But if it's
absolute, then the browser is going to try to go out to the web to fetch
it. If it succeeds, then it's masking the fact that you haven't
downloaded the "whole web site."

The same is true for other external refs. It may be impossible to host
it elsewhere if there are any absolute urls.

MRAB · Feb 22, 2013

[snip]

As for which version if Python, I have been using Python 2 to learn on
as I heard that Python 3 was still largely unadopted due to a lack of
library support etc... by comparison. Are people adopting it fast
enough now that I should consider learning on 3 instead of 2?

[snip]
You should be concentrating on Python 3 unless you rely on a library
that hasn'tbeen ported yet. Python 2 has stopped at Python 2.7. There
won't be a Python 2.8.

urlopen	1	Apr 26, 2007
urllib2.urlopen+BadStatusLine+https	0	May 12, 2011
urllib2.urlopen timeout	0	Aug 3, 2009
<urlopen error (11001, 'getaddrinfo failed')>	0	Sep 10, 2009
Urllib2 urlopen and read - difference	3	Apr 15, 2010
urllib2.urlopen gives error 10060	0	Mar 26, 2008
Web page data and urllib2.urlopen	8	Aug 5, 2009
urllib2.urlopen Progress bar	0	Jan 15, 2006

Urllib's urlopen and urlretrieve

qoresucks

Dave Angel

rh

rh

Dave Angel

Dave Angel

qoresucks

qoresucks

Dave Angel

MRAB

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads