Urllib's urlopen and urlretrieve

Discussion in 'Python' started by qoresucks@gmail.com, Feb 21, 2013.

  1. Guest

    I only just started Python and given that I know nothing about network programming or internet programming of any kind really, I thought it would be interesting to try write something that could create an archive of a websitefor myself. With this I started trying to use the urllib library, however I am having a problem understanding why certain things wont work with the urllib.urlretrieve and urllib.urlopen then reading.

    Why is it that when using urllib.urlopen then reading or urllib.urlretrieve, does it only give me parts of the sites, loosing the formatting, images, etc...? How can I get around this?

    Lastly, while its a bit off topic, I lack a good understanding of network programming as a whole. From making programs communicate or to simply extract data from URL's, I don't know where to even begin, which has lead me to learning python to better understand it hopefully then carry it over to other languages I know. Can anyone give me some advice on where to begin learning this information? Even if its in another language.
    , Feb 21, 2013
    #1
    1. Advertising

  2. Dave Angel Guest

    On 02/21/2013 07:12 AM, wrote:
    > I only just started Python and given that I know nothing about network programming or internet programming of any kind really, I thought it would be interesting to try write something that could create an archive of a website for myself.


    Please send your emails as text, not html; this is a text-based mailing
    list.

    To archive your website, use the rsync command. No need to write any
    code, as rsync will descend into all the directories as needed, and
    it'll get the actual website data, not the stuff that the web server
    feeds to the browsers.

    If for some reason you don't have rsync, you could use scp. But it
    doesn't seem to be able to preserve attributes. It's also not smart
    enough to only copy stuff that's been changed, when you want to update
    incrementally.


    --
    DaveA
    Dave Angel, Feb 21, 2013
    #2
    1. Advertising

  3. rh Guest

    On Thu, 21 Feb 2013 10:56:15 -0500
    Dave Angel <> wrote:
    > On 02/21/2013 07:12 AM, wrote:
    > > I only just started Python and given that I know nothing about
    > > network programming or internet programming of any kind really, I
    > > thought it would be interesting to try write something that could
    > > create an archive of a website for myself.

    >



    > To archive your website, use the rsync command. No need to write any
    > code, as rsync will descend into all the directories as needed, and
    > it'll get the actual website data, not the stuff that the web server
    > feeds to the browsers.


    How many websites let you suck down their content using rsync???
    The request was for creating their own copy of a website.

    >
    > If for some reason you don't have rsync, you could use scp. But it
    > doesn't seem to be able to preserve attributes. It's also not smart
    > enough to only copy stuff that's been changed, when you want to
    > update incrementally.


    Ditto of above.

    And how does this help someone just learning the language?
    rh, Feb 21, 2013
    #3
  4. rh Guest

    On Thu, 21 Feb 2013 04:12:52 -0800 (PST)
    wrote:

    > I only just started Python and given that I know nothing about
    > network programming or internet programming of any kind really, I
    > thought it would be interesting to try write something that could
    > create an archive of a website for myself. With this I started trying
    > to use the urllib library, however I am having a problem
    > understanding why certain things wont work with the
    > urllib.urlretrieve and urllib.urlopen then reading.
    >
    > Why is it that when using urllib.urlopen then reading or
    > urllib.urlretrieve, does it only give me parts of the sites, loosing
    > the formatting, images, etc...? How can I get around this?


    urllib2 is the standard library in 2.7.3 to use, in 3.3 it is urllib
    straight from the doc page

    import urllib2
    f = urllib2.urlopen('http://www.python.org/')
    print f.read(100)
    <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    <?xml-stylesheet href="./css/ht2html

    And so your journey begins. With recursing into links, etc., etc.
    >
    > Lastly, while its a bit off topic, I lack a good understanding of
    > network programming as a whole. From making programs communicate or
    > to simply extract data from URL's, I don't know where to even begin,
    > which has lead me to learning python to better understand it
    > hopefully then carry it over to other languages I know. Can anyone
    > give me some advice on where to begin learning this information? Even
    > if its in another language.


    Also since you're new you may want to work with python3 but not
    a requirement.

    There are lots of free books online, search this list for links.
    (you can search this list at gmane and probably elsewhere)
    rh, Feb 21, 2013
    #4
  5. Dave Angel Guest

    On 02/21/2013 12:47 PM, rh wrote:
    > On Thu, 21 Feb 2013 10:56:15 -0500
    > Dave Angel <> wrote:
    >> On 02/21/2013 07:12 AM, wrote:
    >>> I only just started Python and given that I know nothing about
    >>> network programming or internet programming of any kind really, I
    >>> thought it would be interesting to try write something that could
    >>> create an archive of a website for myself.

    >>

    >
    >
    >> To archive your website, use the rsync command. No need to write any
    >> code, as rsync will descend into all the directories as needed, and
    >> it'll get the actual website data, not the stuff that the web server
    >> feeds to the browsers.

    >
    > How many websites let you suck down their content using rsync???
    > The request was for creating their own copy of a website.
    >


    Clearly this was his own website, since it's usually unethical to "suck
    down" someone else's. And my message specifically said "To archive
    *your* website..." As to the implied question of why, since he
    presumably has the original sources, I can only relate my own
    experience. I generate mine by a python program, but over time obsolete
    files are left behind. Additionally, an overzealous SEO person
    hand-edited my files. And finally, I reinstalled my system from scratch
    a couple of months ago. So in order to see exactly what's out there, I
    used rsync, about two weeks ago.


    --
    DaveA
    Dave Angel, Feb 21, 2013
    #5
  6. Dave Angel Guest

    On 02/21/2013 07:12 AM, wrote:
    >
    > <snip>
    > Why is it that when using urllib.urlopen then reading or urllib.urlretrieve, does it only give me parts of the sites, loosing the formatting, images, etc...? How can I get around this?
    >


    Start by telling us if you're using Python2 or Python3, as this library
    is different for different versions. Also what OS, as there are lots of
    useful utilities in Unix, and a different set in Windows or other
    places. Even if the same program exists on both, it's likely to be
    named differently.

    My earlier reply assumed you were trying to get an accurate copy of your
    website, presumably because your own local copy had gotten out of synch.
    rh assumed differently, so I'll try again. If you're trying to
    download someone else's, you should realize that you may be violating
    copyright, and ought to get permission. It's one thing to extract a
    file or two, but another entirely to try to capture the entire site.
    And many sites consider all of the details proprietary. Others consider
    the images proprietary, and enforce the individual copyrights.

    You can indeed copy individual files with urlib or urlib2, but that's
    just the start of the problem. A typical web page is written in html
    (or xhtml, or ...), and displaying it is the job of a browser, not the
    cat command. In addition, the page will generally refer to lots of
    other files, with the most common being a css file and a few jpegs. So
    you have to parse the page to find all those dependencies, and copy them
    as well.

    Next, the page may contain code (eg. php, javascript), or it may be code
    (eg. Python or perl). In each of those cases, what you'll get isn't
    exactly what you'd expect. If you try to fetch a python program,
    generally what happens is it gets run, and you fetch its stdout instead.
    On the other hand javascript gets executed by the browser, and I don't
    know where php gets executed, or by whom. Finally, the page may make
    use of resources which simply won't be visible to you without becoming a
    hacker. Like my rsync and scp examples, you'll probably need a userid
    and password to get into the guts.

    If you want to play with some of this without programming, you could go
    to your favorite browser, and View->Source. The method of doing that
    varies with browser brand, version & OS, but it should be there on some
    menu someplace. In Chrome, it's Tools->ViewSource.

    Examples below extracted from the main page at python.org

    <title>Python Programming Language &ndash; Official Website</title>

    That simply sets the title for the page. It is not even part of the
    body, it's part of the header for the page. In this case, the header
    continues for 77 pages, including meta tags, javascript stuff, css
    stuff, etc.

    You might observe that angle brackets are used to enclose explicit kinds
    of data. In the above example, it's a "title" element. And it's
    enclosed with <title> and </title>

    In xhtml, these will always come in pairs, like curly braces in C
    programming. However, most web pages are busted, so parsing it is
    sometimes troublesome. Most people seem to recommand Beautiful Soup, in
    part because it tolerates many kinds of errors.

    I'd get a good book on html programming, making sure it covers xhtml and
    css. But I don't know what to recommend, as everything in my arsenal is
    thoroughly dated.

    Much of the body is devoted to the complexity of setting up the page in
    a browser of variable size, varying fonts, user-overrides, etc. The
    following exerpt:

    > <div style="align:center; padding-top: 0.5em; padding-left: 1em">
    > <a href="/psf/donations/"><img width="116" height="42"
    > src="/images/donate.png" alt="" title="" /></a>
    > </div>


    The whole thing is a "div" or division. It's a individual chunk of the
    page that might be placed almost anywhere within a bigger div or the
    page itself. It has a style attribute, which gives hints to the browser
    about what it wants. More commonly, the style will be indirected
    through a separate css page.

    It has an "a" tag, which shows a link. The link may be underlined, but
    the css or the browser may override that. The url for the link is
    specified in the 'src' attribute, the tooltip is specified in the alt
    attribute. This is enclosing an 'img' tag, which describes a png image
    file to be displayed, and specifies the scaling for it.

    > <h4><a href="/about/help/">Help</a></h4>


    The h4 tag refers to css which specifies various things about how
    this'll display. It's usually used for making larger and smaller
    versions of text for titles and such.

    > <link rel="stylesheet" type="text/css" media="screen"
    > id="screen-switcher-stylesheet"
    > href="/styles/screen-switcher-default.css" />


    This points to a css file, which refers to another one, called
    styles.css. That's where you can see the definition for a style of h4


    > H1,H2,H3,H4,H5 {
    > font-family: Georgia, "Bitstream Vera Serif",
    > "New York", Palatino, serif;
    > font-weight:normal;
    > line-height: 1em;
    > }


    This defines the common attributes for all the Hn series. Then they are
    refined and overridden by:


    > H4
    > {
    > font-size: 125%;
    > color: #366D9C;
    > margin: 0.4em 0 0.0em 0;
    > }


    So we see that H4 is 25% bigger than default. Similarly H3 is 35%, and
    H2 is 40% bigger.

    It's a very complicated topic, and I wish you luck on it. But it's not
    clear that the first step should involve any Python programming. I got
    all the above just with Chrome in its default setup. I haven't even
    mentioned things like the Tools->DeveloperTools, or other stuff you
    could get via plugins.

    If you're copying these files with a view of being able to run them
    locally, realize that for most websites, you need lots of installed
    software to support being a webserver. If you're writing your own, you
    can start simple, and maybe never need any of the extra tools. For
    example, on my own website, I only needed static pages. So the python
    code I used was to generate the web pages, which are then uploaded as is
    to the site. They can be tested locally by simply making up a url which
    starts

    file://

    instead of

    http://

    But as soon as I want database features, or counters, or user accounts,
    or data entry, or randomness, I might add code that runs on the server,
    and that's a lot trickier. Probably someone who has done it can tell us
    I'm all wet, though.

    --
    DaveA
    Dave Angel, Feb 21, 2013
    #6
  7. Guest

    Initially I was just trying the html, but later when I attempted more complicated sites that weren't my own I noticed that large bulks of the site were lost in the process. The urllib code essentially looks like what I was trying but it didn't work as I had expected.

    To be more specific, after I got it working for my own little page, I attempted to take it further and get all the lessons from Learn Python The Hard Way. When I tried the same method on the first intro page to see if I was even getting it right, the html code was all there but upon opening it I noticed the format was all wrong, colors were off for the background, images, etc... were all missing. So clearly I ended up misunderstanding something and its something critical I need to understand.

    As for the OS, I primarily use Mac OS, however well versed in linux and windows if there is anything specific out there that might help.

    As for which version if Python, I have been using Python 2 to learn on as Iheard that Python 3 was still largely unadopted due to a lack of library support etc... by comparison. Are people adopting it fast enough now that I should consider learning on 3 instead of 2?

    Also, it isn't so much to do it for technical reasons but rather I thought it would be something interesting and fun to learn some form of internet/network programming. Granted, its not the best approach, but I'm not really aware of too many others, and I it does seem interesting to me.

    Python programming probably isn't the best way to initially approach this Iagree, but I wasn't sure what to research on or to get a better grasp of network/internet/web programming so I figured I would just dive head first and figure things out, and reinforce more programming while learning internet/network programming was my initial goal.

    Thank you all for your responses though. :)


    On Thursday, February 21, 2013 7:59:26 AM UTC-5, Michael Herman wrote:
    > Are you just trying to get the html? If so, you can use this code-
    >
    >
    >
    > import urllib
    >
    >
    > # fetch the and download a webpage, nameing it test.html
    > urllib.urlretrieve("http://www.web2py.com/", filename="test.html")
    >
    >
    >
    >
    >
    >
    > I recommend using the requests library, as it's easier to use and more powerful:
    >
    >
    >
    >
    >
    >
    >
    >
    >
    > import requests
    >
    > # retrive the webpage
    > r = requests.get("http://www.web2py.com/")
    >
    > # write the content to test_request.html
    > with open("test_requests.html", "wb") as code:  
    >
    >
    >
    >
    >
    > code.write(r.content)
    >
    > If you want to get up to speed quickly on internet programming, I have a course I am developing. It's on kickstarter - http://kck.st/VQj8hq. The first section of the book dives into web fundamentals and internet programming. 
    >
    >
    >
    >
    >
    >
    >
    > On Thu, Feb 21, 2013 at 4:12 AM, <> wrote:
    >
    >
    > I only just started Python and given that I know nothing about network programming or internet programming of any kind really, I thought it would beinteresting to try write something that could create an archive of a website for myself. With this I started trying to use the urllib library, however I am having a problem understanding why certain things wont work with theurllib.urlretrieve and urllib.urlopen then reading.
    >
    >
    >
    >
    >
    > Why is it that when using urllib.urlopen then reading or urllib.urlretrieve, does it only give me parts of the sites, loosing the formatting, images, etc...? How can I get around this?
    >
    >
    >
    > Lastly, while its a bit off topic, I lack a good understanding of networkprogramming as a whole. From making programs communicate or to simply extract data from URL's, I don't know where to even begin, which has lead me tolearning python to better understand it hopefully then carry it over to other languages I know. Can anyone give me some advice on where to begin learning this information? Even if its in another language.
    >
    >
    >
    > --
    >
    > http://mail.python.org/mailman/listinfo/python-list
    , Feb 22, 2013
    #7
  8. Guest

    Initially I was just trying the html, but later when I attempted more complicated sites that weren't my own I noticed that large bulks of the site were lost in the process. The urllib code essentially looks like what I was trying but it didn't work as I had expected.

    To be more specific, after I got it working for my own little page, I attempted to take it further and get all the lessons from Learn Python The Hard Way. When I tried the same method on the first intro page to see if I was even getting it right, the html code was all there but upon opening it I noticed the format was all wrong, colors were off for the background, images, etc... were all missing. So clearly I ended up misunderstanding something and its something critical I need to understand.

    As for the OS, I primarily use Mac OS, however well versed in linux and windows if there is anything specific out there that might help.

    As for which version if Python, I have been using Python 2 to learn on as Iheard that Python 3 was still largely unadopted due to a lack of library support etc... by comparison. Are people adopting it fast enough now that I should consider learning on 3 instead of 2?

    Also, it isn't so much to do it for technical reasons but rather I thought it would be something interesting and fun to learn some form of internet/network programming. Granted, its not the best approach, but I'm not really aware of too many others, and I it does seem interesting to me.

    Python programming probably isn't the best way to initially approach this Iagree, but I wasn't sure what to research on or to get a better grasp of network/internet/web programming so I figured I would just dive head first and figure things out, and reinforce more programming while learning internet/network programming was my initial goal.

    Thank you all for your responses though. :)


    On Thursday, February 21, 2013 7:59:26 AM UTC-5, Michael Herman wrote:
    > Are you just trying to get the html? If so, you can use this code-
    >
    >
    >
    > import urllib
    >
    >
    > # fetch the and download a webpage, nameing it test.html
    > urllib.urlretrieve("http://www.web2py.com/", filename="test.html")
    >
    >
    >
    >
    >
    >
    > I recommend using the requests library, as it's easier to use and more powerful:
    >
    >
    >
    >
    >
    >
    >
    >
    >
    > import requests
    >
    > # retrive the webpage
    > r = requests.get("http://www.web2py.com/")
    >
    > # write the content to test_request.html
    > with open("test_requests.html", "wb") as code:  
    >
    >
    >
    >
    >
    > code.write(r.content)
    >
    > If you want to get up to speed quickly on internet programming, I have a course I am developing. It's on kickstarter - http://kck.st/VQj8hq. The first section of the book dives into web fundamentals and internet programming. 
    >
    >
    >
    >
    >
    >
    >
    > On Thu, Feb 21, 2013 at 4:12 AM, <> wrote:
    >
    >
    > I only just started Python and given that I know nothing about network programming or internet programming of any kind really, I thought it would beinteresting to try write something that could create an archive of a website for myself. With this I started trying to use the urllib library, however I am having a problem understanding why certain things wont work with theurllib.urlretrieve and urllib.urlopen then reading.
    >
    >
    >
    >
    >
    > Why is it that when using urllib.urlopen then reading or urllib.urlretrieve, does it only give me parts of the sites, loosing the formatting, images, etc...? How can I get around this?
    >
    >
    >
    > Lastly, while its a bit off topic, I lack a good understanding of networkprogramming as a whole. From making programs communicate or to simply extract data from URL's, I don't know where to even begin, which has lead me tolearning python to better understand it hopefully then carry it over to other languages I know. Can anyone give me some advice on where to begin learning this information? Even if its in another language.
    >
    >
    >
    > --
    >
    > http://mail.python.org/mailman/listinfo/python-list
    , Feb 22, 2013
    #8
  9. Dave Angel Guest

    On 02/22/2013 12:09 AM, wrote:
    > Initially I was just trying the html, but later when I attempted more complicated sites that weren't my own I noticed that large bulks of the site were lost in the process. The urllib code essentially looks like what I was trying but it didn't work as I had expected.
    >
    > To be more specific, after I got it working for my own little page, I attempted to take it further and get all the lessons from Learn Python The Hard Way. When I tried the same method on the first intro page to see if I was even getting it right, the html code was all there but upon opening it I noticed the format was all wrong, colors were off for the background, images, etc... were all missing.


    So how are you opening this html? In a text editor that somehow added
    colors? Or were you opening it in a browser? In order for a browser to
    render a non-trivial page, it may need lots of files other than the
    html. Colors for example can be specified inline, in the header, or in
    an external css file. If the page was designed to use the external css,
    and it's missing or not in the right location, then the browser is going
    to get the colors wrong.

    Further, if the location (url) is relative, then you can create a
    similar directory structure, and the browser will find it. But if it's
    absolute, then the browser is going to try to go out to the web to fetch
    it. If it succeeds, then it's masking the fact that you haven't
    downloaded the "whole web site."

    The same is true for other external refs. It may be impossible to host
    it elsewhere if there are any absolute urls.

    --
    DaveA
    Dave Angel, Feb 22, 2013
    #9
  10. MRAB Guest

    [snip]
    > As for which version if Python, I have been using Python 2 to learn on
    > as I heard that Python 3 was still largely unadopted due to a lack of
    > library support etc... by comparison. Are people adopting it fast
    > enough now that I should consider learning on 3 instead of 2?
    >

    [snip]
    You should be concentrating on Python 3 unless you rely on a library
    that hasn'tbeen ported yet. Python 2 has stopped at Python 2.7. There
    won't be a Python 2.8.
    MRAB, Feb 22, 2013
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Xu, C.S.
    Replies:
    5
    Views:
    472
    John J. Lee
    Sep 17, 2003
  2. Chris
    Replies:
    0
    Views:
    1,043
    Chris
    Jul 10, 2005
  3. Ian Kelly
    Replies:
    0
    Views:
    103
    Ian Kelly
    Feb 24, 2013
  4. MRAB
    Replies:
    1
    Views:
    113
    Thomas Rachel
    Feb 24, 2013
  5. rh
    Replies:
    0
    Views:
    126
Loading...

Share This Page