What is the best way to "get" a web page?

Pete · Sep 24, 2006

I have the following code:

The file "temp.html" is created, but it doesn't look like the page at
www.python.org. I'm guessing there are multiple frames and my code did
not get everything. Can anyone point me to a tutorial or other
reference on how to "get" all of the html contents at a particular
page?

Why did Python print the line after "file.close"?

Thanks,
Pete

Paul McGuire · Sep 24, 2006

Pete said:
I have the following code:

The file "temp.html" is created, but it doesn't look like the page at
www.python.org. I'm guessing there are multiple frames and my code did
not get everything. Can anyone point me to a tutorial or other
reference on how to "get" all of the html contents at a particular
page?

Why did Python print the line after "file.close"?

Thanks,
Pete

A. You didn't actually invoke the close method, you simply referenced it,
which is why you got the output line after file.close. Python is not VB.
To call close, you have to follow it with ()'s, as in:

file.close()

This will have the added benefit of flushing the output to temp.html,
probably containing the missing content you were looking for.

B. Don't name variables "file", or "list", "str", "dict", "int", etc. Doing
so masks global names of builtin data types. Try "tempFile" instead.

-- Paul

Pete · Sep 24, 2006

I have the following code:

A. You didn't actually invoke the close method, you simply referenced it,
which is why you got the output line after file.close. Python is not VB.
To call close, you have to follow it with ()'s, as in:

file.close()

Ahhhh. Thank you very much!

This will have the added benefit of flushing the output to temp.html,
probably containing the missing content you were looking for.

B. Don't name variables "file", or "list", "str", "dict", "int", etc. Doing
so masks global names of builtin data types. Try "tempFile" instead.

Oh. Thanks again!
The file "temp.html" is definitely different than the first run, but
still not anything close to www.python.org . Any other suggestions?

Thanks,
Pete

George Sakkis · Sep 24, 2006

Pete said:
The file "temp.html" is definitely different than the first run, but
still not anything close to www.python.org . Any other suggestions?

If you mean that the page looks different in a browser, for one thing
you have to download the css files too. Here's the relevant extract
from the main page:

<link media="screen" href="styles/screen-switcher-default.css"
type="text/css" id="screen-switcher-stylesheet" rel="stylesheet" />
<link media="scReen" href="styles/netscape4.css" type="text/css"
rel="stylesheet" />
<link media="print" href="styles/print.css" type="text/css"
rel="stylesheet" />
<link media="screen" href="styles/largestyles.css" type="text/css"
rel="alternate stylesheet" title="large text" />
<link media="screen" href="styles/defaultfonts.css" type="text/css"
rel="alternate stylesheet" title="default fonts" />

You may either hardcode the urls of the css files, or parse the page,
extract the css links and normalize them to absolute urls. The first is
simpler but the second is more robust, in case a new css is added or an
existing one is renamed or removed.

George

Wolfgang Keller · Sep 24, 2006

Can anyone point me to a tutorial or other reference on how to "get" all

of the html contents at a particular page?

Why not use httrack?

http://www.satzbau-gmbh.de/staff/abel/httrack-py/

Sincerely,

Wolfgang Keller

Pete · Sep 24, 2006

The file "temp.html" is definitely different than the first run, but

If you mean that the page looks different in a browser, for one thing
you have to download the css files too. Here's the relevant extract
from the main page:

<link media="screen" href="styles/screen-switcher-default.css"
type="text/css" id="screen-switcher-stylesheet" rel="stylesheet" />
<link media="scReen" href="styles/netscape4.css" type="text/css"
rel="stylesheet" />
<link media="print" href="styles/print.css" type="text/css"
rel="stylesheet" />
<link media="screen" href="styles/largestyles.css" type="text/css"
rel="alternate stylesheet" title="large text" />
<link media="screen" href="styles/defaultfonts.css" type="text/css"
rel="alternate stylesheet" title="default fonts" />

You may either hardcode the urls of the css files, or parse the page,
extract the css links and normalize them to absolute urls. The first is
simpler but the second is more robust, in case a new css is added or an
existing one is renamed or removed.

George

Thanks for the information on CSS. I'll look into that later, but now
my question is on the first two lines of HTML code. Here's my latest
python code:

Here are the first two lines of temp.html:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/x html1/DTD/xhtml1-transitional.dtd">
2 <html lang="en" xml:lang="en"
xmlns="http://www.w3.org/1999/xhtml">

Here are the first two lines of www.python.org as saved from Firefox:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/x html1/DTD/xhtml1-transitional.dtd">
2 <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"
lang="en"><head>

Lines one are identical. Lines two are different. Why would lines two
differ? Hmmmm...

Thanks,
Pete

Pete · Sep 24, 2006

Can anyone point me to a tutorial or other reference on how to "get" all

Why not use httrack?

http://www.satzbau-gmbh.de/staff/abel/httrack-py/

Sincerely,

Wolfgang Keller

Thanks for the tip. I'll check that out. Is that your code?

Rainy · Sep 24, 2006

Pete said:
Thanks for the information on CSS. I'll look into that later, but now
my question is on the first two lines of HTML code. Here's my latest
python code:

Here are the first two lines of temp.html:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/x html1/DTD/xhtml1-transitional.dtd">
2 <html lang="en" xml:lang="en"
xmlns="http://www.w3.org/1999/xhtml">

Here are the first two lines of www.python.org as saved from Firefox:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/x html1/DTD/xhtml1-transitional.dtd">
2 <html xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"
lang="en"><head>

Lines one are identical. Lines two are different. Why would lines two
differ? Hmmmm...

Functionally they are the same, but third line included in Firefox.
Opera View Source command produces the same result as Python. It looks
like Firefox will do some cosmetic changes to source but nothing that
would change the way code works. Notice that attributes in second line
are re-arranged in order only?

Felipe Almeida Lessa · Sep 24, 2006

24 Sep 2006 10:09:16 -0700 said:
Functionally they are the same, but third line included in Firefox.
Opera View Source command produces the same result as Python.

[snip]

It's better to compare with the result of a downloader-only (instead
of a parser), like wget on Unix. That way you'll get exactly the same
bytes (assuming the page is static).

Pete · Sep 24, 2006

Functionally they are the same, but third line included in Firefox.

Opera View Source command produces the same result as Python.

Click to expand...

[snip]

It's better to compare with the result of a downloader-only (instead
of a parser), like wget on Unix. That way you'll get exactly the same
bytes (assuming the page is static).

Ahhhh. wget - most cool. My temp.html matches wget. Now to capture that
pesky css stuff...

Thanks,
Pete

Frithiof Andreas Jensen · Sep 25, 2006

something like:

os.popen("wget -r3 http://juicypornpics.com")

..... wget understands the peculiarities of web pages so you do have to.

What is the best way of going about recreating the setTimeout() function?	0	Sep 2, 2022
Is there a way to get a single mode using all the points within a 2D array?	2	Oct 17, 2022
Hello I have question for a programmer it`s regarding REDDIT but this forum is the best where to put it?	0	Sep 12, 2023
Using GIT to get remote code	1	Dec 30, 2021
What is the best way to handle a missing newline in the following case	4	Nov 5, 2010
How to use PDF-lib and how to center each line of texts on the page?	1	Aug 16, 2023
Access to objects in a frame on a web page	0	Sep 12, 2013
HCaptcha - How to stop page from refreshing on submit if captcha is not checked/validated	1	Aug 29, 2023

What is the best way to "get" a web page?

Pete

Paul McGuire

Pete

George Sakkis

Wolfgang Keller

Pete

Pete

Rainy

Felipe Almeida Lessa

Pete

Frithiof Andreas Jensen

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads