Modification of a urllib2 object ?

vincehofmeister · Oct 10, 2008

I have several ways to the following problem.

This is what I have:

....
import ClientForm
import BeautifulSoup from BeautifulSoup

request = urllib2.Request('http://form.com/)

self.first_object = urllib2.open(request)

soup = BeautifulSoup(self.first_object)

forms = ClienForm.ParseResponse(self.first_object)

Now, when I do this, forms returns an index errror because no forms
are returned, but the BeautifulSoup registers fine.

Now, when I switch the order to this:

import ClientForm
import BeautifulSoup from BeautifulSoup

request = urllib2.Request('http://form.com/)

self.first_object = urllib2.open(request)

forms = ClienForm.ParseResponse(self.first_object)

soup = BeautifulSoup(self.first_object)

Now, the form is returned correctly, but the BeautifulSoup objects
returns empty.

So what I can draw from this is both methods erase the properties of
the object, so i tried importing the copy module and uses
copy.deepcopy(self.first_object)...

this didn't work either.

Does anyone have any idea on this or what I should do so the object
does not get erased.

Thanks in advance for any advice in advance.

George Sakkis · Oct 10, 2008

I have several ways to the following problem.

This is what I have:

...
import ClientForm
import BeautifulSoup from BeautifulSoup

request = urllib2.Request('http://form.com/)

self.first_object = urllib2.open(request)

soup = BeautifulSoup(self.first_object)

forms = ClienForm.ParseResponse(self.first_object)

Now, when I do this, forms returns an index errror because no forms
are returned, but the BeautifulSoup registers fine.

First off, please copy and paste working code; the above has several
syntax errors, so it can't raise IndexError (or anything else for that
matter).

Now, when I switch the order to this:

import ClientForm
import BeautifulSoup from BeautifulSoup

request = urllib2.Request('http://form.com/)

self.first_object = urllib2.open(request)

forms = ClienForm.ParseResponse(self.first_object)

soup = BeautifulSoup(self.first_object)

Now, the form is returned correctly, but the BeautifulSoup objects
returns empty.

So what I can draw from this is both methods erase the properties of
the object,

No, that's not the case. What happens is that the http response object
returned by urllib2.open() is read by the ClienForm.ParseResponse or
BeautifulSoup - whatever happens first - and the second call has
nothing to read.

The easiest solution is to save the request object and call
urllib2.open twice. Alternatively check if ClientForm has a parse
method that accepts strings instead of urllib2 requests and then read
and save the html text explicitly:

HTH,
George

vincehofmeister · Oct 10, 2008

First off, please copy and paste working code; the above has several
syntax errors, so it can't raise IndexError (or anything else for that
matter).

No, that's not the case. What happens is that the http response object
returned by urllib2.open() is read by the ClienForm.ParseResponse or
BeautifulSoup - whatever happens first - and the second call has
nothing to read.

The easiest solution is to save the request object and call
urllib2.open twice. Alternatively check if ClientForm has a parse
method that accepts strings instead of urllib2 requests and then read
and save the html text explicitly:

HTH,
George

request = urllib2.Request(settings.register_page)

self.url_obj = urllib2.urlopen(request).read()

soup = BeautifulSoup(self.url_obj);

forms = ClientForm.ParseResponse(self.url_obj,
backwards_compat=False)

print forms

images = HtmlHelper.getCaptchaImages(soup)

self.webView.setHtml(str(soup))

#here we generate the popup dialog
Dialog = QtGui.QDialog()
ui = captcha_popup.Ui_Dialog()
ui.setupUi(Dialog, self)
ui.webView.setHtml(str(images[0]));
ui.webView_2.setHtml(str(images[1]));
Dialog.raise_()
Dialog.activateWindow()
Dialog.exec_()
Dialog.show()

Now I am getting this error:

Traceback (most recent call last):
File "C:\Python25\Lib\site-packages\PyQt4\POS Pounder\Oct7\oct.py",
line 1251, in createAccounts
forms = ClientForm.ParseResponse(self.url_obj,
backwards_compat=False)
File "C:\Python25\lib\site-packages\clientform-0.2.9-py2.5.egg
\ClientForm.py", line 1054, in ParseResponse
AttributeError: 'str' object has no attribute 'geturl'

George Sakkis · Oct 11, 2008

request = urllib2.Request(settings.register_page)

self.url_obj = urllib2.urlopen(request)..read()

soup = BeautifulSoup(self.url_obj);

forms = ClientForm.ParseResponse(self.url_obj,
backwards_compat=False)

Now I am getting this error:

Traceback (most recent call last):
File "C:\Python25\Lib\site-packages\PyQt4\POS Pounder\Oct7\oct.py",
line 1251, in createAccounts
forms = ClientForm.ParseResponse(self.url_obj,
backwards_compat=False)
File "C:\Python25\lib\site-packages\clientform-0.2.9-py2.5.egg
\ClientForm.py", line 1054, in ParseResponse
AttributeError: 'str' object has no attribute 'geturl'

Did you read what I wrote ? ClientForm.ParseResponse() expects a
response object, not a string. Browsing through its docs, it seems
there is an alternative parsing fuction, ClienForm.ParseFile(file,
base_uri, ...).

The following should work (untested):

from cStringIO import StringIO

request = urllib2.Request(settings.register_page)
response = urllib2.urlopen(request)
text = response.read()
soup = BeautifulSoup(text)
forms = ClientForm.ParseFile(StringIO(text), response.geturl(),
backwards_compat=False)

HTH,
George

jowillia · Nov 22, 2008

Did you read what I wrote ? ClientForm.ParseResponse() expects a
response object, not a string. Browsing through its docs, it seems
there is an alternative parsing fuction, ClienForm.ParseFile(file,
base_uri, ...).

The following should work (untested):

from cStringIO import StringIO

request = urllib2.Request(settings.register_page)
response = urllib2.urlopen(request)
text = response.read()
soup = BeautifulSoup(text)
forms = ClientForm.ParseFile(StringIO(text), response.geturl(),
backwards_compat=False)

HTH,
George

Hello George,

I seem to be running into the same problem as Vince. Your solution
seems very good, but ClientForm gets a little bit more from the handle
than just the text.

The following should work (untested):

from cStringIO import StringIO

request = urllib2.Request(settings.register_page)
response = urllib2.urlopen(request)
text = response.read()
soup = BeautifulSoup(text)
forms = ClientForm.ParseFile(StringIO(text), response.geturl(),
backwards_compat=False)

Hello George,

When running your code in my program, which is doing something very
similar to Vince, I get:

AttributeError: 'cStringIO.StringI' object has no attribute 'geturl'

This makes perfect sense in regards to the way ClientForms handles
requests. It seems that short of figuring out how to deepcopy the
handle, your going to be stuck making the request twice. But this is
going to hit the URL (server) twice, which I would say is a bad idea.

I've been struggling with this issue for some time now, and this is
the first place I've found a solid discussion about it.

-Josh

urllib2 opendirector versus request object	0	Jun 9, 2011
HTTP post with urllib2	5	Aug 6, 2013
strip away html tags from extracted links	2	Nov 29, 2013
Object cleanup	4	May 30, 2012
Crawling	1	Mar 10, 2021
ntlm authentication for urllib2	0	Nov 30, 2012
HTTPError sometimes when using urllib2.urlopen	1	May 27, 2008
writing a csv file	1	Nov 12, 2012

Modification of a urllib2 object ?

vincehofmeister

George Sakkis

vincehofmeister

George Sakkis

jowillia

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads