Modification of a urllib2 object ?

V

vincehofmeister

I have several ways to the following problem.

This is what I have:

....
import ClientForm
import BeautifulSoup from BeautifulSoup


request = urllib2.Request('http://form.com/)

self.first_object = urllib2.open(request)

soup = BeautifulSoup(self.first_object)

forms = ClienForm.ParseResponse(self.first_object)


Now, when I do this, forms returns an index errror because no forms
are returned, but the BeautifulSoup registers fine.

Now, when I switch the order to this:


import ClientForm
import BeautifulSoup from BeautifulSoup


request = urllib2.Request('http://form.com/)

self.first_object = urllib2.open(request)

forms = ClienForm.ParseResponse(self.first_object)

soup = BeautifulSoup(self.first_object)

Now, the form is returned correctly, but the BeautifulSoup objects
returns empty.

So what I can draw from this is both methods erase the properties of
the object, so i tried importing the copy module and uses
copy.deepcopy(self.first_object)...

this didn't work either.

Does anyone have any idea on this or what I should do so the object
does not get erased.

Thanks in advance for any advice in advance.
 
G

George Sakkis

I have several ways to the following problem.

This is what I have:

...
import ClientForm
import BeautifulSoup from BeautifulSoup

request = urllib2.Request('http://form.com/)

self.first_object = urllib2.open(request)

soup = BeautifulSoup(self.first_object)

forms = ClienForm.ParseResponse(self.first_object)

Now, when I do this, forms returns an index errror because no forms
are returned, but the BeautifulSoup registers fine.

First off, please copy and paste working code; the above has several
syntax errors, so it can't raise IndexError (or anything else for that
matter).
Now, when I switch the order to this:

import ClientForm
import BeautifulSoup from BeautifulSoup

request = urllib2.Request('http://form.com/)

self.first_object = urllib2.open(request)

forms = ClienForm.ParseResponse(self.first_object)

soup = BeautifulSoup(self.first_object)

Now, the form is returned correctly, but the BeautifulSoup objects
returns empty.

So what I can draw from this is both methods erase the properties of
the object,

No, that's not the case. What happens is that the http response object
returned by urllib2.open() is read by the ClienForm.ParseResponse or
BeautifulSoup - whatever happens first - and the second call has
nothing to read.

The easiest solution is to save the request object and call
urllib2.open twice. Alternatively check if ClientForm has a parse
method that accepts strings instead of urllib2 requests and then read
and save the html text explicitly:

HTH,
George
 
V

vincehofmeister

First off, please copy and paste working code; the above has several
syntax errors, so it can't raise IndexError (or anything else for that
matter).











No, that's not the case. What happens is that the http response object
returned by urllib2.open() is read by the ClienForm.ParseResponse or
BeautifulSoup - whatever happens first - and the second call has
nothing to read.

The easiest solution is to save the request object and call
urllib2.open twice. Alternatively check if ClientForm has a parse
method that accepts strings instead of urllib2 requests and then read
and save the html text explicitly:


HTH,
George

request = urllib2.Request(settings.register_page)

self.url_obj = urllib2.urlopen(request).read()

soup = BeautifulSoup(self.url_obj);

forms = ClientForm.ParseResponse(self.url_obj,
backwards_compat=False)

print forms

images = HtmlHelper.getCaptchaImages(soup)

self.webView.setHtml(str(soup))

#here we generate the popup dialog
Dialog = QtGui.QDialog()
ui = captcha_popup.Ui_Dialog()
ui.setupUi(Dialog, self)
ui.webView.setHtml(str(images[0]));
ui.webView_2.setHtml(str(images[1]));
Dialog.raise_()
Dialog.activateWindow()
Dialog.exec_()
Dialog.show()


Now I am getting this error:

Traceback (most recent call last):
File "C:\Python25\Lib\site-packages\PyQt4\POS Pounder\Oct7\oct.py",
line 1251, in createAccounts
forms = ClientForm.ParseResponse(self.url_obj,
backwards_compat=False)
File "C:\Python25\lib\site-packages\clientform-0.2.9-py2.5.egg
\ClientForm.py", line 1054, in ParseResponse
AttributeError: 'str' object has no attribute 'geturl'
 
G

George Sakkis

request = urllib2.Request(settings.register_page)

                self.url_obj = urllib2.urlopen(request)..read()

                soup = BeautifulSoup(self.url_obj);

                forms = ClientForm.ParseResponse(self.url_obj,
backwards_compat=False)



Now I am getting this error:

Traceback (most recent call last):
  File "C:\Python25\Lib\site-packages\PyQt4\POS Pounder\Oct7\oct.py",
line 1251, in createAccounts
    forms = ClientForm.ParseResponse(self.url_obj,
backwards_compat=False)
  File "C:\Python25\lib\site-packages\clientform-0.2.9-py2.5.egg
\ClientForm.py", line 1054, in ParseResponse
AttributeError: 'str' object has no attribute 'geturl'

Did you read what I wrote ? ClientForm.ParseResponse() expects a
response object, not a string. Browsing through its docs, it seems
there is an alternative parsing fuction, ClienForm.ParseFile(file,
base_uri, ...).

The following should work (untested):

from cStringIO import StringIO

request = urllib2.Request(settings.register_page)
response = urllib2.urlopen(request)
text = response.read()
soup = BeautifulSoup(text)
forms = ClientForm.ParseFile(StringIO(text), response.geturl(),
backwards_compat=False)

HTH,
George
 
J

jowillia

Did you read what I wrote ? ClientForm.ParseResponse() expects a
response object, not a string. Browsing through its docs, it seems
there is an alternative parsing fuction, ClienForm.ParseFile(file,
base_uri, ...).

The following should work (untested):

from cStringIO import StringIO

request = urllib2.Request(settings.register_page)
response = urllib2.urlopen(request)
text = response.read()
soup = BeautifulSoup(text)
forms = ClientForm.ParseFile(StringIO(text), response.geturl(),
                             backwards_compat=False)

HTH,
George

Hello George,

I seem to be running into the same problem as Vince. Your solution
seems very good, but ClientForm gets a little bit more from the handle
than just the text.
The following should work (untested):

from cStringIO import StringIO

request = urllib2.Request(settings.register_page)
response = urllib2.urlopen(request)
text = response.read()
soup = BeautifulSoup(text)
forms = ClientForm.ParseFile(StringIO(text), response.geturl(),
backwards_compat=False)

Hello George,

When running your code in my program, which is doing something very
similar to Vince, I get:

AttributeError: 'cStringIO.StringI' object has no attribute 'geturl'

This makes perfect sense in regards to the way ClientForms handles
requests. It seems that short of figuring out how to deepcopy the
handle, your going to be stuck making the request twice. But this is
going to hit the URL (server) twice, which I would say is a bad idea.

I've been struggling with this issue for some time now, and this is
the first place I've found a solid discussion about it.

-Josh
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,772
Messages
2,569,591
Members
45,100
Latest member
MelodeeFaj
Top