how can i use lxml with win32com?

E

elca

hello...
if anyone know..please help me !
i really want to know...i was searched in google lot of time.
but can't found clear soultion. and also because of my lack of python
knowledge.
i want to use IE.navigate function with beautifulsoup or lxml..
if anyone know about this or sample.
please help me!
thanks in advance ..
 
S

Stefan Behnel

Hi,

elca, 25.10.2009 02:35:
hello...
if anyone know..please help me !
i really want to know...i was searched in google lot of time.
but can't found clear soultion. and also because of my lack of python
knowledge.
i want to use IE.navigate function with beautifulsoup or lxml..
if anyone know about this or sample.
please help me!
thanks in advance ..

You wrote a message with nine lines, only one of which gives a tiny hint on
what you actually want to do. What about providing an explanation of what
you want to achieve instead? Try to answer questions like: Where does your
data come from? Is it XML or HTML? What do you want to do with it?

This might help:

http://www.catb.org/~esr/faqs/smart-questions.html

Stefan
 
E

elca

Hello,
im very sorry .
first my source is come from website which consist of html mainly.
and i want to make web scraper.
i was found some script source in internet.
following is script source which can beautifulsoup and PAMIE work together.
but if i run this script source error was happened.

AttributeError: PAMIE instance has no attribute 'pageText'
File "C:\test12.py", line 7, in <module>
bs = BeautifulSoup(ie.pageText())

and following is orginal source until i was found in internet.

from BeautifulSoup import BeautifulSoup
from PAM30 import PAMIE
url = 'http://www.cnn.com'
ie = PAMIE(url)
bs = BeautifulSoup(ie.pageText())

if possible i really want to make it work together with beautifulsoup or
lxml with PAMIE.
sorry my bad english.
thanks in advance.
 
U

User

i want to make web scraper.
if possible i really want to make it work together with
beautifulsoup or
lxml with PAMIE.

Scraping information from webpages falls apart in two tasks:

1. Getting the HTML data
2. Extracting information from the HTML data

It looks like you want to use Internet Explorer for getting the HTML
data; is there any reason you can't use a simpler approach like using
urllib2.urlopen()?

Once you have the HTML data, you could feed it into BeautifulSoup or
lxml.

Mixing up 1 and 2 into a single statement created some confusion for
you, I think.

Greetings,
 
E

elca

Hello,
yes there is some reason why i nave to insist internet explorere interface.
because of javascript im trying to insist use PAMIE.
i was tried some other solution urlopen or mechanize and so on.
but it hard to use javascript.
can you show me some sample for me ? :)
such like if i want to extract some text in CNN website with 'CNN Shop'
'Site map' in bottom of CNN website page by use PAMIE.
thanks for your help.
 
U

User

because of javascript im trying to insist use PAMIE.

I see, your problem is not with lxml or BeautifulSoup, but getting the
raw data in the first place.

i want to extract some text in CNN website with 'CNN Shop'
'Site map' in bottom of CNN website page

What text? Can you give an example? I'd like to be able to reproduce
it manually in the webbrowser so I get a clear idea what exactly
you're trying to achieve.

Greetings,
 
E

elca

hello,
www.cnn.com in main website page.
for example ,if you see www.cnn.com's html source, maybe you can find such
like line of html source.

http://www.turnerstoreonline.com/ CNN Shop

and for example if i want to extract 'CNN Shop' text in html source.
and i want to add such like function ,with following script source.

from BeautifulSoup import BeautifulSoup
from PAM30 import PAMIE
from time import sleep

url = 'http://www.cnn.com'
ie = PAMIE(url)
sleep(10)
bs = BeautifulSoup(ie.getTextArea())
#from here i want to add such like text extract function with use PAMIE and
lxml or beautifulsoup.

thanks for your help .


in the cnn website's html source
there i
 
U

User

www.cnn.com in main website page.
for example ,if you see www.cnn.com's html source, maybe you can
find such
like line of html source.
http://www.turnerstoreonline.com/ CNN Shop
and for example if i want to extract 'CNN Shop' text in html source.

So, if I understand you correctly, you want your program to do the
following:

1. Retrieve the http://cnn.com webpage
2. Look for a link identified by the text "CNN Shop"
3. Extract the URL for that link.

The result would be http://www.turnerstoreonline.com

Is that what you want?

Greetings,
 
M

Michiel Overtoom

elca said:
yes i want to extract this text 'CNN Shop' and linked page
'http://www.turnerstoreonline.com'.

Well then.
First, we'll get the page using urrlib2:

doc=urllib2.urlopen("http://www.cnn.com")

Then we'll feed it into the HTML parser:

soup=BeautifulSoup(doc)

Next, we'll look at all the links in the page:

for a in soup.findAll("a"):

and when a link has the text 'CNN Shop', we have a hit,
and print the URL:

if a.renderContents()=="CNN Shop":
print a["href"]


The complete program is thus:

import urllib2
from BeautifulSoup import BeautifulSoup

doc=urllib2.urlopen("http://www.cnn.com")
soup=BeautifulSoup(doc)
for a in soup.findAll("a"):
if a.renderContents()=="CNN Shop":
print a["href"]


The example above can be condensed because BeautifulSoup's find function
can also look for texts:

print soup.find("a",text="CNN Shop")

and since that's a navigable string, we can ascend to its parent and
display the href attribute:

print soup.find("a",text="CNN Shop").findParent()["href"]

So eventually the whole program could be collapsed into one line:

print
BeautifulSoup(urllib2.urlopen("http://www.cnn.com")).find("a",text="CNN
Shop").findParent()["href"]

....but I think this is very ugly!

> im very sorry my english.

You English is quite understandable. The hard part is figuring out what
exactly you wanted to achieve ;-)

I have a question too. Why did you think JavaScript was necessary to
arrive at this result?

Greetings,
 
S

Stefan Behnel

elca, 25.10.2009 08:46:
im very sorry my english.

It's fairly common in this news-group that people do not have a good level
of English, so that's perfectly ok. But you should try to provide more
information in your posts. Be explicit about what you tried and what failed
(and how!), and provide short code examples and exact copies of failure
messages whenever possible. That will help others in understanding what is
going on on your side. Remember that we can't look at your screen, nor read
your mind.

Oh, and please don't top-post in replies.

Stefan
 
E

elca

Hello,
thanks for your reply.
actually what i want to parse website is some different language site.
so i was quote some common english website for easy understand. :)
by the way, is it possible to use with PAMIE and beautifulsoup work
together?
Thanks a lot


yes i want to extract this text 'CNN Shop' and linked page
'http://www.turnerstoreonline.com'.

Well then.
First, we'll get the page using urrlib2:

doc=urllib2.urlopen("http://www.cnn.com")

Then we'll feed it into the HTML parser:

soup=BeautifulSoup(doc)

Next, we'll look at all the links in the page:

for a in soup.findAll("a"):

and when a link has the text 'CNN Shop', we have a hit,
and print the URL:

if a.renderContents()=="CNN Shop":
print a["href"]


The complete program is thus:

import urllib2
from BeautifulSoup import BeautifulSoup

doc=urllib2.urlopen("http://www.cnn.com")
soup=BeautifulSoup(doc)
for a in soup.findAll("a"):
if a.renderContents()=="CNN Shop":
print a["href"]


The example above can be condensed because BeautifulSoup's find function
can also look for texts:

print soup.find("a",text="CNN Shop")

and since that's a navigable string, we can ascend to its parent and
display the href attribute:

print soup.find("a",text="CNN Shop").findParent()["href"]

So eventually the whole program could be collapsed into one line:

print
BeautifulSoup(urllib2.urlopen("http://www.cnn.com")).find("a",text="CNN
Shop").findParent()["href"]

...but I think this is very ugly!

im very sorry my english.

You English is quite understandable. The hard part is figuring out what
exactly you wanted to achieve ;-)

I have a question too. Why did you think JavaScript was necessary to
arrive at this result?

Greetings,
 
M

Michiel Overtoom

elca said:
actually what i want to parse website is some different language site.

A different website? What website? What text? Please show your actual
use case, instead of smokescreens.

so i was quote some common english website for easy understand. :)

And, did you learn something from it? Were you able to apply the
technique to the other website?

by the way, is it possible to use with PAMIE and beautifulsoup work
together?

If you define 'working together' as like 'PAMIE produces a HTML text and
BeautifulSoup parses it', then maybe yes.

Greetings,
 
E

elca

Hello,
actually what i want is,
if you run my script you can reach this page
'http://news.search.naver.com/search.naver?sm=tab_hty&where=news&query=korea+times&x=0&y=0'
that is korea portal site and i was search keyword using 'korea times'
and i want to scrap resulted to text name with 'blogscrap_save.txt'
if you run this script ,you can see
following article

"Yesan County: How do you like them apples?
코리아헤럴드 |
carp fishing at the Yedang Reservoir -
Korea`s biggest - taking a nice stroll...
During the curator`s recitation of Yun`s life and times as a resistance
and freedom fighter,
he would emphsize random ...
"

and also can see following article and so on ....
"
10,000 Nepalese Diaspora Emerging in Korea
코리아타임스 세계 | 2009.10.23 (금) 오후 9:31
Although the Nepalese community in Korea is worker dominated,
there are... yoga is popular among Nepalese. These festivals are the
times when expatriate Nepalese feel nostalgic for their... "

so actual process to scrap site is,
first i want to use keyword and want to save resulted article with only
text.


i was attached currently im making script but not so much good and can't
work well.
especially extract part is really hard for novice,such like for me :)
thanks in advance..




http://www.nabble.com/file/p26046215/untitled-1.py untitled-1.py
 
P

paul

elca said:
Hello, Hi,

following is script source which can beautifulsoup and PAMIE work together.
but if i run this script source error was happened.

AttributeError: PAMIE instance has no attribute 'pageText'
File "C:\test12.py", line 7, in <module>
bs = BeautifulSoup(ie.pageText())
You could execute the script line by line in the python console, then
after the line "ie = PAMIE(url)" look at the "ie" object with "dir(ie)"
to check if it really looks like a healthy instance. ...got bored, just
tried it -- looks like pageText() has been renamed to getPageText().
Try:
text = PAMIE('http://www.cnn.com').getPageText()

cheers
Paul
and following is orginal source until i was found in internet.

from BeautifulSoup import BeautifulSoup
from PAM30 import PAMIE
url = 'http://www.cnn.com'
ie = PAMIE(url)
bs = BeautifulSoup(ie.pageText())

if possible i really want to make it work together with beautifulsoup or
lxml with PAMIE.
sorry my bad english.
thanks in advance.
 
E

elca

Hi,
thanks a lot.
studying alone is tough thing :)
how can i improve my skill...

Hello, Hi,

following is script source which can beautifulsoup and PAMIE work
together.
but if i run this script source error was happened.

AttributeError: PAMIE instance has no attribute 'pageText'
File "C:\test12.py", line 7, in <module>
bs = BeautifulSoup(ie.pageText())
You could execute the script line by line in the python console, then
after the line "ie = PAMIE(url)" look at the "ie" object with "dir(ie)"
to check if it really looks like a healthy instance. ...got bored, just
tried it -- looks like pageText() has been renamed to getPageText().
Try:
text = PAMIE('http://www.cnn.com').getPageText()

cheers
Paul
 
P

paul

elca said:
Hi,
thanks a lot.
studying alone is tough thing :)
how can i improve my skill...
1. Stop top-posting.
2. Read documentation
3. Use the interactive prompt

cheers
Paul
 
E

elca

paul said:
Hi,
thanks a lot.
studying alone is tough thing :)
how can i improve my skill...
1. Stop top-posting.
2. Read documentation
3. Use the interactive prompt

cheers
Paul


hello,
im sorry ,also im not familiar with newsgroup.
so this position is bottom-posting position?
if wrong correct me..
thanks , in addition i was testing just before you sent

text = PAMIE('http://www.naver.com').getPageText()
i have some question...
how can i keep open only one windows? not open several windows.
following is my scenario.
after open www.cnn.com i want to go
http://www.cnn.com/2009/US/10/24/teen.jane.doe/index.html
with keep only one windows.

text = PAMIE('http://www.cnn.com').getPageText()
sleep(5)
text = PAMIE('http://www.cnn.com/2009/US/10/24/teen.jane.doe/index.html')
thanks in advance :)
 
M

Michiel Overtoom

elca said:
im sorry ,also im not familiar with newsgroup.

It's not a newsgroup, but a mailing list. And if you're new to a certain
community you're not familiar with, it's best to lurk a few days to see
how it is used.

so this position is bottom-posting position?

It is, but you should also cut away any quoted text that is not directly
related to the answer.
Otherwise people have to scroll many screens full of text before they
can see the answer.

> how can i keep open only one windows? not open several windows.

The trick is to not instantiate multiple PAMIE objects, but only once,
and reuse that.
Like:

import time
import PAM30
ie=PAM30.PAMIE( )

ie.navigate("http://www.cnn.com")
text1=ie.getPageText()

ie.navigate("http://www.nu.nl")
text2=ie.getPageText()

ie.quit()
print len(text1), len(text2)


But still I think it's unnecessary to use Internet Explorer to get
simple web pages.
The standard library "urllib2.urlopen()" works just as well, and doesn't
rely on Internet Explorer to be present.

Greetings,
 
I

Irmen de Jong

Michiel said:
It's not a newsgroup, but a mailing list. And if you're new to a certain
community you're not familiar with, it's best to lurk a few days to see
how it is used.

Pot. Kettle. Black.
comp.lang.python really is a usenet news group. There is a mailing list that mirrors the
newsgroup though.

-irmen
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top