how can i use lxml with win32com?

elca · Oct 25, 2009

hello...
if anyone know..please help me !
i really want to know...i was searched in google lot of time.
but can't found clear soultion. and also because of my lack of python
knowledge.
i want to use IE.navigate function with beautifulsoup or lxml..
if anyone know about this or sample.
please help me!
thanks in advance ..

Stefan Behnel · Oct 25, 2009

Hi,

elca, 25.10.2009 02:35:

hello...
if anyone know..please help me !
i really want to know...i was searched in google lot of time.
but can't found clear soultion. and also because of my lack of python
knowledge.
i want to use IE.navigate function with beautifulsoup or lxml..
if anyone know about this or sample.
please help me!
thanks in advance ..

You wrote a message with nine lines, only one of which gives a tiny hint on
what you actually want to do. What about providing an explanation of what
you want to achieve instead? Try to answer questions like: Where does your
data come from? Is it XML or HTML? What do you want to do with it?

This might help:

http://www.catb.org/~esr/faqs/smart-questions.html

Stefan

elca · Oct 25, 2009

Hello,
im very sorry .
first my source is come from website which consist of html mainly.
and i want to make web scraper.
i was found some script source in internet.
following is script source which can beautifulsoup and PAMIE work together.
but if i run this script source error was happened.

AttributeError: PAMIE instance has no attribute 'pageText'
File "C:\test12.py", line 7, in <module>
bs = BeautifulSoup(ie.pageText())

and following is orginal source until i was found in internet.

from BeautifulSoup import BeautifulSoup
from PAM30 import PAMIE
url = 'http://www.cnn.com'
ie = PAMIE(url)
bs = BeautifulSoup(ie.pageText())

if possible i really want to make it work together with beautifulsoup or
lxml with PAMIE.
sorry my bad english.
thanks in advance.

User · Oct 25, 2009

i want to make web scraper.
if possible i really want to make it work together with
beautifulsoup or
lxml with PAMIE.

Scraping information from webpages falls apart in two tasks:

1. Getting the HTML data
2. Extracting information from the HTML data

It looks like you want to use Internet Explorer for getting the HTML
data; is there any reason you can't use a simpler approach like using
urllib2.urlopen()?

Once you have the HTML data, you could feed it into BeautifulSoup or
lxml.

Mixing up 1 and 2 into a single statement created some confusion for
you, I think.

Greetings,

elca · Oct 25, 2009

Hello,
yes there is some reason why i nave to insist internet explorere interface.
because of javascript im trying to insist use PAMIE.
i was tried some other solution urlopen or mechanize and so on.
but it hard to use javascript.
can you show me some sample for me ?

such like if i want to extract some text in CNN website with 'CNN Shop'
'Site map' in bottom of CNN website page by use PAMIE.
thanks for your help.

User · Oct 25, 2009

because of javascript im trying to insist use PAMIE.

I see, your problem is not with lxml or BeautifulSoup, but getting the
raw data in the first place.

i want to extract some text in CNN website with 'CNN Shop'
'Site map' in bottom of CNN website page

What text? Can you give an example? I'd like to be able to reproduce
it manually in the webbrowser so I get a clear idea what exactly
you're trying to achieve.

Greetings,

elca · Oct 25, 2009

hello,
www.cnn.com in main website page.
for example ,if you see www.cnn.com's html source, maybe you can find such
like line of html source.

http://www.turnerstoreonline.com/ CNN Shop

and for example if i want to extract 'CNN Shop' text in html source.
and i want to add such like function ,with following script source.

from BeautifulSoup import BeautifulSoup
from PAM30 import PAMIE
from time import sleep

url = 'http://www.cnn.com'
ie = PAMIE(url)
sleep(10)
bs = BeautifulSoup(ie.getTextArea())
#from here i want to add such like text extract function with use PAMIE and
lxml or beautifulsoup.

thanks for your help .

in the cnn website's html source
there i

User · Oct 25, 2009

www.cnn.com in main website page.
for example ,if you see www.cnn.com's html source, maybe you can
find such
like line of html source.
http://www.turnerstoreonline.com/ CNN Shop
and for example if i want to extract 'CNN Shop' text in html source.

So, if I understand you correctly, you want your program to do the
following:

1. Retrieve the http://cnn.com webpage
2. Look for a link identified by the text "CNN Shop"
3. Extract the URL for that link.

The result would be http://www.turnerstoreonline.com

Is that what you want?

Greetings,

elca · Oct 25, 2009

hello,
im very sorry my english.
yes i want to extract this text 'CNN Shop' and linked page
'http://www.turnerstoreonline.com'.
thanks a lot!

Michiel Overtoom · Oct 25, 2009

elca said:
yes i want to extract this text 'CNN Shop' and linked page
'http://www.turnerstoreonline.com'.

Well then.
First, we'll get the page using urrlib2:

doc=urllib2.urlopen("http://www.cnn.com")

Then we'll feed it into the HTML parser:

soup=BeautifulSoup(doc)

Next, we'll look at all the links in the page:

for a in soup.findAll("a"):

and when a link has the text 'CNN Shop', we have a hit,
and print the URL:

if a.renderContents()=="CNN Shop":
print a["href"]

The complete program is thus:

import urllib2
from BeautifulSoup import BeautifulSoup

doc=urllib2.urlopen("http://www.cnn.com")
soup=BeautifulSoup(doc)
for a in soup.findAll("a"):
if a.renderContents()=="CNN Shop":
print a["href"]

The example above can be condensed because BeautifulSoup's find function
can also look for texts:

print soup.find("a",text="CNN Shop")

and since that's a navigable string, we can ascend to its parent and
display the href attribute:

print soup.find("a",text="CNN Shop").findParent()["href"]

So eventually the whole program could be collapsed into one line:

print
BeautifulSoup(urllib2.urlopen("http://www.cnn.com")).find("a",text="CNN
Shop").findParent()["href"]

....but I think this is very ugly!

> im very sorry my english.

You English is quite understandable. The hard part is figuring out what
exactly you wanted to achieve ;-)

I have a question too. Why did you think JavaScript was necessary to
arrive at this result?

Greetings,

Stefan Behnel · Oct 25, 2009

elca, 25.10.2009 08:46:

im very sorry my english.

It's fairly common in this news-group that people do not have a good level
of English, so that's perfectly ok. But you should try to provide more
information in your posts. Be explicit about what you tried and what failed
(and how!), and provide short code examples and exact copies of failure
messages whenever possible. That will help others in understanding what is
going on on your side. Remember that we can't look at your screen, nor read
your mind.

Oh, and please don't top-post in replies.

Stefan

elca · Oct 25, 2009

Hello,
thanks for your reply.
actually what i want to parse website is some different language site.
so i was quote some common english website for easy understand.

by the way, is it possible to use with PAMIE and beautifulsoup work
together?
Thanks a lot

yes i want to extract this text 'CNN Shop' and linked page
'http://www.turnerstoreonline.com'.

Click to expand...

Well then.
First, we'll get the page using urrlib2:

doc=urllib2.urlopen("http://www.cnn.com")

Then we'll feed it into the HTML parser:

soup=BeautifulSoup(doc)

Next, we'll look at all the links in the page:

for a in soup.findAll("a"):

and when a link has the text 'CNN Shop', we have a hit,
and print the URL:

if a.renderContents()=="CNN Shop":
print a["href"]

The complete program is thus:

import urllib2
from BeautifulSoup import BeautifulSoup

doc=urllib2.urlopen("http://www.cnn.com")
soup=BeautifulSoup(doc)
for a in soup.findAll("a"):
if a.renderContents()=="CNN Shop":
print a["href"]

The example above can be condensed because BeautifulSoup's find function
can also look for texts:

print soup.find("a",text="CNN Shop")

and since that's a navigable string, we can ascend to its parent and
display the href attribute:

print soup.find("a",text="CNN Shop").findParent()["href"]

So eventually the whole program could be collapsed into one line:

print
BeautifulSoup(urllib2.urlopen("http://www.cnn.com")).find("a",text="CNN
Shop").findParent()["href"]

...but I think this is very ugly!

im very sorry my english.

Click to expand...

You English is quite understandable. The hard part is figuring out what
exactly you wanted to achieve ;-)

I have a question too. Why did you think JavaScript was necessary to
arrive at this result?

Greetings,

Michiel Overtoom · Oct 25, 2009

elca said:
actually what i want to parse website is some different language site.

A different website? What website? What text? Please show your actual
use case, instead of smokescreens.

so i was quote some common english website for easy understand.

And, did you learn something from it? Were you able to apply the
technique to the other website?

by the way, is it possible to use with PAMIE and beautifulsoup work
together?

If you define 'working together' as like 'PAMIE produces a HTML text and
BeautifulSoup parses it', then maybe yes.

Greetings,

elca · Oct 25, 2009

Hello,
actually what i want is,
if you run my script you can reach this page
'http://news.search.naver.com/search.naver?sm=tab_hty&where=news&query=korea+times&x=0&y=0'
that is korea portal site and i was search keyword using 'korea times'
and i want to scrap resulted to text name with 'blogscrap_save.txt'
if you run this script ,you can see
following article

"Yesan County: How do you like them apples?
ì½”ë¦¬ì•„í—¤ëŸ´ë“œ |
carp fishing at the Yedang Reservoir -
Korea`s biggest - taking a nice stroll...
During the curator`s recitation of Yun`s life and times as a resistance
and freedom fighter,
he would emphsize random ...
"

and also can see following article and so on ....
"
10,000 Nepalese Diaspora Emerging in Korea
ì½”ë¦¬ì•„íƒ€ìž„ìŠ¤ ì„¸ê³„ | 2009.10.23 (ê¸ˆ) ì˜¤í›„ 9:31
Although the Nepalese community in Korea is worker dominated,
there are... yoga is popular among Nepalese. These festivals are the
times when expatriate Nepalese feel nostalgic for their... "

so actual process to scrap site is,
first i want to use keyword and want to save resulted article with only
text.

i was attached currently im making script but not so much good and can't
work well.
especially extract part is really hard for novice,such like for me

thanks in advance..

http://www.nabble.com/file/p26046215/untitled-1.py untitled-1.py

paul · Oct 25, 2009

elca said:
Hello, Hi,

following is script source which can beautifulsoup and PAMIE work together.
but if i run this script source error was happened.

AttributeError: PAMIE instance has no attribute 'pageText'
File "C:\test12.py", line 7, in <module>
bs = BeautifulSoup(ie.pageText())

You could execute the script line by line in the python console, then
after the line "ie = PAMIE(url)" look at the "ie" object with "dir(ie)"
to check if it really looks like a healthy instance. ...got bored, just
tried it -- looks like pageText() has been renamed to getPageText().
Try:
text = PAMIE('http://www.cnn.com').getPageText()

cheers
Paul

and following is orginal source until i was found in internet.

from BeautifulSoup import BeautifulSoup
from PAM30 import PAMIE
url = 'http://www.cnn.com'
ie = PAMIE(url)
bs = BeautifulSoup(ie.pageText())

if possible i really want to make it work together with beautifulsoup or
lxml with PAMIE.
sorry my bad english.
thanks in advance.

elca · Oct 25, 2009

Hi,
thanks a lot.
studying alone is tough thing

how can i improve my skill...

Hello, Hi,

following is script source which can beautifulsoup and PAMIE work
together.
but if i run this script source error was happened.

AttributeError: PAMIE instance has no attribute 'pageText'
File "C:\test12.py", line 7, in <module>
bs = BeautifulSoup(ie.pageText())

Click to expand...

You could execute the script line by line in the python console, then
after the line "ie = PAMIE(url)" look at the "ie" object with "dir(ie)"
to check if it really looks like a healthy instance. ...got bored, just
tried it -- looks like pageText() has been renamed to getPageText().
Try:
text = PAMIE('http://www.cnn.com').getPageText()

cheers
Paul

paul · Oct 25, 2009

elca said:
Hi,
thanks a lot.
studying alone is tough thing
how can i improve my skill...

1. Stop top-posting.
2. Read documentation
3. Use the interactive prompt

cheers
Paul

elca · Oct 25, 2009

paul said:
Hi,
thanks a lot.
studying alone is tough thing
how can i improve my skill...

Click to expand...

1. Stop top-posting.
2. Read documentation
3. Use the interactive prompt

cheers
Paul

hello,
im sorry ,also im not familiar with newsgroup.
so this position is bottom-posting position?
if wrong correct me..
thanks , in addition i was testing just before you sent

text = PAMIE('http://www.naver.com').getPageText()
i have some question...
how can i keep open only one windows? not open several windows.
following is my scenario.
after open www.cnn.com i want to go
http://www.cnn.com/2009/US/10/24/teen.jane.doe/index.html
with keep only one windows.

text = PAMIE('http://www.cnn.com').getPageText()
sleep(5)
text = PAMIE('http://www.cnn.com/2009/US/10/24/teen.jane.doe/index.html')
thanks in advance

Michiel Overtoom · Oct 25, 2009

elca said:
im sorry ,also im not familiar with newsgroup.

It's not a newsgroup, but a mailing list. And if you're new to a certain
community you're not familiar with, it's best to lurk a few days to see
how it is used.

so this position is bottom-posting position?

It is, but you should also cut away any quoted text that is not directly
related to the answer.
Otherwise people have to scroll many screens full of text before they
can see the answer.

> how can i keep open only one windows? not open several windows.

The trick is to not instantiate multiple PAMIE objects, but only once,
and reuse that.
Like:

import time
import PAM30
ie=PAM30.PAMIE( )

ie.navigate("http://www.cnn.com")
text1=ie.getPageText()

ie.navigate("http://www.nu.nl")
text2=ie.getPageText()

ie.quit()
print len(text1), len(text2)

But still I think it's unnecessary to use Internet Explorer to get
simple web pages.
The standard library "urllib2.urlopen()" works just as well, and doesn't
rely on Internet Explorer to be present.

Greetings,

Irmen de Jong · Oct 25, 2009

Michiel said:
It's not a newsgroup, but a mailing list. And if you're new to a certain
community you're not familiar with, it's best to lurk a few days to see
how it is used.

Pot. Kettle. Black.
comp.lang.python really is a usenet news group. There is a mailing list that mirrors the
newsgroup though.

-irmen

docx/lxml	6	Jul 31, 2012
How can I view / open / render / display a pdf file with c code?	0	Sep 23, 2023
installing lxml ?	14	Nov 11, 2009
How can I create my own android application?	4	Jan 21, 2022
How Do I Set text on an Image and use the image as a border?	7	Mar 16, 2023
How do I open PLINK with R studio	0	Nov 16, 2023
Can I use calc to change multiple parent sizes?	0	Nov 20, 2021
Website with Database. use C#	1	Mar 25, 2023

how can i use lxml with win32com?

elca

Stefan Behnel

elca

User

elca

User

elca

User

elca

Michiel Overtoom

Stefan Behnel

elca

Michiel Overtoom

elca

paul

elca

paul

elca

Michiel Overtoom

Irmen de Jong

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads