Text over multiple lines

R

Rigga

Hi,

I am using the HTMLParser to parse a web page, part of the routine I need
to write (I am new to Python) involves looking for a particular tag and
once I know the start and the end of the tag then to assign all the data
in between the tags to a variable, this is easy if the tag starts and ends
on the same line however how would I go about doing it if its split over
two or more lines?

Thanks

R
 
P

Pawel Kraszewski

Rigga said:
I am using the HTMLParser to parse a web page, part of the routine I need
to write (I am new to Python) involves looking for a particular tag and
once I know the start and the end of the tag then to assign all the data
in between the tags to a variable, this is easy if the tag starts and ends
on the same line however how would I go about doing it if its split over
two or more lines?

Perhaps you should glue the whole text as a one long line?
 
N

Nigel Rowe

Rigga said:
Hi,

I am using the HTMLParser to parse a web page, part of the routine I need
to write (I am new to Python) involves looking for a particular tag and
once I know the start and the end of the tag then to assign all the data
in between the tags to a variable, this is easy if the tag starts and ends
on the same line however how would I go about doing it if its split over
two or more lines?

Thanks

R

Don't re-invent the wheel,
http://www.crummy.com/software/BeautifulSoup/
 
D

Diez B. Roggisch

Rigga said:
Hi,

I am using the HTMLParser to parse a web page, part of the routine I need
to write (I am new to Python) involves looking for a particular tag and
once I know the start and the end of the tag then to assign all the data
in between the tags to a variable, this is easy if the tag starts and ends
on the same line however how would I go about doing it if its split over
two or more lines?


What difference does it make that the text is spread over more than one
line? Just collect the data in handle_data.
 
P

Peter Hansen

Rigga said:
I want to do it manually as it will help with my understanding of Python,
any ideas how I go about it?

Wouldn't it help you a lot more then, to figure it out on your
own? ;-)

(If you are really looking for help improving your understanding
of Python, reading the source for BeautifulSoup and figuring out
how *it* does it will probably get you farther than figuring it
out yourself. If, on the other hand, it's a better understanding
of *programming* that you are after, then doing it yourself is
the best bet... IMHO )

-Peter
 
J

John Roth

Rigga said:
Hi,

I am using the HTMLParser to parse a web page, part of the routine I need
to write (I am new to Python) involves looking for a particular tag and
once I know the start and the end of the tag then to assign all the data
in between the tags to a variable, this is easy if the tag starts and ends
on the same line however how would I go about doing it if its split over
two or more lines?

Thanks

Depending on exactly what I want to do, I frequently use <file>.read()
to pick up the entire file in one string, rather than <file>.readlines() to
create a list of strings. It works quite well when what I need to do
can be served by regexs (which is not always the case.)

John Roth
 
N

Nelson Minar

Rigga said:
I am using the HTMLParser to parse a web page, part of the routine I need
to write (I am new to Python) involves looking for a particular tag and
once I know the start and the end of the tag then to assign all the data
in between the tags to a variable, this is easy if the tag starts and ends
on the same line however how would I go about doing it if its split over
two or more lines?

I often have variants of this problem too. The simplest way to make it
work is to read all the HTML in at once with a single call to
file.read(), and then use a regular expression. Note that you probably
don't need re.MULTILINE, although you should take a look at what it
means in the docs just to know.

This works fine as long as you expect your files to be relatively
small (under a meg or so).
 
R

Rigga

I often have variants of this problem too. The simplest way to make it
work is to read all the HTML in at once with a single call to
file.read(), and then use a regular expression. Note that you probably
don't need re.MULTILINE, although you should take a look at what it
means in the docs just to know.

This works fine as long as you expect your files to be relatively
small (under a meg or so).

Im reading the entire file in to a variable at the moment and passing it
through HTMLParser. I have ran in to another problem that I am having a
hard time working out, my data is in this format:

<TD><SPAN class=qv id=EmployeeNo
title="Employee Number">123456</SPAN></TD></TR>

Some times the data is spread over 3 lines like:

<TD><SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>

The data I need to get is the data enclosed in quotes after the word
title= and data after the > and before the </SPAN, in the case aove would
be: Some Shady Business
Group Ltd.

Running the file through HTMLParser I discovered that the title= part
and the data part I need is contained in a list therefore I have done this:

snippet of my code:

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag

def handle_data(self, data):
if "title=" in data:
print "found title"

However I can not work out how to search through the data (which is in a
list) to pull out the data I need.

Sorry if this is a dumb question but hey I am learning!

Many thanks

Rigga
 
W

William Park

Rigga said:
Im reading the entire file in to a variable at the moment and passing
it through HTMLParser. I have ran in to another problem that I am
having a hard time working out, my data is in this format:

<TD><SPAN class=qv id=EmployeeNo
title="Employee Number">123456</SPAN></TD></TR>

Some times the data is spread over 3 lines like:

<TD><SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>

The data I need to get is the data enclosed in quotes after the word
title= and data after the > and before the </SPAN, in the case aove
would be: Some Shady Business Group Ltd.

Approach:

1. Extract '<SPAN ([^>]*)>([^<]*)</SPAN>' which is

<SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN>

with parenthized groups giving

submatch[1]='class=qv id=BusinessName\ntitle="Business Name"'
submatch[2]='Some Shady Business\nGroup Ltd.'

2. Split submatch[1] into

class=qv
id=BusinessName
title="Business Name"

Homework:

Write a Python script.

Bash solution:

First, you need my patched Bash which can be found at

http://freshmeat.net/projects/bashdiff/

You need to patch the Bash shell, and compile. It has many Python
features, particularly regex and array. Shell solution is

text='<TD><SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>'

newf () { # Usage: newf match submatch1 submatch2
eval $2 # --> class, id, title
echo $title > title
echo $3 > name
}
x=()
array -e '<SPAN ([^>]*)>([^<]*)</SPAN>' -E newf x "$text"
cat title
cat name

I can explain the steps, that it's rather long. :)
 
R

Rigga

Rigga said:
Im reading the entire file in to a variable at the moment and passing
it through HTMLParser. I have ran in to another problem that I am
having a hard time working out, my data is in this format:

<TD><SPAN class=qv id=EmployeeNo
title="Employee Number">123456</SPAN></TD></TR>

Some times the data is spread over 3 lines like:

<TD><SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>

The data I need to get is the data enclosed in quotes after the word
title= and data after the > and before the </SPAN, in the case aove
would be: Some Shady Business Group Ltd.

Approach:

1. Extract '<SPAN ([^>]*)>([^<]*)</SPAN>' which is

<SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN>

with parenthized groups giving

submatch[1]='class=qv id=BusinessName\ntitle="Business Name"'
submatch[2]='Some Shady Business\nGroup Ltd.'

2. Split submatch[1] into

class=qv
id=BusinessName
title="Business Name"

Homework:

Write a Python script.

Bash solution:

First, you need my patched Bash which can be found at

http://freshmeat.net/projects/bashdiff/

You need to patch the Bash shell, and compile. It has many Python
features, particularly regex and array. Shell solution is

text='<TD><SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>'

newf () { # Usage: newf match submatch1 submatch2
eval $2 # --> class, id, title
echo $title > title
echo $3 > name
}
x=()
array -e '<SPAN ([^>]*)>([^<]*)</SPAN>' -E newf x "$text"
cat title
cat name

I can explain the steps, that it's rather long. :)

Thanks for everyones help, I have now worked out a way that works for me
, your input has helped me immensley.

many thanks

R
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,578
Members
45,052
Latest member
LucyCarper

Latest Threads

Top