Text over multiple lines

Rigga · Jun 20, 2004

Hi,

I am using the HTMLParser to parse a web page, part of the routine I need
to write (I am new to Python) involves looking for a particular tag and
once I know the start and the end of the tag then to assign all the data
in between the tags to a variable, this is easy if the tag starts and ends
on the same line however how would I go about doing it if its split over
two or more lines?

Thanks

R

Pawel Kraszewski · Jun 20, 2004

Rigga said:
I am using the HTMLParser to parse a web page, part of the routine I need
to write (I am new to Python) involves looking for a particular tag and
once I know the start and the end of the tag then to assign all the data
in between the tags to a variable, this is easy if the tag starts and ends
on the same line however how would I go about doing it if its split over
two or more lines?

Perhaps you should glue the whole text as a one long line?

Nigel Rowe · Jun 20, 2004

Rigga said:
Hi,

I am using the HTMLParser to parse a web page, part of the routine I need
to write (I am new to Python) involves looking for a particular tag and
once I know the start and the end of the tag then to assign all the data
in between the tags to a variable, this is easy if the tag starts and ends
on the same line however how would I go about doing it if its split over
two or more lines?

Thanks

R

Don't re-invent the wheel,
http://www.crummy.com/software/BeautifulSoup/

Rigga · Jun 20, 2004

Don't re-invent the wheel,
http://www.crummy.com/software/BeautifulSoup/

I want to do it manually as it will help with my understanding of Python,
any ideas how I go about it?

Diez B. Roggisch · Jun 20, 2004

Rigga said:
Hi,

I am using the HTMLParser to parse a web page, part of the routine I need
to write (I am new to Python) involves looking for a particular tag and
once I know the start and the end of the tag then to assign all the data
in between the tags to a variable, this is easy if the tag starts and ends
on the same line however how would I go about doing it if its split over
two or more lines?

What difference does it make that the text is spread over more than one
line? Just collect the data in handle_data.

Peter Hansen · Jun 20, 2004

Rigga said:
I want to do it manually as it will help with my understanding of Python,
any ideas how I go about it?

Wouldn't it help you a lot more then, to figure it out on your
own? ;-)

(If you are really looking for help improving your understanding
of Python, reading the source for BeautifulSoup and figuring out
how *it* does it will probably get you farther than figuring it
out yourself. If, on the other hand, it's a better understanding
of *programming* that you are after, then doing it yourself is
the best bet... IMHO )

-Peter

John Roth · Jun 20, 2004

Rigga said:
Hi,

I am using the HTMLParser to parse a web page, part of the routine I need
to write (I am new to Python) involves looking for a particular tag and
once I know the start and the end of the tag then to assign all the data
in between the tags to a variable, this is easy if the tag starts and ends
on the same line however how would I go about doing it if its split over
two or more lines?

Thanks

Depending on exactly what I want to do, I frequently use <file>.read()
to pick up the entire file in one string, rather than <file>.readlines() to
create a list of strings. It works quite well when what I need to do
can be served by regexs (which is not always the case.)

John Roth

Nelson Minar · Jun 20, 2004

Rigga said:
I am using the HTMLParser to parse a web page, part of the routine I need
to write (I am new to Python) involves looking for a particular tag and
once I know the start and the end of the tag then to assign all the data
in between the tags to a variable, this is easy if the tag starts and ends
on the same line however how would I go about doing it if its split over
two or more lines?

I often have variants of this problem too. The simplest way to make it
work is to read all the HTML in at once with a single call to
file.read(), and then use a regular expression. Note that you probably
don't need re.MULTILINE, although you should take a look at what it
means in the docs just to know.

This works fine as long as you expect your files to be relatively
small (under a meg or so).

Rigga · Jun 20, 2004

I often have variants of this problem too. The simplest way to make it
work is to read all the HTML in at once with a single call to
file.read(), and then use a regular expression. Note that you probably
don't need re.MULTILINE, although you should take a look at what it
means in the docs just to know.

This works fine as long as you expect your files to be relatively
small (under a meg or so).

Im reading the entire file in to a variable at the moment and passing it
through HTMLParser. I have ran in to another problem that I am having a
hard time working out, my data is in this format:

<TD><SPAN class=qv id=EmployeeNo
title="Employee Number">123456</SPAN></TD></TR>

Some times the data is spread over 3 lines like:

<TD><SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>

The data I need to get is the data enclosed in quotes after the word
title= and data after the > and before the </SPAN, in the case aove would
be: Some Shady Business
Group Ltd.

Running the file through HTMLParser I discovered that the title= part
and the data part I need is contained in a list therefore I have done this:

snippet of my code:

class MyHTMLParser(HTMLParser):

def handle_starttag(self, tag, attrs):
print "Encountered the beginning of a %s tag" % tag

def handle_data(self, data):
if "title=" in data:
print "found title"

However I can not work out how to search through the data (which is in a
list) to pull out the data I need.

Sorry if this is a dumb question but hey I am learning!

Many thanks

Rigga

William Park · Jun 21, 2004

Rigga said:
Im reading the entire file in to a variable at the moment and passing
it through HTMLParser. I have ran in to another problem that I am
having a hard time working out, my data is in this format:

<TD><SPAN class=qv id=EmployeeNo
title="Employee Number">123456</SPAN></TD></TR>

Some times the data is spread over 3 lines like:

<TD><SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>

The data I need to get is the data enclosed in quotes after the word
title= and data after the > and before the </SPAN, in the case aove
would be: Some Shady Business Group Ltd.

Approach:

1. Extract '<SPAN ([^>]*)>([^<]*)</SPAN>' which is

<SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN>

with parenthized groups giving

submatch[1]='class=qv id=BusinessName\ntitle="Business Name"'
submatch[2]='Some Shady Business\nGroup Ltd.'

2. Split submatch[1] into

class=qv
id=BusinessName
title="Business Name"

Homework:

Write a Python script.

Bash solution:

First, you need my patched Bash which can be found at

http://freshmeat.net/projects/bashdiff/

You need to patch the Bash shell, and compile. It has many Python
features, particularly regex and array. Shell solution is

text='<TD><SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>'

newf () { # Usage: newf match submatch1 submatch2
eval $2 # --> class, id, title
echo $title > title
echo $3 > name
}
x=()
array -e '<SPAN ([^>]*)>([^<]*)</SPAN>' -E newf x "$text"
cat title
cat name

I can explain the steps, that it's rather long.

Rigga · Jun 21, 2004

Rigga said:
Rigga said:

Im reading the entire file in to a variable at the moment and passing
it through HTMLParser. I have ran in to another problem that I am
having a hard time working out, my data is in this format:

<TD><SPAN class=qv id=EmployeeNo
title="Employee Number">123456</SPAN></TD></TR>

Some times the data is spread over 3 lines like:

<TD><SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>

The data I need to get is the data enclosed in quotes after the word
title= and data after the > and before the </SPAN, in the case aove
would be: Some Shady Business Group Ltd.

Click to expand...

Approach:

1. Extract '<SPAN ([^>]*)>([^<]*)</SPAN>' which is

<SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN>

with parenthized groups giving

submatch[1]='class=qv id=BusinessName\ntitle="Business Name"'
submatch[2]='Some Shady Business\nGroup Ltd.'

2. Split submatch[1] into

class=qv
id=BusinessName
title="Business Name"

Homework:

Write a Python script.

Bash solution:

First, you need my patched Bash which can be found at

http://freshmeat.net/projects/bashdiff/

You need to patch the Bash shell, and compile. It has many Python
features, particularly regex and array. Shell solution is

text='<TD><SPAN class=qv id=BusinessName
title="Business Name">Some Shady Business
Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>'

newf () { # Usage: newf match submatch1 submatch2
eval $2 # --> class, id, title
echo $title > title
echo $3 > name
}
x=()
array -e '<SPAN ([^>]*)>([^<]*)</SPAN>' -E newf x "$text"
cat title
cat name

I can explain the steps, that it's rather long.

Thanks for everyones help, I have now worked out a way that works for me
, your input has helped me immensley.

many thanks

R

Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
Parsing multiple lines from text file using regex	0	Oct 27, 2013
Python point location of intersect between two lines	0	Feb 28, 2018
Inserting Multiple Lines from Console	14	Apr 8, 2013
Python and PEP8 - Recommendations on breaking up long lines?	19	Nov 28, 2013
matching over multiple lines	4	Nov 21, 2006
Extracting lines from text files - script with a couple of 'side effects'	3	Sep 25, 2013
Adding new lines to word document using zipfile module within python 2.7?	0	Aug 27, 2013

Text over multiple lines

Rigga

Pawel Kraszewski

Nigel Rowe

Rigga

Diez B. Roggisch

Peter Hansen

John Roth

Nelson Minar

Rigga

William Park

Rigga

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads