Text over multiple lines

Discussion in 'Python' started by Rigga, Jun 20, 2004.

  1. Rigga

    Rigga Guest

    Hi,

    I am using the HTMLParser to parse a web page, part of the routine I need
    to write (I am new to Python) involves looking for a particular tag and
    once I know the start and the end of the tag then to assign all the data
    in between the tags to a variable, this is easy if the tag starts and ends
    on the same line however how would I go about doing it if its split over
    two or more lines?

    Thanks

    R
     
    Rigga, Jun 20, 2004
    #1
    1. Advertising

  2. Rigga wrote:

    > I am using the HTMLParser to parse a web page, part of the routine I need
    > to write (I am new to Python) involves looking for a particular tag and
    > once I know the start and the end of the tag then to assign all the data
    > in between the tags to a variable, this is easy if the tag starts and ends
    > on the same line however how would I go about doing it if its split over
    > two or more lines?


    Perhaps you should glue the whole text as a one long line?

    --
    Pawel Kraszewski FreeBSD/Linux

    E-Mail/Jabber Phone ICQ GG
    +48 604 777447 45615564 69381
     
    Pawel Kraszewski, Jun 20, 2004
    #2
    1. Advertising

  3. Rigga

    Nigel Rowe Guest

    Rigga wrote:

    > Hi,
    >
    > I am using the HTMLParser to parse a web page, part of the routine I need
    > to write (I am new to Python) involves looking for a particular tag and
    > once I know the start and the end of the tag then to assign all the data
    > in between the tags to a variable, this is easy if the tag starts and ends
    > on the same line however how would I go about doing it if its split over
    > two or more lines?
    >
    > Thanks
    >
    > R


    Don't re-invent the wheel,
    http://www.crummy.com/software/BeautifulSoup/

    --
    Nigel Rowe
    A pox upon the spammers that make me write my address like..
    rho (snail) swiftdsl (stop) com (stop) au
     
    Nigel Rowe, Jun 20, 2004
    #3
  4. Rigga

    Rigga Guest

    On Sun, 20 Jun 2004 21:03:34 +1000, Nigel Rowe wrote:

    > Rigga wrote:
    >
    >> Hi,
    >>
    >> I am using the HTMLParser to parse a web page, part of the routine I need
    >> to write (I am new to Python) involves looking for a particular tag and
    >> once I know the start and the end of the tag then to assign all the data
    >> in between the tags to a variable, this is easy if the tag starts and ends
    >> on the same line however how would I go about doing it if its split over
    >> two or more lines?
    >>
    >> Thanks
    >>
    >> R

    >
    > Don't re-invent the wheel,
    > http://www.crummy.com/software/BeautifulSoup/

    I want to do it manually as it will help with my understanding of Python,
    any ideas how I go about it?
     
    Rigga, Jun 20, 2004
    #4
  5. Rigga wrote:

    > Hi,
    >
    > I am using the HTMLParser to parse a web page, part of the routine I need
    > to write (I am new to Python) involves looking for a particular tag and
    > once I know the start and the end of the tag then to assign all the data
    > in between the tags to a variable, this is easy if the tag starts and ends
    > on the same line however how would I go about doing it if its split over
    > two or more lines?



    What difference does it make that the text is spread over more than one
    line? Just collect the data in handle_data.

    --
    Regards,

    Diez B. Roggisch
     
    Diez B. Roggisch, Jun 20, 2004
    #5
  6. Rigga

    Peter Hansen Guest

    Rigga wrote:

    > On Sun, 20 Jun 2004 21:03:34 +1000, Nigel Rowe wrote:
    >>Don't re-invent the wheel,
    >>http://www.crummy.com/software/BeautifulSoup/

    >
    > I want to do it manually as it will help with my understanding of Python,
    > any ideas how I go about it?


    Wouldn't it help you a lot more then, to figure it out on your
    own? ;-)

    (If you are really looking for help improving your understanding
    of Python, reading the source for BeautifulSoup and figuring out
    how *it* does it will probably get you farther than figuring it
    out yourself. If, on the other hand, it's a better understanding
    of *programming* that you are after, then doing it yourself is
    the best bet... IMHO )

    -Peter
     
    Peter Hansen, Jun 20, 2004
    #6
  7. Rigga

    John Roth Guest

    "Rigga" <> wrote in message
    news:p...
    > Hi,
    >
    > I am using the HTMLParser to parse a web page, part of the routine I need
    > to write (I am new to Python) involves looking for a particular tag and
    > once I know the start and the end of the tag then to assign all the data
    > in between the tags to a variable, this is easy if the tag starts and ends
    > on the same line however how would I go about doing it if its split over
    > two or more lines?
    >
    > Thanks


    Depending on exactly what I want to do, I frequently use <file>.read()
    to pick up the entire file in one string, rather than <file>.readlines() to
    create a list of strings. It works quite well when what I need to do
    can be served by regexs (which is not always the case.)

    John Roth
    >
    > R
     
    John Roth, Jun 20, 2004
    #7
  8. Rigga

    Nelson Minar Guest

    Rigga <> writes:
    > I am using the HTMLParser to parse a web page, part of the routine I need
    > to write (I am new to Python) involves looking for a particular tag and
    > once I know the start and the end of the tag then to assign all the data
    > in between the tags to a variable, this is easy if the tag starts and ends
    > on the same line however how would I go about doing it if its split over
    > two or more lines?


    I often have variants of this problem too. The simplest way to make it
    work is to read all the HTML in at once with a single call to
    file.read(), and then use a regular expression. Note that you probably
    don't need re.MULTILINE, although you should take a look at what it
    means in the docs just to know.

    This works fine as long as you expect your files to be relatively
    small (under a meg or so).
     
    Nelson Minar, Jun 20, 2004
    #8
  9. Rigga

    Rigga Guest

    On Sun, 20 Jun 2004 17:22:53 +0000, Nelson Minar wrote:

    > Rigga <> writes:
    >> I am using the HTMLParser to parse a web page, part of the routine I need
    >> to write (I am new to Python) involves looking for a particular tag and
    >> once I know the start and the end of the tag then to assign all the data
    >> in between the tags to a variable, this is easy if the tag starts and ends
    >> on the same line however how would I go about doing it if its split over
    >> two or more lines?

    >
    > I often have variants of this problem too. The simplest way to make it
    > work is to read all the HTML in at once with a single call to
    > file.read(), and then use a regular expression. Note that you probably
    > don't need re.MULTILINE, although you should take a look at what it
    > means in the docs just to know.
    >
    > This works fine as long as you expect your files to be relatively
    > small (under a meg or so).


    Im reading the entire file in to a variable at the moment and passing it
    through HTMLParser. I have ran in to another problem that I am having a
    hard time working out, my data is in this format:

    <TD><SPAN class=qv id=EmployeeNo
    title="Employee Number">123456</SPAN></TD></TR>

    Some times the data is spread over 3 lines like:

    <TD><SPAN class=qv id=BusinessName
    title="Business Name">Some Shady Business
    Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>

    The data I need to get is the data enclosed in quotes after the word
    title= and data after the > and before the </SPAN, in the case aove would
    be: Some Shady Business
    Group Ltd.

    Running the file through HTMLParser I discovered that the title= part
    and the data part I need is contained in a list therefore I have done this:

    snippet of my code:

    class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
    print "Encountered the beginning of a %s tag" % tag

    def handle_data(self, data):
    if "title=" in data:
    print "found title"

    However I can not work out how to search through the data (which is in a
    list) to pull out the data I need.

    Sorry if this is a dumb question but hey I am learning!

    Many thanks

    Rigga
     
    Rigga, Jun 20, 2004
    #9
  10. Rigga

    William Park Guest

    Rigga <> wrote:
    > On Sun, 20 Jun 2004 17:22:53 +0000, Nelson Minar wrote:
    >
    > > Rigga <> writes:
    > >> I am using the HTMLParser to parse a web page, part of the routine
    > >> I need to write (I am new to Python) involves looking for a
    > >> particular tag and once I know the start and the end of the tag
    > >> then to assign all the data in between the tags to a variable, this
    > >> is easy if the tag starts and ends on the same line however how
    > >> would I go about doing it if its split over two or more lines?

    > >
    > > I often have variants of this problem too. The simplest way to make
    > > it work is to read all the HTML in at once with a single call to
    > > file.read(), and then use a regular expression. Note that you
    > > probably don't need re.MULTILINE, although you should take a look at
    > > what it means in the docs just to know.
    > >
    > > This works fine as long as you expect your files to be relatively
    > > small (under a meg or so).

    >
    > Im reading the entire file in to a variable at the moment and passing
    > it through HTMLParser. I have ran in to another problem that I am
    > having a hard time working out, my data is in this format:
    >
    > <TD><SPAN class=qv id=EmployeeNo
    > title="Employee Number">123456</SPAN></TD></TR>
    >
    > Some times the data is spread over 3 lines like:
    >
    > <TD><SPAN class=qv id=BusinessName
    > title="Business Name">Some Shady Business
    > Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>
    >
    > The data I need to get is the data enclosed in quotes after the word
    > title= and data after the > and before the </SPAN, in the case aove
    > would be: Some Shady Business Group Ltd.


    Approach:

    1. Extract '<SPAN ([^>]*)>([^<]*)</SPAN>' which is

    <SPAN class=qv id=BusinessName
    title="Business Name">Some Shady Business
    Group Ltd.</SPAN>

    with parenthized groups giving

    submatch[1]='class=qv id=BusinessName\ntitle="Business Name"'
    submatch[2]='Some Shady Business\nGroup Ltd.'

    2. Split submatch[1] into

    class=qv
    id=BusinessName
    title="Business Name"

    Homework:

    Write a Python script.

    Bash solution:

    First, you need my patched Bash which can be found at

    http://freshmeat.net/projects/bashdiff/

    You need to patch the Bash shell, and compile. It has many Python
    features, particularly regex and array. Shell solution is

    text='<TD><SPAN class=qv id=BusinessName
    title="Business Name">Some Shady Business
    Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>'

    newf () { # Usage: newf match submatch1 submatch2
    eval $2 # --> class, id, title
    echo $title > title
    echo $3 > name
    }
    x=()
    array -e '<SPAN ([^>]*)>([^<]*)</SPAN>' -E newf x "$text"
    cat title
    cat name

    I can explain the steps, that it's rather long. :)

    --
    William Park, Open Geometry Consulting, <>
    No, I will not fix your computer! I'll reformat your harddisk, though.
     
    William Park, Jun 21, 2004
    #10
  11. Rigga

    Rigga Guest

    On Mon, 21 Jun 2004 05:06:50 +0000, William Park wrote:

    > Rigga <> wrote:
    >> On Sun, 20 Jun 2004 17:22:53 +0000, Nelson Minar wrote:
    >>
    >> > Rigga <> writes:
    >> >> I am using the HTMLParser to parse a web page, part of the routine
    >> >> I need to write (I am new to Python) involves looking for a
    >> >> particular tag and once I know the start and the end of the tag
    >> >> then to assign all the data in between the tags to a variable, this
    >> >> is easy if the tag starts and ends on the same line however how
    >> >> would I go about doing it if its split over two or more lines?
    >> >
    >> > I often have variants of this problem too. The simplest way to make
    >> > it work is to read all the HTML in at once with a single call to
    >> > file.read(), and then use a regular expression. Note that you
    >> > probably don't need re.MULTILINE, although you should take a look at
    >> > what it means in the docs just to know.
    >> >
    >> > This works fine as long as you expect your files to be relatively
    >> > small (under a meg or so).

    >>
    >> Im reading the entire file in to a variable at the moment and passing
    >> it through HTMLParser. I have ran in to another problem that I am
    >> having a hard time working out, my data is in this format:
    >>
    >> <TD><SPAN class=qv id=EmployeeNo
    >> title="Employee Number">123456</SPAN></TD></TR>
    >>
    >> Some times the data is spread over 3 lines like:
    >>
    >> <TD><SPAN class=qv id=BusinessName
    >> title="Business Name">Some Shady Business
    >> Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>
    >>
    >> The data I need to get is the data enclosed in quotes after the word
    >> title= and data after the > and before the </SPAN, in the case aove
    >> would be: Some Shady Business Group Ltd.

    >
    > Approach:
    >
    > 1. Extract '<SPAN ([^>]*)>([^<]*)</SPAN>' which is
    >
    > <SPAN class=qv id=BusinessName
    > title="Business Name">Some Shady Business
    > Group Ltd.</SPAN>
    >
    > with parenthized groups giving
    >
    > submatch[1]='class=qv id=BusinessName\ntitle="Business Name"'
    > submatch[2]='Some Shady Business\nGroup Ltd.'
    >
    > 2. Split submatch[1] into
    >
    > class=qv
    > id=BusinessName
    > title="Business Name"
    >
    > Homework:
    >
    > Write a Python script.
    >
    > Bash solution:
    >
    > First, you need my patched Bash which can be found at
    >
    > http://freshmeat.net/projects/bashdiff/
    >
    > You need to patch the Bash shell, and compile. It has many Python
    > features, particularly regex and array. Shell solution is
    >
    > text='<TD><SPAN class=qv id=BusinessName
    > title="Business Name">Some Shady Business
    > Group Ltd.</SPAN></TD></TR></TBODY></TABLE></TD></TR>'
    >
    > newf () { # Usage: newf match submatch1 submatch2
    > eval $2 # --> class, id, title
    > echo $title > title
    > echo $3 > name
    > }
    > x=()
    > array -e '<SPAN ([^>]*)>([^<]*)</SPAN>' -E newf x "$text"
    > cat title
    > cat name
    >
    > I can explain the steps, that it's rather long. :)


    Thanks for everyones help, I have now worked out a way that works for me
    , your input has helped me immensley.

    many thanks

    R
     
    Rigga, Jun 21, 2004
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Vronans
    Replies:
    0
    Views:
    6,318
    Vronans
    Nov 20, 2006
  2. Murali
    Replies:
    2
    Views:
    609
    Jerry Coffin
    Mar 9, 2006
  3. Sara
    Replies:
    6
    Views:
    285
    John W. Krahn
    Apr 12, 2004
  4. Replies:
    5
    Views:
    323
    DJ Stunks
    Nov 20, 2006
  5. Cah Sableng
    Replies:
    0
    Views:
    261
    Cah Sableng
    Apr 23, 2007
Loading...

Share This Page