all the text (including tags) between <body> .. </body>

Discussion in 'Perl Misc' started by tarakparekh@yahoo.com, Sep 7, 2005.

  1. Guest

    Hello,

    I am not well-versed in Perl, so would like to request for
    suggestions/help.

    My goal is to merge 2 HTML files - particularly, to get everything
    between <body> C1 </body> in file1.html and <body> C2 </body>
    in file2.html to create file3.html that has:
    <body> C1 C2 </body>

    C1, C2 can contain that can fall within the "body" tags.

    I have taken a look at HTML::parser, as well as HTML::TokeParser,
    but on initial tryouts was unable to get the tags themselves.

    Some postings indicated reg. expressions are not good for HTML
    parsing, but what I am doing in terms of merging is pretty dumb.

    Would appreciate any help.
    thanks,
    tarak
    , Sep 7, 2005
    #1
    1. Advertising

  2. Simon Taylor Guest

    Hello Tarak,

    > I am not well-versed in Perl, so would like to request for
    > suggestions/help.
    >
    > My goal is to merge 2 HTML files - particularly, to get everything
    > between <body> C1 </body> in file1.html and <body> C2 </body>
    > in file2.html to create file3.html that has:
    > <body> C1 C2 </body>
    >
    > C1, C2 can contain that can fall within the "body" tags.
    >
    > I have taken a look at HTML::parser, as well as HTML::TokeParser,
    > but on initial tryouts was unable to get the tags themselves.
    >
    > Some postings indicated reg. expressions are not good for HTML
    > parsing, but what I am doing in terms of merging is pretty dumb.
    >
    > Would appreciate any help.


    This may help throw some light on using HTML::parser

    http://www.perlmeme.org/tutorials/html_parser.html

    And in general, google for "using HTML::parser"

    Regards,

    Simon Taylor
    --
    www.perlmeme.org
    Simon Taylor, Sep 7, 2005
    #2
    1. Advertising

  3. Paul Lalli Guest

    wrote:
    > I am not well-versed in Perl, so would like to request for
    > suggestions/help.
    >
    > My goal is to merge 2 HTML files - particularly, to get everything
    > between <body> C1 </body> in file1.html and <body> C2 </body>
    > in file2.html to create file3.html that has:
    > <body> C1 C2 </body>
    >
    > C1, C2 can contain that can fall within the "body" tags.
    >
    > I have taken a look at HTML::parser, as well as HTML::TokeParser,
    > but on initial tryouts was unable to get the tags themselves.


    So what were those initial tryouts? And what were the results? No one
    can help you fix your program if you don't show us your program.

    Please read the posting guidelines for this group. Then please post a
    short-but-complete script that demonstrates the errors you're having.
    Then we can help you correct those errors.

    > Some postings indicated reg. expressions are not good for HTML
    > parsing,


    Correct.

    > but what I am doing in terms of merging is pretty dumb.


    I have no idea what 'dumb' means in this context.

    Paul Lalli
    Paul Lalli, Sep 7, 2005
    #3
  4. Guest

    Paul,

    Sorry for not posting the script earlier. status.html is a small html
    file containing
    some links to pictures, and some to other html documents.

    !/usr/pkg/bin/perl

    package my_parser;
    use base 'HTML::parser';

    $in_body = 0;
    $body = "";

    sub start {
    my ($self, $tag) = @_;

    if ($tag eq 'body') {
    $in_body = 1;
    }
    }

    sub end {
    my ($self, $tag) = @_;

    if ($tag eq 'body') {
    $in_body = 0;
    }
    }

    sub text {
    my ($self, $text) = @_;

    if ($in_body) {
    $body .= $text;
    }
    }

    my $p = my_parser->new();
    $p -> parse_file('status.html');

    print "BODY=$body\n";


    --- Results:
    BODY=Project: P1Status: Owner: owner1 Issues/Comments:
    Issue 1
    ----

    I missed all the links and Image tags as expected, but dont know how to
    retain
    them.

    What i meant by "dumb" was, I wanted to nothing but all the text
    between the
    <body> .. </body> tags. No other processing.

    thanks,
    tarak
    , Sep 7, 2005
    #4
  5. Scott Bryce Guest

    wrote:

    > Sorry for not posting the script earlier. status.html is a small html
    > file containing some links to pictures, and some to other html
    > documents.


    <code snipped>

    Since you only want to know what is between the <body> and the </body>
    tags, ask the parser to only report on those tags.


    #!/usr/bin/perl
    use strict;
    use warnings;
    use HTML::parser();

    my $content;

    my $p = HTML::parser->new( api_version => 3,
    start_h => [\&start],
    end_h => [\&end, 'skipped_text'],
    report_tags => ['body']
    );

    $p->parse_file('status.html') or die "Cannot parse status.html -- $!";

    print $content;

    sub start
    {
    # Nothing needs to happen here
    }

    sub end
    {
    $content = shift;
    }


    In Win98SE this is putting an extra CRLF at the end of each line. I
    don't know if this is a Windows specific thing, or if I am missing
    something in the docs that explains why this is happening.
    Scott Bryce, Sep 7, 2005
    #5
  6. wrote in news:1126116717.585011.288760
    @g43g2000cwa.googlegroups.com:

    > Paul,
    >
    > Sorry for not posting the script earlier. status.html is a small html
    > file containing
    > some links to pictures, and some to other html documents.
    >
    > !/usr/pkg/bin/perl
    >
    > package my_parser;
    > use base 'HTML::parser';


    I see Scott Bryce has already posted a solution to your problem, but
    here is another way to do it:

    #!/usr/bin/perl

    use strict;
    use warnings;

    use HTML::TokeParser::Simple;
    my $p = HTML::TokeParser::Simple->new(
    url => 'http://www.yahoo.com/'
    );

    my $in_body;

    while( my $token = $p->get_token ) {
    if( $token->is_start_tag('body') ) {
    $in_body = 1;
    next;
    } elsif( $token->is_end_tag('body') ) {
    $in_body = 0;
    next;
    }
    print $token->as_is if $in_body;
    }
    __END__


    --
    A. Sinan Unur <>
    (reverse each component and remove .invalid for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://mail.augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
    A. Sinan Unur, Sep 8, 2005
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Rob Nicholson
    Replies:
    3
    Views:
    662
    Rob Nicholson
    May 28, 2005
  2. Replies:
    3
    Views:
    618
  3. Thierry Lam
    Replies:
    7
    Views:
    392
    Neredbojias
    May 2, 2009
  4. Tamara
    Replies:
    2
    Views:
    108
    Michele Dondi
    Apr 7, 2004
  5. replacing tags between tags

    , Sep 18, 2005, in forum: Perl Misc
    Replies:
    9
    Views:
    117
    J├╝rgen Exner
    Sep 19, 2005
Loading...

Share This Page