HTML::Parser not stripping out comments

Discussion in 'Perl Misc' started by Jay, Jun 14, 2004.

  1. Jay

    Jay Guest

    I'm trying to get HTML::parser to strip out the comments using some of the
    sample code from the man page. I'm using the ignore_elements and I still
    get comments in the dtext. Am I doing something wrong?

    Tia,
    Jay

    CODE:
    use HTML::parser ();

    # Create parser object
    $p = HTML::parser->new( api_version => 3,
    start_h => [\&start, "tagname, attr"],
    end_h => [\&end, "tagname"],
    comment_h => [\&comment, "self,text"],
    text_h => [\&dtext, "self,text"],
    marked_sections => 1,
    );

    $p->ignore_elements( qw(script, comment, style) );
    $p->strict_comment( [1] );
    # Parse directly from file
    $p->parse_file("0");


    sub start {
    my($self, $tagname, $attr, $attrseq, $origtext) = @_;
    #...
    }

    sub end {
    my($self, $tagname, $origtext) = @_;
    #...
    }

    sub text {
    my($self, $origtext, $is_cdata) = @_;
    #...
    }
    sub comment{
    #my($self, $origtext, $is_cdata) = @_;
    #...
    }

    sub dtext {
    my($self, $dtext ) = @_;
    $dtext=~s/\s+/ /g;
    print "DTEXT: $dtext\n";
    }

    Example of some of the output from parsing some web page:

    DTEXT: <!-- /* You may give each page an identifying name, server, and
    channel on the next lines. */ var s_pageName="buy"; var s_server="CWEB15";
    var s_channel="buy"; var s_pageTyp
    e=""; var s_prop1="Autoweb Direct to Site"; var s_prop2="Autoweb Direct to
    Site 10714"; var s_prop3=""; var s_prop4=""; var s_prop5=""; var s_prop6="";
    var s_prop7="buy|"; var s_pr
    op8=""; var s_prop9="buy|Autoweb Direct to Site|10714"; var s_prop10="buy|";
    var s_prop11="Autoweb Direct to Site|10714|taweb"; var s_prop12="||"; var
    s_prop13="||||||buy||No"; var
    s_prop14="Autoweb Direct to Site|10714|taweb|||||buy||No"; var s_prop15="No
    Article|No Article"; var s_prop16=""; var s_prop17=""; var s_prop18="Autoweb
    Direct to Site|10714|buy";
    var s_prop19="Autoweb Direct to Site|10714||buy"; var
    s_prop20="buy||||sky|ban|Autoweb Direct to Site"; /* E-commerce Variables */
    var s_campaign="10714"; var s_state=""; var s_zi
    p=""; var s_events=""; var s_products=""; var s_purchaseID=""; var
    s_eVar1="Autoweb Direct to Site"; var s_eVar2="Autoweb Direct to Site
    10714"; var s_eVar3="NT-sky-ban"; var s_eVa
    r4=""; var s_eVar5=""; /********* INSERT THE DOMAIN AND PATH TO YOUR CODE
    BELOW ************/ /********** DO NOT ALTER ANYTHING ELSE BELOW THIS LINE!
    *************/ var s_code=' '/
    /-->
    DTEXT:
    DTEXT:
    Jay, Jun 14, 2004
    #1
    1. Advertising

  2. Jay

    Gisle Aas Guest

    "Jay" <> writes:

    > I'm trying to get HTML::parser to strip out the comments using some of the
    > sample code from the man page. I'm using the ignore_elements and I still
    > get comments in the dtext. Am I doing something wrong?


    Probably :)

    > CODE:
    > use HTML::parser ();
    >
    > # Create parser object
    > $p = HTML::parser->new( api_version => 3,
    > start_h => [\&start, "tagname, attr"],
    > end_h => [\&end, "tagname"],
    > comment_h => [\&comment, "self,text"],
    > text_h => [\&dtext, "self,text"],
    > marked_sections => 1,


    You really want marked_sections to be enabled?

    > );
    >
    > $p->ignore_elements( qw(script, comment, style) );


    You need to remove all the "," here. Otherwise they end up part of
    the strings passed to ignore_elements. Also there is no <comment> tag
    in HTML, so there is not comment element either.

    The extra commas probably explain why you get the JavaScript comment
    reported to your &dtext callback. Anything between <script> and
    </script> is reported as text. Even if it looks like a comment to you
    it's not really that.

    > $p->strict_comment( [1] );


    I don't think you actually want strict_comment enabled either. A
    plain 1 is also a perfectly fine true boolean.

    > # Parse directly from file
    > $p->parse_file("0");


    That's a strange file name.

    Regards,
    Gisle
    Gisle Aas, Jun 15, 2004
    #2
    1. Advertising

  3. Jay

    Jay Guest

    "Gisle Aas" <> wrote in message
    news:...
    > "Jay" <> writes:
    >
    > > I'm trying to get HTML::parser to strip out the comments using some of

    the
    > > sample code from the man page. I'm using the ignore_elements and I

    still
    > > get comments in the dtext. Am I doing something wrong?

    >
    > Probably :)
    >
    > > CODE:
    > > use HTML::parser ();
    > >
    > > # Create parser object
    > > $p = HTML::parser->new( api_version => 3,
    > > start_h => [\&start, "tagname, attr"],
    > > end_h => [\&end, "tagname"],
    > > comment_h => [\&comment, "self,text"],
    > > text_h => [\&dtext, "self,text"],
    > > marked_sections => 1,

    >
    > You really want marked_sections to be enabled?
    >
    > > );
    > >
    > > $p->ignore_elements( qw(script, comment, style) );

    >
    > You need to remove all the "," here. Otherwise they end up part of
    > the strings passed to ignore_elements. Also there is no <comment> tag
    > in HTML, so there is not comment element either.
    >
    > The extra commas probably explain why you get the JavaScript comment
    > reported to your &dtext callback. Anything between <script> and
    > </script> is reported as text. Even if it looks like a comment to you
    > it's not really that.
    >
    > > $p->strict_comment( [1] );

    >
    > I don't think you actually want strict_comment enabled either. A
    > plain 1 is also a perfectly fine true boolean.
    >
    > > # Parse directly from file
    > > $p->parse_file("0");

    >
    > That's a strange file name.
    >
    > Regards,
    > Gisle


    Thanks Gisle,
    I will look at this some more with your reccomendations and post the
    results.
    yes, a strange filename.

    Jay
    Jay, Jun 15, 2004
    #3
  4. Jay

    Jay Guest

    "Jay" <> wrote in message
    news:ARmzc.20666$Qv1.5290@lakeread03...
    > I'm trying to get HTML::parser to strip out the comments using some of the
    > sample code from the man page. I'm using the ignore_elements and I still
    > get comments in the dtext. Am I doing something wrong?
    >
    > Tia,
    > Jay
    >
    > CODE:
    > use HTML::parser ();
    >
    > # Create parser object
    > $p = HTML::parser->new( api_version => 3,
    > start_h => [\&start, "tagname, attr"],
    > end_h => [\&end, "tagname"],
    > comment_h => [\&comment, "self,text"],
    > text_h => [\&dtext, "self,text"],
    > marked_sections => 1,
    > );
    >
    > $p->ignore_elements( qw(script, comment, style) );
    > $p->strict_comment( [1] );
    > # Parse directly from file
    > $p->parse_file("0");
    >
    >
    > sub start {
    > my($self, $tagname, $attr, $attrseq, $origtext) = @_;
    > #...
    > }
    >
    > sub end {
    > my($self, $tagname, $origtext) = @_;
    > #...
    > }
    >
    > sub text {
    > my($self, $origtext, $is_cdata) = @_;
    > #...
    > }
    > sub comment{
    > #my($self, $origtext, $is_cdata) = @_;
    > #...
    > }
    >
    > sub dtext {
    > my($self, $dtext ) = @_;
    > $dtext=~s/\s+/ /g;
    > print "DTEXT: $dtext\n";
    > }
    >
    > Example of some of the output from parsing some web page:
    >
    > DTEXT: <!-- /* You may give each page an identifying name, server, and
    > channel on the next lines. */ var s_pageName="buy"; var s_server="CWEB15";
    > var s_channel="buy"; var s_pageTyp
    > e=""; var s_prop1="Autoweb Direct to Site"; var s_prop2="Autoweb Direct to
    > Site 10714"; var s_prop3=""; var s_prop4=""; var s_prop5=""; var

    s_prop6="";
    > var s_prop7="buy|"; var s_pr
    > op8=""; var s_prop9="buy|Autoweb Direct to Site|10714"; var

    s_prop10="buy|";
    > var s_prop11="Autoweb Direct to Site|10714|taweb"; var s_prop12="||"; var
    > s_prop13="||||||buy||No"; var
    > s_prop14="Autoweb Direct to Site|10714|taweb|||||buy||No"; var

    s_prop15="No
    > Article|No Article"; var s_prop16=""; var s_prop17=""; var

    s_prop18="Autoweb
    > Direct to Site|10714|buy";
    > var s_prop19="Autoweb Direct to Site|10714||buy"; var
    > s_prop20="buy||||sky|ban|Autoweb Direct to Site"; /* E-commerce Variables

    */
    > var s_campaign="10714"; var s_state=""; var s_zi
    > p=""; var s_events=""; var s_products=""; var s_purchaseID=""; var
    > s_eVar1="Autoweb Direct to Site"; var s_eVar2="Autoweb Direct to Site
    > 10714"; var s_eVar3="NT-sky-ban"; var s_eVa
    > r4=""; var s_eVar5=""; /********* INSERT THE DOMAIN AND PATH TO YOUR CODE
    > BELOW ************/ /********** DO NOT ALTER ANYTHING ELSE BELOW THIS

    LINE!
    > *************/ var s_code=' '/
    > /-->
    > DTEXT:
    > DTEXT:



    That did the trick, thanks alot.

    Jay
    Jay, Jun 15, 2004
    #4
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Kelly
    Replies:
    1
    Views:
    405
    The Durban Towel
    Apr 28, 2004
  2. Stephane CHAZELAS

    Re: Stripping multiline C comments without using Lex

    Stephane CHAZELAS, Feb 4, 2004, in forum: C Programming
    Replies:
    3
    Views:
    900
    Jens Schweikhardt
    Feb 5, 2004
  3. Replies:
    4
    Views:
    575
  4. Xicheng Jia
    Replies:
    9
    Views:
    213
    robic0
    Apr 19, 2006
  5. bizt
    Replies:
    1
    Views:
    90
    Evertjan.
    Nov 16, 2009
Loading...

Share This Page