Extracting HTML Content

Discussion in 'Perl Misc' started by masterGaurav, May 1, 2006.

  1. masterGaurav

    masterGaurav Guest

    Hi,

    I have some HTML content. I want to strip off the HTML tags and
    retain the raw text.
    That's simple:

    $data =~ /<.+?>//gsm

    However, now I have a condition... I want to strip off the tags only
    if the content has least one '$' character.

    For example:

    <p><a href='#'>This is low priced at $50</a></p>

    Should return the raw content, however

    <p><a href='#'>This is low priced at 50</a></p>

    should return nothing.

    Can it be done using one regex, may be in a loop?


    Cheers,
    Gaurav
    masterGaurav, May 1, 2006
    #1
    1. Advertising

  2. masterGaurav

    robic0 Guest

    On 30 Apr 2006 20:04:05 -0700, "masterGaurav" <> wrote:

    >Hi,
    >
    > I have some HTML content. I want to strip off the HTML tags and
    >retain the raw text.
    > That's simple:
    >
    > $data =~ /<.+?>//gsm
    >
    > However, now I have a condition... I want to strip off the tags only
    >if the content has least one '$' character.
    >
    >For example:
    >
    ><p><a href='#'>This is low priced at $50</a></p>
    >
    > Should return the raw content, however
    >
    ><p><a href='#'>This is low priced at 50</a></p>
    >
    > should return nothing.
    >
    > Can it be done using one regex, may be in a loop?
    >
    >
    >Cheers,
    >Gaurav


    Submit your request to robic0's RXParse. Do a search on it.
    robic0, May 1, 2006
    #2
    1. Advertising

  3. masterGaurav

    masterGaurav Guest

    masterGaurav, May 1, 2006
    #3
  4. masterGaurav wrote:
    > Hi,
    >
    > I have some HTML content. I want to strip off the HTML tags and
    > retain the raw text.
    > That's simple:


    No, it isn't. Contrary to popular believe parsing HTML correctly is quite
    difficult

    > $data =~ /<.+?>//gsm


    Which fails in many, many cirumstances.
    Please see the FAQ (perldoc -q html) why and how to parse HTML correctly.

    BTW: this is a Very FAQ.

    jue
    Jürgen Exner, May 1, 2006
    #4
  5. masterGaurav <> wrote:

    > I have some HTML content. I want to strip off the HTML tags and
    > retain the raw text.



    perldoc -q HTML

    How do I remove HTML from a string?


    > That's simple:



    If you think it is simple, then you haven't been thinking
    about it long enough.


    > $data =~ /<.+?>//gsm



    The FAQ answer has a half dozen snippets of valid HTML that
    make that trip on its face.


    > However, now I have a condition... I want to strip off the tags only
    > if the content has least one '$' character.



    > Can it be done using one regex,



    Why do you care what form the answer takes, don't you want an
    answer if it is not in the form of a regex?

    Use a module that understands HTML for processing HTML data.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, May 1, 2006
    #5
  6. masterGaurav

    robic0 Guest

    On 30 Apr 2006 20:32:40 -0700, "masterGaurav" <> wrote:

    >I am new to this forum. I searhed for "robic0's RXParse" and found only
    >one result:
    >
    > http://www.codecomments.com/archive235-2006-2-806819.html
    >
    >Can you please explain...
    >
    >
    >Cheers,
    >Gaurav


    I meant in this forum. As many will tell you, you can't strip/alter html yourself.
    Its much more complicated than you think. What robic0 proposes is to create a
    generalized xhtml/xml modification method(s) that is safe and guarantee's compliant.
    You can peruse his code in this group, and try out whats there on your html.
    Submit your request's to him (within this group).
    robic0, May 1, 2006
    #6
  7. masterGaurav

    Keith Keller Guest

    On 2006-05-01, masterGaurav <> wrote:
    > I am new to this forum. I searhed for "robic0's RXParse" and found only
    > one result:


    Whatever you choose to do, do not use robic0's RXParse module. It is
    a good example of how not to code. Instead, do as others have suggested
    and read the perldocs on the subject, and then use one of the standard
    and well-coded HTML modules available from CPAN.

    --keith

    --
    -francisco.ca.us
    (try just my userid to email me)
    AOLSFAQ=http://wombat.san-francisco.ca.us/cgi-bin/fom
    see X- headers for PGP signature information
    Keith Keller, May 1, 2006
    #7
  8. masterGaurav

    robic0 Guest

    On Sun, 30 Apr 2006 21:31:31 -0700, Keith Keller <-francisco.ca.us> wrote:

    >On 2006-05-01, masterGaurav <> wrote:
    >> I am new to this forum. I searhed for "robic0's RXParse" and found only
    >> one result:

    >
    >Whatever you choose to do, do not use robic0's RXParse module.

    And if he does use it, whats the consequences?
    You never used it and don't even know what 'it' is
    >a good example of how not to code.

    The module is not a code example. Its a working 1.1 standard parser
    that you know as much about as how the foreskin appeared on your limp dick.
    >Instead, do as others have suggested

    that you never ever did or have ever done parsing
    >and read the perldocs on the subject,

    you don't know the subject. You don't know what the phrase parsing means at all..
    >and then use one of the standard

    there is no standard other than the w3c ones. Can you name one standard?
    >and well-coded HTML modules available from CPAN.

    and how would you know? Your an idiot, and dumb!
    >
    >--keith
    robic0, May 1, 2006
    #8
  9. masterGaurav

    masterGaurav Guest

    Thanks everybody for your help and pointers.

    I was 120% sure that it's impossible to write HTML-parser in one Regex.
    TagStripper is ok... my code would work for scenarios except for cases
    where, say, Javascript exists (e.g.: i < 10 && j > 20).

    Well, anyway... I found some workaround for my special case.

    Thanks once again for your time.


    Cheers,
    Gaurav Vaish
    http://mastergaurav.org
    -----------------
    masterGaurav, May 1, 2006
    #9
  10. masterGaurav

    robic0 Guest

    On 1 May 2006 01:13:41 -0700, "masterGaurav" <> wrote:

    >Thanks everybody for your help and pointers.
    >
    >I was 120% sure that it's impossible to write HTML-parser in one Regex.
    >TagStripper is ok... my code would work for scenarios except for cases
    >where, say, Javascript exists (e.g.: i < 10 && j > 20).
    >
    >Well, anyway... I found some workaround for my special case.
    >
    >Thanks once again for your time.
    >
    >
    >Cheers,
    >Gaurav Vaish
    >http://mastergaurav.org
    >-----------------


    Good that you understand. Javascript nor whatever you can think of will
    make a simple regexp work for parsing.

    I didn't mean to referr you to RXParse. Sorry if that was a problem.
    Keep learning, many years ahead of you. The shortcuts don't really work
    until your old. You have a long time to get old. Unfortunately, while
    your young, its hard to know which way to go, hard to know whats real,
    hard to cut out wasted time. No fear though, theres only one path,
    most travel it.
    robic0, May 1, 2006
    #10
  11. "masterGaurav" <> wrote in news::

    > I am new to this forum.


    In that case, please read the posting guidelines.

    > I searhed for "robic0's RXParse"


    Why did you search for that?

    > and found only one result:
    >
    > http://www.codecomments.com/archive235-2006-2-806819.html


    You seem to be under the illusion that comp.lang.perl.misc is a
    web based forum. It is not. It is a UseNet group.

    http://en.wikipedia.org/wiki/Usenet

    > Can you please explain...


    You have already been pointed to perldoc -q HTML. I am going
    to recommend that you skim through the entire FAQ list at least once.

    There are well established CPAN modules you can use to parse HTML.

    http://search.cpan.org/~gaas/HTML-Parser-3.54/
    http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/

    Sinan

    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
    A. Sinan Unur, May 1, 2006
    #11
  12. masterGaurav

    robic0 Guest

    On Mon, 01 May 2006 10:18:44 GMT, "A. Sinan Unur" <> wrote:

    >"masterGaurav" <> wrote in news::
    >
    >> I am new to this forum.

    >
    >In that case, please read the posting guidelines.
    >
    >> I searhed for "robic0's RXParse"

    >
    >Why did you search for that?


    I told him to as you well know. Is that a problem?
    >
    >> and found only one result:
    >>
    >> http://www.codecomments.com/archive235-2006-2-806819.html

    >
    >You seem to be under the illusion that comp.lang.perl.misc is a
    >web based forum. It is not. It is a UseNet group.
    >

    There is no illusion that web based news readers abound.
    Or maybe you don't know that?

    >http://en.wikipedia.org/wiki/Usenet
    >
    >> Can you please explain...


    Sumthing wrong with his web hit?
    >
    >You have already been pointed to perldoc -q HTML. I am going
    >to recommend that you skim through the entire FAQ list at least once.
    >
    >There are well established CPAN modules you can use to parse HTML.
    >

    There are NO, I repeat NO modules that do what he wants to do.
    He never intended to parse html. Show me where big boy ...

    >http://search.cpan.org/~gaas/HTML-Parser-3.54/
    >http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/
    >
    >Sinan


    Stay clear of sinan advice. He means to bust your balls is all.
    robic0, May 1, 2006
    #12
  13. masterGaurav

    DJ Stunks Guest

    robic0 wrote:
    > On 30 Apr 2006 20:04:05 -0700, "masterGaurav" <> wrote:
    >
    > >Hi,
    > >
    > > I have some HTML content. I want to strip off the HTML tags and
    > >retain the raw text.
    > >I want to strip off the tags only
    > >if the content has least one '$' character.
    > >

    > Submit your request to robic0's RXParse. Do a search on it.


    I hereby suggest that you, masterGaurav post some sample HTML and you,
    robic0 give us a sample script which demonstrates the use of your
    crummy parser to parse it.

    and robic, don't give us any bullshit about how expensive your time is
    and you can't afford to show a sample, a solid product demonstration
    will pay for itself 10-fold.

    with bated breath,
    -jp
    DJ Stunks, May 1, 2006
    #13
  14. > <robic0> wrote in message
    > news:...
    >> On Mon, 01 May 2006 10:18:44 GMT, "A. Sinan Unur"
    >> <> wrote:
    >>
    >>>"masterGaurav" <> wrote in
    >>>news::
    >>>>
    >>>> I have some HTML content. I want to strip off the HTML tags and
    >>>> retain the raw text.


    ....

    >>>There are well established CPAN modules you can use to parse HTML.
    >>>

    >> There are NO, I repeat NO modules that do what he wants to do.
    >> He never intended to parse html. Show me where big boy ...


    It must be hard to live with an IQ below room temparature.

    http://search.cpan.org/src/GAAS/HTML-Parser-3.54/eg/hstrip

    Sinan

    --
    A. Sinan Unur <>
    (remove .invalid and reverse each component for email address)

    comp.lang.perl.misc guidelines on the WWW:
    http://augustmail.com/~tadmc/clpmisc/clpmisc_guidelines.html
    A. Sinan Unur, May 1, 2006
    #14
  15. masterGaurav

    robic0 Guest

    On Mon, 01 May 2006 17:00:02 GMT, "Todd W" <> wrote:

    >
    ><robic0> wrote in message news:...
    >> On Mon, 01 May 2006 10:18:44 GMT, "A. Sinan Unur"
    >> <> wrote:
    >>
    >>>"masterGaurav" <> wrote in
    >>>news::
    >>>>
    >>>> I have some HTML content. I want to strip off the HTML tags and
    >>>> retain the raw text.
    >>>> That's simple:
    >>>>
    >>>> $data =~ /<.+?>//gsm
    >>>>
    >>>> However, now I have a condition... I want to strip off the tags only
    >>>> if the content has least one '$' character.
    >>>
    >>>There are well established CPAN modules you can use to parse HTML.
    >>>

    >> There are NO, I repeat NO modules that do what he wants to do.
    >> He never intended to parse html. Show me where big boy ...
    >>
    >>>http://search.cpan.org/~gaas/HTML-Parser-3.54/
    >>>http://search.cpan.org/~ovid/HTML-TokeParser-Simple-3.15/
    >>>

    >
    >$ perl toke.pl
    >TEXT: $100.00 Dollars
    >TEXT: discount: $75.25
    >
    >$ cat toke.pl
    >use warnings;
    >use strict;
    >
    >use HTML::TokeParser;
    >
    >my $p = HTML::TokeParser->new( \ join('', <DATA>) );
    >$p->unbroken_text(1);
    >
    >while (my $token = $p->get_token) {
    > if ( $token->[0] eq 'T' && $token->[1] =~ m|\$| ) {
    > print('TEXT: ' . $token->[1] . "\n");
    > }
    >}
    >
    >__DATA__
    ><html>
    > <head>
    > <title>$100.00 Dollars</title>
    > </head>
    > <body>
    > <img src="foo.img" alt="$100.00 USD">
    > <div>size: 50 x 50</div>
    > <div>discount: $75.25</div>
    > </body>
    ></html>
    >


    This is a debug output from RXParse using your data.
    Seems your missing a closing tag somewhere?


    SCALAR ref
    --------------------
    char _:

    --------------------
    start _: html
    --------------------
    char _:

    --------------------
    start _: head
    --------------------
    char _:

    --------------------
    start _: title
    --------------------
    char _: $100.00 Dollars
    --------------------
    end _: /title
    --------------------
    char _:

    --------------------
    end _: /head
    --------------------
    char _:

    --------------------
    start _: body
    --------------------
    char _:

    --------------------
    start _: img
    src = foo.img
    alt = $100.00 USD
    --------------------
    char _:

    --------------------
    start _: div
    --------------------
    char _: size: 50 x 50
    --------------------
    end _: /div
    --------------------
    char _:

    --------------------
    start _: div
    --------------------
    char _: discount: $75.25
    --------------------
    end _: /div
    --------------------
    char _:

    --------------------
    rp_error_05, expected closing tag '/img' (line 10, col 9)
    robic0, May 5, 2006
    #15
  16. masterGaurav

    robic0 Guest

    On 1 May 2006 10:49:03 -0700, "DJ Stunks" <> wrote:

    >
    >robic0 wrote:
    >> On 30 Apr 2006 20:04:05 -0700, "masterGaurav" <> wrote:
    >>
    >> >Hi,
    >> >
    >> > I have some HTML content. I want to strip off the HTML tags and
    >> >retain the raw text.
    >> >I want to strip off the tags only
    >> >if the content has least one '$' character.
    >> >

    >> Submit your request to robic0's RXParse. Do a search on it.

    >
    >I hereby suggest that you, masterGaurav post some sample HTML and you,
    >robic0 give us a sample script which demonstrates the use of your
    >crummy parser to parse it.
    >
    >and robic, don't give us any bullshit about how expensive your time is
    >and you can't afford to show a sample, a solid product demonstration
    >will pay for itself 10-fold.
    >
    >with bated breath,
    >-jp


    Hey thats fair. The RXParse code (on this forum) needs to have the top usage examples,
    before the package declaration, commented out and a "1;" added to before the __END__.
    Add a 'strict' statement after the package name. The file should be namee RXParse.pm.

    Using Todd W's data sample within this thread, just using the default handlers with debug off
    (turning debug off just does syntax checking) yeilds this nice error:

    ======================================================
    use strict;
    use warnings;
    use RXParse;

    my $parse_ln = '
    <html>
    <head>
    <title>$100.00 Dollars</title>
    </head>
    <body>
    <img src="foo.img" alt="$100.00 USD">
    <div>size: 50 x 50</div>
    <div>discount: $75.25</div>
    </body>
    </html>
    ';

    my $p = new RXParse();

    #$p->setDebugMode(1);
    $p->parse(\$parse_ln);

    __END__
    rp_error_05, expected closing tag '/img' (line 10, col 9)

    ==========================================================

    There, now that didn't take alot of my time.
    Un-commenting the debug line above yeilds this:


    SCALAR ref
    --------------------
    char _:

    --------------------
    start _: html
    --------------------
    char _:

    --------------------
    start _: head
    --------------------
    char _:

    --------------------
    start _: title
    --------------------
    char _: $100.00 Dollars
    --------------------
    end _: /title
    --------------------
    char _:

    --------------------
    end _: /head
    --------------------
    char _:

    --------------------
    start _: body
    --------------------
    char _:

    --------------------
    start _: img
    src = foo.img
    alt = $100.00 USD
    --------------------
    char _:

    --------------------
    start _: div
    --------------------
    char _: size: 50 x 50
    --------------------
    end _: /div
    --------------------
    char _:

    --------------------
    start _: div
    --------------------
    char _: discount: $75.25
    --------------------
    end _: /div
    --------------------
    char _:

    --------------------
    rp_error_05, expected closing tag '/img' (line 10, col 9)

    =============================================================

    You can expect the same data to be passed to your user defined handlers.
    The code that sets the user defined handlers goes like this:

    sub setHandlers {
    my ($self, @args) = @_;
    my %oldh = ();
    if (!scalar(@args)) {
    while (my ($name,$val) = splice (@args, 0, 2)) {
    $name =~ s/^\s+//s; $name =~ s/\s+$//s;
    my $hname = "h".lc($name);
    if (exists $self->{$hname}) {
    $oldh{$name} = $self->{$hname};
    if (ref($val) eq 'CODE') {
    $self->{$hname} = $val;
    } else {
    # if its not a CODE ref,
    # just set default handler
    $self->setDfltHandlers ($name);
    }
    }
    }
    }
    return %oldh;
    }

    I'll make up a sample user friendly template for you in a little bit.
    The parameters passed to the handlers, as well as setting the handlers
    are those typical Expat, mostly. There are many ways this can parse a block.
    See the parse method. I wouldn't be so quick to call this a 'crappy' parser boy!

    robic0
    robic0, May 5, 2006
    #16
  17. masterGaurav

    l v Guest

    robic0 wrote:

    [snip]

    > >__DATA__
    > ><html>
    > > <head>
    > > <title>$100.00 Dollars</title>
    > > </head>
    > > <body>
    > > <img src="foo.img" alt="$100.00 USD">
    > > <div>size: 50 x 50</div>
    > > <div>discount: $75.25</div>
    > > </body>
    > ></html>
    > >

    >
    > This is a debug output from RXParse using your data.
    > Seems your missing a closing tag somewhere?
    >


    [snip]

    > rp_error_05, expected closing tag '/img' (line 10, col 9)



    >From http://www.w3schools.com/tags/tag_img.asp


    HTML <img> tag
    Definition and Usage

    The img element defines an image.

    Differences Between HTML and XHTML

    In HTML the <img> tag has no end tag.

    In XHTML the <img> tag must be properly closed.

    Len
    l v, May 5, 2006
    #17
  18. masterGaurav

    l v Guest

    robic0 wrote:

    [big snip]

    > <html>
    > <head>
    > <title>$100.00 Dollars</title>
    > </head>
    > <body>
    > <img src="foo.img" alt="$100.00 USD">
    > <div>size: 50 x 50</div>
    > <div>discount: $75.25</div>
    > </body>
    > </html>
    > rp_error_05, expected closing tag '/img' (line 10, col 9)
    >


    [snip]

    > I'll make up a sample user friendly template for you in a little bit.
    > The parameters passed to the handlers, as well as setting the handlers
    > are those typical Expat, mostly. There are many ways this can parse a block.
    > See the parse method. I wouldn't be so quick to call this a 'crappy' parser boy!
    >
    > robic0



    >From http://www.w3schools.com/tags/tag_img.asp


    HTML <img> tag
    Definition and Usage

    The img element defines an image.

    Differences Between HTML and XHTML

    In HTML the <img> tag has no end tag.

    In XHTML the <img> tag must be properly closed.
    l v, May 5, 2006
    #18
  19. masterGaurav

    robic0 Guest

    On 4 May 2006 19:31:11 -0700, "l v" <> wrote:

    >robic0 wrote:
    >
    >[snip]
    >
    >> >__DATA__
    >> ><html>
    >> > <head>
    >> > <title>$100.00 Dollars</title>
    >> > </head>
    >> > <body>
    >> > <img src="foo.img" alt="$100.00 USD">
    >> > <div>size: 50 x 50</div>
    >> > <div>discount: $75.25</div>
    >> > </body>
    >> ></html>
    >> >

    >>
    >> This is a debug output from RXParse using your data.
    >> Seems your missing a closing tag somewhere?
    >>

    >
    >[snip]
    >
    >> rp_error_05, expected closing tag '/img' (line 10, col 9)

    >
    >
    >>From http://www.w3schools.com/tags/tag_img.asp

    >
    >HTML <img> tag
    >Definition and Usage
    >
    >The img element defines an image.
    >
    >Differences Between HTML and XHTML
    >
    >In HTML the <img> tag has no end tag.
    >
    >In XHTML the <img> tag must be properly closed.
    >
    >Len


    Don't know what to say. DOCTYPE? Invoke with a force flag?
    Namespace, xmlns? Can't look at <html attr's>. Avoiding dtd imports
    so won't give <html> power, if I've halted at parsing ENTITY, ATTRIB, ELEMENT
    contents for now.

    A flag can force html, xhtml, xml. Minor regexp modifications (3 separate). Standards
    changing. I haven't gotten into loading namespace yet. Avoiding that at this stage
    trying to unitize the outer constructs. Wizz around w3c site a while.
    robic0, May 6, 2006
    #19
  20. masterGaurav

    robic0 Guest

    On Fri, 05 May 2006 18:18:36 -0700, robic0 wrote:

    >On 4 May 2006 19:31:11 -0700, "l v" <> wrote:
    >
    >>robic0 wrote:
    >>
    >>[snip]
    >>
    >>> >__DATA__
    >>> ><html>
    >>> > <head>
    >>> > <title>$100.00 Dollars</title>
    >>> > </head>
    >>> > <body>
    >>> > <img src="foo.img" alt="$100.00 USD">
    >>> > <div>size: 50 x 50</div>
    >>> > <div>discount: $75.25</div>
    >>> > </body>
    >>> ></html>
    >>> >
    >>>
    >>> This is a debug output from RXParse using your data.
    >>> Seems your missing a closing tag somewhere?
    >>>

    >>
    >>[snip]
    >>
    >>> rp_error_05, expected closing tag '/img' (line 10, col 9)

    >>
    >>
    >>>From http://www.w3schools.com/tags/tag_img.asp

    >>
    >>HTML <img> tag
    >>Definition and Usage
    >>
    >>The img element defines an image.
    >>
    >>Differences Between HTML and XHTML
    >>
    >>In HTML the <img> tag has no end tag.
    >>
    >>In XHTML the <img> tag must be properly closed.
    >>
    >>Len

    >
    >Don't know what to say. DOCTYPE? Invoke with a force flag?
    >Namespace, xmlns? Can't look at <html attr's>. Avoiding dtd imports
    >so won't give <html> power, if I've halted at parsing ENTITY, ATTRIB, ELEMENT
    >contents for now.
    >
    >A flag can force html, xhtml, xml. Minor regexp modifications (3 separate). Standards
    >changing. I haven't gotten into loading namespace yet. Avoiding that at this stage
    >trying to unitize the outer constructs. Wizz around w3c site a while.


    Some pages for reference:

    http://www.w3.org/TR/html4/strict.dtd
    http://www.w3schools.com/tags/default.asp
    http://www.w3.org/TR/xml11/

    Certainly I'm all over this.
    Then there's that SGML thing too...
    robic0, May 6, 2006
    #20
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. hazz
    Replies:
    6
    Views:
    49,393
    SkyUCHC
    Jun 9, 2010
  2. Replies:
    0
    Views:
    358
  3. Bernard Rankin
    Replies:
    0
    Views:
    282
    Bernard Rankin
    Jan 16, 2009
  4. Replies:
    4
    Views:
    97
    Sherm Pendley
    Sep 30, 2005
  5. Cognizance
    Replies:
    1
    Views:
    94
    McKirahan
    May 23, 2005
Loading...

Share This Page