The definitive statement on parsing HTML with regular expressions

Discussion in 'Perl Misc' started by Tim McDaniel, Jan 29, 2013.

  1. Tim McDaniel

    Tim McDaniel Guest

    I'd have to say that at
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
    the first answer is definitive. I know that The Pony is real, for I
    have fed carrots to His Effulgent Face. And I don't even know what
    "Effulgent" means, except that it means His Face.

    Actually, I just saw it on the Cheezburger Network and thought it was
    funny.

    And yes, if you *know* that your HTML is simple and limited (for
    example, generated by a known program), you may be able to parse those
    particular files with regexps.

    --
    Tim McDaniel,
    Tim McDaniel, Jan 29, 2013
    #1
    1. Advertising

  2. Tim McDaniel

    Tim McDaniel Guest

    In article <>,
    Ben Morrow <> wrote:
    >
    >Quoth :
    >> I'd have to say that at
    >> http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

    >
    >That was a posted a *long* time ago...


    March 2012 now counts as "a *long* time ago" in Interweb Time.
    In any event, I wrote,
    >> Actually, I just saw it on the Cheezburger Network and thought it was
    >> funny.


    >> the first answer is definitive. I know that The Pony is real, for I
    >> have fed carrots to His Effulgent Face. And I don't even know what
    >> "Effulgent" means, except that it means His Face.

    >
    >(A BtVS reference?)


    If so, only by accident. Looking up "effulgent", I should have
    written "Darkly Effulgent" for better effect.

    >It is, in fact, possible to parse HTML correctly with Perl regexen

    ....
    >Below is a pattern which matches valid XML


    Um, your goalposts seem to be moving.

    >However, it's currently rather difficult to modify it to do
    >anything *useful* with the result, most importantly because of the
    >limitations on both (?(DEFINE)) and (?{}).


    Bit of a drawback, eh wot? as few people want to merely recognize XML.

    In any event, I think it's difficult to parse HTML or XML *correctly*
    with *any* technology, due to corner cases and features. In general,
    a better answer is usually to use an existing module.

    --
    Tim McDaniel,
    Tim McDaniel, Jan 30, 2013
    #2
    1. Advertising

  3. (Tim McDaniel) writes:

    [...]

    >>However, it's currently rather difficult to modify it to do
    >>anything *useful* with the result, most importantly because of the
    >>limitations on both (?(DEFINE)) and (?{}).

    >
    > Bit of a drawback, eh wot? as few people want to merely recognize XML.
    >
    > In any event, I think it's difficult to parse HTML or XML *correctly*
    > with *any* technology, due to corner cases and features. In general,
    > a better answer is usually to use an existing module.


    The conclusion "it is difficult" => "everybody else must have solved
    it correctly already" seems a little flimsy to me ...
    Rainer Weikusat, Jan 30, 2013
    #3
  4. >>>>> "RW" == Rainer Weikusat <> writes:

    RW> (Tim McDaniel) writes: [...]

    >> In any event, I think it's difficult to parse HTML or XML
    >> *correctly* with *any* technology, due to corner cases and
    >> features. In general, a better answer is usually to use an
    >> existing module.


    RW> The conclusion "it is difficult" => "everybody else must have
    RW> solved it correctly already" seems a little flimsy to me ...

    How the hell do you make that leap?

    It is difficult, so it is better to use a mature code package that many
    people have used (and thus tested) than it is to roll your own.

    Charlton



    --
    Charlton Wilbur
    Charlton Wilbur, Jan 30, 2013
    #4
  5. Tim McDaniel

    brian d foy Guest

    brian d foy, Jan 31, 2013
    #5
  6. >>>>> "bdf" == brian d foy <> writes:

    bdf> In article <ke9gk0$9vd$>, Tim McDaniel
    bdf> <> wrote:

    >> I'd have to say that at
    >>
    >> http://stackoverflow.com/questions/1732348/regex-match-open-tags-
    >> except-xhtml-self-contained-tags the first answer is definitive.


    bdf> It's certainly funny, and was dogma until tchrist actually
    bdf> solved it with a recursive regex in a different Stackoverflow
    bdf> answer:

    bdf> http://stackoverflow.com/questions/4231382/regular-expression-
    bdf> pattern-not-matching-anywhere-in-string/4234491#4234491

    To be honest, before tchrist's answer it was dogma that was known to be
    false by those of us who either understand the theory of computation
    (since Perl's regular expressions stopped being strictly regular some
    time ago) or who had to update or maintain a dog's breakfast of HTML
    "parsing" using regular expressions.

    tchrist does continue to say that even though you CAN parse HTML with
    Perl regular expressions, you probably SHOULDN'T, because the larger and
    more sophisticated the problem, the better it is to use a real parser.
    Which is wisdom, and I am not just saying that because I have been
    saying it for 10+ years at this point.

    Charlton


    --
    Charlton Wilbur
    Charlton Wilbur, Jan 31, 2013
    #6
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Jay Douglas
    Replies:
    0
    Views:
    598
    Jay Douglas
    Aug 15, 2003
  2. Replies:
    12
    Views:
    2,066
    jan V
    Sep 15, 2005
  3. Captain Dondo

    Parsing HTML with Regular Expressions

    Captain Dondo, Jun 15, 2005, in forum: HTML
    Replies:
    7
    Views:
    636
    Gunnar Hjalmarsson
    Jun 15, 2005
  4. Anthony Walsh

    html parsing using regular expressions

    Anthony Walsh, Oct 25, 2006, in forum: Ruby
    Replies:
    1
    Views:
    119
    Austin Ziegler
    Oct 25, 2006
  5. Noman Shapiro
    Replies:
    0
    Views:
    231
    Noman Shapiro
    Jul 17, 2013
Loading...

Share This Page