The definitive statement on parsing HTML with regular expressions

Discussion in 'Perl Misc' started by Tim McDaniel, Jan 29, 2013.

  1. Tim McDaniel

    Tim McDaniel Guest

    I'd have to say that at
    http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
    the first answer is definitive. I know that The Pony is real, for I
    have fed carrots to His Effulgent Face. And I don't even know what
    "Effulgent" means, except that it means His Face.

    Actually, I just saw it on the Cheezburger Network and thought it was
    funny.

    And yes, if you *know* that your HTML is simple and limited (for
    example, generated by a known program), you may be able to parse those
    particular files with regexps.
     
    Tim McDaniel, Jan 29, 2013
    #1
    1. Advertisements

  2. Tim McDaniel

    Tim McDaniel Guest

    March 2012 now counts as "a *long* time ago" in Interweb Time.
    In any event, I wrote,
    If so, only by accident. Looking up "effulgent", I should have
    written "Darkly Effulgent" for better effect.
    Um, your goalposts seem to be moving.
    Bit of a drawback, eh wot? as few people want to merely recognize XML.

    In any event, I think it's difficult to parse HTML or XML *correctly*
    with *any* technology, due to corner cases and features. In general,
    a better answer is usually to use an existing module.
     
    Tim McDaniel, Jan 30, 2013
    #2
    1. Advertisements

  3. (Tim McDaniel) writes:

    [...]
    The conclusion "it is difficult" => "everybody else must have solved
    it correctly already" seems a little flimsy to me ...
     
    Rainer Weikusat, Jan 30, 2013
    #3
  4. RW> (Tim McDaniel) writes: [...]

    RW> The conclusion "it is difficult" => "everybody else must have
    RW> solved it correctly already" seems a little flimsy to me ...

    How the hell do you make that leap?

    It is difficult, so it is better to use a mature code package that many
    people have used (and thus tested) than it is to roll your own.

    Charlton
     
    Charlton Wilbur, Jan 30, 2013
    #4
  5. Tim McDaniel

    brian d foy Guest

    brian d foy, Jan 31, 2013
    #5
  6. bdf> It's certainly funny, and was dogma until tchrist actually
    bdf> solved it with a recursive regex in a different Stackoverflow
    bdf> answer:

    bdf> http://stackoverflow.com/questions/4231382/regular-expression-
    bdf> pattern-not-matching-anywhere-in-string/4234491#4234491

    To be honest, before tchrist's answer it was dogma that was known to be
    false by those of us who either understand the theory of computation
    (since Perl's regular expressions stopped being strictly regular some
    time ago) or who had to update or maintain a dog's breakfast of HTML
    "parsing" using regular expressions.

    tchrist does continue to say that even though you CAN parse HTML with
    Perl regular expressions, you probably SHOULDN'T, because the larger and
    more sophisticated the problem, the better it is to use a real parser.
    Which is wisdom, and I am not just saying that because I have been
    saying it for 10+ years at this point.

    Charlton
     
    Charlton Wilbur, Jan 31, 2013
    #6
    1. Advertisements

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments (here). After that, you can post your question and our members will help you out.