FAQ 6.4 How do I match XML, HTML, or other nasty, ugly things with a regex?

Discussion in 'Perl Misc' started by PerlFAQ Server, Feb 24, 2011.

  1. This is an excerpt from the latest version perlfaq6.pod, which
    comes with the standard Perl distribution. These postings aim to
    reduce the number of repeated questions as well as allow the community
    to review and update the answers. The latest version of the complete
    perlfaq is at http://faq.perl.org .


    6.4: How do I match XML, HTML, or other nasty, ugly things with a regex?

    (contributed by brian d foy)

    If you just want to get work done, use a module and forget about the
    regular expressions. The "XML::parser" and "HTML::parser" modules are
    good starts, although each namespace has other parsing modules
    specialized for certain tasks and different ways of doing it. Start at
    CPAN Search ( http://search.cpan.org ) and wonder at all the work people
    have done for you already! :)

    The problem with things such as XML is that they have balanced text
    containing multiple levels of balanced text, but sometimes it isn't
    balanced text, as in an empty tag ("<br/>", for instance). Even then,
    things can occur out-of-order. Just when you think you've got a pattern
    that matches your input, someone throws you a curveball.

    If you'd like to do it the hard way, scratching and clawing your way
    toward a right answer but constantly being disappointed, besieged by bug
    reports, and weary from the inordinate amount of time you have to spend
    reinventing a triangular wheel, then there are several things you can
    try before you give up in frustration:

    * Solve the balanced text problem from another question in perlfaq6

    * Try the recursive regex features in Perl 5.10 and later. See perlre

    * Try defining a grammar using Perl 5.10's "(?DEFINE)" feature.

    * Break the problem down into sub-problems instead of trying to use a
    single regex

    * Convince everyone not to use XML or HTML in the first place

    Good luck!


    The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
    are not necessarily experts in every domain where Perl might show up,
    so please include as much information as possible and relevant in any
    corrections. The perlfaq-workers also don't have access to every
    operating system or platform, so please include relevant details for
    corrections to examples that do not work on particular platforms.
    Working code is greatly appreciated.

    If you'd like to help maintain the perlfaq, see the details in
    PerlFAQ Server, Feb 24, 2011
    1. Advertisements

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Peter Charles
    Lauri Raittila
    Jan 24, 2004
  2. Peter

    question about nasty regex

    Peter, Apr 3, 2006, in forum: Python
    Peter Hansen
    Apr 4, 2006
  3. =?Utf-8?B?V2lsbGlhbSBTdWxsaXZhbg==?=

    vs2005 publish website doing bad things, bad things

    =?Utf-8?B?V2lsbGlhbSBTdWxsaXZhbg==?=, Oct 25, 2006, in forum: ASP .Net
    Oct 25, 2006
  4. Svetoslav Vasilev

    The nasty 'Ambiguous match found' error in ASP.NET 2.0

    Svetoslav Vasilev, Apr 26, 2006, in forum: ASP .Net Web Controls
    Svetoslav Vasilev
    Apr 26, 2006
  5. PerlFAQ Server
    PerlFAQ Server
    Jan 27, 2011

Share This Page