RegEx Help Needed

Discussion in 'Perl Misc' started by DeepDiver, Dec 4, 2004.

  1. DeepDiver

    DeepDiver Guest

    I'm trying to parse a string of HTML that contains a mix of tags and text.
    My goal is to match and replace double quote marks in the text (but not
    within the tags) and replace them with the equivalent html character entity
    (i.e., ").

    For example, this string:
    The "slow" red fox.<div class="test">The "quick" brown fox.</div>

    would become this:
    The &quot;slow&quot; red fox.<div class="test">The &quot;quick&quot;
    brown fox.</div>

    TIA!!!
    DeepDiver, Dec 4, 2004
    #1
    1. Advertising

  2. Sherm Pendley, Dec 4, 2004
    #2
    1. Advertising

  3. DeepDiver

    DeepDiver Guest

    "Sherm Pendley" <> wrote in message
    news:...
    >
    > Have a look at HTML::parser on CPAN.
    >


    Thanks, but I'm in need of a pure RegEx solution.
    DeepDiver, Dec 4, 2004
    #3
  4. DeepDiver

    Lars Eighner Guest

    In our last episode, <JZbsd.9270$>, the lovely
    and talented DeepDiver broadcast on comp.lang.perl.misc:

    > I'm trying to parse a string of HTML that contains a mix of tags and text.
    > My goal is to match and replace double quote marks in the text (but not
    > within the tags) and replace them with the equivalent html character entity
    > (i.e., &quot;).


    > For example, this string:
    > The "slow" red fox.<div class="test">The "quick" brown fox.</div>


    > would become this:
    > The &quot;slow&quot; red fox.<div class="test">The &quot;quick&quot;
    > brown fox.</div>


    > TIA!!!


    I can't do it in one, but --

    WARNING! Those offended by brute force ugliness should look away now!
    WARNING!

    goodwill~/test$perl -wpi -e '$/=undef;while( s/\"([^<>]*<)/&quot\;$1/g ){}
    ;' test.html

    This won't work if you have unbalanced <s and/or > anywhere in the
    document such as a script with something like document.write("<")
    or simply unclosed tags. If you actually run this as a one-liner,
    beware of what your shell may do with $1 if you double quote the
    executable.


    --
    Lars Eighner -finger for geek code- http://www.io.com/~eighner/
    War on Terrorism: Camp Follower
    "I am ... a total sucker for the guys ... with all the ribbons on and stuff,
    and they say it's true and I'm ready to believe it. -Cokie Roberts,_ABC_
    Lars Eighner, Dec 4, 2004
    #4
  5. On 2004-12-04, DeepDiver <> wrote:
    > "Sherm Pendley" <> wrote in message
    > news:...
    >>
    >> Have a look at HTML::parser on CPAN.
    >>

    >
    > Thanks, but I'm in need of a pure RegEx solution.


    This of course raises the question: Why?

    We can probably help you better if we have some idea of why you reject
    the generally accepted solution...

    dha

    --
    David H. Adler - <> - http://www.panix.com/~dha/
    [Insert Angus Prune Tune here]
    David H. Adler, Dec 4, 2004
    #5
  6. DeepDiver

    DeepDiver Guest

    "David H. Adler" <> wrote in message
    news:...
    > On 2004-12-04, DeepDiver <> wrote:
    > > "Sherm Pendley" <> wrote in message
    > > news:...
    > >>
    > >> Have a look at HTML::parser on CPAN.
    > >>

    > >
    > > Thanks, but I'm in need of a pure RegEx solution.

    >
    > This of course raises the question: Why?



    A few reasons:

    1. I'm not programming in Perl. In fact, my experience with Perl was a long
    time ago (and not very extensive even then). I came here because I believe
    that Perl programmers are generally the most proficient with regular
    expressions.

    2. I'm writing the current routine in C#. But I would still prefer a "pure"
    RegEx solution so that I have something that is concise and (higher-level)
    language independent.

    3. I'm trying to improve my RegEx skills, so the more I can learn how to do
    things like this in RegEx (without "massaging" in a higher-level language)
    the better.

    I hope this addresses your concerns.

    Thanks,
    Michael
    DeepDiver, Dec 4, 2004
    #6
  7. DeepDiver wrote:

    > 1. I'm not programming in Perl.
    >
    > 2. I'm writing the current routine in C#.


    This is a Perl group. The C# group is down the hall to the left. Don't
    let the door hit you on the way out.

    sherm--

    --
    Cocoa programming in Perl: http://camelbones.sourceforge.net
    Hire me! My resume: http://www.dot-app.org
    Sherm Pendley, Dec 4, 2004
    #7
  8. DeepDiver

    Joe Smith Guest

    DeepDiver wrote:

    > 1. I came here because I believe
    > that Perl programmers are generally the most proficient with regular
    > expressions.


    Regular expressions as implemented in other languages are not the same.

    Using just a regular expression won't cut it; correct parsing usually
    requires program logic as well.
    -Joe
    Joe Smith, Dec 4, 2004
    #8
  9. Also sprach DeepDiver:

    > "David H. Adler" <> wrote in message
    > news:...
    >> On 2004-12-04, DeepDiver <> wrote:
    >> > "Sherm Pendley" <> wrote in message
    >> > news:...
    >> >>
    >> >> Have a look at HTML::parser on CPAN.
    >> >>
    >> >
    >> > Thanks, but I'm in need of a pure RegEx solution.

    >>
    >> This of course raises the question: Why?

    >
    >
    > A few reasons:
    >
    > 1. I'm not programming in Perl. In fact, my experience with Perl was a long
    > time ago (and not very extensive even then). I came here because I believe
    > that Perl programmers are generally the most proficient with regular
    > expressions.


    This nonetheless makes your posting rather off-topic in this group. Perl
    did not invent regular expressions. Also, Perl regular expressions are
    likely to be more powerful than regular expressions found in other
    languages. This means you probably couldn't use a regex solution
    from this group in your program.

    > 2. I'm writing the current routine in C#. But I would still prefer a "pure"
    > RegEx solution so that I have something that is concise and (higher-level)
    > language independent.


    I have my doubts as to the conciseness of a pure regex solution.
    Classical reguar expressions aren't even remotely powerful enough to
    parse HTML (and there's not much to argue about: It can be proven with
    the famous Pumping lemma). Perl's regular expressions might be powerful
    enough as they have some non-regular extensions (they allow
    back-references, they can be recursive etc.). Still, a regex solution
    could hardly be robust. Let alone the fact that .NET regular expressions
    lack many of the Perl features.

    Tassilo
    --
    $_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
    pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
    $_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
    Tassilo v. Parseval, Dec 4, 2004
    #9
  10. On Sat, 4 Dec 2004, Tassilo v. Parseval wrote:

    > Perl regular expressions are likely to be more powerful than regular
    > expressions found in other languages.


    Would this be a moment to mention PCRE, http://www.pcre.org/ ?

    "Perl Compatible Regular Expressions" library.

    I often use its diagnostic command, "pcretest", to explore the
    behaviour of some complex regex that I'm working with, when fed with
    various data. Whether the regex is meant for Perl or, indeed, when
    writing ACLs for the same author's excellent MTA, exim.

    (Of course, that has nothing to do with attempting to use regexes for
    parsing arbitrary HTML - which is ultimately hopeless.)
    Alan J. Flavell, Dec 4, 2004
    #10
  11. DeepDiver wrote:
    [About parsing HTML]
    > "Sherm Pendley" <> wrote in message
    > news:...
    >>
    >> Have a look at HTML::parser on CPAN.
    >>

    >
    > Thanks, but I'm in need of a pure RegEx solution.


    Forget it. Nobody with a sane mind would try parsing HTML using pure REs.
    Contrary to popular believe parsing HTML is non-trivial and while it is not
    decided yet if Perl's advanced REs are powerful enough to do it, most
    certainly it would be _way_ too complex to be of any real use.
    As this has been discussed many times before please see the FAQ and Google
    for further details .

    jue
    Jürgen Exner, Dec 4, 2004
    #11
  12. DeepDiver wrote:

    > "Sherm Pendley" <> wrote in message
    > news:...
    >>
    >> Have a look at HTML::parser on CPAN.
    >>

    >
    > Thanks, but I'm in need of a pure RegEx solution.


    No, you aren't. You may think you are, but you aren't.
    --
    Christopher Mattern

    "Which one you figure tracked us?"
    "The ugly one, sir."
    "...Could you be more specific?"
    Chris Mattern, Dec 4, 2004
    #12
  13. DeepDiver wrote:

    > "David H. Adler" <> wrote in message
    > news:...
    >> On 2004-12-04, DeepDiver <> wrote:
    >> > "Sherm Pendley" <> wrote in message
    >> > news:...
    >> >>
    >> >> Have a look at HTML::parser on CPAN.
    >> >>
    >> >
    >> > Thanks, but I'm in need of a pure RegEx solution.

    >>
    >> This of course raises the question: Why?

    >
    >
    > A few reasons:
    >
    > 1. I'm not programming in Perl. In fact, my experience with Perl was a
    > long time ago (and not very extensive even then). I came here because I
    > believe that Perl programmers are generally the most proficient with
    > regular expressions.


    Regular expressions differ subtly but significantly between the languages
    that implement them. Solutions formulated for Perl regular expressions
    would have a good chance of not working in your language. Ask in a
    forum that deals with your language.
    >
    > 2. I'm writing the current routine in C#. But I would still prefer a
    > "pure" RegEx solution so that I have something that is concise and
    > (higher-level) language independent.


    See above about the portability of regular expressions.
    >
    > 3. I'm trying to improve my RegEx skills, so the more I can learn how to
    > do things like this in RegEx (without "massaging" in a higher-level
    > language) the better.


    Regular expressions are a very poor tool for parsing HTML. Depending
    on your task, using them to do so will range from hair-tearing frustrating
    to simply impossible. Parsing HTML is not a trivial task. The main
    lesson you would learn trying to parse HTML with regular expressions would
    be, if you were paying attention, "don't parse HTML with regular
    expressions".
    >
    > I hope this addresses your concerns.


    Hope these address yours.
    >
    > Thanks,
    > Michael


    --
    Christopher Mattern

    "Which one you figure tracked us?"
    "The ugly one, sir."
    "...Could you be more specific?"
    Chris Mattern, Dec 4, 2004
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Alvin Bruney - ASP.NET MVP

    Regex help needed

    Alvin Bruney - ASP.NET MVP, Sep 16, 2005, in forum: ASP .Net
    Replies:
    2
    Views:
    372
    Alvin Bruney - ASP.NET MVP
    Sep 16, 2005
  2. rh0dium

    Regex help needed

    rh0dium, Jan 10, 2006, in forum: Python
    Replies:
    8
    Views:
    353
    Michael Spencer
    Jan 11, 2006
  3. Pradnyesh Sawant

    help needed with regex and unicode

    Pradnyesh Sawant, Mar 4, 2008, in forum: Python
    Replies:
    2
    Views:
    673
    Mark Tolonen
    Mar 4, 2008
  4. darrel

    regex help/check needed

    darrel, May 5, 2008, in forum: ASP .Net
    Replies:
    1
    Views:
    298
    darrel
    May 5, 2008
  5. Replies:
    3
    Views:
    731
    Reedick, Andrew
    Jul 1, 2008
Loading...

Share This Page