RegEx Help Needed

D

DeepDiver

I'm trying to parse a string of HTML that contains a mix of tags and text.
My goal is to match and replace double quote marks in the text (but not
within the tags) and replace them with the equivalent html character entity
(i.e., ").

For example, this string:
The "slow" red fox.<div class="test">The "quick" brown fox.</div>

would become this:
The &quot;slow&quot; red fox.<div class="test">The &quot;quick&quot;
brown fox.</div>

TIA!!!
 
L

Lars Eighner

the lovely said:
I'm trying to parse a string of HTML that contains a mix of tags and text.
My goal is to match and replace double quote marks in the text (but not
within the tags) and replace them with the equivalent html character entity
(i.e., &quot;).
For example, this string:
The "slow" red fox.<div class="test">The "quick" brown fox.</div>
would become this:
The &quot;slow&quot; red fox.<div class="test">The &quot;quick&quot;
brown fox.</div>

I can't do it in one, but --

WARNING! Those offended by brute force ugliness should look away now!
WARNING!

goodwill~/test$perl -wpi -e '$/=undef;while( s/\"([^<>]*<)/&quot\;$1/g ){}
;' test.html

This won't work if you have unbalanced <s and/or > anywhere in the
document such as a script with something like document.write("<")
or simply unclosed tags. If you actually run this as a one-liner,
beware of what your shell may do with $1 if you double quote the
executable.
 
D

David H. Adler

Thanks, but I'm in need of a pure RegEx solution.

This of course raises the question: Why?

We can probably help you better if we have some idea of why you reject
the generally accepted solution...

dha
 
D

DeepDiver

David H. Adler said:
This of course raises the question: Why?


A few reasons:

1. I'm not programming in Perl. In fact, my experience with Perl was a long
time ago (and not very extensive even then). I came here because I believe
that Perl programmers are generally the most proficient with regular
expressions.

2. I'm writing the current routine in C#. But I would still prefer a "pure"
RegEx solution so that I have something that is concise and (higher-level)
language independent.

3. I'm trying to improve my RegEx skills, so the more I can learn how to do
things like this in RegEx (without "massaging" in a higher-level language)
the better.

I hope this addresses your concerns.

Thanks,
Michael
 
S

Sherm Pendley

DeepDiver said:
1. I'm not programming in Perl.

2. I'm writing the current routine in C#.

This is a Perl group. The C# group is down the hall to the left. Don't
let the door hit you on the way out.

sherm--
 
J

Joe Smith

DeepDiver said:
1. I came here because I believe
that Perl programmers are generally the most proficient with regular
expressions.

Regular expressions as implemented in other languages are not the same.

Using just a regular expression won't cut it; correct parsing usually
requires program logic as well.
-Joe
 
T

Tassilo v. Parseval

Also sprach DeepDiver:
A few reasons:

1. I'm not programming in Perl. In fact, my experience with Perl was a long
time ago (and not very extensive even then). I came here because I believe
that Perl programmers are generally the most proficient with regular
expressions.

This nonetheless makes your posting rather off-topic in this group. Perl
did not invent regular expressions. Also, Perl regular expressions are
likely to be more powerful than regular expressions found in other
languages. This means you probably couldn't use a regex solution
from this group in your program.
2. I'm writing the current routine in C#. But I would still prefer a "pure"
RegEx solution so that I have something that is concise and (higher-level)
language independent.

I have my doubts as to the conciseness of a pure regex solution.
Classical reguar expressions aren't even remotely powerful enough to
parse HTML (and there's not much to argue about: It can be proven with
the famous Pumping lemma). Perl's regular expressions might be powerful
enough as they have some non-regular extensions (they allow
back-references, they can be recursive etc.). Still, a regex solution
could hardly be robust. Let alone the fact that .NET regular expressions
lack many of the Perl features.

Tassilo
 
A

Alan J. Flavell

Perl regular expressions are likely to be more powerful than regular
expressions found in other languages.

Would this be a moment to mention PCRE, http://www.pcre.org/ ?

"Perl Compatible Regular Expressions" library.

I often use its diagnostic command, "pcretest", to explore the
behaviour of some complex regex that I'm working with, when fed with
various data. Whether the regex is meant for Perl or, indeed, when
writing ACLs for the same author's excellent MTA, exim.

(Of course, that has nothing to do with attempting to use regexes for
parsing arbitrary HTML - which is ultimately hopeless.)
 
J

Jürgen Exner

DeepDiver wrote:
[About parsing HTML]
Thanks, but I'm in need of a pure RegEx solution.

Forget it. Nobody with a sane mind would try parsing HTML using pure REs.
Contrary to popular believe parsing HTML is non-trivial and while it is not
decided yet if Perl's advanced REs are powerful enough to do it, most
certainly it would be _way_ too complex to be of any real use.
As this has been discussed many times before please see the FAQ and Google
for further details .

jue
 
C

Chris Mattern

DeepDiver said:
Thanks, but I'm in need of a pure RegEx solution.

No, you aren't. You may think you are, but you aren't.
--
Christopher Mattern

"Which one you figure tracked us?"
"The ugly one, sir."
"...Could you be more specific?"
 
C

Chris Mattern

DeepDiver said:
A few reasons:

1. I'm not programming in Perl. In fact, my experience with Perl was a
long time ago (and not very extensive even then). I came here because I
believe that Perl programmers are generally the most proficient with
regular expressions.

Regular expressions differ subtly but significantly between the languages
that implement them. Solutions formulated for Perl regular expressions
would have a good chance of not working in your language. Ask in a
forum that deals with your language.
2. I'm writing the current routine in C#. But I would still prefer a
"pure" RegEx solution so that I have something that is concise and
(higher-level) language independent.

See above about the portability of regular expressions.
3. I'm trying to improve my RegEx skills, so the more I can learn how to
do things like this in RegEx (without "massaging" in a higher-level
language) the better.

Regular expressions are a very poor tool for parsing HTML. Depending
on your task, using them to do so will range from hair-tearing frustrating
to simply impossible. Parsing HTML is not a trivial task. The main
lesson you would learn trying to parse HTML with regular expressions would
be, if you were paying attention, "don't parse HTML with regular
expressions".
I hope this addresses your concerns.

Hope these address yours.
Thanks,
Michael

--
Christopher Mattern

"Which one you figure tracked us?"
"The ugly one, sir."
"...Could you be more specific?"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top