Regex Replacement: Replacing text with an empty string

H

Hal Vaughan

I'm trying to clean up some comments in web pages. I'm using regexes to do
a lot of the work, but I've run into a problem. Toward the end of the
process, I'm trying to replace any remaining HTML tags with an empty
string, as in no spaces, nothing, just "". If I replace the HTML tags with
a space or other characters it works, but it won't work with an empty
string. (I also tried at mindprod.com, one of the first places for Java
info, but the site is down.)

Here's a snippet to explain what I'm doing:

//sDesc is the string with the text I'm working on
String sTag = "<.*?>";
Pattern pTag = Pattern.compile(sTag);
Matcher lineMatch = pTag.matcher(sDesc);
sDesc = lineMatch.replaceAll("");

If I use " " in that last line, it works fine, but whenever I use "", the
HTML tags are NOT replaced.

I know it deeply offends people if any code is posted that isn't ready to be
compiled and run as is, but I think this is more about how regexes work
than a specific piece of code. I've searched for "empty string" in
connection with regex replacement (and using different terms), but I
haven't found anything about this. In most cases, I find something talking
about accidently matching empty strings. I would also think there's a
better term than empty string to apply to this. Is there?

Why is it that a replace with a space works but with an empty string it
doesn't?

Thanks!

Hal
 
P

Patricia Shanahan

Hal Vaughan wrote:
....
Here's a snippet to explain what I'm doing:

//sDesc is the string with the text I'm working on
String sTag = "<.*?>";
Pattern pTag = Pattern.compile(sTag);
Matcher lineMatch = pTag.matcher(sDesc);
sDesc = lineMatch.replaceAll("");

If I use " " in that last line, it works fine, but whenever I use "", the
HTML tags are NOT replaced.

I know it deeply offends people if any code is posted that isn't ready to be
compiled and run as is, but I think this is more about how regexes work
than a specific piece of code.
....

I'm afraid you do need to prepare a test case that is a complete
program. I took your code and attempted to reproduce the problem, but it
works perfectly. There is something else about your program that is not
inherent in the snippet that is causing the problem.

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexReplaceTest {
public static void main(String[] args) {
String sDesc = "XXX<I'm a Tag>YYY";
System.out.println(sDesc);
String sTag = "<.*?>";
Pattern pTag = Pattern.compile(sTag);
Matcher lineMatch = pTag.matcher(sDesc);
sDesc = lineMatch.replaceAll("");
System.out.println(sDesc);
}
}

output:

XXX<I'm a Tag>YYY
XXXYYY

Patricia
 
H

Hal Vaughan

Patricia said:
Hal Vaughan wrote:
...
...

I'm afraid you do need to prepare a test case that is a complete
program. I took your code and attempted to reproduce the problem, but it
works perfectly. There is something else about your program that is not
inherent in the snippet that is causing the problem.

Okay. No problem. I haven't touched regexes in Java until this past week
and, try as I could, the empty string just would not work. I've found that
there are a LOT of things in any language that are often taken as
understood by people working in it but can trip up someone who hasn't
worked with that feature before. I figured this was one of them. All the
matching I did and experimented with worked great, until I used empty
strings.

What I can't see is how other code would effect a regex, but I'll play
around and see what I get.

Thanks!

Hal
 
S

SadRed

I'm trying to clean up some comments in web pages. I'm using regexes to do
a lot of the work, but I've run into a problem. Toward the end of the
process, I'm trying to replace any remaining HTML tags with an empty
string, as in no spaces, nothing, just "". If I replace the HTML tags with
a space or other characters it works, but it won't work with an empty
string. (I also tried at mindprod.com, one of the first places for Java
info, but the site is down.)

Here's a snippet to explain what I'm doing:

//sDesc is the string with the text I'm working on
String sTag = "<.*?>";
Pattern pTag = Pattern.compile(sTag);
Matcher lineMatch = pTag.matcher(sDesc);
sDesc = lineMatch.replaceAll("");

If I use " " in that last line, it works fine, but whenever I use "", the
HTML tags are NOT replaced.

I know it deeply offends people if any code is posted that isn't ready to be
compiled and run as is, but I think this is more about how regexes work
than a specific piece of code. I've searched for "empty string" in
connection with regex replacement (and using different terms), but I
haven't found anything about this. In most cases, I find something talking
about accidently matching empty strings. I would also think there's a
better term than empty string to apply to this. Is there?

Why is it that a replace with a space works but with an empty string it
doesn't?

Thanks!

Hal
this is more about how regexes work than a specific piece of code
That is other way around. This statement is a flavor of arrogance. The
replaceAll() with empty string works flawless. Fault is on your code
or input, not on the Java regex. Post an SSCCE with a small example
input. See: http://homepage1.nifty.com/algafield/sscce.html
 
H

Hal Vaughan

Patricia said:
Hal Vaughan wrote:
...
...

I'm afraid you do need to prepare a test case that is a complete
program. I took your code and attempted to reproduce the problem, but it
works perfectly. There is something else about your program that is not
inherent in the snippet that is causing the problem.

All I needed, and this was a BIG help, was to find out there was no issue
with using a null.

Believe it or not, I just added this line:

String newDesc = sDesc;

Then I used newDesc in every place where I used sDesc before.

It works now. No idea why that makes a difference and, honestly, I don't
have time to pursue it.

Thanks for the verification this isn't just some obscure point I had never
heard of.

Hal
 
R

Roedy Green

I would also think there's a
better term than empty string to apply to this.

a String of 0 chars is called an "empty String", as distinct from
null. The difference is the source of all manner of bugs in
professionally written code. Programmers writing Javadoc tend to be
fuzzy about whether a method can accept/produce an empty/null String.

Since either flavour of String is often rare, the bug won't show up in
routine testing.

Eiffel has design by contract to formally describe assertions on
method inputs and outputs. See
http://mindprod.com/jgloss/designbycontract.html


I would like to have formal ways of describing String types with
whether they can be null or empty, with exceptions if they are when
they shouldn't be.
 
H

Hal Vaughan

Roedy said:
a String of 0 chars is called an "empty String", as distinct from
null. The difference is the source of all manner of bugs in
professionally written code. Programmers writing Javadoc tend to be
fuzzy about whether a method can accept/produce an empty/null String.

I knew null was the wrong term, since that's an entirely different thing (I
can do (if myString == null) but not (if myString == "")). I was not sure
if "empty string" was the actual technical term. I've heard people
say "null string" and I know what they mean, but I also know that's wrong.
I wasn't sure if there was a better term to Google than "empty string."
Since either flavour of String is often rare, the bug won't show up in
routine testing.

Eiffel has design by contract to formally describe assertions on
method inputs and outputs. See
http://mindprod.com/jgloss/designbycontract.html

I don't know why, but for several hours yesterday, your site was down, or at
least inaccessible from my location. It was the first place I tried
looking for an answer. I figured if there were a quirk I needed to know
about with regexes and empty strings, I'd find it mentioned there.
I would like to have formal ways of describing String types with
whether they can be null or empty, with exceptions if they are when
they shouldn't be.

I can certainly see the need for that!

Thanks! As always, you have some good and useful information!

Hal
 
D

Daniel Pitts

Hal said:
All I needed, and this was a BIG help, was to find out there was no issue
with using a null.

Believe it or not, I just added this line:

String newDesc = sDesc;

Then I used newDesc in every place where I used sDesc before.

It works now. No idea why that makes a difference and, honestly, I don't
have time to pursue it.

Thanks for the verification this isn't just some obscure point I had never
heard of.

Hal
I doubt that was all you needed to do, and I suspect that it didn't
actually fix your problem. My guess is that somewhere along the line
you fixed the underlying problem, and didn't know it.
 
H

Hal Vaughan

Daniel said:
Hal Vaughan wrote: ....
I doubt that was all you needed to do, and I suspect that it didn't
actually fix your problem. My guess is that somewhere along the line
you fixed the underlying problem, and didn't know it.

I wouldn't be surprised, but this is "time off" programming to mess with a
few things I've never tried before (regexes being just one of them). If it
were for work, I'd be busting my tail to go through every line and see what
fixed it, but since this project will be cut up into pieces for other ones,
I'll see what happens later.

Hal
 
J

James

[snip]

In my limited experience, the problem to be tends to be related to scope
when changing a variable name solves it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top