Need Help With Java Regex

D

DartmanX

Hi, I'm trying to write a regex to match text in the following format:

Bld Rm
(several lines)
Page ## of ##

However, the following code fragment doesn't work:

fullDocument = fullDocument.replaceAll("Bld (\\w)+Page [0-1000] of
[0-1000]","");

(fullDocument is of type String)

I'm absolutely horrible with regex's, so any help would be appreciated.

Jason
 
H

hiwa

DartmanX said:
Hi, I'm trying to write a regex to match text in the following format:

Bld Rm
(several lines)
Page ## of ##

However, the following code fragment doesn't work:

fullDocument = fullDocument.replaceAll("Bld (\\w)+Page [0-1000] of
[0-1000]","");

(fullDocument is of type String)

I'm absolutely horrible with regex's, so any help would be appreciated.

Jason

public class DotAll{
public static void main(String[] args){
String fulldoc = "Never Let Me Go\n" +
"Small Crimes\n" +
"Bld Rm\n" +
"The Faceless System\n" +
"Market For The Death\n" +
"Page 201 of 1380\n" +
"Age of Abundance\n" +
"Medici Money\n";

String regex = "(?s:Bld.+Page \\d+ of \\d+\\n)";

System.out.println(fulldoc.replaceAll(regex, ""));
}
}
 
D

Dale King

DartmanX said:
Hi, I'm trying to write a regex to match text in the following format:

Bld Rm
(several lines)
Page ## of ##

However, the following code fragment doesn't work:

fullDocument = fullDocument.replaceAll("Bld (\\w)+Page [0-1000] of
[0-1000]","");

The [0-1000] in a regex does not mean any number between 0 and 1000. The
[] expression is a class of characters where you specify the characters
you want to search for and can use ranges. For example, [A-Za-z] will
specify any letter A-Z in either case.

In your case the expression [0-1000] equates to the characters from 0 to
1 or 0 or 0 or 0. So it only succeeds for the characters 0 or 1.

What you probably want is something like: Page \\d+ of \\d+

\d+ says to find one or more digits 0-9
 
H

Harald

DartmanX said:
Hi, I'm trying to write a regex to match text in the following format:

Bld Rm
(several lines)
Page ## of ##

However, the following code fragment doesn't work:

fullDocument = fullDocument.replaceAll("Bld (\\w)+Page [0-1000] of
[0-1000]","");

Several things need to be changed:
1) With "[0-1000]" you seem to indent to match integers in the given
range, but this is not what "[]" does. Rather, "[]" describes a set of
characters to be matched. What you want is "[0-9]+" instead to match a
sequence of digits.

2) (\\w)+ matches a word, but you want a repetition of pairs (space
word). This should be covered by "(\\s+\\w+)+" . In particular the \s
should match over end of line.

3) This does not account yet for the last space in front of "Page", so
you add another \\s: "(\\s+\\w+)+\\s+Page"

4) This still does not what you want, because always the longest match
is sought. If you have many records of the described structure in your
text, the match will be from the start of the first to the end of the
last of those. You may have luck with non-greedy matching:
"(\\s+\\w+)+?\\s+Page

And don't forget to set the flag "MULTILINE" when calling compile().

And don't believe a single word I wrote because normally I prefer my
own regex-package (see sig:)

Harald.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,056
Latest member
GlycogenSupporthealth

Latest Threads

Top