Splitting a String with a Regex

S

stevengarcia

I have multiple root XML documents in a String that looks like

"<?xml...><response .../><?xml...><response .../><?xml...><response
..../>"

There are three valid XML documents above, unfortunately I have all of
them in one String so (as far as I can tell) XML parsing with dom4j
will not give me three Document objects.

I am trying to write a method that will split the above String into
three separate strings that are all valid XML, and can be parsed by an
XML parser. First I tried String.split()...but there is no good
delimiter. Then I tried writing a regular expression, and I think
regex's will work here, but I'm not proficient at this advanced topic.

The other thing too is the real XML has carriage feeds and other random
characters between each XML document. The XML within each document is
assured to be valid, however.

Is a regex a good way to do this? Your help would be appreciated.
 
S

stevengarcia

I guess I could use a StringTokenizer, and the token would be "<?xml",
and also tell the StringTokenizer to return the delimiter along with
each token.

That should work.
 
F

Frank Seidinger

I have multiple root XML documents in a String that looks like

"<?xml...><response .../><?xml...><response .../><?xml...><response
.../>"

There are three valid XML documents above, unfortunately I have all of
them in one String so (as far as I can tell) XML parsing with dom4j
will not give me three Document objects.

Did you try to parse the first document with either a dom or a sax parser?
All xml parsers are reading from input streams and don't care if the
document is split over several lines or just com in one line.

Therefore creating an input stream from a string using
StringBufferInputStream and feeding this stream to a parser should consume
as many characters as needed to parse the first valid xml document.

Using the same input stream again for the parser should get you the next
document. You can repeat this, until your string is completely consumed.
I am trying to write a method that will split the above String into
three separate strings that are all valid XML, and can be parsed by an
XML parser. First I tried String.split()...but there is no good
delimiter. Then I tried writing a regular expression, and I think
regex's will work here, but I'm not proficient at this advanced topic.

For that, you simply can use the indexOf(String str) method of the string
class itself with indexOf("<?xml") for example you can find the index where
your first document starts.

With indexOf("<?xml", firstIndex) you will find the start of the second
document. The space between firstIndex and secondIndex is the content of
your fist document.
 
S

stevengarcia

I guess I could use a StringTokenizer, and the token would be "<?xml",
and also tell the StringTokenizer to return the delimiter along with
each token.

That should work.

Nope, it doesn't. StringTokenizer uses all of the characters in the
delim as tokens. I want the "<?xml" to be one token, not
individualized.

Maybe it's back to regex.
 
D

Danno

Try:

String s = "<?xml...><response
..../><?xml...><response.../><?xml...><response.../>";
String[] tokens = s.split("<\\?xml[.]*>");
for (String token : tokens) {
System.out.println(token);
}

Just a guess, I haven't tried it, so there maybe errors.
 
S

stevengarcia

Frank said:
Did you try to parse the first document with either a dom or a sax parser?
All xml parsers are reading from input streams and don't care if the
document is split over several lines or just com in one line.

Therefore creating an input stream from a string using
StringBufferInputStream and feeding this stream to a parser should consume
as many characters as needed to parse the first valid xml document.

Using the same input stream again for the parser should get you the next
document. You can repeat this, until your string is completely consumed.

I got excited by this idea, so I tried it. It didn't work, as I got
the following exception

The processing instruction target matching "[xX][mM][lL]" is not
allowed.

and that I think means you can't have more than one <?xml in a
document.

Good suggestion though.
 
S

Smilodon

Would you please try this one?

public class MultiXMLSplit {
private static final String xmlStr =
"<?xml><root>hello1</root><?xml><root>hello2</root><?xml><root>hello3</root>";
public static void main(String[] args) {
int index1 = xmlStr.indexOf("<?xml");
int index2;
while (index1 != -1 && index1 < xmlStr.length() - 1) {
index2 = xmlStr.indexOf("<?xml", index1 + 1);
if (index2 != -1 && index2 < xmlStr.length()) {
System.out.println(xmlStr.substring(index1, index2));
} else break;
index1 = index2;
}
// Deal with the last xml doc
if (index1 != -1 && index1 < xmlStr.length() - 1)
System.out.println(xmlStr.substring(index1));
}
}

Maybe you should add more codes to trim the space chars at the head of each
XML document text. As I known, if an xml document text starts with space
chars, the xml parser will not parse it correctly. You will get error
messages like this:

The processing instruction target matching "[xX][mM][lL]" is not
allowed.
 
O

Oliver Wong

Danno said:
Try:

String s = "<?xml...><response
.../><?xml...><response.../><?xml...><response.../>";
String[] tokens = s.split("<\\?xml[.]*>");
for (String token : tokens) {
System.out.println(token);
}

Just a guess, I haven't tried it, so there maybe errors.

Probably won't work. XML is a context-free language, not a regular
language.

- Oliver
 
J

Jussi Piitulainen

Oliver said:
Danno said:
Try:

String s = "<?xml...><response
.../><?xml...><response.../><?xml...><response.../>";
String[] tokens = s.split("<\\?xml[.]*>");
for (String token : tokens) {
System.out.println(token);
}

Just a guess, I haven't tried it, so there maybe errors.

Probably won't work. XML is a context-free language, not a
regular language.

It might well work (maybe better with "<[?]xml.*?>" or so) for a
particular kind of input sequence where any <?xml...?> thing only
appears in the beginning of each individual part and nowhere else,
and the ... in any of them doesn't contain >.

Just looping to find each string "<?xml" would then also work.
 
O

Oliver Wong

Jussi Piitulainen said:
Oliver said:
Danno said:
Try:

String s = "<?xml...><response
.../><?xml...><response.../><?xml...><response.../>";
String[] tokens = s.split("<\\?xml[.]*>");
for (String token : tokens) {
System.out.println(token);
}

Just a guess, I haven't tried it, so there maybe errors.

Probably won't work. XML is a context-free language, not a
regular language.

It might well work (maybe better with "<[?]xml.*?>" or so) for a
particular kind of input sequence where any <?xml...?> thing only
appears in the beginning of each individual part and nowhere else,
and the ... in any of them doesn't contain >.

Just looping to find each string "<?xml" would then also work.

Oops, I had thought that the regular expression Danno wrote was to get
the content of the strings themselves, rather than the delimiters. So
actually, Danno's code may probably work, as long as the "[.]*" part isn't
greedy, along with the other qualifications you gave.

- Oliver
 
J

Jussi Piitulainen

Oliver said:
Jussi said:
Oliver said:
Danno wrote: ....
String s = "<?xml...><response
.../><?xml...><response.../><?xml...><response.../>";
String[] tokens = s.split("<\\?xml[.]*>"); ....
Probably won't work. XML is a context-free language, not a
regular language.

It might well work (maybe better with "<[?]xml.*?>" or so) for a
particular kind of input sequence where any <?xml...?> thing only
appears in the beginning of each individual part and nowhere else,
and the ... in any of them doesn't contain >.

Just looping to find each string "<?xml" would then also work.

Oops, I had thought that the regular expression Danno wrote was
to get the content of the strings themselves, rather than the
delimiters. So actually, Danno's code may probably work, as long as
the "[.]*" part isn't greedy, along with the other qualifications
you gave.

Yes, the pattern in .split() is just the delimiter.

Greed is one fault. Character class brackets are another: the pattern
"[.]*" matches any number of dots only, while ".*" matches any number
of almost any characters. Both faults are easily fixed.

The method does not return the actual delimiters, so the text that was
matched by ".?" would be lost. If all the other conditions are right,
then "(<[?]xml.*?)((?=<[?]xml)|\\z)" should match exactly the wanted
parts of the document: from "<?xml" up to another "<?xml" or the end
of all input. Let me see. I shorten the tags a bit to keep the line
lengths under control:

import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Split {
public static void main(String [] _) {
Matcher m = Pattern
.compile("(<[?]x.*?)((?=<[?]x)|\\z)")
.matcher("<?x 1?><r 1/><?x 2?><r 2/><?x 3?><r 3/>");
while (m.find()) {
System.out.println("(" + m.group(1) + ")(" + m.group(2) + ")");
}
}
}

Ok, it appears to work - if all the conditions about the input are
true.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top