Java Pattern Matching / Reg Ex Question

D

Domenick

I would like to parse an HTML file and retrieve certain pieces of
information from the page. I have already read the file into a string,
and I have been using a very inefficient way to get the data I am
looking for. From what I remember in school, regular expressions
should do the trick, but I am not very familiar with them in Java and I
am not very familiar with Patterns in Java either. Given the string
below:

<html>
<head>
<title>The Title</title>
</head>
<body>

<span id="user1Control" class="user">Player 1 Name</span>
<span id="user2Control" class="user">Player 2 Name</span>
..
..
..
<span id="usernControl" class="user">Player n Name</span>

</body>
</html>

What would be the best way to get an array of Strings containing
Player 1 Name
Player 2 Name
Player 3 Name?

Also, in the HTML file I would like to parse, the actual "Player 1
Name" would be John Smith, and "Player 2 Name" would be Bill Jones,
etc....

Finally, the above HTML Example is a trimmed down version of the HTML
file that I will be parsing, and I would like to avoid calling "split"
on the string to try to retrieve the information that I am looking for.
I would like to learn how to do this using regular expressions and
patterns.

Any help or direction would be apprecited. Thanks in advance.
- Domenick
 
K

klynn47

The documentation for java.util.regex.Pattern has alot of information.
I'm not sure why you're uncomfortable with split. Is it for
performance? Many times when I parse these files, I get rid of most of
the HTML and convert what is between what I want into a character like
! or one that doesn't appear in the String and then split based on the
character.
 
D

Domenick

I'll take another look at Pattern and see what I can learn from it. I
am acutally using split right now to get the information that I want,
but the code readability is an absolute mess. Also, if the layout of
the HTML gets changed, I'd have to go through and redo the parsing
again. Another problem with the split is that it takes a regex as a
parameter, and I'm not sure what characters I need to escape to get the
split to work as I want it to. I just thought that using a regex or
pattern would be much cleaner. Like some way to say:
1. Get the end index of the first occurrence <span id="user*Control"
class="user"> (where * is any number of any characters)
2. From that index, get the beginning index of </span>
3. From index #1 to index #2 is the string I want.
4. Get the end index of the second occurrence of <span
id="user*Control" class="user">
5. Repeat #2
6. Repeat #3
Something like that.

I could still use some more help, and thanks in advance again.
- Domenick
 
?

=?ISO-8859-1?Q?S=E9bastien_Auvray?=

Domenick a écrit :
I would like to parse an HTML file and retrieve certain pieces of
information from the page.
Any help or direction would be apprecited. Thanks in advance.
- Domenick

Take a look at SAX. It includes a parser which parses a source
(io.InputStream or io.Reader, maybe StringReader in your case)
and calls your Handler on methods like startElement (tag start),
endElement (tag end) and characters (simple text betwen tags).
Just raise a flag when you're called on startElement with the tag you're
interested in (qName.equals("span") && atts.getIndex("class")!=-1
&& atts.getValue("class").equals("user"), then store the text provided
by characters where your flag is on. Don't forget to "false" your flag
on endElement call...

Sorry for my english, hope it helps...

Sebastien.
 
A

Alan Moore

I would like to parse an HTML file and retrieve certain pieces of
information from the page. I have already read the file into a string,
and I have been using a very inefficient way to get the data I am
looking for. From what I remember in school, regular expressions
should do the trick, but I am not very familiar with them in Java and I
am not very familiar with Patterns in Java either. Given the string
below:

<html>
<head>
<title>The Title</title>
</head>
<body>

<span id="user1Control" class="user">Player 1 Name</span>
<span id="user2Control" class="user">Player 2 Name</span>
.
.
.
<span id="usernControl" class="user">Player n Name</span>

</body>
</html>

What would be the best way to get an array of Strings containing
Player 1 Name
Player 2 Name
Player 3 Name?

Also, in the HTML file I would like to parse, the actual "Player 1
Name" would be John Smith, and "Player 2 Name" would be Bill Jones,
etc....

Finally, the above HTML Example is a trimmed down version of the HTML
file that I will be parsing, and I would like to avoid calling "split"
on the string to try to retrieve the information that I am looking for.
I would like to learn how to do this using regular expressions and
patterns.

Any help or direction would be apprecited. Thanks in advance.
- Domenick

Pattern p = Pattern.compile(
"<span id=\"user\\d+Control\"[^>]*+>([^<]*)");

Matcher m = p.matcher(htmlString);
List playerNames = new ArrayList();
while (m.find())
{
playerNames.add(m.group(1));
}

That regex assumes that the "id" attribute will always be the first
one listed, and that there will be precisely one space between it and
the element name. If you can't count on those things being true, you
can use the more general (but less readable) regex:

"<span[^>]+?id=\"user\\d+Control\"[^>]*+>([^<]*)"


There's a pretty good regex tutorial at
http://www.regular-expressions.info/


BTW, when you're using methods like split() that take regex arguments,
but you don't want to mess with regexes, just escape all punctuation
characters with backslashes. If a character doesn't have any special
meaning, it won't hurt anything.
 
V

Virgil Green

Alan said:
I would like to parse an HTML file and retrieve certain pieces of
information from the page. I have already read the file into a
string, and I have been using a very inefficient way to get the data
I am looking for. From what I remember in school, regular
expressions should do the trick, but I am not very familiar with
them in Java and I am not very familiar with Patterns in Java
either. Given the string below:

<html>
<head>
<title>The Title</title>
</head>
<body>

<span id="user1Control" class="user">Player 1 Name</span>
<span id="user2Control" class="user">Player 2 Name</span>
.
.
.
<span id="usernControl" class="user">Player n Name</span>

</body>
</html>

What would be the best way to get an array of Strings containing
Player 1 Name
Player 2 Name
Player 3 Name?

Also, in the HTML file I would like to parse, the actual "Player 1
Name" would be John Smith, and "Player 2 Name" would be Bill Jones,
etc....

Finally, the above HTML Example is a trimmed down version of the HTML
file that I will be parsing, and I would like to avoid calling
"split" on the string to try to retrieve the information that I am
looking for. I would like to learn how to do this using regular
expressions and patterns.

Any help or direction would be apprecited. Thanks in advance.
- Domenick

Pattern p = Pattern.compile(
"<span id=\"user\\d+Control\"[^>]*+>([^<]*)");

Matcher m = p.matcher(htmlString);
List playerNames = new ArrayList();
while (m.find())
{
playerNames.add(m.group(1));
}

That regex assumes that the "id" attribute will always be the first
one listed, and that there will be precisely one space between it and
the element name. If you can't count on those things being true, you
can use the more general (but less readable) regex:

"<span[^>]+?id=\"user\\d+Control\"[^>]*+>([^<]*)"

What about "<span id="user.Control" class="user">([^>]*)</span>" as the
regex? I realize this depends a lot upon the consistent construction of the
span tag, but other than that, what do you think?

- Virgil
 
A

Alan Moore

"<span[^>]+?id=\"user\\d+Control\"[^>]*+>([^<]*)"

What about "<span id="user.Control" class="user">([^>]*)</span>" as the
regex? I realize this depends a lot upon the consistent construction of the
span tag, but other than that, what do you think?

Well, you would have to escape all the quotes in the regex, as I did.
Also, you're assuming there will only be one digit in the "id"
attribute, but I understood there could be more than one. Otherwise,
and with the caveat you mentioned, that regex looks good to me.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,020
Latest member
GenesisGai

Latest Threads

Top