Java Regex problem - Please Help

D

Daku

Could some Java guru please help ? I am using java.util.regex, but
having extreme problems with parsing strings.
The regex string is:
<div\\b[^>]+?id=\""+keyValue+"\\b*\">(?:[^<]++|<(?!/?div>))*+</div>
And I am trying to match strings as:
<div class="taggedInput" id="Tools"
name="toolsets">&nbsp;&nbsp;wrench,hammer,screwdriver</div>
The keyValue is replaced with Tools.
I am always getting an error that there is no match, although this
regex has been know to match strings like this before.
I tried online at :
http://www.fileformat.info/tool/regex.htm,
but am confused by what I need to put in the text box labelled
'Replacement'.

Any hints, suggestions would be invaluable. Thanks in advance for your
help.
 
M

markspace

Daku said:
Could some Java guru please help ? I am using java.util.regex, but
having extreme problems with parsing strings.
The regex string is:
<div\\b[^>]+?id=\""+keyValue+"\\b*\">(?:[^<]++|<(?!/?div>))*+</div>
And I am trying to match strings as:
<div class="taggedInput" id="Tools"
name="toolsets">&nbsp;&nbsp;wrench,hammer,screwdriver</div>


It breaks down for me after the id="Tools" part. I can't see what you
are trying to do with \b* (lots of word boundaries?) and "> won't match
anything if there's a space in between the " and the >.

Compile your regex with COMMENTS option, and use this:


#opening <div
<div
# anything, reluctant
.*?
# word boundry
\b
# find "id="
id="Tools"
# match anything, reluctant, until the first >
.*?>
# capture, anything, reluctant
(.*?)
# until the </div>
</div>

I don't know if this is perfect, but it worked for me. It makes heavy
use of the reluctant modifier to parse through the <div> without
skipping parts it's looking for.

You'll have to add \\ for the \ escapes for Java Strings. And you'll
have to add your keyValue back into the string. Also you also may need
to turn on "DOT_ALL" if you have a <div> that spans lines, like you show.
 
M

markspace

Roedy said:
<div\\b[^>]+?id=\""+keyValue+"\\b*\">(?:[^<]++|<(?!/?div>))*+</div>
And I am trying to match strings as:
<div class="taggedInput" id="Tools"

I think you are using the wrong tool. See
http://mindprod.com/jgloss/xml.html


This is good advice, generally. The regex could get confused by CDATA
sections, embedded JavaScript, comments, etc.

OTOH, HTML is not XML. The OP will need an HTML parser, most HTML won't
validate with XML.
 
D

Daniel Pitts

Daku said:
Could some Java guru please help ? I am using java.util.regex, but
having extreme problems with parsing strings.
The regex string is:
<div\\b[^>]+?id=\""+keyValue+"\\b*\">(?:[^<]++|<(?!/?div>))*+</div>
And I am trying to match strings as:
<div class="taggedInput" id="Tools"
name="toolsets">&nbsp;&nbsp;wrench,hammer,screwdriver</div>
The keyValue is replaced with Tools.
I am always getting an error that there is no match, although this
regex has been know to match strings like this before.
I tried online at :
http://www.fileformat.info/tool/regex.htm,
but am confused by what I need to put in the text box labelled
'Replacement'.

Any hints, suggestions would be invaluable. Thanks in advance for your
help.
Don't use regex to parse HTML:

<http://www.codinghorror.com/blog/archives/001311.html>

Look at the library HTMLParser, it will probably serve your need, and
will probably do so more cleanly and more quickly.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top