Regex challenge

R

Roedy Green

I have been beating my head against a wall trying to get a regex to
work that spans several lines. The sales tax data for the cities in
Avoyelles Parish Lousiana looks like this:

<p class=MsoNormal align=center style='text-align:center'>Bunkie</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>9 %</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>4%</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>5%</p>
</td>
</tr>
<tr style='mso-yfti-irow:2'>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center
style='text-align:center'>Hessmer</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>8%</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>4%</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>4%</p>
</td>

I want to find a regex that will match a pattern for each row so I can
extract the fields
Bunkie,9
and
Hessmer,8

Here is one of my many failed attempts:

(?m)center'>([a-zA-Z ]+)</p>$.+$.+$.+?center'>([0-9\.]+)([ %]+)</p>

This is the string passed to Pattern.compile after fudging to get it
past the command interpreter.

Ideally I don't want to have to specify all the bubblegum. I would
like to skip over most of it with dot. You might recommend I use an
HTML parser instead of regex, but in the HTML is rife with syntax
errors.

A bit of background on the problem. I am updating sales tax tables
for every county and city in the USA for the American Sales Tax
calculator. See http://mindprod.com/applet/americantax.html. Lousiana
makes gathering this data particularly difficult. They don't even have
a PDF document to describe the rules, much less something sane like
CSV format. They told me they hope to get organised some time this
October. They have privatised sales tax collecting. Each
parish(county) is handled by a different business or private
individual. Each parish has a web page with the rules, usually in
table form, but every one is different. I have downloaded all the web
pages and I am trying to develop regex for each parish to extract
the raw data, one regex group for city and one for tax rate.

I wrote a utility that takes a regex and a file and extracts the
groups it finds to a CSV file for further processing. I have no
problem when all the data are on a single line. I think I must have
some misconception about how multiline REGEXes work.
 
D

Daniele Futtorovic

I have been beating my head against a wall trying to get a regex to
work that spans several lines. The sales tax data for the cities in
Avoyelles Parish Lousiana looks like this:

I wrote a utility that takes a regex and a file and extracts the
groups it finds to a CSV file for further processing. I have no
problem when all the data are on a single line. I think I must have
some misconception about how multiline REGEXes work.

Have you tried the java.util.regex.Pattern.DOTALL and similar flags?
 
R

Roedy Green

Have you tried the java.util.regex.Pattern.DOTALL and similar flags?

In my example, I use (?m) which is supposed to turn on the multiline
feature to treat $ as end of line rather than end of string.

I suppose I could experiment with turning it on explicitly.
 
T

Tom Anderson

I have been beating my head against a wall trying to get a regex to
work that spans several lines. The sales tax data for the cities in
Avoyelles Parish Lousiana looks like this:

<p class=MsoNormal align=center style='text-align:center'>Bunkie</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>9 %</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>4%</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>5%</p>
</td>
</tr>
<tr style='mso-yfti-irow:2'>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center
style='text-align:center'>Hessmer</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>8%</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>4%</p>
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
<p class=MsoNormal align=center style='text-align:center'>4%</p>
</td>

I want to find a regex that will match a pattern for each row so I can
extract the fields
Bunkie,9
and
Hessmer,8

Here is one of my many failed attempts:

(?m)center'>([a-zA-Z ]+)</p>$.+$.+$.+?center'>([0-9\.]+)([ %]+)</p>

This is the string passed to Pattern.compile after fudging to get it
past the command interpreter.

Ideally I don't want to have to specify all the bubblegum. I would
like to skip over most of it with dot. You might recommend I use an
HTML parser instead of regex, but in the HTML is rife with syntax
errors.

I wrote a utility that takes a regex and a file and extracts the groups
it finds to a CSV file for further processing. I have no problem when
all the data are on a single line. I think I must have some
misconception about how multiline REGEXes work.

ISTR from perl that . doesn't match newlines, even in multiline mode,
unless you do something special. I think in java the same applies, and you
have to specify the DOTALL flag. In your regexp, you use MULTILINE (?m)
but not DOTALL (?s).

tom
 
R

Roedy Green

In your regexp, you use MULTILINE (?m)
but not DOTALL (?s).

As I understand it (?m) makes $ match line end instead of end of
string.

(?s) makes . match any character INCLUDING end of line. Normally it
excludes end of line chars.

So would think you would either use (?m) and $ or (?s) and . to get
over line ends, if you don't handle them with explicit \n \r.
 
D

Daniele Futtorovic

In my example, I use (?m) which is supposed to turn on the multiline
feature to treat $ as end of line rather than end of string.

I suppose I could experiment with turning it on explicitly.

There's one problem you haven't addressed, AFAICS. You have your
"Bunkie" and "Hessmer", each followed by a row of numbers/percentages.
The trouble is the 1->N relationship. It's not clear to me whether you
want to have all these percentages or but one of them. If you want to
have only one you can get it to work. If you want to have them all AND
their number is always the same, then you can get it to work. But unless
I'm mistaken, if you want them all AND their number is NOT always the
same, you won't get it to work with one regex only. For, again: unless
I'm mistaken, undetermined capturing within quantified expressions
doesn't work. That was a terrible formulation of the problem, but I hope
you see what I mean.

Here's an example that works with a fixed number of percentages (three,
as in you input). I've modified the input a bit, giving it a proper HTML
structure.

<sscce>
package scratch;

import java.util.*;
import java.util.regex.*;

public class Scratch {

public static void main(String[] ss) {
String term = System.getProperty("line.separator");
String input = "<tr><td><p class=MsoNormal align=center
style='text-align:center'>Bunkie</p>" + term +
"</td>" + term +
"<td style='padding:.75pt .75pt .75pt .75pt'>" + term +
"<p class=MsoNormal align=center style='text-align:center'>9
%</p>" + term +
" </td>" + term +
" <td style='padding:.75pt .75pt .75pt .75pt'>" + term +
" <p class=MsoNormal align=center
style='text-align:center'>4%</p>" + term +
" </td>" + term +
" <td style='padding:.75pt .75pt .75pt .75pt'>" + term +
" <p class=MsoNormal align=center
style='text-align:center'>5%</p>" + term +
" </td>" + term +
" </tr>" + term +
" <tr style='mso-yfti-irow:2'>" + term +
" <td style='padding:.75pt .75pt .75pt .75pt'>" + term +
" <p class=MsoNormal align=center" + term +
"style='text-align:center'>Hessmer</p>" + term +
" </td>" + term +
" <td style='padding:.75pt .75pt .75pt .75pt'>" + term +
" <p class=MsoNormal align=center
style='text-align:center'>8%</p>" + term +
" </td>" + term +
" <td style='padding:.75pt .75pt .75pt .75pt'>" + term +
" <p class=MsoNormal align=center
style='text-align:center'>4%</p>" + term +
" </td>" + term +
" <td style='padding:.75pt .75pt .75pt .75pt'>" + term +
" <p class=MsoNormal align=center
style='text-align:center'>4%</p>" + term +
" </td></tr>";

Pattern p = Pattern.compile("(?s)<tr(?:" +
".*?<td.*?<p .*?>(.*?)</p>.*?</td>" +
".*?<td.*?<p .*?>(.*?)</p>.*?</td>" +
".*?<td.*?<p .*?>(.*?)</p>.*?</td>" +
".*?<td.*?<p .*?>(.*?)</p>.*?</td>" +
").*?</tr>");

for( Matcher m = p.matcher(input); m.find(); ){
System.out.print(m.group(1).toString() + ": ");

for(int ii = 2; ii <= m.groupCount(); ii++){
System.out.print(m.group(ii));

System.out.print( (ii < m.groupCount()) ? ", " : term);
}
}
}
}
</sscce>
 
R

Roedy Green

There's one problem you haven't addressed, AFAICS. You have your
"Bunkie" and "Hessmer", each followed by a row of numbers/percentages.
The trouble is the 1->N relationship. It's not clear to me whether you
want to have all these percentages or but one of them. If you want to
have only one you can get it to work. If you want to have them all AND
their number is always the same, then you can get it to work. But unless
I'm mistaken, if you want them all AND their number is NOT always the
same, you won't get it to work with one regex only. For, again: unless
I'm mistaken, undetermined capturing within quantified expressions
doesn't work. That was a terrible formulation of the problem, but I hope
you see what I mean.

I just want the first number after the name.

These are city names and the total tax. The other two columns are
local tax and state tax which sum to the first number -- at least that
is how many of the parishes do it.

Thankfully I am not trying to pluck a variable number of related
values.
 
R

Roedy Green

Pattern p = Pattern.compile("(?s)<tr(?:" +
".*?<td.*?<p .*?>(.*?)</p>.*?</td>" +
".*?<td.*?<p .*?>(.*?)</p>.*?</td>" +
".*?<td.*?<p .*?>(.*?)</p>.*?</td>" +
".*?<td.*?<p .*?>(.*?)</p>.*?</td>" +
").*?</tr>");

It seem to me I tried a pattern similar to that before without
success, but this time it worked perfectly. Thanks very much.

pluck "(?s)<tr.*?>.*?<td.*?>*.?<p.*?>([a-zA-Z
]+)</p>.*?</td>.*?<td.*?>.*?<p.*?>([0-9\.]+)[ %%]+</p>"
"Avoyelles.csv" "Avoyelles.html"

The % is doubled to foil the command processor.

It extracted the data:

Avoyelles.html,Bunkie,9
Avoyelles.html,Hessmer,8
Avoyelles.html,Cottonport,8
Avoyelles.html,Mansura,9
Avoyelles.html,Marksville,9

For this project the (?s) method will work better than (?m) since the
line breaks are not consistent.
 
J

Jussi Piitulainen

Roedy said:
As I understand it (?m) makes $ match line end instead of end of
string.

(?s) makes . match any character INCLUDING end of line. Normally it
excludes end of line chars.

So would think you would either use (?m) and $ or (?s) and . to get
over line ends, if you don't handle them with explicit \n \r.

^ and $ do not match any character. In multiline mode, they match an
empty string right after or before a line terminator. Just use (?s)
and let . eat terminator characters and other whitespace.

class T {
public static void main(String [] _) {
System.out.println("\n\n".matches("(?m)..")); // false
System.out.println("\n\n".matches("(?s)..")); // true
System.out.println("\n\n".matches("(?m)$$")); // false
System.out.println("\n\n".matches("(?s)$$")); // false
System.out.println("\n\n".matches("(?m)\n$")); // false
System.out.println("\n\n".matches("(?s)..$")); // true
}
}
 
D

Daniele Futtorovic

Pattern p = Pattern.compile("(?s)<tr(?:" +
".*?<td.*?<p .*?>(.*?)</p>.*?</td>" +
".*?<td.*?<p .*?>(.*?)</p>.*?</td>" +
".*?<td.*?<p .*?>(.*?)</p>.*?</td>" +
".*?<td.*?<p .*?>(.*?)</p>.*?</td>" +
").*?</tr>");

It seem to me I tried a pattern similar to that before without
success, but this time it worked perfectly. Thanks very much.

pluck "(?s)<tr.*?>.*?<td.*?>*.?<p.*?>([a-zA-Z
]+)</p>.*?</td>.*?<td.*?>.*?<p.*?>([0-9\.]+)[ %%]+</p>"
"Avoyelles.csv" "Avoyelles.html"

The % is doubled to foil the command processor.

It extracted the data:

Avoyelles.html,Bunkie,9
Avoyelles.html,Hessmer,8
Avoyelles.html,Cottonport,8
Avoyelles.html,Mansura,9
Avoyelles.html,Marksville,9

For this project the (?s) method will work better than (?m) since the
line breaks are not consistent.

Glad I could help.

BTW, it's not necessary to quote the dot in a character class.
 
R

Roedy Green

Roedy, has this become an intellectual challenge, or do
you just want to get the job done? I

I was hoping the exercise would generate some rules of thumb and
recipes that would be useful for general screenscraping.

Perhaps I could add a flag to pluck to translate all control chars to
space before you start scanning to simplify the regexes.
 
B

BTDTGTTS

Roedy said:
I was hoping the exercise would generate some rules of thumb and
recipes that would be useful for general screenscraping.

Perhaps I could add a flag to pluck to translate all control chars to
space before you start scanning to simplify the regexes.

FWIW I wouldn't dream of using regex for this kind of stuff. I'd use
something like TagSoup to turn the HTML into something approximating a well
formed XML document and either an XSLT stylesheet or XPath to extract the
data.

Regards
 
R

Roedy Green

FWIW I wouldn't dream of using regex for this kind of stuff. I'd use
something like TagSoup to turn the HTML into something approximating a well
formed XML document and either an XSLT stylesheet or XPath to extract the
data.

I think I will give that a try. There is a general principle to first
try the tool most specialised for the job.
 
B

BTDTGTTS

Roedy said:
I think I will give that a try. There is a general principle to first
try the tool most specialised for the job.

When screen scraping, I've always found it conceptually easier to be able to
say that I want, for example, "colomn 3 and 4 from row 2 of the third
table". Of course, this doesn't help if they change the layout of the page
on you, but depending on how good the HTML is and how profficient you are
with XSLT & XPath you can usually be more precise.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,768
Messages
2,569,574
Members
45,048
Latest member
verona

Latest Threads

Top