Regex challenge

Roedy Green · Jun 4, 2008

I have been beating my head against a wall trying to get a regex to
work that spans several lines. The sales tax data for the cities in
Avoyelles Parish Lousiana looks like this:

Bunkie
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
9 %
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
4%
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
5%
</td>
</tr>
<tr style='mso-yfti-irow:2'>
<td style='padding:.75pt .75pt .75pt .75pt'>
Hessmer
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
8%
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
4%
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
4%
</td>

I want to find a regex that will match a pattern for each row so I can
extract the fields
Bunkie,9
and
Hessmer,8

Here is one of my many failed attempts:

(?m)center'>([a-zA-Z ]+)$.+$.+$.+?center'>([0-9\.]+)([ %]+)

This is the string passed to Pattern.compile after fudging to get it
past the command interpreter.

Ideally I don't want to have to specify all the bubblegum. I would
like to skip over most of it with dot. You might recommend I use an
HTML parser instead of regex, but in the HTML is rife with syntax
errors.

A bit of background on the problem. I am updating sales tax tables
for every county and city in the USA for the American Sales Tax
calculator. See http://mindprod.com/applet/americantax.html. Lousiana
makes gathering this data particularly difficult. They don't even have
a PDF document to describe the rules, much less something sane like
CSV format. They told me they hope to get organised some time this
October. They have privatised sales tax collecting. Each
parish(county) is handled by a different business or private
individual. Each parish has a web page with the rules, usually in
table form, but every one is different. I have downloaded all the web
pages and I am trying to develop regex for each parish to extract
the raw data, one regex group for city and one for tax rate.

I wrote a utility that takes a regex and a file and extracts the
groups it finds to a CSV file for further processing. I have no
problem when all the data are on a single line. I think I must have
some misconception about how multiline REGEXes work.

Daniele Futtorovic · Jun 4, 2008

I have been beating my head against a wall trying to get a regex to
work that spans several lines. The sales tax data for the cities in
Avoyelles Parish Lousiana looks like this:

I wrote a utility that takes a regex and a file and extracts the
groups it finds to a CSV file for further processing. I have no
problem when all the data are on a single line. I think I must have
some misconception about how multiline REGEXes work.

Have you tried the java.util.regex.Pattern.DOTALL and similar flags?

Roedy Green · Jun 4, 2008

Have you tried the java.util.regex.Pattern.DOTALL and similar flags?

In my example, I use (?m) which is supposed to turn on the multiline
feature to treat $ as end of line rather than end of string.

I suppose I could experiment with turning it on explicitly.

Tom Anderson · Jun 4, 2008

I have been beating my head against a wall trying to get a regex to
work that spans several lines. The sales tax data for the cities in
Avoyelles Parish Lousiana looks like this:

Bunkie
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
9 %
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
4%
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
5%
</td>
</tr>
<tr style='mso-yfti-irow:2'>
<td style='padding:.75pt .75pt .75pt .75pt'>
Hessmer
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
8%
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
4%
</td>
<td style='padding:.75pt .75pt .75pt .75pt'>
4%
</td>

I want to find a regex that will match a pattern for each row so I can
extract the fields
Bunkie,9
and
Hessmer,8

Here is one of my many failed attempts:

(?m)center'>([a-zA-Z ]+)$.+$.+$.+?center'>([0-9\.]+)([ %]+)

This is the string passed to Pattern.compile after fudging to get it
past the command interpreter.

Ideally I don't want to have to specify all the bubblegum. I would
like to skip over most of it with dot. You might recommend I use an
HTML parser instead of regex, but in the HTML is rife with syntax
errors.

I wrote a utility that takes a regex and a file and extracts the groups
it finds to a CSV file for further processing. I have no problem when
all the data are on a single line. I think I must have some
misconception about how multiline REGEXes work.

ISTR from perl that . doesn't match newlines, even in multiline mode,
unless you do something special. I think in java the same applies, and you
have to specify the DOTALL flag. In your regexp, you use MULTILINE (?m)
but not DOTALL (?s).

tom

Roedy Green · Jun 4, 2008

In your regexp, you use MULTILINE (?m)
but not DOTALL (?s).

As I understand it (?m) makes $ match line end instead of end of
string.

(?s) makes . match any character INCLUDING end of line. Normally it
excludes end of line chars.

So would think you would either use (?m) and $ or (?s) and . to get
over line ends, if you don't handle them with explicit \n \r.

Daniele Futtorovic · Jun 4, 2008

In my example, I use (?m) which is supposed to turn on the multiline
feature to treat $ as end of line rather than end of string.

I suppose I could experiment with turning it on explicitly.

There's one problem you haven't addressed, AFAICS. You have your
"Bunkie" and "Hessmer", each followed by a row of numbers/percentages.
The trouble is the 1->N relationship. It's not clear to me whether you
want to have all these percentages or but one of them. If you want to
have only one you can get it to work. If you want to have them all AND
their number is always the same, then you can get it to work. But unless
I'm mistaken, if you want them all AND their number is NOT always the
same, you won't get it to work with one regex only. For, again: unless
I'm mistaken, undetermined capturing within quantified expressions
doesn't work. That was a terrible formulation of the problem, but I hope
you see what I mean.

Here's an example that works with a fixed number of percentages (three,
as in you input). I've modified the input a bit, giving it a proper HTML
structure.

<sscce>
package scratch;

import java.util.*;
import java.util.regex.*;

public class Scratch {

public static void main(String[] ss) {
String term = System.getProperty("line.separator");
String input = "<tr><td>Bunkie" + term +
"</td>" + term +
"<td style='padding:.75pt .75pt .75pt .75pt'>" + term +
"9
%" + term +
" </td>" + term +
" <td style='padding:.75pt .75pt .75pt .75pt'>" + term +
" 4%" + term +
" </td>" + term +
" <td style='padding:.75pt .75pt .75pt .75pt'>" + term +
" 5%" + term +
" </td>" + term +
" </tr>" + term +
" <tr style='mso-yfti-irow:2'>" + term +
" <td style='padding:.75pt .75pt .75pt .75pt'>" + term +
" Hessmer" + term +
" </td>" + term +
" <td style='padding:.75pt .75pt .75pt .75pt'>" + term +
" 8%" + term +
" </td>" + term +
" <td style='padding:.75pt .75pt .75pt .75pt'>" + term +
" 4%" + term +
" </td>" + term +
" <td style='padding:.75pt .75pt .75pt .75pt'>" + term +
" 4%" + term +
" </td></tr>";

Pattern p = Pattern.compile("(?s)<tr(?:" +
".*?<td.*?(.*?).*?</td>" +
".*?<td.*?(.*?).*?</td>" +
".*?<td.*?(.*?).*?</td>" +
".*?<td.*?(.*?).*?</td>" +
").*?</tr>");

for( Matcher m = p.matcher(input); m.find(); ){
System.out.print(m.group(1).toString() + ": ");

for(int ii = 2; ii <= m.groupCount(); ii++){
System.out.print(m.group(ii));

System.out.print( (ii < m.groupCount()) ? ", " : term);
}
}
}
}
</sscce>

Roedy Green · Jun 4, 2008

There's one problem you haven't addressed, AFAICS. You have your
"Bunkie" and "Hessmer", each followed by a row of numbers/percentages.
The trouble is the 1->N relationship. It's not clear to me whether you
want to have all these percentages or but one of them. If you want to
have only one you can get it to work. If you want to have them all AND
their number is always the same, then you can get it to work. But unless
I'm mistaken, if you want them all AND their number is NOT always the
same, you won't get it to work with one regex only. For, again: unless
I'm mistaken, undetermined capturing within quantified expressions
doesn't work. That was a terrible formulation of the problem, but I hope
you see what I mean.

I just want the first number after the name.

These are city names and the total tax. The other two columns are
local tax and state tax which sum to the first number -- at least that
is how many of the parishes do it.

Thankfully I am not trying to pluck a variable number of related
values.

Roedy Green · Jun 4, 2008

("(?s)<tr(?:"

I have never seen ?: before. What does it mean?

Daniele Futtorovic · Jun 4, 2008

I have never seen ?: before. What does it mean?

Non-capturing group. Wasn't needed in the version I posted. I forgot to
remove it.

Roedy Green · Jun 4, 2008

Pattern p = Pattern.compile("(?s)<tr(?:" +
".*?<td.*?(.*?).*?</td>" +
".*?<td.*?(.*?).*?</td>" +
".*?<td.*?(.*?).*?</td>" +
".*?<td.*?(.*?).*?</td>" +
").*?</tr>");

It seem to me I tried a pattern similar to that before without
success, but this time it worked perfectly. Thanks very much.

pluck "(?s)<tr.*?>.*?<td.*?>*.?<p.*?>([a-zA-Z
]+).*?</td>.*?<td.*?>.*?<p.*?>([0-9\.]+)[ %%]+"
"Avoyelles.csv" "Avoyelles.html"

The % is doubled to foil the command processor.

It extracted the data:

Avoyelles.html,Bunkie,9
Avoyelles.html,Hessmer,8
Avoyelles.html,Cottonport,8
Avoyelles.html,Mansura,9
Avoyelles.html,Marksville,9

For this project the (?s) method will work better than (?m) since the
line breaks are not consistent.

Jussi Piitulainen · Jun 4, 2008

Roedy said:
As I understand it (?m) makes $ match line end instead of end of
string.

(?s) makes . match any character INCLUDING end of line. Normally it
excludes end of line chars.

So would think you would either use (?m) and $ or (?s) and . to get
over line ends, if you don't handle them with explicit \n \r.

^ and $ do not match any character. In multiline mode, they match an
empty string right after or before a line terminator. Just use (?s)
and let . eat terminator characters and other whitespace.

class T {
public static void main(String [] _) {
System.out.println("\n\n".matches("(?m)..")); // false
System.out.println("\n\n".matches("(?s)..")); // true
System.out.println("\n\n".matches("(?m)$$")); // false
System.out.println("\n\n".matches("(?s)$$")); // false
System.out.println("\n\n".matches("(?m)\n$")); // false
System.out.println("\n\n".matches("(?s)..$")); // true
}
}

Daniele Futtorovic · Jun 4, 2008

Pattern p = Pattern.compile("(?s)<tr(?:" +
".*?<td.*?(.*?).*?</td>" +
".*?<td.*?(.*?).*?</td>" +
".*?<td.*?(.*?).*?</td>" +
".*?<td.*?(.*?).*?</td>" +
").*?</tr>");

Click to expand...

It seem to me I tried a pattern similar to that before without
success, but this time it worked perfectly. Thanks very much.

pluck "(?s)<tr.*?>.*?<td.*?>*.?<p.*?>([a-zA-Z
]+).*?</td>.*?<td.*?>.*?<p.*?>([0-9\.]+)[ %%]+"
"Avoyelles.csv" "Avoyelles.html"

The % is doubled to foil the command processor.

It extracted the data:

Avoyelles.html,Bunkie,9
Avoyelles.html,Hessmer,8
Avoyelles.html,Cottonport,8
Avoyelles.html,Mansura,9
Avoyelles.html,Marksville,9

For this project the (?s) method will work better than (?m) since the
line breaks are not consistent.

Glad I could help.

BTW, it's not necessary to quote the dot in a character class.

Roedy Green · Jun 4, 2008

Roedy, has this become an intellectual challenge, or do
you just want to get the job done? I

I was hoping the exercise would generate some rules of thumb and
recipes that would be useful for general screenscraping.

Perhaps I could add a flag to pluck to translate all control chars to
space before you start scanning to simplify the regexes.

BTDTGTTS · Jun 4, 2008

Roedy said:
I was hoping the exercise would generate some rules of thumb and
recipes that would be useful for general screenscraping.

Perhaps I could add a flag to pluck to translate all control chars to
space before you start scanning to simplify the regexes.

FWIW I wouldn't dream of using regex for this kind of stuff. I'd use
something like TagSoup to turn the HTML into something approximating a well
formed XML document and either an XSLT stylesheet or XPath to extract the
data.

Regards

Roedy Green · Jun 5, 2008

FWIW I wouldn't dream of using regex for this kind of stuff. I'd use
something like TagSoup to turn the HTML into something approximating a well
formed XML document and either an XSLT stylesheet or XPath to extract the
data.

I think I will give that a try. There is a general principle to first
try the tool most specialised for the job.

BTDTGTTS · Jun 5, 2008

Roedy said:
I think I will give that a try. There is a general principle to first
try the tool most specialised for the job.

When screen scraping, I've always found it conceptually easier to be able to
say that I want, for example, "colomn 3 and 4 from row 2 of the third
table". Of course, this doesn't help if they change the layout of the page
on you, but depending on how good the HTML is and how profficient you are
with XSLT & XPath you can usually be more precise.

I need help fixing my website	2	Oct 15, 2023
How to have two html audio players on one page?	0	May 3, 2022
Help needed with thank you message	5	Sep 11, 2021
Help with code	0	Jun 12, 2022
Registration form	13	May 19, 2021
Aligned to the left	3	Apr 19, 2023
Sort by number of characters	1	Nov 2, 2023
I need help making a zooming function	11	Dec 14, 2021

Regex challenge

Roedy Green

Daniele Futtorovic

Roedy Green

Tom Anderson

Roedy Green

Daniele Futtorovic

Roedy Green

Roedy Green

Daniele Futtorovic

Roedy Green

Jussi Piitulainen

Daniele Futtorovic

Roedy Green

BTDTGTTS

Roedy Green

BTDTGTTS

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads