Regular expression to structure HTML

504crank · Oct 2, 2009

I'm kind of new to regular expressions, and I've spent hours trying to
finesse a regular expression to build a substitution.

What I'd like to do is extract data elements from HTML and structure
them so that they can more readily be imported into a database.

No -- sorry -- I don't want to use BeautifulSoup (though I have for
other projects). Humor me, please -- I'd really like to see if this
can be done with just regular expressions.

Note that the output is referenced using named groups.

My challenge is successfully matching the HTML tags in between the
first table row, and the second table row.

I'd appreciate any suggestions to improve the approach.

rText = "<tr><td valign=top>8583</td><td valign=top><a
href=lic_details.asp?lic_number=8583>New Horizon Technical Academy,
Inc #4</a></td><td valign=top>Jefferson</td><td valign=top>70114</td></
tr><tr><td valign=top>9371</td><td valign=top><a href=lic_details.asp?
lic_number=9371>Career Learning Center</a></td><td
valign=top>Jefferson</td><td valign=top>70113</td></tr>"

rText = re.compile(r'(<tr><td valign=top>)(?P<zlicense>\d+)(</td>)(<td
valign=top>)(<a href=lic_details.asp)(\?lic_number=\d+)(>)(?P<zname>[A-
Za-z0-9#\s\S\W]+)(</.*?>).+$').sub(r'LICENSE:\g<zlicense>|NAME:
\g<zname>\n', rText)

print rText

LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4</a></td><td
valign=top>Jefferson</td><td valign=top>70114</td></tr><tr><td
valign=top>9371</td><td valign=top><a href=lic_details.asp?
lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113

Bruno Desthuilliers · Oct 2, 2009

(e-mail address removed) a écrit :

I'm kind of new to regular expressions, and I've spent hours trying to
finesse a regular expression to build a substitution.

What I'd like to do is extract data elements from HTML and structure
them so that they can more readily be imported into a database.

No -- sorry -- I don't want to use BeautifulSoup (though I have for
other projects). Humor me, please -- I'd really like to see if this
can be done with just regular expressions.

I'm kind of new to hammers, and I've spent hours trying to find out how
to drive a screw with a hammer. No -- sorry -- I don't want to use a
screwdriver.

<g>

Paul McGuire · Oct 2, 2009

I'm kind of new to regular expressions, and I've spent hours trying to
finesse a regular expression to build a substitution.

What I'd like to do is extract data elements from HTML and structure
them so that they can more readily be imported into a database.

Oy! If I had a nickel for every misguided coder who tried to scrape
HTML with regexes...

Some reasons why RE's are no good at parsing HTML:
- tags can be mixed case
- tags can have whitespace in many unexpected places
- tags with no body can combine opening and closing tag with a '/'
before the closing '>', as in "<BR/>"
- tags can have attributes that you did not expect (like "<BR
CLEAR=ALL>")
- attributes can occur in any order within the tag
- attribute names can also be in unexpected upper/lower case
- attribute values can be enclosed in double quotes, single quotes, or
even (surprise!) NO quotes

For HTML that is machine-generated, you *may* be able to make some
page-specific assumptions. But if edited by human hands, or if you
are trying to make a generic page scraper, RE's will never cut it.

-- Paul

Stefan Behnel · Oct 2, 2009

Paul said:
Oy! If I had a nickel for every misguided coder who tried to scrape
HTML with regexes...

Some reasons why RE's are no good at parsing HTML:
- tags can be mixed case
- tags can have whitespace in many unexpected places
- tags with no body can combine opening and closing tag with a '/'
before the closing '>', as in "<BR/>"
- tags can have attributes that you did not expect (like "<BR
CLEAR=ALL>")
- attributes can occur in any order within the tag
- attribute names can also be in unexpected upper/lower case
- attribute values can be enclosed in double quotes, single quotes, or
even (surprise!) NO quotes

BTW, BeautifulSoup's parser also uses regexes, so if the OP used it, he/she
could claim to have solved the problem "with regular expressions" without
even lying.

Stefan

John · Oct 2, 2009

I'm kind of new to regular expressions, and I've spent hours trying to
finesse a regular expression to build a substitution.

What I'd like to do is extract data elements from HTML and structure
them so that they can more readily be imported into a database.

No -- sorry -- I don't want to use BeautifulSoup (though I have for
other projects). Humor me, please -- I'd really like to see if this
can be done with just regular expressions.

Note that the output is referenced using named groups.

My challenge is successfully matching the HTML tags in between the
first table row, and the second table row.

I'd appreciate any suggestions to improve the approach.

rText = "<tr><td valign=top>8583</td><td valign=top><a
href=lic_details.asp?lic_number=8583>New Horizon Technical Academy,
Inc #4</a></td><td valign=top>Jefferson</td><td valign=top>70114</td></
tr><tr><td valign=top>9371</td><td valign=top><a href=lic_details.asp?
lic_number=9371>Career Learning Center</a></td><td
valign=top>Jefferson</td><td valign=top>70113</td></tr>"

rText = re.compile(r'(<tr><td valign=top>)(?P<zlicense>\d+)(</td>)(<td
valign=top>)(<a href=lic_details.asp)(\?lic_number=\d+)(>)(?P<zname>[A-
Za-z0-9#\s\S\W]+)(</.*?>).+$').sub(r'LICENSE:\g<zlicense>|NAME:
\g<zname>\n', rText)

print rText

LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4</a></td><td
valign=top>Jefferson</td><td valign=top>70114</td></tr><tr><td
valign=top>9371</td><td valign=top><a href=lic_details.asp?
lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113

Some suggestions to start off with:

* triple-quote your multiline strings
* consider using the re.X, re.M, and re.S options for re.compile()
* save your re object after you compile it
* note that re.sub() returns a new string

Also, it sounds like you want to replace the first 2 <td> elements for
each <tr> element with their content separated by a pipe (throwing
away the <td> tags themselves), correct?

---John

Brian D · Oct 2, 2009

Yes, John, that's correct. I'm trying to trap and discard the <tr> row
<td> elements, re-formatting with pipes so that I can more readily
import the data into a database. The tags are, of course, initially
useful for pattern discovery. But there are other approaches -- I
could just replace the tags and capture the data as an array.

I'm well aware of the problems using regular expressions for html
parsing. This isn't merely a question of knowing when to use the right
tool. It's a question about how to become a better developer using
regular expressions.

I'm trying to figure out where the regular expression fails. The
structure of the page I'm scraping is uniform in the production of
tags -- it's an old ASP page that pulls data from a database.

What's different in the first <tr> row is the appearance of a comma, a
# pound symbol, and a number (", Inc #4"). I'm making the assumption
that's what's throwing off the remainder of the regular expression --
because (despite the snark by others above) the expression is working
for every other data row. But I could be wrong. Of course, if I could
identify the problem, I wouldn't be asking. That's why I posted the
question for other eyes to review.

I discovered that I may actually be properly parsing the data from the
tags when I tried this test in a Python interpreter:

New Horizon Technical Academy said:
s = "New Horizon Technical Academy, Inc #4</a>"
p = re.compile(r'([\s\S\WA-Za-z0-9]*)(</.*?>)')
m = p.match(s)
m = p.match(s)
m.group(0)

Click to expand...

New Horizon Technical Academy said:

m.group(1) "New Horizon Technical Academy, Inc #4"
m.group(2)

Click to expand...

Click to expand...

'</a>'

I found it curious that I was capturing the groups as sequences, but I
didn't understand how to use this knowledge in named groups -- or
maybe I am merely mis-identifying the source of the regular expression
problem.

It's a puzzle. I'm hoping someone will want to share the wisdom of
their experience, not criticize for the attempt to learn. Maybe one
shouldn't learn how to use a hammer on a screw, but I wouldn't say
that I have never hammered a screw into a piece of wood just because I
only had a hammer.

Thanks,
Brian

I'm kind of new to regular expressions, and I've spent hours trying to
finesse a regular expression to build a substitution.

Click to expand...

What I'd like to do is extract data elements from HTML and structure
them so that they can more readily be imported into a database.

Click to expand...

No -- sorry -- I don't want to use BeautifulSoup (though I have for
other projects). Humor me, please -- I'd really like to see if this
can be done with just regular expressions.

Click to expand...

Note that the output is referenced using named groups.

Click to expand...

My challenge is successfully matching the HTML tags in between the
first table row, and the second table row.

Click to expand...

I'd appreciate any suggestions to improve the approach.

Click to expand...

rText = "<tr><td valign=top>8583</td><td valign=top><a
href=lic_details.asp?lic_number=8583>New Horizon Technical Academy,
Inc #4</a></td><td valign=top>Jefferson</td><td valign=top>70114</td></
tr><tr><td valign=top>9371</td><td valign=top><a href=lic_details..asp?
lic_number=9371>Career Learning Center</a></td><td
valign=top>Jefferson</td><td valign=top>70113</td></tr>"

Click to expand...

rText = re.compile(r'(<tr><td valign=top>)(?P<zlicense>\d+)(</td>)(<td
valign=top>)(<a href=lic_details.asp)(\?lic_number=\d+)(>)(?P<zname>[A-
Za-z0-9#\s\S\W]+)(</.*?>).+$').sub(r'LICENSE:\g<zlicense>|NAME:
\g<zname>\n', rText)

Click to expand...

print rText

Click to expand...

LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4</a></td><td
valign=top>Jefferson</td><td valign=top>70114</td></tr><tr><td
valign=top>9371</td><td valign=top><a href=lic_details.asp?
lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113

Click to expand...

Some suggestions to start off with:

* triple-quote your multiline strings
* consider using the re.X, re.M, and re.S options for re.compile()
* save your re object after you compile it
* note that re.sub() returns a new string

Also, it sounds like you want to replace the first 2 <td> elements for
each <tr> element with their content separated by a pipe (throwing
away the <td> tags themselves), correct?

---John

Brian D · Oct 2, 2009

The other thought I had was that I may not be properly trapping the
end of the first <tr> row, and the beginning of the next <tr> row.

I'm kind of new to regular expressions, and I've spent hours trying to
finesse a regular expression to build a substitution.

Click to expand...

What I'd like to do is extract data elements from HTML and structure
them so that they can more readily be imported into a database.

Click to expand...

No -- sorry -- I don't want to use BeautifulSoup (though I have for
other projects). Humor me, please -- I'd really like to see if this
can be done with just regular expressions.

Click to expand...

Note that the output is referenced using named groups.

Click to expand...

My challenge is successfully matching the HTML tags in between the
first table row, and the second table row.

Click to expand...

I'd appreciate any suggestions to improve the approach.

Click to expand...

rText = "<tr><td valign=top>8583</td><td valign=top><a
href=lic_details.asp?lic_number=8583>New Horizon Technical Academy,
Inc #4</a></td><td valign=top>Jefferson</td><td valign=top>70114</td></
tr><tr><td valign=top>9371</td><td valign=top><a href=lic_details..asp?
lic_number=9371>Career Learning Center</a></td><td
valign=top>Jefferson</td><td valign=top>70113</td></tr>"

Click to expand...

rText = re.compile(r'(<tr><td valign=top>)(?P<zlicense>\d+)(</td>)(<td
valign=top>)(<a href=lic_details.asp)(\?lic_number=\d+)(>)(?P<zname>[A-
Za-z0-9#\s\S\W]+)(</.*?>).+$').sub(r'LICENSE:\g<zlicense>|NAME:
\g<zname>\n', rText)

Click to expand...

print rText

Click to expand...

LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4</a></td><td
valign=top>Jefferson</td><td valign=top>70114</td></tr><tr><td
valign=top>9371</td><td valign=top><a href=lic_details.asp?
lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113

Click to expand...

Some suggestions to start off with:

* triple-quote your multiline strings
* consider using the re.X, re.M, and re.S options for re.compile()
* save your re object after you compile it
* note that re.sub() returns a new string

Also, it sounds like you want to replace the first 2 <td> elements for
each <tr> element with their content separated by a pipe (throwing
away the <td> tags themselves), correct?

---John

504crank · Oct 2, 2009

Screw:

<td valign=top>14313
</td>

<td valign=top><a href=lic_details.asp?lic_number=14313>Python
Hammer Institute #2</a>
</td>

<td valign=top>Jefferson
</td>

<td valign=top>70114
</td>

</tr>

<tr>

<td valign=top>8583
</td>

<td valign=top><a href=lic_details.asp?lic_number=8583>New
Screwdriver Technical Academy, Inc #4</a>
</td>

<td valign=top>Jefferson
</td>

<td valign=top>70114
</td>

</tr>

<tr>

<td valign=top>9371
</td>

<td valign=top><a href=lic_details.asp?lic_number=9371>Career
RegEx Center</a>
</td>

<td valign=top>Jefferson
</td>

<td valign=top>70113
</td>

</tr>"""

Hammer:

First remove line returns.
Then remove extra spaces.
Then insert a line return to restore logical rows on each </tr><tr>
combination. For more information, see: http://www.qc4blog.com/?p=55
<tr><td valign=top>14313</td><td valign=top><a href=lic_details.asp?
lic_number=14313>Python Hammer Institute #2</a></td><td
valign=top>Jefferson</td><td valign=top>70114</td></tr>
<tr><td valign=top>8583</td><td valign=top><a href=lic_details.asp?
lic_number=8583>New Screwdriver Technical Academy, Inc #4</a></td><td
valign=top>Jefferson</td><td valign=top>70114</td></tr>
<tr><td valign=top>9371</td><td valign=top><a href=lic_details.asp?
lic_number=9371>Career RegEx Center</a></td><td valign=top>Jefferson</
td><td valign=top>70113</td></tr>

p = re.compile(r"(<tr><td valign=top>)(?P<zlicense>\d+)(</td>)(<td valign=top>)(<a href=lic_details\.asp)(\?lic_number=\d+)(>)(?P<zname>[\s\S\WA-Za-z0-9]*?)(</a>)(</td>)(?:<td valign=top>)(?P<zparish>[\s\WA-Za-z]+)(</td>)(<td valign=top>)(?P<zzip>\d+)(</td>)(</tr>)$", re.M)
n = p.sub(r'LICENSE:\g<zlicense>|NAME:\g<zname>|PARISH:\g<zparish>|ZIP:\g<zzip>', s)
print n

Click to expand...

Click to expand...

LICENSE:14313|NAME

ython Hammer Institute #2|PARISH:Jefferson|ZIP:
70114
LICENSE:8583|NAME:New Screwdriver Technical Academy, Inc #4|
PARISH:Jefferson|ZIP:70114
LICENSE:9371|NAME:Career RegEx Center|PARISH:Jefferson|ZIP:70113
The solution was to escape the period in the ".asp" string, e.g.,
"\.asp". I also had to limit the pattern in the <zname> grouping by
using a "?" qualifier to limit the "greediness" of the "*" pattern
metacharacter.

Now, who would like to turn that re.compile pattern into a MULTILINE
expression, combining the re.M and re.X flags?

Documentation says that one should be able to use the bitwise OR
operator (e.g., re.M | re.X), but I sure couldn't get it to work.

Sometimes a hammer actually is the right tool if you hit the screw
long and hard enough.

I think I'll try to hit some more screws with my new hammer.

Good day.

I'm kind of new to regular expressions, and I've spent hours trying to
finesse a regular expression to build a substitution.

What I'd like to do is extract data elements from HTML and structure
them so that they can more readily be imported into a database.

No -- sorry -- I don't want to use BeautifulSoup (though I have for
other projects). Humor me, please -- I'd really like to see if this
can be done with just regular expressions.

Note that the output is referenced using named groups.

My challenge is successfully matching the HTML tags in between the
first table row, and the second table row.

I'd appreciate any suggestions to improve the approach.

rText = "<tr><td valign=top>8583</td><td valign=top><a
href=lic_details.asp?lic_number=8583>New Horizon Technical Academy,
Inc #4</a></td><td valign=top>Jefferson</td><td valign=top>70114</td></
tr><tr><td valign=top>9371</td><td valign=top><a href=lic_details.asp?
lic_number=9371>Career Learning Center</a></td><td
valign=top>Jefferson</td><td valign=top>70113</td></tr>"

rText = re.compile(r'(<tr><td valign=top>)(?P<zlicense>\d+)(</td>)(<td
valign=top>)(<a href=lic_details.asp)(\?lic_number=\d+)(>)(?P<zname>[A-
Za-z0-9#\s\S\W]+)(</.*?>).+$').sub(r'LICENSE:\g<zlicense>|NAME:
\g<zname>\n', rText)

print rText

LICENSE:8583|NAME:New Horizon Technical Academy, Inc #4</a></td><td
valign=top>Jefferson</td><td valign=top>70114</td></tr><tr><td
valign=top>9371</td><td valign=top><a href=lic_details.asp?
lic_number=9371>Career Learning Center|PARISH:Jefferson|ZIP:70113

greg · Oct 3, 2009

Brian said:
This isn't merely a question of knowing when to use the right
tool. It's a question about how to become a better developer using
regular expressions.

It could be said that if you want to learn how to use a
hammer, it's better to practise on nails rather than
screws.

504crank · Oct 3, 2009

It could be said that if you want to learn how to use a
hammer, it's better to practise on nails rather than
screws.

It could be said that the bandwidth in technical forums should be
reserved for on-topic exchanges, not flaming intelligent people who
might have something to contribute to the forum. The truth is, I found
a solution where others were ostensibly either too lazy to attempt, or
too eager grandstanding their superiority to assist. Who knows --
maybe I'll provide an alternative to BeautifulSoup one day.

Nobody · Oct 4, 2009

I'm kind of new to regular expressions

The most important thing to learn about regular expressions is to learn
what they can do, what they can't do, and what they can do in theory but
can't do in practice (usually because of exponential or combinatorial
growth).

One thing they can't do is to match any kind of construct which has
arbitrary nesting. E.g. you can't match any class of HTML element which
can self-nest or whose children can self-nest. In practice, this means you
can only match a handful of elements which are either empty (e.g. <img>)
or which can only contain CDATA (e.g. <script>, <style>).

You can match individual tags, although getting it right is quite hard;

What I'd like to do is extract data elements from HTML and structure
them so that they can more readily be imported into a database.

If you want to extract entire elements from arbitrary HTML, you have to
use a real parser which can handle recursion, e.g. a recursive-descent
parser or a push-down automaton.

You can use regexps to match individual tags. If you only need to parse a
very specific subset of HTML (i.e. the pages are all generated from a
common template), you may even be able to match some entire elements using
regexps.

Stefan Behnel · Oct 5, 2009

No -- sorry -- I don't want to use BeautifulSoup (though I have for
other projects). Humor me, please -- I'd really like to see if this
can be done with just regular expressions.

I think the reason why people are giving funny comments here is that you
failed to provide a reason for the above requirement. That makes it sound
like a typical "How can I use X to do Y?" question.

http://www.catb.org/~esr/faqs/smart-questions.html#id383188

Stefan

Can someone tell me if this a real tracker? Or is it one designed to show you a different message at certain times, ie. acting like one?	0	Jan 10, 2021
Need help with <rowspan> in an HTML table	1	Nov 6, 2024
How to have two html audio players on one page?	0	May 3, 2022
Can anyone please help? HTML - two tables applying different styles	4	Dec 1, 2020
Image shifts to the right when export the page to pdf	4	May 5, 2023
Hello I am learning how to code and I tried making a calculator with HTML and js with some CSS I am stuck at thing, Like the screen value is	0	Mar 13, 2025
SendGrid email issue in responsive Gmail	1	Nov 4, 2021
Struggling with html table and verification	0	Nov 10, 2018

Regular expression to structure HTML

504crank

Bruno Desthuilliers

Paul McGuire

Stefan Behnel

John

Brian D

Brian D

504crank

greg

504crank

Nobody

Stefan Behnel

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads