python fast HTML data extraction library

Filip · Jul 22, 2009

Hello,

Sometime ago I was searching for a library that would simplify mass
data scraping/extraction from webpages. Python XPath implementation
seemed like the way to go. The problem was that most of the HTML on
the net doesn't conform to XML standards, even the XHTML (those
advertised as valid XHTML too) pages.

I tried to fix that with BeautifulSoup + regexp filtering of some
particular cases I encountered. That was slow and after running my
data scraper for some time a lot of new problems (exceptions from
xpath parser) were showing up. Not to mention that BeautifulSoup
stripped almost all of the content from some heavily broken pages
(50+KiB page stripped down to some few hundred bytes). Character
encoding conversion was a hell too - even UTF-8 pages had some non-
standard characters causing issues.

Cutting to the chase - that's when I decided to take the matter into
my own hands. I hacked together a solution sporting completely new
approach overnight. It's called htxpath - a small, lightweight (also
without dependencies) python library which lets you to extract
specific tag(s) from a HTML document using a path string which has
very similar syntax to xpath (but is more convenient in some cases).
It did a very good job for me.

My library, rather than parsing the whole input into a tree, processes
it like a char stream with regular expressions.

I decided to share it with everyone so there it is: http://code.google.com/p/htxpath/
I am aware that it is not beautifully coded as my experience with
python is rather brief, but I am curious if it will be useful to
anyone (also it's my first potentially [real-world

] useful project
gone public). In that case I promise to continue developing it. It's
probably full of bugs, but I can't catch them all by myself.

regards,
Filip Sobalski

Paul McGuire · Jul 23, 2009

My library, rather than parsing the whole input into a tree, processes
it like a char stream with regular expressions.

Filip -

In general, parsing HTML with re's is fraught with easily-overlooked
deviations from the norm. But since you have stepped up to the task,
here are some comments on your re's:

# You should use raw string literals throughout, as in:
# blah_re = re.compile(r'sljdflsflds')
# (note the leading r before the string literal). raw string
literals
# really help keep your re expressions clean, so that you don't ever
# have to double up any '\' characters.

# Attributes might be enclosed in single quotes, or not enclosed in
any quotes at all.
attr_re = re.compile('([\da-z]+?)\s*=\s*\"(.*?)\"', re.DOTALL |
re.UNICODE | re.IGNORECASE)

# Needs re.IGNORECASE, and can have tag attributes, such as <BR
CLEAR="ALL">
line_break_re = re.compile('<br\/?>', re.UNICODE)

# what about HTML entities defined using hex syntax, such as
amp_re = re.compile('\&(?![a-z]+?\

', re.UNICODE | re.IGNORECASE)

How would you extract data from a table? For instance, how would you
extract the data entries from the table at this URL:
http://tf.nist.gov/tf-cgi/servers.cgi ? This would be a good example
snippet for your module documentation.

Try extracting all of the <a href=...>sldjlsfjd</a> links from
yahoo.com, and see how much of what you expect actually gets matched.

Good luck!

-- Paul

Aahz · Jul 25, 2009

I tried to fix that with BeautifulSoup + regexp filtering of some
particular cases I encountered. That was slow and after running my
data scraper for some time a lot of new problems (exceptions from
xpath parser) were showing up. Not to mention that BeautifulSoup
stripped almost all of the content from some heavily broken pages
(50+KiB page stripped down to some few hundred bytes). Character
encoding conversion was a hell too - even UTF-8 pages had some non-
standard characters causing issues.

Have you tried lxml?
--
Aahz ([email protected]) <*> http://www.pythoncraft.com/

"At Resolver we've found it useful to short-circuit any doubt and just
refer to comments in code as 'lies'.

"
--Michael Foord paraphrases Christian Muirhead on python-dev, 2009-03-22

John Machin · Jul 26, 2009

On Jul 22, 5:43 pm, Filip <[email protected]> wrote:

# Needs re.IGNORECASE, and can have tag attributes, such as <BR
CLEAR="ALL">
line_break_re = re.compile('<br\/?>', re.UNICODE)

Just in case somebody actually uses valid XHTML

it might be a good

idea to allow for said:
# what about HTML entities defined using hex syntax, such as
amp_re = re.compile('\&(?![a-z]+?\', re.UNICODE | re.IGNORECASE)

What about the decimal syntax ones? E.g. not only   and
but also

Also, entity names can contain digits e.g. ¹ ¾

Filip · Jul 27, 2009

# You should use raw string literals throughout, as in:
# blah_re = re.compile(r'sljdflsflds')
# (note the leading r before the string literal). raw string
literals
# really help keep your re expressions clean, so that you don't ever
# have to double up any '\' characters.

Thanks, I didn't know about that, updated my code.

# Attributes might be enclosed in single quotes, or not enclosed in
any quotes at all.
attr_re = re.compile('([\da-z]+?)\s*=\s*\"(.*?)\"', re.DOTALL |
re.UNICODE | re.IGNORECASE)

Of course, you mean attribute's *value* can be enclosed in single/
double quotes?
To be true, I haven't seen single quote variant in HTML lately but I
checked it and it seems to be in the specs and it can be even quite
useful (man learns something every day).
Thank you for pointing that one out, I updated the code accordingly
(just realized that condition check REs need an update too :/).

As far as the lack of value quoting is concerned, I am not so sure I
need this - It would significanly obfuscate my REs and this practice
is rather deprecated, considered unsafe
and I've seen it only in very old websites.

How would you extract data from a table? For instance, how would you
extract the data entries from the table at this URL:http://tf.nist.gov/tf-cgi/servers.cgi? This would be a good example
snippet for your module documentation.

This really seems like a nice example. I'll surely explain it in my
docs (examples are surely needed there

).

Try extracting all of the <a href=...>sldjlsfjd</a> links from
yahoo.com, and see how much of what you expect actually gets matched.

The library was used in my humble production environment, processing a
few hundred thousand+ of pages and spitting out about 10000 SQL
records so it does work quite good with a simple task like extracting
all links. However, I can't really say that the task introduced enough
diversity (there were only 9 different page templates) to say that the
library is 'tested'...

idea to allow for said:
On Jul 22, 5:43 pm, Filip <[email protected]> wrote:

Click to expand...

# Needs re.IGNORECASE, and can have tag attributes, such as <BR
CLEAR="ALL">
line_break_re = re.compile('<br\/?>', re.UNICODE)

Click to expand...

Just in case somebody actually uses valid XHTML it might be a good

idea to allow for said:

# what about HTML entities defined using hex syntax, such as
amp_re = re.compile('\&(?![a-z]+?\', re.UNICODE | re.IGNORECASE)

Click to expand...

What about the decimal syntax ones? E.g. not only   and
but also

Also, entity names can contain digits e.g. ¹ ¾

Thanks for pointing this out, I fixed that. Although it has very
little impact on how the library performs its main task (I'd like to
see some comments on that

).

How to push data from one HTML page to another	4	Jan 3, 2024
Adding modules to library? / package?	1	Aug 29, 2023
Python client/server that reads HTML body from server	1	Apr 12, 2023
Help regarding python facepy library	0	Sep 16, 2013
HTML Assessment for interview	2	Feb 16, 2024
Raw data extraction question	1	Aug 26, 2009
Changing .html in URL	3	Jul 11, 2022
Getting extra blank rows from appending HTML..?	2	Oct 24, 2023

python fast HTML data extraction library

Filip

Paul McGuire

Aahz

John Machin

Filip

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads