Beginner Q. interrogate html object OR file search?

Mark G · Dec 3, 2009

Hi all,

I am new to python and don't yet know the libraries well. What would
be the best way to approach this problem: I have a html file parsing
script - the file sits on my harddrive. I want to extract the date
modified from the meta-data. Should I read through lines of the file
doing a string.find to look for the character patterns of the meta-
tag, or should I use a DOM type library to retrieve the html element I
want? Which is best practice? which occupies least code?

Regards, Mark

inhahe · Dec 3, 2009

or i guess you could go the middle-way and just use regex.
people generally say don't use regex for html (regex can't do the
nesting), but it's what i would do in this case.
though i don't exactly understand the question, re the html file
parsing script you say you have already, or how the date is 'modified
from' the meta-data.

Mark G · Dec 3, 2009

or i guess you could go the middle-way and just use regex.
people generally say don't use regex for html (regex can't do the
nesting), but it's what i would do in this case.
though i don't exactly understand the question, re the html file
parsing script you say you have already, or how the date is 'modified
from' the meta-data.

I'm tempted to use regex too. I have done a bit of perl & bash, and
that is how I would do it with those.

However, I thought there would be a smarter way to do it with
libraries. I have done some digging through the libraries and think I
can do it with xml.dom.minidom. Here is what I want to try...

# if html file already exists, inherit the created date
# 'output' is the filename for the parsed file
html_xml = xml.dom.minidom.parse(output)
for node in html_xml.getElementsByTagName('meta'): # visit every
node <meta />
#debug print node.toxml()
metatag_type = nodes.attributes["name"]
if metatag_type.name == "DC.Date.Modified":
metatag_date = nodes.attributes["content"]
date_created = metatag_date.value()
print date_created

I haven't quite got up to hear in my debugging. I'll let you know if
it works...

RE: your questions. 1) I already have the script in bash - wanting to
convert to Python

I'm half way through
I want to extract the value of the tag <metadata
name="DC.Date.Modified" value="2009-11-17">

r0g · Dec 3, 2009

Mark said:
Hi all,

I am new to python and don't yet know the libraries well. What would
be the best way to approach this problem: I have a html file parsing
script - the file sits on my harddrive. I want to extract the date
modified from the meta-data. Should I read through lines of the file
doing a string.find to look for the character patterns of the meta-
tag, or should I use a DOM type library to retrieve the html element I
want? Which is best practice? which occupies least code?

Regards, Mark

Beautiful soup is the html parser of choice partly as it handles badly
formed html well.

http://www.crummy.com/software/BeautifulSoup/

If the date info occurs at a consistent offset from the start of the tag
then you can use simple string slicing to snip out the date. If not
then, as others suggest, regex is your friend.

If you need to convert a date/time string back into a unix style
timestamp chop the string into bits, put them into a tuple of length 9
and give that to time.mktime()...

def time_to_timestamp( t ):
return time.mktime( (int(t[0:4]), int(t[5:7]), int(t[8:10]),
int(t[11:13]), int(t[14:16]), int(t[17:19]), 0, 0, 0) )

Note the last 3 values are hardcoded to 0, this is because most
date/time strings I deal with do not encode sub second information, only
YYYY/MM/DD h:m:s

Roger.

Steven D'Aprano · Dec 3, 2009

Hi all,

I am new to python and don't yet know the libraries well. What would be
the best way to approach this problem: I have a html file parsing script
- the file sits on my harddrive. I want to extract the date modified
from the meta-data. Should I read through lines of the file doing a
string.find to look for the character patterns of the meta- tag,

That will probably be the fastest, simplest, and easiest to develop. But
the downside is that it will be subject to false positives, if some tag
happens to include text which by chance looks like your meta-data. So,
strictly speaking, this approach is incorrect.

or
should I use a DOM type library to retrieve the html element I want?
Which is best practice?

"Best practice" would imply DOM.

As for which you use, you need to weigh up the risks of a false positive
versus the convenience and speed of string matching versus the
correctness of a DOM approach.

which occupies least code?

Unless you're writing for an embedded system, or if the difference is
vast (e.g. 300 lines versus 30) that's premature optimization.

Personally, I'd use string matching or a regex, and feel guilty about it.

FOSS or Freeware, Prefferably Runs on Linux Mint: Search US Goverment Records, Legally to Find Literarary Work	8	Apr 5, 2023
How to check the validation of js files or html files including js?	6	Jan 12, 2020
beginner - py unicode Q	2	Apr 8, 2007
How to create PDF file in Batch	5	May 11, 2022
Python beginner, unicode encode/decode Q	1	Jul 14, 2008
Database schema for file organizer.	1	May 17, 2022
XML/XHTML/HTML differences, bugs... and howto	0	Jan 23, 2013
What i should use ?	3	Nov 20, 2022

Beginner Q. interrogate html object OR file search?

Mark G

inhahe

Mark G

r0g

Steven D'Aprano

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads