Beginner Q. interrogate html object OR file search?

M

Mark G

Hi all,

I am new to python and don't yet know the libraries well. What would
be the best way to approach this problem: I have a html file parsing
script - the file sits on my harddrive. I want to extract the date
modified from the meta-data. Should I read through lines of the file
doing a string.find to look for the character patterns of the meta-
tag, or should I use a DOM type library to retrieve the html element I
want? Which is best practice? which occupies least code?

Regards, Mark
 
I

inhahe

or i guess you could go the middle-way and just use regex.
people generally say don't use regex for html (regex can't do the
nesting), but it's what i would do in this case.
though i don't exactly understand the question, re the html file
parsing script you say you have already, or how the date is 'modified
from' the meta-data.
 
M

Mark G

or i guess you could go the middle-way and just use regex.
people generally say don't use regex for html (regex can't do the
nesting), but it's what i would do in this case.
though i don't exactly understand the question, re the html file
parsing script you say you have already, or how the date is 'modified
from' the meta-data.

I'm tempted to use regex too. I have done a bit of perl & bash, and
that is how I would do it with those.

However, I thought there would be a smarter way to do it with
libraries. I have done some digging through the libraries and think I
can do it with xml.dom.minidom. Here is what I want to try...

# if html file already exists, inherit the created date
# 'output' is the filename for the parsed file
html_xml = xml.dom.minidom.parse(output)
for node in html_xml.getElementsByTagName('meta'): # visit every
node <meta />
#debug print node.toxml()
metatag_type = nodes.attributes["name"]
if metatag_type.name == "DC.Date.Modified":
metatag_date = nodes.attributes["content"]
date_created = metatag_date.value()
print date_created

I haven't quite got up to hear in my debugging. I'll let you know if
it works...

RE: your questions. 1) I already have the script in bash - wanting to
convert to Python :) I'm half way through
I want to extract the value of the tag <metadata
name="DC.Date.Modified" value="2009-11-17">
 
R

r0g

Mark said:
Hi all,

I am new to python and don't yet know the libraries well. What would
be the best way to approach this problem: I have a html file parsing
script - the file sits on my harddrive. I want to extract the date
modified from the meta-data. Should I read through lines of the file
doing a string.find to look for the character patterns of the meta-
tag, or should I use a DOM type library to retrieve the html element I
want? Which is best practice? which occupies least code?

Regards, Mark


Beautiful soup is the html parser of choice partly as it handles badly
formed html well.

http://www.crummy.com/software/BeautifulSoup/


If the date info occurs at a consistent offset from the start of the tag
then you can use simple string slicing to snip out the date. If not
then, as others suggest, regex is your friend.

If you need to convert a date/time string back into a unix style
timestamp chop the string into bits, put them into a tuple of length 9
and give that to time.mktime()...

def time_to_timestamp( t ):
return time.mktime( (int(t[0:4]), int(t[5:7]), int(t[8:10]),
int(t[11:13]), int(t[14:16]), int(t[17:19]), 0, 0, 0) )

Note the last 3 values are hardcoded to 0, this is because most
date/time strings I deal with do not encode sub second information, only
YYYY/MM/DD h:m:s


Roger.
 
S

Steven D'Aprano

Hi all,

I am new to python and don't yet know the libraries well. What would be
the best way to approach this problem: I have a html file parsing script
- the file sits on my harddrive. I want to extract the date modified
from the meta-data. Should I read through lines of the file doing a
string.find to look for the character patterns of the meta- tag,

That will probably be the fastest, simplest, and easiest to develop. But
the downside is that it will be subject to false positives, if some tag
happens to include text which by chance looks like your meta-data. So,
strictly speaking, this approach is incorrect.
or
should I use a DOM type library to retrieve the html element I want?
Which is best practice?

"Best practice" would imply DOM.

As for which you use, you need to weigh up the risks of a false positive
versus the convenience and speed of string matching versus the
correctness of a DOM approach.

which occupies least code?

Unless you're writing for an embedded system, or if the difference is
vast (e.g. 300 lines versus 30) that's premature optimization.

Personally, I'd use string matching or a regex, and feel guilty about it.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
474,432
Messages
2,571,682
Members
48,796
Latest member
Greg L.

Latest Threads

Top