Beginner Q. interrogate html object OR file search?

Discussion in 'Python' started by Mark G, Dec 3, 2009.

  1. Mark G

    Mark G Guest

    Hi all,

    I am new to python and don't yet know the libraries well. What would
    be the best way to approach this problem: I have a html file parsing
    script - the file sits on my harddrive. I want to extract the date
    modified from the meta-data. Should I read through lines of the file
    doing a string.find to look for the character patterns of the meta-
    tag, or should I use a DOM type library to retrieve the html element I
    want? Which is best practice? which occupies least code?

    Regards, Mark
    Mark G, Dec 3, 2009
    #1
    1. Advertising

  2. Mark G

    inhahe Guest

    or i guess you could go the middle-way and just use regex.
    people generally say don't use regex for html (regex can't do the
    nesting), but it's what i would do in this case.
    though i don't exactly understand the question, re the html file
    parsing script you say you have already, or how the date is 'modified
    from' the meta-data.

    On Wed, Dec 2, 2009 at 10:24 PM, Mark G <> wrote:
    > Hi all,
    >
    > I am new to python and don't yet know the libraries well. What would
    > be the best way to approach this problem: I have a html file parsing
    > script - the file sits on my harddrive. I want to extract the date
    > modified from the meta-data. Should I read through lines of the file
    > doing a string.find to look for the character patterns of the meta-
    > tag, or should I use a DOM type library to retrieve the html element I
    > want? Which is best practice? which occupies least code?
    >
    > Regards, Mark
    > --
    > http://mail.python.org/mailman/listinfo/python-list
    >
    inhahe, Dec 3, 2009
    #2
    1. Advertising

  3. Mark G

    Mark G Guest

    On Dec 3, 4:19 pm, inhahe <> wrote:
    > or i guess you could go the middle-way and just use regex.
    > people generally say don't use regex for html (regex can't do the
    > nesting), but it's what i would do in this case.
    > though i don't exactly understand the question, re the html file
    > parsing script you say you have already, or how the date is 'modified
    > from' the meta-data.
    >
    > On Wed, Dec 2, 2009 at 10:24 PM, Mark G <> wrote:
    > > Hi all,

    >
    > > I am new to python and don't yet know the libraries well. What would
    > > be the best way to approach this problem: I have a html file parsing
    > > script - the file sits on my harddrive. I want to extract the date
    > > modified from the meta-data. Should I read through lines of the file
    > > doing a string.find to look for the character patterns of the meta-
    > > tag, or should I use a DOM type library to retrieve the html element I
    > > want? Which is best practice? which occupies least code?

    >
    > > Regards, Mark
    > > --
    > >http://mail.python.org/mailman/listinfo/python-list

    >
    >


    I'm tempted to use regex too. I have done a bit of perl & bash, and
    that is how I would do it with those.

    However, I thought there would be a smarter way to do it with
    libraries. I have done some digging through the libraries and think I
    can do it with xml.dom.minidom. Here is what I want to try...

    # if html file already exists, inherit the created date
    # 'output' is the filename for the parsed file
    html_xml = xml.dom.minidom.parse(output)
    for node in html_xml.getElementsByTagName('meta'): # visit every
    node <meta />
    #debug print node.toxml()
    metatag_type = nodes.attributes["name"]
    if metatag_type.name == "DC.Date.Modified":
    metatag_date = nodes.attributes["content"]
    date_created = metatag_date.value()
    print date_created

    I haven't quite got up to hear in my debugging. I'll let you know if
    it works...

    RE: your questions. 1) I already have the script in bash - wanting to
    convert to Python :) I'm half way through
    I want to extract the value of the tag <metadata
    name="DC.Date.Modified" value="2009-11-17">
    Mark G, Dec 3, 2009
    #3
  4. Mark G

    r0g Guest

    Mark G wrote:
    > Hi all,
    >
    > I am new to python and don't yet know the libraries well. What would
    > be the best way to approach this problem: I have a html file parsing
    > script - the file sits on my harddrive. I want to extract the date
    > modified from the meta-data. Should I read through lines of the file
    > doing a string.find to look for the character patterns of the meta-
    > tag, or should I use a DOM type library to retrieve the html element I
    > want? Which is best practice? which occupies least code?
    >
    > Regards, Mark



    Beautiful soup is the html parser of choice partly as it handles badly
    formed html well.

    http://www.crummy.com/software/BeautifulSoup/


    If the date info occurs at a consistent offset from the start of the tag
    then you can use simple string slicing to snip out the date. If not
    then, as others suggest, regex is your friend.

    If you need to convert a date/time string back into a unix style
    timestamp chop the string into bits, put them into a tuple of length 9
    and give that to time.mktime()...

    def time_to_timestamp( t ):
    return time.mktime( (int(t[0:4]), int(t[5:7]), int(t[8:10]),
    int(t[11:13]), int(t[14:16]), int(t[17:19]), 0, 0, 0) )

    Note the last 3 values are hardcoded to 0, this is because most
    date/time strings I deal with do not encode sub second information, only
    YYYY/MM/DD h:m:s


    Roger.
    r0g, Dec 3, 2009
    #4
  5. On Wed, 02 Dec 2009 19:24:07 -0800, Mark G wrote:

    > Hi all,
    >
    > I am new to python and don't yet know the libraries well. What would be
    > the best way to approach this problem: I have a html file parsing script
    > - the file sits on my harddrive. I want to extract the date modified
    > from the meta-data. Should I read through lines of the file doing a
    > string.find to look for the character patterns of the meta- tag,


    That will probably be the fastest, simplest, and easiest to develop. But
    the downside is that it will be subject to false positives, if some tag
    happens to include text which by chance looks like your meta-data. So,
    strictly speaking, this approach is incorrect.

    > or
    > should I use a DOM type library to retrieve the html element I want?
    > Which is best practice?


    "Best practice" would imply DOM.

    As for which you use, you need to weigh up the risks of a false positive
    versus the convenience and speed of string matching versus the
    correctness of a DOM approach.


    > which occupies least code?


    Unless you're writing for an embedded system, or if the difference is
    vast (e.g. 300 lines versus 30) that's premature optimization.

    Personally, I'd use string matching or a regex, and feel guilty about it.



    --
    Steven
    Steven D'Aprano, Dec 3, 2009
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. =?Utf-8?B?S3VydCBTY2hyb2VkZXI=?=

    No Class at ALL!!! beginner/beginner question

    =?Utf-8?B?S3VydCBTY2hyb2VkZXI=?=, Feb 2, 2005, in forum: ASP .Net
    Replies:
    7
    Views:
    570
    =?Utf-8?B?S3VydCBTY2hyb2VkZXI=?=
    Feb 3, 2005
  2. Replies:
    9
    Views:
    452
    Keith Thompson
    Jul 3, 2009
  3. Abby Lee
    Replies:
    5
    Views:
    393
    Abby Lee
    Aug 2, 2004
  4. andy

    rubytorrent, interrogate

    andy, Apr 5, 2008, in forum: Ruby
    Replies:
    0
    Views:
    80
  5. Janis Papanagnou

    Interrogate ID of logged in user?

    Janis Papanagnou, Aug 24, 2010, in forum: Javascript
    Replies:
    3
    Views:
    96
    Janis Papanagnou
    Sep 1, 2010
Loading...

Share This Page