Newbie ? -- SGML metadata extraction

Discussion in 'Python' started by ProvoWallis, Jan 16, 2006.

  1. ProvoWallis

    ProvoWallis Guest

    Hi,

    I'm trying to write a script that will extract the value of an
    attribute from an element using the attribute value of another element
    as the basis for extraction.

    For example, in my situation I have a pre-defined list of main sections
    and I want to extract the id attribute of the form element and create a
    dictionary of graphic ID and section number pairs but only for the
    sections in my pre-defined list but I want to exclude the id value from
    any section that does not appear on my list. I.e., I want to know the
    id value for the forms that appear in sections 1 and 3 but not in 2.

    Boiled down my SGML looks something like this:

    <main-section no="1">

    <form id="graphic_1.tif">
    <form id="graphic_2.tif">

    <main-section no="2">

    <form id="graphic_3.tif">

    <main-section no="3">

    <form id="graphic_4.tif">
    <form id="graphic_5.tif">
    <form id="graphic_6.tif">

    This is what I have come up with on my own so far. My problem is that I
    can't seem to pick up the value of the id attribute.

    Any advice appreciated.

    Greg

    ###

    import os, re, csv

    root = raw_input("Enter the path where the program should run: ")
    fname = raw_input("Enter name of the CSV file containing the section
    numbers: ")
    sgmlname = raw_input("Enter name of the SGML file to search: ")
    print

    given,ext = os.path.splitext(fname)
    root_name = os.path.join(root,fname)
    n = given + '.new'
    outputName = os.path.join(root,n)

    reader = csv.reader(open(root_name, 'r'), delimiter=',')

    sections = []

    for row in reader:
    sections.append(row[0])


    inputFile = open(os.path.join(root,sgmlname), 'r')

    illoList ={}

    while 1:
    lines = inputFile.readlines()
    if not lines:
    break
    for line in lines:

    main = re.search(r'(?i)(?m)(?s)<main-section
    no=\"(\w+)\"', line)
    id = re.search(r'(?i)id=\"(.*?tif)\"', line)
    if main is not None and main.group(1) in sections:

    if id is not None:

    illoList[illo.group(1)] = main.group(1)
    ProvoWallis, Jan 16, 2006
    #1
    1. Advertising

  2. ProvoWallis

    Adonis Guest

    ProvoWallis wrote:

    <snip>

    From what I gather here is a quickie, probably better solutions on the
    way but this accomplishes the idea I think.

    Some helpful links:
    http://docs.python.org/lib/module-sgmllib.html
    http://docs.python.org/lib/module-HTMLParser.html
    http://docs.python.org/lib/module-htmllib.html

    ---

    from HTMLParser import HTMLParser

    data = """<main-section no="1">

    <form id="graphic_1.tif">
    <form id="graphic_2.tif">

    <main-section no="2">

    <form id="graphic_3.tif">

    <main-section no="3">

    <form id="graphic_4.tif">
    <form id="graphic_5.tif">
    <form id="graphic_6.tif">
    """

    class ParseForms(HTMLParser):

    def handle_starttag(self, tag, attrs):
    if tag == "form":
    # attrs argument is a list of tuples [(attribute, value)]
    # converted it to a dictionary to access attribute easier
    print "form id: %s" % dict(attrs).get('id')

    if __name__ == "__main__":
    parser = ParseForms()
    parser.feed(data)
    Adonis, Jan 17, 2006
    #2
    1. Advertising

  3. ProvoWallis

    ProvoWallis Guest

    Thanks. One more question, though.

    I'm not sure how to limit the scope of my search so that I'm just
    extracting the id attribute from the sections that I want. I.e., I want
    the id attributes from the forms in sections 1 and 3 but not from 2.

    Maybe I'm missing something.
    ProvoWallis, Jan 17, 2006
    #3
  4. ProvoWallis

    Adonis Guest

    ProvoWallis wrote:
    > Thanks. One more question, though.
    >
    > I'm not sure how to limit the scope of my search so that I'm just
    > extracting the id attribute from the sections that I want. I.e., I want
    > the id attributes from the forms in sections 1 and 3 but not from 2.
    >
    > Maybe I'm missing something.
    >


    If the data has closing tags this is easily achieved using a dom or sax
    parser, but here is a slightly modified version, very ugly but simple.

    hope this helps.

    Adonis

    ---

    from HTMLParser import HTMLParser

    data = """<main-section no="1">

    <form id="graphic_1.tif">
    <form id="graphic_2.tif">

    <main-section no="2">

    <form id="graphic_3.tif">

    <main-section no="3">

    <form id="graphic_4.tif">
    <form id="graphic_5.tif">
    <form id="graphic_6.tif">
    """

    class ParseForms(HTMLParser):

    _section = None
    _secDict = dict()

    def getSection(self, key):
    return self._secDict.get(str(key))

    def handle_starttag(self, tag, attrs):
    if tag == "form":
    if not self._secDict.has_key(self._section):
    self._secDict[self._section] = [dict(attrs).get('id')]
    else:
    self._secDict[self._section].append(dict(attrs).get('id'))

    if tag == "main-section":
    self._section = dict(attrs).get('no')

    if __name__ == "__main__":
    parser = ParseForms()
    parser.feed(data)
    print parser.getSection(1)
    print parser.getSection(3)
    Adonis, Jan 17, 2006
    #4
  5. ProvoWallis

    ProvoWallis Guest

    Thanks very much for your help. It's greatly appreciated.

    It look a couple of tries to see what was happening but I've figured
    it out.

    Greg
    ProvoWallis, Jan 18, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. totoro
    Replies:
    0
    Views:
    904
    totoro
    Feb 21, 2006
  2. rblah
    Replies:
    3
    Views:
    446
    Peter Flynn
    Jan 18, 2004
  3. Brett Selleck

    Schema Metadata not a Metadata Schema

    Brett Selleck, Sep 4, 2003, in forum: XML
    Replies:
    1
    Views:
    397
    Andy Dingley
    Sep 4, 2003
  4. Clifford W. Racz
    Replies:
    4
    Views:
    2,007
    Clifford W. Racz
    Feb 13, 2004
  5. Stuart Clarke
    Replies:
    6
    Views:
    139
    Stuart Clarke
    Oct 3, 2010
Loading...

Share This Page