Newbie ? -- SGML metadata extraction

ProvoWallis · Jan 16, 2006

Hi,

I'm trying to write a script that will extract the value of an
attribute from an element using the attribute value of another element
as the basis for extraction.

For example, in my situation I have a pre-defined list of main sections
and I want to extract the id attribute of the form element and create a
dictionary of graphic ID and section number pairs but only for the
sections in my pre-defined list but I want to exclude the id value from
any section that does not appear on my list. I.e., I want to know the
id value for the forms that appear in sections 1 and 3 but not in 2.

Boiled down my SGML looks something like this:

<main-section no="1">

<form id="graphic_1.tif">
<form id="graphic_2.tif">

<main-section no="2">

<form id="graphic_3.tif">

<main-section no="3">

<form id="graphic_4.tif">
<form id="graphic_5.tif">
<form id="graphic_6.tif">

This is what I have come up with on my own so far. My problem is that I
can't seem to pick up the value of the id attribute.

Any advice appreciated.

Greg

###

import os, re, csv

root = raw_input("Enter the path where the program should run: ")
fname = raw_input("Enter name of the CSV file containing the section
numbers: ")
sgmlname = raw_input("Enter name of the SGML file to search: ")
print

given,ext = os.path.splitext(fname)
root_name = os.path.join(root,fname)
n = given + '.new'
outputName = os.path.join(root,n)

reader = csv.reader(open(root_name, 'r'), delimiter=',')

sections = []

for row in reader:
sections.append(row[0])

inputFile = open(os.path.join(root,sgmlname), 'r')

illoList ={}

while 1:
lines = inputFile.readlines()
if not lines:
break
for line in lines:

main = re.search(r'(?i)(?m)(?s)<main-section
no=\"(\w+)\"', line)
id = re.search(r'(?i)id=\"(.*?tif)\"', line)
if main is not None and main.group(1) in sections:

if id is not None:

illoList[illo.group(1)] = main.group(1)

Adonis · Jan 17, 2006

ProvoWallis wrote:

<snip>

From what I gather here is a quickie, probably better solutions on the
way but this accomplishes the idea I think.

Some helpful links:
http://docs.python.org/lib/module-sgmllib.html
http://docs.python.org/lib/module-HTMLParser.html
http://docs.python.org/lib/module-htmllib.html

---

from HTMLParser import HTMLParser

data = """<main-section no="1">

<form id="graphic_1.tif">
<form id="graphic_2.tif">

<main-section no="2">

<form id="graphic_3.tif">

<main-section no="3">

<form id="graphic_4.tif">
<form id="graphic_5.tif">
<form id="graphic_6.tif">
"""

class ParseForms(HTMLParser):

def handle_starttag(self, tag, attrs):
if tag == "form":
# attrs argument is a list of tuples [(attribute, value)]
# converted it to a dictionary to access attribute easier
print "form id: %s" % dict(attrs).get('id')

if __name__ == "__main__":
parser = ParseForms()
parser.feed(data)

ProvoWallis · Jan 17, 2006

Thanks. One more question, though.

I'm not sure how to limit the scope of my search so that I'm just
extracting the id attribute from the sections that I want. I.e., I want
the id attributes from the forms in sections 1 and 3 but not from 2.

Maybe I'm missing something.

Adonis · Jan 17, 2006

ProvoWallis said:
Thanks. One more question, though.

I'm not sure how to limit the scope of my search so that I'm just
extracting the id attribute from the sections that I want. I.e., I want
the id attributes from the forms in sections 1 and 3 but not from 2.

Maybe I'm missing something.

If the data has closing tags this is easily achieved using a dom or sax
parser, but here is a slightly modified version, very ugly but simple.

hope this helps.

Adonis

---

from HTMLParser import HTMLParser

data = """<main-section no="1">

<form id="graphic_1.tif">
<form id="graphic_2.tif">

<main-section no="2">

<form id="graphic_3.tif">

<main-section no="3">

<form id="graphic_4.tif">
<form id="graphic_5.tif">
<form id="graphic_6.tif">
"""

class ParseForms(HTMLParser):

_section = None
_secDict = dict()

def getSection(self, key):
return self._secDict.get(str(key))

def handle_starttag(self, tag, attrs):
if tag == "form":
if not self._secDict.has_key(self._section):
self._secDict[self._section] = [dict(attrs).get('id')]
else:
self._secDict[self._section].append(dict(attrs).get('id'))

if tag == "main-section":
self._section = dict(attrs).get('no')

if __name__ == "__main__":
parser = ParseForms()
parser.feed(data)
print parser.getSection(1)
print parser.getSection(3)

ProvoWallis · Jan 18, 2006

Thanks very much for your help. It's greatly appreciated.

It look a couple of tries to see what was happening but I've figured
it out.

Greg

newbie write to file question	2	Dec 4, 2005
HCaptcha - How to stop page from refreshing on submit if captcha is not checked/validated	1	Aug 29, 2023
Survey details won't go through using php, ajax, Mysql	0	Oct 26, 2023
Newbie Question: CSV to XML	1	Jan 6, 2006
Newbie Question: CSV to XML	1	Jan 6, 2006
Anyone familiar with WP Bakery and/or Visual Composer?	4	Jan 27, 2023
Newbie Text Processing Question	4	Oct 5, 2005
Only one table shows up with the information	2	Mar 29, 2023

Newbie ? -- SGML metadata extraction

ProvoWallis

Adonis

ProvoWallis

Adonis

ProvoWallis

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads