identifying and parsing string in text file

B

Bryan.Fodness

I have a large file that has many lines like this,

<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE</element>

I would like to identify the line by the tag (300a,0014) and then grab
the name (DoseReferenceStructureType) and value (SITE).

I would like to create a file that would have the structure,

DoseReferenceStructureType = Site
...
...

Also, there is a possibility that there are multiple lines with the
same tag, but different values. These all need to be recorded.

So far, I have a little bit of code to look at everything that is
available,

for line in open(str(sys.argv[1])):
i_line = line.split()
if i_line:
if i_line[0] == "<element":
a = i_line[1]
b = i_line[5]
print "%s | %s" %(a, b)

but do not see a clever way of doing what I would like.

Any help or guidance would be appreciated.

Bryan
 
B

Bernard

Hey Brian,

It seems the text you are trying to parse is similar to XML/HTML.
So I'd use BeautifulSoup[1] if I were you :)

here's a sample code for your scraping case:

from BeautifulSoup import BeautifulSoup

<python>

# assume the s variable has your text
s = "whatever xml or html here"
# turn it into a tasty & parsable soup :)
soup = BeautifulSoup(s)
# for every element tag in the soup
for el in soup.findAll("element"):
# print out its tag & name attribute plus its inner value!
print el["tag"], el["name"], el.string

</python>

that's it!

[1] http://www.crummy.com/software/BeautifulSoup/
 
N

Nemesis

I have a large file that has many lines like this,

<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE</element>

I would like to identify the line by the tag (300a,0014) and then grab
the name (DoseReferenceStructureType) and value (SITE).

I would like to create a file that would have the structure,

DoseReferenceStructureType = Site
...
...

You should try with Regular Expressions or if it is something like xml there
is for sure a library you can you to parse it ...
anyway you can try something simpler like this:

elem_dic=dict()
for line in open(str(sys.argv[1])):
line_splitted=line.split()
for item in line_splitted:
item_splitted=item.split("=")
if len(item_splitted)>1:
elem_dic[item_splitted[0]]=item_splitted[1]

.... then you have to retrieve from the dict the items you need, for example,
with the line you posted you obtain these items splitted:

['<element']
['tag', '"300a,0014"']
['vr', '"CS"']
['vm', '"1"']
['len', '"4"']
['name', '"DoseReferenceStructureType">SITE</element>']

and elem_dic will contain the last five, with the keys
'tag','vr','vm','len','name' and teh values 300a,0014 etc etc
i.e. this:

{'vr': '"CS"', 'tag': '"300a,0014"', 'vm': '"1"', 'len': '"4"', 'name': '"DoseReferenceStructureType">SITE</element>'}
 
P

Paul McGuire

You should try with Regular Expressions or if it is something like xml there
is for sure a library you can you to parse it ...
<snip>

When it comes to parsing HTML or XML of uncontrolled origin, regular
expressions are an iffy proposition. You'd be amazed what kind of
junk shows up inside an XML (or worse, HTML) tag.

Pyparsing includes a builtin method for constructing tag matching
parsing patterns, which you can then use to scan through the XML or
HTML source:

from pyparsing import makeXMLTags, withAttribute, SkipTo

testdata = """
<blah>
<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE</element>
<element tag="300Z,0019" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITEXXX</element>
<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE2</element>
<blahblah>
"""

elementStart,elementEnd = makeXMLTags("element")
elementStart.setParseAction(withAttribute(tag="300a,0014"))
search = elementStart + SkipTo(elementEnd)("body")

for t in search.searchString(testdata):
print t.name
print t.body

Prints:

DoseReferenceStructureType
SITE
DoseReferenceStructureType
SITE2

In this case, the parse action withAttribute filters <element> tag
matches, accepting *only* those with the attribute "tag" and the value
"300a,0014". The pattern search adds on the body of the <element></
element> tag, and gives it the name "body" so it is easily accessed
after parsing is completed.

-- Paul
(More about pyparsing at http://pyparsing.wikispaces.com.)
 
B

bruno.desthuilliers

I have a large file that has many lines like this,

<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE</element>

I would like to identify the line by the tag (300a,0014) and then grab
the name (DoseReferenceStructureType) and value (SITE).

It's obviously an XML file, so use a XML parser - there are SAX and
DOM parsers in the stdlib, as well as the ElementTree module.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top