identifying and parsing string in text file

Bryan.Fodness · Mar 8, 2008

I have a large file that has many lines like this,

<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE</element>

I would like to identify the line by the tag (300a,0014) and then grab
the name (DoseReferenceStructureType) and value (SITE).

I would like to create a file that would have the structure,

DoseReferenceStructureType = Site
...
...

Also, there is a possibility that there are multiple lines with the
same tag, but different values. These all need to be recorded.

So far, I have a little bit of code to look at everything that is
available,

for line in open(str(sys.argv[1])):
i_line = line.split()
if i_line:
if i_line[0] == "<element":
a = i_line[1]
b = i_line[5]
print "%s | %s" %(a, b)

but do not see a clever way of doing what I would like.

Any help or guidance would be appreciated.

Bryan

Bernard · Mar 8, 2008

Hey Brian,

It seems the text you are trying to parse is similar to XML/HTML.
So I'd use BeautifulSoup[1] if I were you

here's a sample code for your scraping case:

from BeautifulSoup import BeautifulSoup

<python>

# assume the s variable has your text
s = "whatever xml or html here"
# turn it into a tasty & parsable soup

soup = BeautifulSoup(s)
# for every element tag in the soup
for el in soup.findAll("element"):
# print out its tag & name attribute plus its inner value!
print el["tag"], el["name"], el.string

</python>

that's it!

[1] http://www.crummy.com/software/BeautifulSoup/

Nemesis · Mar 8, 2008

I have a large file that has many lines like this,

<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE</element>

I would like to identify the line by the tag (300a,0014) and then grab
the name (DoseReferenceStructureType) and value (SITE).

I would like to create a file that would have the structure,

DoseReferenceStructureType = Site
...
...

You should try with Regular Expressions or if it is something like xml there
is for sure a library you can you to parse it ...
anyway you can try something simpler like this:

elem_dic=dict()
for line in open(str(sys.argv[1])):
line_splitted=line.split()
for item in line_splitted:
item_splitted=item.split("=")
if len(item_splitted)>1:
elem_dic[item_splitted[0]]=item_splitted[1]

.... then you have to retrieve from the dict the items you need, for example,
with the line you posted you obtain these items splitted:

['<element']
['tag', '"300a,0014"']
['vr', '"CS"']
['vm', '"1"']
['len', '"4"']
['name', '"DoseReferenceStructureType">SITE</element>']

and elem_dic will contain the last five, with the keys
'tag','vr','vm','len','name' and teh values 300a,0014 etc etc
i.e. this:

{'vr': '"CS"', 'tag': '"300a,0014"', 'vm': '"1"', 'len': '"4"', 'name': '"DoseReferenceStructureType">SITE</element>'}

Paul McGuire · Mar 8, 2008

You should try with Regular Expressions or if it is something like xml there
is for sure a library you can you to parse it ...

<snip>

When it comes to parsing HTML or XML of uncontrolled origin, regular
expressions are an iffy proposition. You'd be amazed what kind of
junk shows up inside an XML (or worse, HTML) tag.

Pyparsing includes a builtin method for constructing tag matching
parsing patterns, which you can then use to scan through the XML or
HTML source:

from pyparsing import makeXMLTags, withAttribute, SkipTo

testdata = """
<blah>
<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE</element>
<element tag="300Z,0019" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITEXXX</element>
<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE2</element>
<blahblah>
"""

elementStart,elementEnd = makeXMLTags("element")
elementStart.setParseAction(withAttribute(tag="300a,0014"))
search = elementStart + SkipTo(elementEnd)("body")

for t in search.searchString(testdata):
print t.name
print t.body

Prints:

DoseReferenceStructureType
SITE
DoseReferenceStructureType
SITE2

In this case, the parse action withAttribute filters <element> tag
matches, accepting *only* those with the attribute "tag" and the value
"300a,0014". The pattern search adds on the body of the <element></
element> tag, and gives it the name "body" so it is easily accessed
after parsing is completed.

-- Paul
(More about pyparsing at http://pyparsing.wikispaces.com.)

bruno.desthuilliers · Mar 9, 2008

I have a large file that has many lines like this,

<element tag="300a,0014" vr="CS" vm="1" len="4"
name="DoseReferenceStructureType">SITE</element>

I would like to identify the line by the tag (300a,0014) and then grab
the name (DoseReferenceStructureType) and value (SITE).

It's obviously an XML file, so use a XML parser - there are SAX and
DOM parsers in the stdlib, as well as the ElementTree module.

Php combine identical lines in text file	4	Oct 11, 2023
Rearranging .ply file via C++ String Parsing	0	Dec 14, 2019
File content in descending order	0	Nov 8, 2022
Sort and count word pairs in a string	6	Jan 29, 2023
parsley parsing question	0	Jun 2, 2014
Help needed with nested parsing of file into objects	12	Jun 4, 2012
parsing text from a file	1	Jan 29, 2009
parsing tab and newline delimited text	6	Aug 4, 2010

identifying and parsing string in text file

Bryan.Fodness

Bernard

Nemesis

Paul McGuire

bruno.desthuilliers

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads