identifying and parsing string in text file

Discussion in 'Python' started by Bryan.Fodness@gmail.com, Mar 8, 2008.

  1. Guest

    I have a large file that has many lines like this,

    <element tag="300a,0014" vr="CS" vm="1" len="4"
    name="DoseReferenceStructureType">SITE</element>

    I would like to identify the line by the tag (300a,0014) and then grab
    the name (DoseReferenceStructureType) and value (SITE).

    I would like to create a file that would have the structure,

    DoseReferenceStructureType = Site
    ...
    ...

    Also, there is a possibility that there are multiple lines with the
    same tag, but different values. These all need to be recorded.

    So far, I have a little bit of code to look at everything that is
    available,

    for line in open(str(sys.argv[1])):
    i_line = line.split()
    if i_line:
    if i_line[0] == "<element":
    a = i_line[1]
    b = i_line[5]
    print "%s | %s" %(a, b)

    but do not see a clever way of doing what I would like.

    Any help or guidance would be appreciated.

    Bryan
    , Mar 8, 2008
    #1
    1. Advertising

  2. Bernard Guest

    Hey Brian,

    It seems the text you are trying to parse is similar to XML/HTML.
    So I'd use BeautifulSoup[1] if I were you :)

    here's a sample code for your scraping case:

    from BeautifulSoup import BeautifulSoup

    <python>

    # assume the s variable has your text
    s = "whatever xml or html here"
    # turn it into a tasty & parsable soup :)
    soup = BeautifulSoup(s)
    # for every element tag in the soup
    for el in soup.findAll("element"):
    # print out its tag & name attribute plus its inner value!
    print el["tag"], el["name"], el.string

    </python>

    that's it!

    [1] http://www.crummy.com/software/BeautifulSoup/

    On 8 mar, 14:49, "" <>
    wrote:
    > I have a large file that has many lines like this,
    >
    > <element tag="300a,0014" vr="CS" vm="1" len="4"
    > name="DoseReferenceStructureType">SITE</element>
    >
    > I would like to identify the line by the tag (300a,0014) and then grab
    > the name (DoseReferenceStructureType) and value (SITE).
    >
    > I would like to create a file that would have the structure,
    >
    > DoseReferenceStructureType = Site
    > ...
    > ...
    >
    > Also, there is a possibility that there are multiple lines with the
    > same tag, but different values. These all need to be recorded.
    >
    > So far, I have a little bit of code to look at everything that is
    > available,
    >
    > for line in open(str(sys.argv[1])):
    > i_line = line.split()
    > if i_line:
    > if i_line[0] == "<element":
    > a = i_line[1]
    > b = i_line[5]
    > print "%s | %s" %(a, b)
    >
    > but do not see a clever way of doing what I would like.
    >
    > Any help or guidance would be appreciated.
    >
    > Bryan
    Bernard, Mar 8, 2008
    #2
    1. Advertising

  3. Nemesis Guest

    wrote:

    > I have a large file that has many lines like this,
    >
    > <element tag="300a,0014" vr="CS" vm="1" len="4"
    > name="DoseReferenceStructureType">SITE</element>
    >
    > I would like to identify the line by the tag (300a,0014) and then grab
    > the name (DoseReferenceStructureType) and value (SITE).
    >
    > I would like to create a file that would have the structure,
    >
    > DoseReferenceStructureType = Site
    > ...
    > ...


    You should try with Regular Expressions or if it is something like xml there
    is for sure a library you can you to parse it ...
    anyway you can try something simpler like this:

    elem_dic=dict()
    for line in open(str(sys.argv[1])):
    line_splitted=line.split()
    for item in line_splitted:
    item_splitted=item.split("=")
    if len(item_splitted)>1:
    elem_dic[item_splitted[0]]=item_splitted[1]

    .... then you have to retrieve from the dict the items you need, for example,
    with the line you posted you obtain these items splitted:

    ['<element']
    ['tag', '"300a,0014"']
    ['vr', '"CS"']
    ['vm', '"1"']
    ['len', '"4"']
    ['name', '"DoseReferenceStructureType">SITE</element>']

    and elem_dic will contain the last five, with the keys
    'tag','vr','vm','len','name' and teh values 300a,0014 etc etc
    i.e. this:

    {'vr': '"CS"', 'tag': '"300a,0014"', 'vm': '"1"', 'len': '"4"', 'name': '"DoseReferenceStructureType">SITE</element>'}




    --
    Age is not a particularly interesting subject. Anyone can get old. All
    you have to do is live long enough.
    Nemesis, Mar 8, 2008
    #3
  4. Paul McGuire Guest

    On Mar 8, 2:02 pm, Nemesis <> wrote:
    > wrote:
    > > I have a large file that has many lines like this,

    >
    > > <element tag="300a,0014" vr="CS" vm="1" len="4"
    > > name="DoseReferenceStructureType">SITE</element>

    >
    > > I would like to identify the line by the tag (300a,0014) and then grab
    > > the name (DoseReferenceStructureType) and value (SITE).

    >
    > You should try with Regular Expressions or if it is something like xml there
    > is for sure a library you can you to parse it ...

    <snip>

    When it comes to parsing HTML or XML of uncontrolled origin, regular
    expressions are an iffy proposition. You'd be amazed what kind of
    junk shows up inside an XML (or worse, HTML) tag.

    Pyparsing includes a builtin method for constructing tag matching
    parsing patterns, which you can then use to scan through the XML or
    HTML source:

    from pyparsing import makeXMLTags, withAttribute, SkipTo

    testdata = """
    <blah>
    <element tag="300a,0014" vr="CS" vm="1" len="4"
    name="DoseReferenceStructureType">SITE</element>
    <element tag="300Z,0019" vr="CS" vm="1" len="4"
    name="DoseReferenceStructureType">SITEXXX</element>
    <element tag="300a,0014" vr="CS" vm="1" len="4"
    name="DoseReferenceStructureType">SITE2</element>
    <blahblah>
    """

    elementStart,elementEnd = makeXMLTags("element")
    elementStart.setParseAction(withAttribute(tag="300a,0014"))
    search = elementStart + SkipTo(elementEnd)("body")

    for t in search.searchString(testdata):
    print t.name
    print t.body

    Prints:

    DoseReferenceStructureType
    SITE
    DoseReferenceStructureType
    SITE2

    In this case, the parse action withAttribute filters <element> tag
    matches, accepting *only* those with the attribute "tag" and the value
    "300a,0014". The pattern search adds on the body of the <element></
    element> tag, and gives it the name "body" so it is easily accessed
    after parsing is completed.

    -- Paul
    (More about pyparsing at http://pyparsing.wikispaces.com.)
    Paul McGuire, Mar 8, 2008
    #4
  5. Guest

    On 8 mar, 20:49, "" <>
    wrote:
    > I have a large file that has many lines like this,
    >
    > <element tag="300a,0014" vr="CS" vm="1" len="4"
    > name="DoseReferenceStructureType">SITE</element>
    >
    > I would like to identify the line by the tag (300a,0014) and then grab
    > the name (DoseReferenceStructureType) and value (SITE).


    It's obviously an XML file, so use a XML parser - there are SAX and
    DOM parsers in the stdlib, as well as the ElementTree module.
    , Mar 9, 2008
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. hokiegal99
    Replies:
    1
    Views:
    320
    Andrew Dalke
    Dec 26, 2003
  2. hokiegal99
    Replies:
    3
    Views:
    344
    Nicolas =?ISO-8859-15?Q?Favre=2DF=E9lix?=
    Dec 27, 2003
  3. Greg Collins [Microsoft MVP]

    Identifying the file that is specifying the master

    Greg Collins [Microsoft MVP], Jan 2, 2007, in forum: ASP .Net
    Replies:
    0
    Views:
    281
    Greg Collins [Microsoft MVP]
    Jan 2, 2007
  4. Replies:
    1
    Views:
    513
    roy axenov
    Oct 9, 2006
  5. Robhy B.
    Replies:
    3
    Views:
    101
    zuerrong
    Dec 9, 2010
Loading...

Share This Page