T
Tina Li
Hello,
I've been struggling with a regular expression for parsing XML files, which keeps giving the run time error "maximum
recursion limit exceeded". Here is the pattern string:
r'<code>(?P<c>.*?)</code>.*?<targetSeq
name="(?P<tn>.*?)">.*?<target>(?P<t>.*?)</target>.*?<align>(?P<a>.*?)</align>.*?<template>(?P<temp>.*?)</template>.*?<an
otherTag>(?P<at>.*?)</anotherTag>.*?<yetAnotherTag>(?P<yat>.*?)</yetAnotherTag>'
The file format is straighforward. Here is a sample:
<code>1cg2</code>
<chain>a</chain>
<settings>abcde</settings>
<scoreInfo>12345</scoreInfo>
<targetSeq name="1onc">blah
</targetSeq>
<alignment size="335">
<target>WLTFQKKHITNTRDVDCDNIMS</target>
<align> :| ..| : . | . |. . :</align>
<template>QKRDNVLFQAATDEQPAVIKTLEKL</template>
<anotherTag>foobarfoobar</anotherTag>
<yetAnotherTag>barfoobarfoo</yetAnotherTag>
# this group of tags then repeat in the file multiple times
If I search for the pattern up to "</template>" (i.e. no <anotherTag> onwards), it works fine. As soon as I added the
later bits into the pattern it gives the error.
I heard that non-greedy (*?) is inefficient, so I tried replacing all .*? with (?!<target>) etc. which means "if the the
next piece of text doesn't match the <target> tag keep going". But it gives the same error.
So my question is: what is the bottleneck in this pattern? Could someone more experienced in REs give some hints here?
Your help is greatly appreciated!
Tina
I've been struggling with a regular expression for parsing XML files, which keeps giving the run time error "maximum
recursion limit exceeded". Here is the pattern string:
r'<code>(?P<c>.*?)</code>.*?<targetSeq
name="(?P<tn>.*?)">.*?<target>(?P<t>.*?)</target>.*?<align>(?P<a>.*?)</align>.*?<template>(?P<temp>.*?)</template>.*?<an
otherTag>(?P<at>.*?)</anotherTag>.*?<yetAnotherTag>(?P<yat>.*?)</yetAnotherTag>'
The file format is straighforward. Here is a sample:
<code>1cg2</code>
<chain>a</chain>
<settings>abcde</settings>
<scoreInfo>12345</scoreInfo>
<targetSeq name="1onc">blah
</targetSeq>
<alignment size="335">
<target>WLTFQKKHITNTRDVDCDNIMS</target>
<align> :| ..| : . | . |. . :</align>
<template>QKRDNVLFQAATDEQPAVIKTLEKL</template>
<anotherTag>foobarfoobar</anotherTag>
<yetAnotherTag>barfoobarfoo</yetAnotherTag>
# this group of tags then repeat in the file multiple times
If I search for the pattern up to "</template>" (i.e. no <anotherTag> onwards), it works fine. As soon as I added the
later bits into the pattern it gives the error.
I heard that non-greedy (*?) is inefficient, so I tried replacing all .*? with (?!<target>) etc. which means "if the the
next piece of text doesn't match the <target> tag keep going". But it gives the same error.
So my question is: what is the bottleneck in this pattern? Could someone more experienced in REs give some hints here?
Your help is greatly appreciated!
Tina