extracting substrings from a file

sofiafig · Sep 11, 2006

Hi,

I have a file with several entries in the form:

AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
dethiobiotin synthetase (bioD), complete cds.

1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
/CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
/DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
/PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1

and I would like to create a file that has only the following:

AFFX-BioB-5_at /GEN=bioB /gb:J04423.1

1415785_a_at /gb:NM_009840.1 /GEN=Cct8

Could anyone please tell me how can I do it?

Many thanks in advance
Sofia

Tim Chase · Sep 11, 2006

I have a file with several entries in the form:

AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
dethiobiotin synthetase (bioD), complete cds.

1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
/CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
/DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
/PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1

and I would like to create a file that has only the following:

AFFX-BioB-5_at /GEN=bioB /gb:J04423.1

1415785_a_at /gb:NM_009840.1 /GEN=Cct8

Could anyone please tell me how can I do it?

The following seems to do it for me...

outfile = file('out.txt', 'w')
for line in file('in.txt'):
if '/GEN' in line and '/gb:' in line:
newline = []
for index, item in enumerate(line.split()):
if index == 0 or item.startswith('/GEN')
or item.startswith('/gb:'):
newline.append(item)
outfile.write('\t'.join(newline))
outfile.write('\n')
outfile.close()

There are some underdefined conditions...I presume that both the
GEN and gb: have to appear in the line. If only one of them is
required, change the "and" to an "or".

-tkc

John Machin · Sep 11, 2006

Hi,

I have a file with several entries in the form:

AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
dethiobiotin synthetase (bioD), complete cds.

1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
/CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
/DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
/PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1

and I would like to create a file that has only the following:

AFFX-BioB-5_at /GEN=bioB /gb:J04423.1

1415785_a_at /gb:NM_009840.1 /GEN=Cct8

Could anyone please tell me how can I do it?

Many thanks in advance
Sofia

Here's my first iteration:
C:\junk>type sofia.py
prefixes = ['/GEN=', '/gb:']

def extract(fname):
f = open(fname, 'r')
chunks = [[]]
for line in f:
words = line.split()
if words:
chunks[-1].extend(words)
else:
chunks.append([])
for chunk in chunks:
if not chunk:
continue
output = [chunk[0]]
for word in chunk:
for prefix in prefixes:
if word.startswith(prefix):
output.append(word)
break
print ' '.join(output)

if __name__ == "__main__":
import sys
extract(sys.argv[1])

C:\junk>sofia.py sofia.txt
AFFX-BioB-5_at /GEN=bioB /gb:J04423.1 /gb:J04423.1
1415785_a_at /gb:NM_009840.1 /GEN=Cct8 /gb:BC009007.1

Before I fix the duplicate in the first line, you need to say whether
you really want the
/gb:BC009007.1 in the second line thrown away -- IOW, what's the rule?
For each prefix, either (1) get the first "word" that starts with that
prefix or (2) get all unique such words. You choose.

Cheers,
John

Larry Bates · Sep 11, 2006

Hi,

I have a file with several entries in the form:

AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
dethiobiotin synthetase (bioD), complete cds.

1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
/CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
/DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
/PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1

and I would like to create a file that has only the following:

AFFX-BioB-5_at /GEN=bioB /gb:J04423.1

1415785_a_at /gb:NM_009840.1 /GEN=Cct8

Could anyone please tell me how can I do it?

Many thanks in advance
Sofia

What have your tried so far?

Hint: split line on spaces, the first pieces is the first item you want,
then iterate over the pieces looking for the /GEN and /gb: pieces that
you are interested in keeping. I am assuming that /GEN= and /gb: data
doesn't have any spaces in them. If they do, you will need to use
regular expressions instead of split.

-Larry Bates

Paul McGuire · Sep 11, 2006

Hi,

I have a file with several entries in the form:

AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
dethiobiotin synthetase (bioD), complete cds.

1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
/CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
/DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
/PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1

and I would like to create a file that has only the following:

AFFX-BioB-5_at /GEN=bioB /gb:J04423.1

1415785_a_at /gb:NM_009840.1 /GEN=Cct8

Here's a pyparsing solution that will address your immediate question, and
also gives you some leeway for adding other "/" options to your search.
Pyparsing's home page is at pyparsing.wikispaces.com.

-- Paul

data = """
AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
dethiobiotin synthetase (bioD), complete cds.

1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
/CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
/DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
/PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
"""

from pyparsing import *

# create expression we are looking for:
# name [ junk word... ] /qualifier...
name = Word(alphanums,printables).setResultsName("name")
junkWord = ~(Literal("/")) + Word(printables)
qualifier = ("/" + Word(alphas+"_-.").setResultsName("key") + \
oneOf("= :") + \
Word(printables).setResultsName("value"))
expr = name + ZeroOrMore(junkWord) + \
Dict(ZeroOrMore(qualifier)).setResultsName("quals")

# use parse action to repackage qualifier data to support "dict"-like
# access to qualifiers
qualifier.setParseAction( lambda t: (t.key,"".join(t)) )

# use this parse action instead if you just want whatever is
# after the '=' or ':' delimiter in the qualifier
# qualifier.setParseAction( lambda t: (t.key,t.value) )

# parse data strings, showing returned data structure
# (just to show what pyparsing results structure looks like)
for d in data.split("\n\n"):
res = expr.parseString(d)
print res.dump()
print
print

# now just do what the OP wanted in the first place
for d in data.split("\n\n"):
res = expr.parseString(d)
print res.name, res.quals["gb"], res.quals["GEN"]

Gives these results:
['AFFX-BioB-5_at', 'E.', 'coli', [('GEN', '/GEN=bioB'), ('gb',
'/gb:J04423.1')]]
- name: AFFX-BioB-5_at
- quals: [('GEN', '/GEN=bioB'), ('gb', '/gb:J04423.1')]
- GEN: /GEN=bioB
- gb: /gb:J04423.1

['1415785_a_at', [('gb', '/gb:NM_009840.1'), ('DB_XREF',
'/DB_XREF=gi:6753327'), ('GEN', '/GEN=Cct8'), ('FEA', '/FEA=FLmRNA'),
('CNT', '/CNT=482'), ('TID', '/TID=Mm.17989.1'), ('TIER', '/TIER=FL+Stack'),
('STK', '/STK=281'), ('UG', '/UG=Mm.17989'), ('LL', '/LL=12469'), ('DEF',
'/DEF=Mus')]]
- name: 1415785_a_at
- quals: [('gb', '/gb:NM_009840.1'), ('DB_XREF', '/DB_XREF=gi:6753327'),
('GEN', '/GEN=Cct8'), ('FEA', '/FEA=FLmRNA'), ('CNT', '/CNT=482'), ('TID',
'/TID=Mm.17989.1'), ('TIER', '/TIER=FL+Stack'), ('STK', '/STK=281'), ('UG',
'/UG=Mm.17989'), ('LL', '/LL=12469'), ('DEF', '/DEF=Mus')]
- CNT: /CNT=482
- DB_XREF: /DB_XREF=gi:6753327
- DEF: /DEF=Mus
- FEA: /FEA=FLmRNA
- GEN: /GEN=Cct8
- LL: /LL=12469
- STK: /STK=281
- TID: /TID=Mm.17989.1
- TIER: /TIER=FL+Stack
- UG: /UG=Mm.17989
- gb: /gb:NM_009840.1

AFFX-BioB-5_at /gb:J04423.1 /GEN=bioB
1415785_a_at /gb:NM_009840.1 /GEN=Cct8

extracting substrings from a file

sofiafig

Tim Chase

John Machin

Larry Bates

Paul McGuire

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads