Extracting lines from text files - script with a couple of 'side effects'

mstagliamonte · Sep 25, 2013

Dear All,

Here I am, with another newbie question. I am trying to extract some lines from a fasta (text) file which match the headers in another file. i.e:
Fasta file:

header1|info1:info2_info3 general text
header2|info1:info2_info3

general text

headers file:
header1|info1:info2_info3
header2|info1:info2_info3

I want to create a third file, similar to the first one, but only containing headers and text of what is listed in the second file. Also, I want to print out how many headers were actually found from the second file to match the first.

I have done a script which seems to work, but with a couple of 'side effects'
Here is my script:
-------------------------------------------------------------------
import re
class Extractor():

def __init__(self,headers_file, fasta_file,output_file):
with open(headers_file,'r') as inp0:
counter0=0
container=''
inp0_bis=inp0.read().split('\n')
for x in inp0_bis:
container+=x.replace(':','_').replace('|','_')
with open(fasta_file,'r') as inp1:
inp1_bis=inp1.read().split('>')
for i in inp1_bis:
i_bis= i.split('\n')
match = re.search(i_bis[0].replace(':','_').replace('|','_'),container)
if match:
counter0+=1
with open(output_file,'at') as out0:
out0.write('>'+i)
print '{} sequences were found'.format(counter0)

-------------------------------------------------------------------
Side effects:
1) The very first header is written as >>header1 rather than >header1
2) the number of sequences found is 1 more than the ones actually found!

Have you got any thoughts about causes/solutions?

Thanks for your time!
P.S.: I think I have removed the double posting... not sure...
Max

Dave Angel · Sep 25, 2013

Dear All,

Here I am, with another newbie question. I am trying to extract some lines from a fasta (text) file which match the headers in another file. i.e:
Fasta file:

header1|info1:info2_info3 general text
header2|info1:info2_info3

Click to expand...

general text

headers file:
header1|info1:info2_info3
header2|info1:info2_info3

I want to create a third file, similar to the first one, but only containing headers and text of what is listed in the second file. Also, I want to print out how many headers were actually found from the second file to match the first.

I have done a script which seems to work, but with a couple of 'side effects'
Here is my script:
-------------------------------------------------------------------
import re
class Extractor():

def __init__(self,headers_file, fasta_file,output_file):
with open(headers_file,'r') as inp0:
counter0=0
container=''
inp0_bis=inp0.read().split('\n')
for x in inp0_bis:
container+=x.replace(':','_').replace('|','_')
with open(fasta_file,'r') as inp1:
inp1_bis=inp1.read().split('>')
for i in inp1_bis:
i_bis= i.split('\n')
match = re.search(i_bis[0].replace(':','_').replace('|','_'),container)
if match:
counter0+=1
with open(output_file,'at') as out0:
out0.write('>'+i)
print '{} sequences were found'.format(counter0)

-------------------------------------------------------------------
Side effects:
1) The very first header is written as >>header1 rather than >header1
2) the number of sequences found is 1 more than the ones actually found!

Have you got any thoughts about causes/solutions?

The cause is the same. The first line in the file starts with a "<",
and you're splitting on the same. So the first item of inp1_bis is the
empty string. That string is certainly contained within container, so
it matches, and produces a result of ">"

You can "fix" this by adding a line after the "for i in inp1_bis" line
if not i: continue

But it also seems to me you're doing the search backwards. if the Fasta
file has a line like: >der

it would be considered a match! Seems to me you'd want to only match
lines which contain an entire header.

MRAB · Sep 25, 2013

Dear All,

Here I am, with another newbie question. I am trying to extract some lines from a fasta (text) file which match the headers in another file. i.e:
Fasta file:

header1|info1:info2_info3 general text
header2|info1:info2_info3

Click to expand...

general text

headers file:
header1|info1:info2_info3
header2|info1:info2_info3

I want to create a third file, similar to the first one, but only containing headers and text of what is listed in the second file. Also, I want to print out how many headers were actually found from the second file to match the first.

I have done a script which seems to work, but with a couple of 'side effects'
Here is my script:
-------------------------------------------------------------------
import re
class Extractor():

def __init__(self,headers_file, fasta_file,output_file):
with open(headers_file,'r') as inp0:
counter0=0
container=''
inp0_bis=inp0.read().split('\n')
for x in inp0_bis:
container+=x.replace(':','_').replace('|','_')
with open(fasta_file,'r') as inp1:
inp1_bis=inp1.read().split('>')
for i in inp1_bis:
i_bis= i.split('\n')
match = re.search(i_bis[0].replace(':','_').replace('|','_'),container)
if match:
counter0+=1
with open(output_file,'at') as out0:
out0.write('>'+i)
print '{} sequences were found'.format(counter0)

-------------------------------------------------------------------
Side effects:
1) The very first header is written as >>header1 rather than >header1
2) the number of sequences found is 1 more than the ones actually found!

Have you got any thoughts about causes/solutions?

Thanks for your time!

Here's my version:

class Extractor():
def __init__(self, headers_file, fasta_file, output_file):
with open(headers_file) as inp:
headers = set('>' + line for line in inp)

counter = 0
accept = False

with open(fasta_file) as inp, open(output_file, 'w') as out:
for line in inp:
if line.startswith('>'):
accept = line in headers
if accept:
counter += 1

if accept:
out.write(line)

print '{} sequences were found'.format(counter)

mstagliamonte · Sep 26, 2013

Hi,

Thanks for the answers! And Dave, thanks for explaining the cause of the problem I will keep that in mind for the future. You're right, I am doing thesearch backward, it just seemed easier for me to do it in that way. Looks like I need to keep practising...

Both your suggestions work, I will try and learn from them.

Have a nice day
Max

Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
Insert replace text based on a name in other file python script	4	Mar 5, 2025
Need help with this script	4	Mar 12, 2023
Handling text lines from files with some (few) starnge chars	6	Jun 5, 2010
Help with code	0	Jun 11, 2022
Extracting links from a html table	1	May 19, 2008
Identify IP address from a text file and replace it with new address.	4	Nov 8, 2010
Perl script to return the number of occurences of multiple lines in afile	8	Dec 29, 2007

Extracting lines from text files - script with a couple of 'side effects'

mstagliamonte

Dave Angel

MRAB

mstagliamonte

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads