Issue values dictionary

  • Thread starter claire morandin
  • Start date

claire morandin

I have two text file with a bunch of transcript name and their corresponding length, it looks like this:
ERCC-00002 1061
ERCC-00003 1023
ERCC-00004 523
ERCC-00009 984
ERCC-00012 994
ERCC-00013 808
ERCC-00014 1957
ERCC-00016 844
ERCC-00017 1136
ERCC-00019 644
ERCC-00002 1058
ERCC-00003 1017
ERCC-00004 519
ERCC-00009 977
ERCC-00019 638
ERCC-00022 746
ERCC-00024 134
ERCC-00024 126
ERCC-00024 98
ERCC-00025 445

I want to compare the length of the transcript and see if the length in blast.txt is at least 90% of the length in ERCC.txt for the corresponding transcript name ( I hope I am clear!)
So I wrote the following script:
ercctranscript_size = {}
for line in open('ERCC.txt'):
columns = line.strip().split()
transcript = columns[0]
size = columns[1]
ercctranscript_size[transcript] = int(size)

unknown_transcript = open('Not_sequenced_ERCC_transcript.txt', 'w')
blast_file = open('blast.txt')
out_file = open ('out.txt', 'w')

blast_transcript = {}
for line in blast_file:
blasttranscript = columns[0].strip()
blastsize = columns[1].strip()
blast_transcript[blasttranscript] = int(blastsize)

blastsize = blast_transcript[blasttranscript]
size = ercctranscript_size[transcript]
print size
if transcript not in blast_transcript:
size = ercctranscript_size[transcript]
if blastsize >= 0.9*size:
print >> out_file, transcript, True
print >> out_file, transcript, False

But I have a problem storing all size length to the value size as it is always comes back with the last entry.
Could anyone explain to me what I am doing wrong and how I should set the values for each dictionary? I am really new to python and this is my first script

Thanks for your help everybody!


But I have a problem storing all size length to the value size as it is always comes back with the last entry.
Could anyone explain to me what I am doing wrong and how I should set thevalues for each dictionary?

Your code has two for loops, one that reads ERCC.txt into a dict, and
one that reads blast.txt into a dict. The first assigns to
`transcript`, the second to `blasttranscript`. When the loops are
finished, you're using the _last_ value set for both `transcript` and
`blasttranscript`. So, really, you want _three_ loops: two to load the
files into dicts, then another to compare the two of them. If the
transcripts in blast.txt are guaranteed to be a subset of ERCC.txt,
then you could get away with two loops:

# convenience function for splitting lines into values
def get_transcript_and_size(line):
columns = line.strip().split()
return columns[0].strip(), int(columns[1].strip())

# read in blast_file
blast_transcripts = {}
with open('transcript_blast.txt') as blast_file:
# this is a context manager, it'll close the file when it's
for line in blast_file:
blasttranscript, blastsize = get_transcript_and_size(line)
blast_transcripts[blasttranscript] = blastsize

# read in ERCC and compare to blast
with open('transcript_ERCC.txt') as ercc_file, \
open('Not_sequenced_ERCC_transcript.txt', 'w') as
unknown_transcript, \
open('transcript_out.txt', 'w') as out_file:
# this is called a _nested_ context manager, and requires 2.7+
or 3.1+
for line in ercc_file:
ercctranscript, erccsize = get_transcript_and_size(line)
if ercctranscript not in blast_transcripts:
print >> unknown_transcript, ercctranscript
is_ninety_percent = blast_transcripts[ercctranscript]
= 0.9*erccsize
print >> out_file, ercctranscript, is_ninety_percent

I've cleaned up your code a bit, using more similar naming schemes and
the same open/write procedures for all file access. Generally, any
time you're repeating code, you should stick it into a function and
use that instead, like the `get_transcript_and_size` func. If the
columns in your two files are separated by tabs, or always by the same
number of spaces, you can simplify this even further by using the csv

Hope this helps.

claire morandin

@alex23 I can't thank you enough this really helped me so much, not only fixing my issue but also understanding where was my original error

Thanks a lot

Peter Otten

alex23 said:
def get_transcript_and_size(line):
columns = line.strip().split()
return columns[0].strip(), int(columns[1].strip())

You can remove all strip() methods here as split() already strips off any
whitespace from the columns.

Not really important, but the nitpicker in me keeps nagging ;)


You can remove all strip() methods here as split() already strips off any
whitespace from the columns.

Not really important, but the nitpicker in me keeps nagging ;)

Thanks, I really should have checked but just pushed the OPs code into
a function, I didn't want to startle them with completely different
code :)

As I mentioned, I would've used the csv module for this anyway, which
is why I never remember the split/strip behaviour.

Nitpickery can be a virtue in this field :)

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Latest member

Latest Threads
