extracting substrings from a file

Discussion in 'Python' started by sofiafig@gmail.com, Sep 11, 2006.

  1. Guest

    Hi,

    I have a file with several entries in the form:

    AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
    corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
    7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
    7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
    dethiobiotin synthetase (bioD), complete cds.

    1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
    /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
    /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
    /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1

    and I would like to create a file that has only the following:

    AFFX-BioB-5_at /GEN=bioB /gb:J04423.1

    1415785_a_at /gb:NM_009840.1 /GEN=Cct8

    Could anyone please tell me how can I do it?

    Many thanks in advance
    Sofia
    , Sep 11, 2006
    #1
    1. Advertising

  2. Tim Chase Guest

    > I have a file with several entries in the form:
    >
    > AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
    > corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
    > 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
    > 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
    > dethiobiotin synthetase (bioD), complete cds.
    >
    > 1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
    > /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
    > /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
    > /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
    >
    > and I would like to create a file that has only the following:
    >
    > AFFX-BioB-5_at /GEN=bioB /gb:J04423.1
    >
    > 1415785_a_at /gb:NM_009840.1 /GEN=Cct8
    >
    > Could anyone please tell me how can I do it?


    The following seems to do it for me...

    outfile = file('out.txt', 'w')
    for line in file('in.txt'):
    if '/GEN' in line and '/gb:' in line:
    newline = []
    for index, item in enumerate(line.split()):
    if index == 0 or item.startswith('/GEN')
    or item.startswith('/gb:'):
    newline.append(item)
    outfile.write('\t'.join(newline))
    outfile.write('\n')
    outfile.close()


    There are some underdefined conditions...I presume that both the
    GEN and gb: have to appear in the line. If only one of them is
    required, change the "and" to an "or".

    -tkc
    Tim Chase, Sep 11, 2006
    #2
    1. Advertising

  3. John Machin Guest

    wrote:
    > Hi,
    >
    > I have a file with several entries in the form:
    >
    > AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
    > corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
    > 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
    > 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
    > dethiobiotin synthetase (bioD), complete cds.
    >
    > 1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
    > /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
    > /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
    > /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
    >
    > and I would like to create a file that has only the following:
    >
    > AFFX-BioB-5_at /GEN=bioB /gb:J04423.1
    >
    > 1415785_a_at /gb:NM_009840.1 /GEN=Cct8
    >
    > Could anyone please tell me how can I do it?
    >
    > Many thanks in advance
    > Sofia


    Here's my first iteration:
    C:\junk>type sofia.py
    prefixes = ['/GEN=', '/gb:']

    def extract(fname):
    f = open(fname, 'r')
    chunks = [[]]
    for line in f:
    words = line.split()
    if words:
    chunks[-1].extend(words)
    else:
    chunks.append([])
    for chunk in chunks:
    if not chunk:
    continue
    output = [chunk[0]]
    for word in chunk:
    for prefix in prefixes:
    if word.startswith(prefix):
    output.append(word)
    break
    print ' '.join(output)

    if __name__ == "__main__":
    import sys
    extract(sys.argv[1])

    C:\junk>sofia.py sofia.txt
    AFFX-BioB-5_at /GEN=bioB /gb:J04423.1 /gb:J04423.1
    1415785_a_at /gb:NM_009840.1 /GEN=Cct8 /gb:BC009007.1

    Before I fix the duplicate in the first line, you need to say whether
    you really want the
    /gb:BC009007.1 in the second line thrown away -- IOW, what's the rule?
    For each prefix, either (1) get the first "word" that starts with that
    prefix or (2) get all unique such words. You choose.

    Cheers,
    John
    John Machin, Sep 11, 2006
    #3
  4. Larry Bates Guest

    wrote:
    > Hi,
    >
    > I have a file with several entries in the form:
    >
    > AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
    > corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
    > 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
    > 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
    > dethiobiotin synthetase (bioD), complete cds.
    >
    > 1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
    > /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
    > /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
    > /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
    >
    > and I would like to create a file that has only the following:
    >
    > AFFX-BioB-5_at /GEN=bioB /gb:J04423.1
    >
    > 1415785_a_at /gb:NM_009840.1 /GEN=Cct8
    >
    > Could anyone please tell me how can I do it?
    >
    > Many thanks in advance
    > Sofia
    >

    What have your tried so far?

    Hint: split line on spaces, the first pieces is the first item you want,
    then iterate over the pieces looking for the /GEN and /gb: pieces that
    you are interested in keeping. I am assuming that /GEN= and /gb: data
    doesn't have any spaces in them. If they do, you will need to use
    regular expressions instead of split.

    -Larry Bates
    Larry Bates, Sep 11, 2006
    #4
  5. Paul McGuire Guest

    <> wrote in message
    news:...
    > Hi,
    >
    > I have a file with several entries in the form:
    >
    > AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
    > corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
    > 7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
    > 7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
    > dethiobiotin synthetase (bioD), complete cds.
    >
    > 1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
    > /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
    > /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
    > /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
    >
    > and I would like to create a file that has only the following:
    >
    > AFFX-BioB-5_at /GEN=bioB /gb:J04423.1
    >
    > 1415785_a_at /gb:NM_009840.1 /GEN=Cct8


    Here's a pyparsing solution that will address your immediate question, and
    also gives you some leeway for adding other "/" options to your search.
    Pyparsing's home page is at pyparsing.wikispaces.com.

    -- Paul


    data = """
    AFFX-BioB-5_at E. coli /GEN=bioB /gb:J04423.1 NOTE=SIF
    corresponding to nucleotides 2032-2305 of /gb:J04423.1 DEF=E.coli
    7,8-diamino-pelargonic acid (bioA), biotin synthetase (bioB),
    7-keto-8-amino-pelargonic acid synthetase (bioF), bioC protein, and
    dethiobiotin synthetase (bioD), complete cds.

    1415785_a_at /gb:NM_009840.1 /DB_XREF=gi:6753327 /GEN=Cct8 /FEA=FLmRNA
    /CNT=482 /TID=Mm.17989.1 /TIER=FL+Stack /STK=281 /UG=Mm.17989 /LL=12469
    /DEF=Mus musculus chaperonin subunit 8 (theta) (Cct8), mRNA.
    /PROD=chaperonin subunit 8 (theta) /FL=/gb:NM_009840.1 /gb:BC009007.1
    """

    from pyparsing import *

    # create expression we are looking for:
    # name [ junk word... ] /qualifier...
    name = Word(alphanums,printables).setResultsName("name")
    junkWord = ~(Literal("/")) + Word(printables)
    qualifier = ("/" + Word(alphas+"_-.").setResultsName("key") + \
    oneOf("= :") + \
    Word(printables).setResultsName("value"))
    expr = name + ZeroOrMore(junkWord) + \
    Dict(ZeroOrMore(qualifier)).setResultsName("quals")

    # use parse action to repackage qualifier data to support "dict"-like
    # access to qualifiers
    qualifier.setParseAction( lambda t: (t.key,"".join(t)) )

    # use this parse action instead if you just want whatever is
    # after the '=' or ':' delimiter in the qualifier
    # qualifier.setParseAction( lambda t: (t.key,t.value) )

    # parse data strings, showing returned data structure
    # (just to show what pyparsing results structure looks like)
    for d in data.split("\n\n"):
    res = expr.parseString(d)
    print res.dump()
    print
    print

    # now just do what the OP wanted in the first place
    for d in data.split("\n\n"):
    res = expr.parseString(d)
    print res.name, res.quals["gb"], res.quals["GEN"]


    Gives these results:
    ['AFFX-BioB-5_at', 'E.', 'coli', [('GEN', '/GEN=bioB'), ('gb',
    '/gb:J04423.1')]]
    - name: AFFX-BioB-5_at
    - quals: [('GEN', '/GEN=bioB'), ('gb', '/gb:J04423.1')]
    - GEN: /GEN=bioB
    - gb: /gb:J04423.1

    ['1415785_a_at', [('gb', '/gb:NM_009840.1'), ('DB_XREF',
    '/DB_XREF=gi:6753327'), ('GEN', '/GEN=Cct8'), ('FEA', '/FEA=FLmRNA'),
    ('CNT', '/CNT=482'), ('TID', '/TID=Mm.17989.1'), ('TIER', '/TIER=FL+Stack'),
    ('STK', '/STK=281'), ('UG', '/UG=Mm.17989'), ('LL', '/LL=12469'), ('DEF',
    '/DEF=Mus')]]
    - name: 1415785_a_at
    - quals: [('gb', '/gb:NM_009840.1'), ('DB_XREF', '/DB_XREF=gi:6753327'),
    ('GEN', '/GEN=Cct8'), ('FEA', '/FEA=FLmRNA'), ('CNT', '/CNT=482'), ('TID',
    '/TID=Mm.17989.1'), ('TIER', '/TIER=FL+Stack'), ('STK', '/STK=281'), ('UG',
    '/UG=Mm.17989'), ('LL', '/LL=12469'), ('DEF', '/DEF=Mus')]
    - CNT: /CNT=482
    - DB_XREF: /DB_XREF=gi:6753327
    - DEF: /DEF=Mus
    - FEA: /FEA=FLmRNA
    - GEN: /GEN=Cct8
    - LL: /LL=12469
    - STK: /STK=281
    - TID: /TID=Mm.17989.1
    - TIER: /TIER=FL+Stack
    - UG: /UG=Mm.17989
    - gb: /gb:NM_009840.1


    AFFX-BioB-5_at /gb:J04423.1 /GEN=bioB
    1415785_a_at /gb:NM_009840.1 /GEN=Cct8
    Paul McGuire, Sep 11, 2006
    #5
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Leandro Pardini

    Binary files, substrings and (un)packing.

    Leandro Pardini, Oct 25, 2003, in forum: Perl
    Replies:
    1
    Views:
    581
    Jim Gibson
    Oct 27, 2003
  2. Markus Dehmann

    regex: How to extract substrings?

    Markus Dehmann, Dec 10, 2005, in forum: Java
    Replies:
    2
    Views:
    787
    IchBin
    Dec 10, 2005
  3. Pilcrow
    Replies:
    2
    Views:
    212
    Eric Sosman
    Nov 21, 2008
  4. killy971
    Replies:
    1
    Views:
    105
    Ron Fox
    Jun 19, 2008
  5. puzzlecracker

    enumerate all adjecent substrings in the file

    puzzlecracker, Dec 11, 2005, in forum: Perl Misc
    Replies:
    9
    Views:
    121
    Anno Siegel
    Dec 13, 2005
Loading...

Share This Page