parsing tab and newline delimited text

Discussion in 'Python' started by elsa, Aug 4, 2010.

  1. elsa

    elsa Guest

    Hi,

    I have a large file of text I need to parse. Individual 'entries' are
    separated by newline characters, while fields within each entry are
    separated by tab characters.

    So, an individual entry might have this form (in printed form):

    Title date position data

    with each field separated by tabs, and a newline at the end of data.
    So, I thought I could simply open a file, read each line in in turn,
    and parse it....

    f=open('MyFile')
    line=f.readline()
    parts=line.split('\t')

    etc...

    However, 'data' is a fairly random string of characters. Because the
    files I'm processing are large, there is a good chance that in every
    file, there is a data field that might look like this:

    899998dlKKlS\lk3#kdf\nllllKK99

    or like this:

    LLLSDKJJJdkkf334$\ttttks)))K99

    so, you see the random strings '\n' and '\t' are stopping me from
    being able to parse my file correctly. Any
    suggestions on how to overcome this problem would be greatly
    appreciated.

    Many thanks,

    Elsa
     
    elsa, Aug 4, 2010
    #1
    1. Advertising

  2. elsa

    James Mills Guest

    On Wed, Aug 4, 2010 at 12:14 PM, elsa <> wrote:
    > I have a large file of text I need to parse. Individual 'entries' are
    > separated by newline characters, while fields within each entry are
    > separated by tab characters.


    Sounds to me like a job of the csv module.

    cheers
    James

    --
    -- James Mills
    --
    -- "Problems are solved by method"
     
    James Mills, Aug 4, 2010
    #2
    1. Advertising

  3. elsa

    Tim Chase Guest

    On 08/03/10 21:14, elsa wrote:
    > I have a large file of text I need to parse. Individual 'entries' are
    > separated by newline characters, while fields within each entry are
    > separated by tab characters.
    >
    > So, an individual entry might have this form (in printed form):
    >
    > Title date position data
    >
    > with each field separated by tabs, and a newline at the end of data.
    > So, I thought I could simply open a file, read each line in in turn,
    > and parse it....
    >
    > f=open('MyFile')
    > line=f.readline()
    > parts=line.split('\t')
    >
    > etc...
    >
    > However, 'data' is a fairly random string of characters. Because the
    > files I'm processing are large, there is a good chance that in every
    > file, there is a data field that might look like this:
    >
    > 899998dlKKlS\lk3#kdf\nllllKK99


    My first question is whether the line contains actual newline/tab
    characters within the field data, or the string-representation of
    the line. For one of the lines in question, what does

    print repr(line)

    (or "print line.encode('hex')") produce? If the line has extra
    literal tabs, then you may be stuck; if the line has escaped text
    (a backslash followed by an "n" or "t", i.e. 2 characters) then
    it's pretty straight-forward. Ideally, you'd see something like

    >>> print repr(line)

    'MyTitle\t2010-08-02\t42\t89998dlKKlS\\lk3#kdf\\nlllKK99'
    ^tab ^tab ^tab ^backslash^

    where the backslashes are literal.

    If you know that it's the last ("data") field that can contain
    such characters, you can at least catch non-newline characters by
    only splitting the first N splits:

    parts = line.split('\t', 3)

    That doesn't solve the newline problem, but your file's
    definition prevents you from being able to discern

    filedata = 'title1\tdate1\tpos1\tdata1\nxxxx\tyyyy\tzzzz\twwww\n'

    Would xxxx/yyyy/zzzz/wwww be a continuation of data1 or are they
    the items in the next row?

    -tkc
     
    Tim Chase, Aug 4, 2010
    #3
  4. elsa

    MRAB Guest

    elsa wrote:
    > Hi,
    >
    > I have a large file of text I need to parse. Individual 'entries' are
    > separated by newline characters, while fields within each entry are
    > separated by tab characters.
    >
    > So, an individual entry might have this form (in printed form):
    >
    > Title date position data
    >
    > with each field separated by tabs, and a newline at the end of data.
    > So, I thought I could simply open a file, read each line in in turn,
    > and parse it....
    >
    > f=open('MyFile')
    > line=f.readline()
    > parts=line.split('\t')
    >
    > etc...
    >
    > However, 'data' is a fairly random string of characters. Because the
    > files I'm processing are large, there is a good chance that in every
    > file, there is a data field that might look like this:
    >
    > 899998dlKKlS\lk3#kdf\nllllKK99
    >
    > or like this:
    >
    > LLLSDKJJJdkkf334$\ttttks)))K99
    >
    > so, you see the random strings '\n' and '\t' are stopping me from
    > being able to parse my file correctly. Any
    > suggestions on how to overcome this problem would be greatly
    > appreciated.
    >

    When you say random strings '\n', etc, are they the backslash character
    \ followed by the letter n? If so, then you don't have a problem. They
    are \ followed by n.

    If, on the other hand, by '\n' you mean the newline character, then,
    well, that's a newline character, and there's (probably) nothing you can
    do about it.
     
    MRAB, Aug 4, 2010
    #4
  5. elsa

    elsa Guest

    On Aug 4, 12:49 pm, Tim Chase <> wrote:
    > On 08/03/10 21:14, elsa wrote:
    >
    >
    >
    > > I have a large file of text I need to parse. Individual 'entries' are
    > > separated by newline characters, while fields within each entry are
    > > separated by tab characters.

    >
    > > So, an individual entry might have this form (in printed form):

    >
    > > Title    date   position   data

    >
    > > with each field separated by tabs, and a newline at the end of data.
    > > So, I thought I could simply open a file, read each line in in turn,
    > > and parse it....

    >
    > > f=open('MyFile')
    > > line=f.readline()
    > > parts=line.split('\t')

    >
    > > etc...

    >
    > > However, 'data' is a fairly random string of characters. Because the
    > > files I'm processing are large, there is a good chance that in every
    > > file, there is a data field that might look like this:

    >
    > > 899998dlKKlS\lk3#kdf\nllllKK99

    >
    > My first question is whether the line contains actual newline/tab
    > characters within the field data, or the string-representation of
    > the line.  For one of the lines in question, what does
    >
    >    print repr(line)


    here is what I get at the interactive prompt:

    >>> line = """IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??55

    .... :E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
    9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
    44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99"""

    >>> line

    'IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??
    55\n:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
    9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
    44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99'

    >>> print repr(line)

    'IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??
    55\n:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
    9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
    44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99'

    basically this is numeric values encoded into ASCII symbols. So '\' is
    a value, 'n' is a value, 'E' is a value etc... it's
    all part of the same data field. It's just unfortunate that '\' and
    'n' have ended up together. (I didn't design this file,
    btw, I'm just expected to process it!)

    Elsa.
     
    elsa, Aug 4, 2010
    #5
  6. On Tue, 3 Aug 2010 20:35:34 -0700 (PDT), elsa <>
    declaimed the following in gmane.comp.python.general:

    >
    > here is what I get at the interactive prompt:
    >
    > >>> line = """IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??55

    > ... :E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
    > 9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
    > 44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99"""
    >

    In this you CREATED a <newline> character when you hit enter afte
    the 55 on the first line; so yes, that will appear as \n if you
    display it.

    How do I know you CREATED a <newline>? The presence of the ...
    continuation prompt from the interpreter.

    > basically this is numeric values encoded into ASCII symbols. So '\' is
    > a value, 'n' is a value, 'E' is a value etc... it's
    > all part of the same data field. It's just unfortunate that '\' and
    > 'n' have ended up together. (I didn't design this file,
    > btw, I'm just expected to process it!)


    If the file physically has the character \ and the character n next
    to each other, there is no problem -- they are two characters! They are
    not a representation of the single character <newline>

    -=-=-=-=-=-=-= Data.txt
    IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??55:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99
    slash and n\nslash and t\ttab newline
    observe
    -=-=-=-=-=-=-=

    Contains three lines, the long gibberish, a line that contains the
    characters \ n and \ t, along with a tab and, of course, the newlines
    that end each line.

    -=-=-=-=-=-=-=-
    >>> import os
    >>> os.chdir("Python Progs")
    >>> dta = open("data.txt")
    >>> data = dta.read()
    >>> print data

    IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??55:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99
    slash and n\nslash and t\ttab newline
    observe
    >>> print repr(data)

    'IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??55:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99\nslash
    and n\\nslash and t\\ttab\tnewline\nobserve'
    >>>

    -=-=-=-=-=-=-=-

    Notice how the \ n and \ t pairs DID NOT CAUSE ANY PROBLEM when read
    and printed... Also notice how, in the repr() print, the \ character is
    doubled, while the newlines and tabs show up with single \.
    --
    Wulfraed Dennis Lee Bieber AF6VN
    HTTP://wlfraed.home.netcom.com/
     
    Dennis Lee Bieber, Aug 4, 2010
    #6
  7. elsa

    alex23 Guest

    On Aug 4, 12:14 pm, elsa <> wrote:
    > So, an individual entry might have this form (in printed form):
    >
    > Title    date   position   data
    >
    > with each field separated by tabs, and a newline at the end of data.


    As James posted, the csv module is ideal for this sort of thing.
    Dealing with delimited text seems obvious but, as with most things,
    there are some edge cases that can bite you, so it's generally best to
    use utility code that has already dealt with them.

    If you're using Python 2.6+ you can use it in conjunction with
    namedtuple for some very easy record retrieval:

    >>> import csv
    >>> from collections import namedtuple
    >>> Record = namedtuple('Record', 'title date position data')
    >>> tabReader = csv.reader(open('test.txt','rb'), delimiter='\t')
    >>> for record in (Record(*row) for row in tabReader):

    .... print record.title, record.data
    ....
    title1 data1\t\n\n\t
    title2 data2\t\t\t\t
    title3 data3\n\n\n\n

    Hope this helps.
     
    alex23, Aug 4, 2010
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. David Lozzi

    Export to Tab Delimited Text File

    David Lozzi, Mar 31, 2006, in forum: ASP .Net
    Replies:
    2
    Views:
    4,953
    David Lozzi
    Apr 1, 2006
  2. mike beck
    Replies:
    2
    Views:
    732
    mike beck
    Sep 30, 2004
  3. Sideswipe
    Replies:
    27
    Views:
    1,835
  4. RyanL
    Replies:
    6
    Views:
    703
    Paul McGuire
    Aug 28, 2007
  5. Julio Capote
    Replies:
    5
    Views:
    200
    vasudevram
    Sep 9, 2006
Loading...

Share This Page