parsing tab and newline delimited text

elsa · Aug 4, 2010

Hi,

I have a large file of text I need to parse. Individual 'entries' are
separated by newline characters, while fields within each entry are
separated by tab characters.

So, an individual entry might have this form (in printed form):

Title date position data

with each field separated by tabs, and a newline at the end of data.
So, I thought I could simply open a file, read each line in in turn,
and parse it....

f=open('MyFile')
line=f.readline()
parts=line.split('\t')

etc...

However, 'data' is a fairly random string of characters. Because the
files I'm processing are large, there is a good chance that in every
file, there is a data field that might look like this:

899998dlKKlS\lk3#kdf\nllllKK99

or like this:

LLLSDKJJJdkkf334$\ttttks)))K99

so, you see the random strings '\n' and '\t' are stopping me from
being able to parse my file correctly. Any
suggestions on how to overcome this problem would be greatly
appreciated.

Many thanks,

Elsa

James Mills · Aug 4, 2010

I have a large file of text I need to parse. Individual 'entries' are
separated by newline characters, while fields within each entry are
separated by tab characters.

Sounds to me like a job of the csv module.

cheers
James

Tim Chase · Aug 4, 2010

I have a large file of text I need to parse. Individual 'entries' are
separated by newline characters, while fields within each entry are
separated by tab characters.

So, an individual entry might have this form (in printed form):

Title date position data

with each field separated by tabs, and a newline at the end of data.
So, I thought I could simply open a file, read each line in in turn,
and parse it....

f=open('MyFile')
line=f.readline()
parts=line.split('\t')

etc...

However, 'data' is a fairly random string of characters. Because the
files I'm processing are large, there is a good chance that in every
file, there is a data field that might look like this:

899998dlKKlS\lk3#kdf\nllllKK99

My first question is whether the line contains actual newline/tab
characters within the field data, or the string-representation of
the line. For one of the lines in question, what does

print repr(line)

(or "print line.encode('hex')") produce? If the line has extra
literal tabs, then you may be stuck; if the line has escaped text
(a backslash followed by an "n" or "t", i.e. 2 characters) then
it's pretty straight-forward. Ideally, you'd see something like
'MyTitle\t2010-08-02\t42\t89998dlKKlS\\lk3#kdf\\nlllKK99'
^tab ^tab ^tab ^backslash^

where the backslashes are literal.

If you know that it's the last ("data") field that can contain
such characters, you can at least catch non-newline characters by
only splitting the first N splits:

parts = line.split('\t', 3)

That doesn't solve the newline problem, but your file's
definition prevents you from being able to discern

filedata = 'title1\tdate1\tpos1\tdata1\nxxxx\tyyyy\tzzzz\twwww\n'

Would xxxx/yyyy/zzzz/wwww be a continuation of data1 or are they
the items in the next row?

-tkc

MRAB · Aug 4, 2010

elsa said:
Hi,

I have a large file of text I need to parse. Individual 'entries' are
separated by newline characters, while fields within each entry are
separated by tab characters.

So, an individual entry might have this form (in printed form):

Title date position data

with each field separated by tabs, and a newline at the end of data.
So, I thought I could simply open a file, read each line in in turn,
and parse it....

f=open('MyFile')
line=f.readline()
parts=line.split('\t')

etc...

However, 'data' is a fairly random string of characters. Because the
files I'm processing are large, there is a good chance that in every
file, there is a data field that might look like this:

899998dlKKlS\lk3#kdf\nllllKK99

or like this:

LLLSDKJJJdkkf334$\ttttks)))K99

so, you see the random strings '\n' and '\t' are stopping me from
being able to parse my file correctly. Any
suggestions on how to overcome this problem would be greatly
appreciated.

When you say random strings '\n', etc, are they the backslash character
\ followed by the letter n? If so, then you don't have a problem. They
are \ followed by n.

If, on the other hand, by '\n' you mean the newline character, then,
well, that's a newline character, and there's (probably) nothing you can
do about it.

elsa · Aug 4, 2010

My first question is whether the line contains actual newline/tab
characters within the field data, or the string-representation of
the line. For one of the lines in question, what does

print repr(line)

here is what I get at the interactive prompt:
.... :E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99"""

'IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??
55\n:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99'

'IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??
55\n:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99'

basically this is numeric values encoded into ASCII symbols. So '\' is
a value, 'n' is a value, 'E' is a value etc... it's
all part of the same data field. It's just unfortunate that '\' and
'n' have ended up together. (I didn't design this file,
btw, I'm just expected to process it!)

Elsa.

Dennis Lee Bieber · Aug 4, 2010

here is what I get at the interactive prompt:

... :E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99"""

In this you CREATED a <newline> character when you hit enter afte
the 55 on the first line; so yes, that will appear as \n if you
display it.

basically this is numeric values encoded into ASCII symbols. So '\' is
a value, 'n' is a value, 'E' is a value etc... it's
all part of the same data field. It's just unfortunate that '\' and
'n' have ended up together. (I didn't design this file,
btw, I'm just expected to process it!)

If the file physically has the character \ and the character n next
to each other, there is no problem -- they are two characters! They are
not a representation of the single character <newline>

-=-=-=-=-=-=-= Data.txt
IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??55:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99
slash and n\nslash and t\ttab newline
observe
-=-=-=-=-=-=-=

Contains three lines, the long gibberish, a line that contains the
characters \ n and \ t, along with a tab and, of course, the newlines
that end each line.

-=-=-=-=-=-=-=-IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??55:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99
slash and n\nslash and t\ttab newline
observe'IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??55:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99\nslash
and n\\nslash and t\\ttab\tnewline\nobserve'-=-=-=-=-=-=-=-

Notice how the \ n and \ t pairs DID NOT CAUSE ANY PROBLEM when read
and printed... Also notice how, in the repr() print, the \ character is
doubled, while the newlines and tabs show up with single \.

alex23 · Aug 4, 2010

So, an individual entry might have this form (in printed form):

Title date position data

with each field separated by tabs, and a newline at the end of data.

As James posted, the csv module is ideal for this sort of thing.
Dealing with delimited text seems obvious but, as with most things,
there are some edge cases that can bite you, so it's generally best to
use utility code that has already dealt with them.

If you're using Python 2.6+ you can use it in conjunction with
namedtuple for some very easy record retrieval:
.... print record.title, record.data
....
title1 data1\t\n\n\t
title2 data2\t\t\t\t
title3 data3\n\n\n\n

Hope this helps.

Reading a tab delimited text file.	5	Feb 23, 2009
XML -> Tab-delimited text file (using lxml)	2	Nov 19, 2008
Translate tab-delimited to fixed width text	2	Sep 21, 2004
parsing a tab delimited or CSV, but keep the delimiter	27	Mar 22, 2007
parsing tab separated data efficiently into numpy/pylab arrays	2	Mar 13, 2009
Nuby problem w/CSV, tab-delimited files & embedded double-quotes	4	Jun 2, 2005
creating textNodes and newline characters	2	Jul 11, 2007
identifying and parsing string in text file	4	Mar 8, 2008

parsing tab and newline delimited text

elsa

James Mills

Tim Chase

MRAB

elsa

Dennis Lee Bieber

alex23

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads