parsing tab and newline delimited text

E

elsa

Hi,

I have a large file of text I need to parse. Individual 'entries' are
separated by newline characters, while fields within each entry are
separated by tab characters.

So, an individual entry might have this form (in printed form):

Title date position data

with each field separated by tabs, and a newline at the end of data.
So, I thought I could simply open a file, read each line in in turn,
and parse it....

f=open('MyFile')
line=f.readline()
parts=line.split('\t')

etc...

However, 'data' is a fairly random string of characters. Because the
files I'm processing are large, there is a good chance that in every
file, there is a data field that might look like this:

899998dlKKlS\lk3#kdf\nllllKK99

or like this:

LLLSDKJJJdkkf334$\ttttks)))K99

so, you see the random strings '\n' and '\t' are stopping me from
being able to parse my file correctly. Any
suggestions on how to overcome this problem would be greatly
appreciated.

Many thanks,

Elsa
 
J

James Mills

I have a large file of text I need to parse. Individual 'entries' are
separated by newline characters, while fields within each entry are
separated by tab characters.

Sounds to me like a job of the csv module.

cheers
James
 
T

Tim Chase

I have a large file of text I need to parse. Individual 'entries' are
separated by newline characters, while fields within each entry are
separated by tab characters.

So, an individual entry might have this form (in printed form):

Title date position data

with each field separated by tabs, and a newline at the end of data.
So, I thought I could simply open a file, read each line in in turn,
and parse it....

f=open('MyFile')
line=f.readline()
parts=line.split('\t')

etc...

However, 'data' is a fairly random string of characters. Because the
files I'm processing are large, there is a good chance that in every
file, there is a data field that might look like this:

899998dlKKlS\lk3#kdf\nllllKK99

My first question is whether the line contains actual newline/tab
characters within the field data, or the string-representation of
the line. For one of the lines in question, what does

print repr(line)

(or "print line.encode('hex')") produce? If the line has extra
literal tabs, then you may be stuck; if the line has escaped text
(a backslash followed by an "n" or "t", i.e. 2 characters) then
it's pretty straight-forward. Ideally, you'd see something like
'MyTitle\t2010-08-02\t42\t89998dlKKlS\\lk3#kdf\\nlllKK99'
^tab ^tab ^tab ^backslash^

where the backslashes are literal.

If you know that it's the last ("data") field that can contain
such characters, you can at least catch non-newline characters by
only splitting the first N splits:

parts = line.split('\t', 3)

That doesn't solve the newline problem, but your file's
definition prevents you from being able to discern

filedata = 'title1\tdate1\tpos1\tdata1\nxxxx\tyyyy\tzzzz\twwww\n'

Would xxxx/yyyy/zzzz/wwww be a continuation of data1 or are they
the items in the next row?

-tkc
 
M

MRAB

elsa said:
Hi,

I have a large file of text I need to parse. Individual 'entries' are
separated by newline characters, while fields within each entry are
separated by tab characters.

So, an individual entry might have this form (in printed form):

Title date position data

with each field separated by tabs, and a newline at the end of data.
So, I thought I could simply open a file, read each line in in turn,
and parse it....

f=open('MyFile')
line=f.readline()
parts=line.split('\t')

etc...

However, 'data' is a fairly random string of characters. Because the
files I'm processing are large, there is a good chance that in every
file, there is a data field that might look like this:

899998dlKKlS\lk3#kdf\nllllKK99

or like this:

LLLSDKJJJdkkf334$\ttttks)))K99

so, you see the random strings '\n' and '\t' are stopping me from
being able to parse my file correctly. Any
suggestions on how to overcome this problem would be greatly
appreciated.
When you say random strings '\n', etc, are they the backslash character
\ followed by the letter n? If so, then you don't have a problem. They
are \ followed by n.

If, on the other hand, by '\n' you mean the newline character, then,
well, that's a newline character, and there's (probably) nothing you can
do about it.
 
E

elsa

My first question is whether the line contains actual newline/tab
characters within the field data, or the string-representation of
the line.  For one of the lines in question, what does

   print repr(line)

here is what I get at the interactive prompt:
.... :E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99"""
'IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??
55\n:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99'
'IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??
55\n:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99'

basically this is numeric values encoded into ASCII symbols. So '\' is
a value, 'n' is a value, 'E' is a value etc... it's
all part of the same data field. It's just unfortunate that '\' and
'n' have ended up together. (I didn't design this file,
btw, I'm just expected to process it!)

Elsa.
 
D

Dennis Lee Bieber

here is what I get at the interactive prompt:

... :E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;
9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;
44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99"""
In this you CREATED a <newline> character when you hit enter afte
the 55 on the first line; so yes, that will appear as \n if you
display it.

basically this is numeric values encoded into ASCII symbols. So '\' is
a value, 'n' is a value, 'E' is a value etc... it's
all part of the same data field. It's just unfortunate that '\' and
'n' have ended up together. (I didn't design this file,
btw, I'm just expected to process it!)

If the file physically has the character \ and the character n next
to each other, there is no problem -- they are two characters! They are
not a representation of the single character <newline>

-=-=-=-=-=-=-= Data.txt
IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??55:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99
slash and n\nslash and t\ttab newline
observe
-=-=-=-=-=-=-=

Contains three lines, the long gibberish, a line that contains the
characters \ n and \ t, along with a tab and, of course, the newlines
that end each line.

-=-=-=-=-=-=-=-IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??55:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99
slash and n\nslash and t\ttab newline
observe'IIIIIIIIIIIIIIIIIIIIIG=4448>IIIIIIIIIIIIIIIIIIIIIIIIIIIIIG666HIIIIII;;;IIIIIIEIIII??55:E?IEEEEFHGCACIIIII699;66IG11G???IIIIIIIIIIIIG???GGGII@@@@GG?;;9>CCIIIIIIIIIIICCCCGHHIIIGEEDBB?9951//////6=ABB=EEGII98AEIECCC>>;A=F@;;44//11::=<<?ADECCCEEEEEIIIIHHHIIGCCCEI99\nslash
and n\\nslash and t\\ttab\tnewline\nobserve'-=-=-=-=-=-=-=-

Notice how the \ n and \ t pairs DID NOT CAUSE ANY PROBLEM when read
and printed... Also notice how, in the repr() print, the \ character is
doubled, while the newlines and tabs show up with single \.
 
A

alex23

So, an individual entry might have this form (in printed form):

Title    date   position   data

with each field separated by tabs, and a newline at the end of data.

As James posted, the csv module is ideal for this sort of thing.
Dealing with delimited text seems obvious but, as with most things,
there are some edge cases that can bite you, so it's generally best to
use utility code that has already dealt with them.

If you're using Python 2.6+ you can use it in conjunction with
namedtuple for some very easy record retrieval:
.... print record.title, record.data
....
title1 data1\t\n\n\t
title2 data2\t\t\t\t
title3 data3\n\n\n\n

Hope this helps.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,014
Latest member
BiancaFix3

Latest Threads

Top