Recommended data structure for newbie

M

manstey

Hi,

I have a text file with about 450,000 lines. Each line has 4-5 fields,
separated by various delimiters (spaces, @, etc).

I want to load in the text file and then run routines on it to produce
2-3 additional fields.

I am a complete newbie to Python but I have the docs and done some
experimenting. But I'm not sure what sort of data structure I should be
creating.

Example:
gee fre asd[234
ger dsf asd[243
gwer af as.:^25a

what i want is
'gee' 'fre' 'asd' '[' '234'
'ger' dsf' asf' '[' '243'
'gwer' 'af 'as.:' '^' '25a

etc for 450,000 lines.

Then on the basis of information in 1 or more lines, I want to append
to each line new data such as

'gee' 'fre' 'asd' '[' '234' 'geefre2' '2344f'
'ger' dsf' asf' '[' '243' 'gerdsd' '2431'
'gwer' 'af 'as.:' '^' '25a 'gweraf' 'sfd2'

What do you recommend? I think it is some sort of list inside a list,
but I'm not sure. I know how to generate the new data but not how to
store everything both in memory and on disk.

Regards,
Matthew
 
P

Paddy

Hi Matthew,
From your example, it is hard to work out what character or character
string is a separator, and what string needs to become a separate word
when seen in the original file.

In the example below you need to learn about regular expressions. the
split is based on the two RE's held in variables 'separators' and
'otherwords'. the first is used to split a line the second is used to
extract sub-words.
The output is printed in two formats that could be piped to a file for
later re-use; as a python list, then as space separated lines of words.
The csv module could be used to create a csv file for reading into
spreadsheets and databases. No doubt an XML formatted output is just as
straight-forward, (but XML is not, yet, my cup of tea).

The prog: word_up.py:

import re
import pprint

instring = '''gee fre asd[234
ger dsf asd[243
gwer af as.:^25a
'''
separators = r'''[ \t\r\f\v]+'''
otherwords = r'''(?x)
(.*)
(
\[
| \^
)
(.*)
'''

def word_up(instring, separators, otherwords):
""" for less confusing names substitute
line for cameo, and w for jockstrap in the function body :)

# doctest
>>> from pprint import pprint as pp
>>> i = 'gee fre asd[234\nger dsf asd[243\ngwer af as.:^25a\n'
>>> print i
gee fre asd[234
ger dsf asd[243
gwer af as.:^25a
>>> s = r'''[ \t\r\f\v]+'''
>>> o = '(?x)\n (.*)\n (\n \\[\n | \\^\n )\n (.*)\n'
>>> print o
(?x)
(.*)
(
\[
| \^
)
(.*)
[['gee', 'fre', 'asd', '[', '234'],
['ger', 'dsf', 'asd', '[', '243'],
['gwer', 'af', 'as.:', '^', '25a']] """
line_words = []
for cameo in instring.splitlines():
# some words are separated by separator chars
word_split = re.split(separators, cameo)
# extract separate sub_words
word_extracts = []
for word in word_split:
matched = re.match(otherwords, word)
if matched:
word_extracts += [jockstrap for jockstrap in
matched.groups() if jockstrap]
else:
word_extracts.append(word)
line_words.append(word_extracts)
return line_words

line_words = word_up(instring, separators, otherwords)

print '\n# Python format extracted words as list of lists'
pprint.pprint(line_words)

print '\n# Unix friendly space separated words'
for l in line_words:
for w in l:
print w,
print


-- Paddy
 
P

Paul McGuire

manstey said:
Hi,

I have a text file with about 450,000 lines. Each line has 4-5 fields,
separated by various delimiters (spaces, @, etc).

I want to load in the text file and then run routines on it to produce
2-3 additional fields.

<snip>

Matthew -

If you find re's to be too cryptic, here is an example using p
 
P

Paul McGuire

manstey said:
Hi,

I have a text file with about 450,000 lines. Each line has 4-5 fields,
separated by various delimiters (spaces, @, etc).

I want to load in the text file and then run routines on it to produce
2-3 additional fields.

<snip>

Matthew -

If you find re's to be a bit cryptic, here is a pyparsing version that may
be a bit more readable, and will easily scan through your input file:

================
from pyparsing import OneOrMore, Word, alphas, oneOf, restOfLine, lineno

data = """gee fre asd[234
ger dsf asd[243
gwer af as.:^25a"""

# define format of input line, that is:
# - one or more words, composed of alphabetic characters, periods, and
colons
# - one of the characters '[' or '^'
# - the rest of the line
entry = OneOrMore( Word(alphas+".:") ) + oneOf("[ ^") + restOfLine

# scan for matches in input data - for each match, scanString will
# report the matching tokens, and start and end locations
for toks,start,end in entry.scanString(data):
print toks
print

# scan again, this time generating additional fields
for toks,start,end in entry.scanString(data):
tokens = list(toks)
# change these lines to implement your
# desired generation code - couldn't guess
# what you wanted from your example
tokens.append( toks[0]+toks[1] )
tokens.append( toks[-1] + toks[-1][-1] )
tokens.append( str( lineno(start, data) ) )
print tokens

================
prints:
['gee', 'fre', 'asd', '[', '234']
['ger', 'dsf', 'asd', '[', '243']
['gwer', 'af', 'as.:', '^', '25a']

['gee', 'fre', 'asd', '[', '234', 'geefre', '2344', '1']
['ger', 'dsf', 'asd', '[', '243', 'gerdsf', '2433', '2']
['gwer', 'af', 'as.:', '^', '25a', 'gweraf', '25aa', '3']


You asked about data structures specifically. The core collections in
python are lists, dicts, and more recently, sets. Pyparsing returns tokens
from its matching process using a pyparsing-defined class called
ParseResults. Fortunately, using Python's "duck-typing" model, you can
treat ParseResults objects just like a list, or like a dict if you have
assigned names to the fields in the parsing expression.

Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul
 
P

Paul McGuire

You asked about data structures specifically. The core collections in
python are lists, dicts, and more recently, sets.

(oops, also forgot tuples - very much like lists, but immutable)
 
P

Paul McGuire

Paul McGuire said:
<snip>

Matthew -

If you find re's to be a bit cryptic, here is a pyparsing version that may
be a bit more readable, and will easily scan through your input file:
<snip>

Lest I be accused of pushing pyparsing where it isn't appropriate, here is a
non-pyparsing version of the same program.

The biggest hangup with your sample data is that you can't predict what the
separator is going to be - sometimes it's '[', sometimes it's '^'. If the
separator character were more predictable, you could use simple split()
calls, as in:

data = "blah blah blah^more blah".split("^")
elements = data[0].split() + [data[1]]
print elements

['blah', 'blah', 'blah', 'more blah']

Note that this also discards the separator. Since you had something which
goes beyond simple string split()'s I thought you might find pyparsing to be
a simple alternative to re's.

Here is a version that tries different separators, then builds the
appropriate list of pieces, including the matching separator. I've also
shown an example of a generator, since you are likely to want one, parsing
100's of thousands of lines as you are.

-- Paul

=================
data = """gee fre asd[234
ger dsf asd[243
gwer af as.:^25a"""

# generator to process each line of data
# call using processData(listOfLines)
def processData(d):
separators = "[^" #expand this string if need other separators
for line in d:
for s in separators:
if s in line:
parts = line.split(s)
# return the first element of parts, split on whitespace
# followed by the separator
# followed by whatever was after the separator
yield parts[0].split() + [ s, parts[1] ]
break
else:
yield line

# to call this for a text file, use something like
# for lineParts in processData( file("xyzzy.txt").readlines() )
for lineParts in processData( data.split("\n") ):
print lineParts

print

# rerun processData, augmenting extracted values with additional
# computed values
for lineParts in processData( data.split("\n") ):
toks = lineParts
tokens = toks[:]
tokens.append( toks[0]+toks[1] )
tokens.append( toks[-1] + toks[-1][-1] )
#~ tokens.append( str( lineno(start, data) ) )
print tokens

====================
prints:

['gee', 'fre', 'asd', '[', '234']
['ger', 'dsf', 'asd', '[', '243']
['gwer', 'af', 'as.:', '^', '25a']

['gee', 'fre', 'asd', '[', '234', 'geefre', '2344']
['ger', 'dsf', 'asd', '[', '243', 'gerdsf', '2433']
['gwer', 'af', 'as.:', '^', '25a', 'gweraf', '25aa']
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,763
Messages
2,569,562
Members
45,038
Latest member
OrderProperKetocapsules

Latest Threads

Top