Recommended data structure for newbie

manstey · May 3, 2006

Hi,

I have a text file with about 450,000 lines. Each line has 4-5 fields,
separated by various delimiters (spaces, @, etc).

I want to load in the text file and then run routines on it to produce
2-3 additional fields.

I am a complete newbie to Python but I have the docs and done some
experimenting. But I'm not sure what sort of data structure I should be
creating.

Example:
gee fre asd[234
ger dsf asd[243
gwer af as.:^25a

what i want is
'gee' 'fre' 'asd' '[' '234'
'ger' dsf' asf' '[' '243'
'gwer' 'af 'as.:' '^' '25a

etc for 450,000 lines.

Then on the basis of information in 1 or more lines, I want to append
to each line new data such as

'gee' 'fre' 'asd' '[' '234' 'geefre2' '2344f'
'ger' dsf' asf' '[' '243' 'gerdsd' '2431'
'gwer' 'af 'as.:' '^' '25a 'gweraf' 'sfd2'

What do you recommend? I think it is some sort of list inside a list,
but I'm not sure. I know how to generate the new data but not how to
store everything both in memory and on disk.

Regards,
Matthew

Paddy · May 3, 2006

Hi Matthew,

From your example, it is hard to work out what character or character

string is a separator, and what string needs to become a separate word
when seen in the original file.

In the example below you need to learn about regular expressions. the
split is based on the two RE's held in variables 'separators' and
'otherwords'. the first is used to split a line the second is used to
extract sub-words.
The output is printed in two formats that could be piped to a file for
later re-use; as a python list, then as space separated lines of words.
The csv module could be used to create a csv file for reading into
spreadsheets and databases. No doubt an XML formatted output is just as
straight-forward, (but XML is not, yet, my cup of tea).

The prog: word_up.py:

import re
import pprint

instring = '''gee fre asd[234
ger dsf asd[243
gwer af as.:^25a
'''
separators = r'''[ \t\r\f\v]+'''
otherwords = r'''(?x)
(.*)
(
\[
| \^
)
(.*)
'''

def word_up(instring, separators, otherwords):
""" for less confusing names substitute
line for cameo, and w for jockstrap in the function body

# doctest

>>> from pprint import pprint as pp
>>> i = 'gee fre asd[234\nger dsf asd[243\ngwer af as.:^25a\n'
>>> print i

Click to expand...

Click to expand...

gee fre asd[234
ger dsf asd[243
gwer af as.:^25a

>>> s = r'''[ \t\r\f\v]+'''
>>> o = '(?x)\n (.*)\n (\n \\[\n | \\^\n )\n (.*)\n'
>>> print o

Click to expand...

Click to expand...

(?x)
(.*)
(
\[
| \^
)
(.*)
[['gee', 'fre', 'asd', '[', '234'],
['ger', 'dsf', 'asd', '[', '243'],
['gwer', 'af', 'as.:', '^', '25a']] """
line_words = []
for cameo in instring.splitlines():
# some words are separated by separator chars
word_split = re.split(separators, cameo)
# extract separate sub_words
word_extracts = []
for word in word_split:
matched = re.match(otherwords, word)
if matched:
word_extracts += [jockstrap for jockstrap in
matched.groups() if jockstrap]
else:
word_extracts.append(word)
line_words.append(word_extracts)
return line_words

line_words = word_up(instring, separators, otherwords)

print '\n# Python format extracted words as list of lists'
pprint.pprint(line_words)

print '\n# Unix friendly space separated words'
for l in line_words:
for w in l:
print w,
print

-- Paddy

Paul McGuire · May 3, 2006

manstey said:
Hi,

I have a text file with about 450,000 lines. Each line has 4-5 fields,
separated by various delimiters (spaces, @, etc).

I want to load in the text file and then run routines on it to produce
2-3 additional fields.

<snip>

Matthew -

If you find re's to be too cryptic, here is an example using p

Paul McGuire · May 3, 2006

manstey said:
Hi,

I have a text file with about 450,000 lines. Each line has 4-5 fields,
separated by various delimiters (spaces, @, etc).

I want to load in the text file and then run routines on it to produce
2-3 additional fields.

<snip>

Matthew -

If you find re's to be a bit cryptic, here is a pyparsing version that may
be a bit more readable, and will easily scan through your input file:

================
from pyparsing import OneOrMore, Word, alphas, oneOf, restOfLine, lineno

data = """gee fre asd[234
ger dsf asd[243
gwer af as.:^25a"""

# define format of input line, that is:
# - one or more words, composed of alphabetic characters, periods, and
colons
# - one of the characters '[' or '^'
# - the rest of the line
entry = OneOrMore( Word(alphas+".:") ) + oneOf("[ ^") + restOfLine

# scan for matches in input data - for each match, scanString will
# report the matching tokens, and start and end locations
for toks,start,end in entry.scanString(data):
print toks
print

# scan again, this time generating additional fields
for toks,start,end in entry.scanString(data):
tokens = list(toks)
# change these lines to implement your
# desired generation code - couldn't guess
# what you wanted from your example
tokens.append( toks[0]+toks[1] )
tokens.append( toks[-1] + toks[-1][-1] )
tokens.append( str( lineno(start, data) ) )
print tokens

================
prints:
['gee', 'fre', 'asd', '[', '234']
['ger', 'dsf', 'asd', '[', '243']
['gwer', 'af', 'as.:', '^', '25a']

['gee', 'fre', 'asd', '[', '234', 'geefre', '2344', '1']
['ger', 'dsf', 'asd', '[', '243', 'gerdsf', '2433', '2']
['gwer', 'af', 'as.:', '^', '25a', 'gweraf', '25aa', '3']

You asked about data structures specifically. The core collections in
python are lists, dicts, and more recently, sets. Pyparsing returns tokens
from its matching process using a pyparsing-defined class called
ParseResults. Fortunately, using Python's "duck-typing" model, you can
treat ParseResults objects just like a list, or like a dict if you have
assigned names to the fields in the parsing expression.

Download pyparsing at http://pyparsing.sourceforge.net.

-- Paul

Paul McGuire · May 3, 2006

You asked about data structures specifically. The core collections in

python are lists, dicts, and more recently, sets.

(oops, also forgot tuples - very much like lists, but immutable)

Paul McGuire · May 3, 2006

Paul McGuire said:
<snip>

Matthew -

If you find re's to be a bit cryptic, here is a pyparsing version that may
be a bit more readable, and will easily scan through your input file:

<snip>

Lest I be accused of pushing pyparsing where it isn't appropriate, here is a
non-pyparsing version of the same program.

The biggest hangup with your sample data is that you can't predict what the
separator is going to be - sometimes it's '[', sometimes it's '^'. If the
separator character were more predictable, you could use simple split()
calls, as in:

data = "blah blah blah^more blah".split("^")
elements = data[0].split() + [data[1]]
print elements

['blah', 'blah', 'blah', 'more blah']

Note that this also discards the separator. Since you had something which
goes beyond simple string split()'s I thought you might find pyparsing to be
a simple alternative to re's.

Here is a version that tries different separators, then builds the
appropriate list of pieces, including the matching separator. I've also
shown an example of a generator, since you are likely to want one, parsing
100's of thousands of lines as you are.

-- Paul

=================
data = """gee fre asd[234
ger dsf asd[243
gwer af as.:^25a"""

# generator to process each line of data
# call using processData(listOfLines)
def processData(d):
separators = "[^" #expand this string if need other separators
for line in d:
for s in separators:
if s in line:
parts = line.split(s)
# return the first element of parts, split on whitespace
# followed by the separator
# followed by whatever was after the separator
yield parts[0].split() + [ s, parts[1] ]
break
else:
yield line

# to call this for a text file, use something like
# for lineParts in processData( file("xyzzy.txt").readlines() )
for lineParts in processData( data.split("\n") ):
print lineParts

print

# rerun processData, augmenting extracted values with additional
# computed values
for lineParts in processData( data.split("\n") ):
toks = lineParts
tokens = toks[:]
tokens.append( toks[0]+toks[1] )
tokens.append( toks[-1] + toks[-1][-1] )
#~ tokens.append( str( lineno(start, data) ) )
print tokens

====================
prints:

['gee', 'fre', 'asd', '[', '234']
['ger', 'dsf', 'asd', '[', '243']
['gwer', 'af', 'as.:', '^', '25a']

['gee', 'fre', 'asd', '[', '234', 'geefre', '2344']
['ger', 'dsf', 'asd', '[', '243', 'gerdsf', '2433']
['gwer', 'af', 'as.:', '^', '25a', 'gweraf', '25aa']

need data structure to for test results analysis	1	Jul 6, 2013
Reading a file into a data structure....	8	Oct 13, 2011
Crawl nested data structure, apply code block to each	10	Apr 13, 2014
iTunes Search Algorithm/Data Structure?	1	Aug 17, 2006
Ideal data structure for nested list format?	29	Sep 4, 2010
design of data structure	0	Mar 4, 2008
data structure for sorting a list of pairs	2	May 30, 2010
Recommended data structure	3	Jun 7, 2004

Recommended data structure for newbie

manstey

Paddy

Paul McGuire

Paul McGuire

Paul McGuire

Paul McGuire

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads