How to efficiently extract information from structured text file

Imaginationworks · Feb 16, 2010

Hi,

I am trying to read object information from a text file (approx.
30,000 lines) with the following format, each line corresponds to a
line in the text file. Currently, the whole file was read into a
string list using readlines(), then use for loop to search the "= {"
and "};" to determine the Object, SubObject,and SubSubObject. My
questions are

1) Is there any efficient method that I can search the whole string
list to find the location of the tokens(such as '= {' or '};'

2) Is there any efficient ways to extract the object information you
may suggest?

Thanks,

- Jeremy

===== Structured text file =================
Object1 = {

....

SubObject1 = {
.....

SubSubObject1 = {
....
};
};

SubObject2 = {
.....

SubSubObject21 = {
....
};
};

SubObjectN = {
.....

SubSubObjectN = {
....
};
};
};

Gary Herron · Feb 17, 2010

Imaginationworks said:
Hi,

I am trying to read object information from a text file (approx.
30,000 lines) with the following format, each line corresponds to a
line in the text file. Currently, the whole file was read into a
string list using readlines(), then use for loop to search the "= {"
and "};" to determine the Object, SubObject,and SubSubObject. My
questions are

1) Is there any efficient method that I can search the whole string
list to find the location of the tokens(such as '= {' or '};'

Yes. Read the *whole* file into a single string using file.read()
method, and then search through the string using string methods (for
simple things) or use re, the regular expression module, (for more
complex searches).

Note: There is a point where a file becomes large enough that reading
the whole file into memory at once (either as a single string or as a
list of strings) is foolish. However, 30,000 lines doesn't push that
boundary.

2) Is there any efficient ways to extract the object information you
may suggest?

Again, the re module has nice ways to find a pattern, and return parse
out pieces of it. Building a good regular expression takes time,
experience, and a bit of black magic... To do so for this case, we
might need more knowledge of your format. Also regular expressions have
their limits. For instance, if the sub objects can nest to any level,
then in fact, regular expressions alone can't solve the whole problem,
and you'll need a more robust parser.

Imaginationworks · Feb 17, 2010

Yes. Read the *whole* file into a single string using file.read()
method, and then search through the string using string methods (for
simple things) or use re, the regular expression module, (for more
complex searches).

Note: There is a point where a file becomes large enough that reading
the whole file into memory at once (either as a single string or as a
list of strings) is foolish. However, 30,000 lines doesn't push that
boundary.

Again, the re module has nice ways to find a pattern, and return parse
out pieces of it. Building a good regular expression takes time,
experience, and a bit of black magic... To do so for this case, we
might need more knowledge of your format. Also regular expressions have
their limits. For instance, if the sub objects can nest to any level,
then in fact, regular expressions alone can't solve the whole problem,
and you'll need a more robust parser.

Gary and Rhodri, Thank you for the suggestions.

Paul McGuire · Feb 17, 2010

Hi,

I am trying to read object information from a text file (approx.
30,000 lines) with the following format, each line corresponds to a
line in the text file. Currently, the whole file was read into a
string list using readlines(), then use for loop to search the "= {"
and "};" to determine the Object, SubObject,and SubSubObject.

If you open(filename).read() this file into a variable named data, the
following pyparsing parser will pick out your nested brace
expressions:

from pyparsing import *

EQ,LBRACE,RBRACE,SEMI = map(Suppress,"={};")
ident = Word(alphas, alphanums)
contents = Forward()
defn = Group(ident + EQ + Group(LBRACE + contents + RBRACE + SEMI))

contents << ZeroOrMore(defn | ~(LBRACE|RBRACE) + Word(printables))

results = defn.parseString(data)

print results

Prints:

[
['Object1',
['...',
['SubObject1',
['....',
['SubSubObject1',
['...']
]
]
],
['SubObject2',
['....',
['SubSubObject21',
['...']
]
]
],
['SubObjectN',
['....',
['SubSubObjectN',
['...']
]
]
]
]
]
]

-- Paul

Imaginationworks · Feb 17, 2010

Hi,

Click to expand...

I am trying to read object information from a text file (approx.
30,000 lines) with the following format, each line corresponds to a
line in the text file. Currently, the whole file was read into a
string list using readlines(), then use for loop to search the "= {"
and "};" to determine the Object, SubObject,and SubSubObject.

Click to expand...

If you open(filename).read() this file into a variable named data, the
following pyparsing parser will pick out your nested brace
expressions:

from pyparsing import *

EQ,LBRACE,RBRACE,SEMI = map(Suppress,"={};")
ident = Word(alphas, alphanums)
contents = Forward()
defn = Group(ident + EQ + Group(LBRACE + contents + RBRACE + SEMI))

contents << ZeroOrMore(defn | ~(LBRACE|RBRACE) + Word(printables))

results = defn.parseString(data)

print results

Prints:

[
['Object1',
['...',
['SubObject1',
['....',
['SubSubObject1',
['...']
]
]
],
['SubObject2',
['....',
['SubSubObject21',
['...']
]
]
],
['SubObjectN',
['....',
['SubSubObjectN',
['...']
]
]
]
]
]
]

-- Paul

Wow, that is great! Thanks

Jonathan Gardner · Feb 18, 2010

Hi,

I am trying to read object information from a text file (approx.
30,000 lines) with the following format, each line corresponds to a
line in the text file. Currently, the whole file was read into a
string list using readlines(), then use for loop to search the "= {"
and "};" to determine the Object, SubObject,and SubSubObject. My
questions are

1) Is there any efficient method that I can search the whole string
list to find the location of the tokens(such as '= {' or '};'

2) Is there any efficient ways to extract the object information you
may suggest?

Parse it!

Go full-bore with a real parser. You may want to consider one of the
many fine Pythonic implementations of modern parsers, or break out
more traditional parsing tools.

This format is nested, meaning that you can't use regexes to parse
what you want out of it. You're going to need a real, full-bore, no-
holds-barred parser for this.

Don't worry, the road is not easy but the destination is absolutely
worth it.

Once you come to appreciate and understand parsing, you have earned
the right to call yourself a red-belt programmer. To get your black-
belt, you'll need to write your own compiler. Having mastered these
two tasks, there is no problem you cannot tackle.

And once you realize that every program is really a compiler, then you
have truly mastered the Zen of Programming in Any Programming Language
That Will Ever Exist.

With this understanding, you will judge programming language utility
based solely on how hard it is to write a compiler in it, and
complexity based on how hard it is to write a compiler for it. (Notice
there are not a few parsers written in Python, as well as Jython and
PyPy and others written for Python!)

Steven D'Aprano · Feb 18, 2010

And once you realize that every program is really a compiler, then you
have truly mastered the Zen of Programming in Any Programming Language
That Will Ever Exist.

In the same way that every tool is really a screwdriver.

Paul McGuire · Feb 18, 2010

In the same way that every tool is really a screwdriver.

The way I learned this was:
- Use the right tool for the right job.
- Every tool is a hammer.

-- Paul

How do I save information from an GUI into a XML-file?	0	Aug 17, 2022
How to use SQLite (sqlite3) more efficiently	0	Jun 5, 2014
Extract Text Table From File	11	Aug 27, 2012
C-API: Extract information from function object	2	Mar 24, 2010
Cyrillic text from file - set utf8 in cmd, unknown characters output anyway	0	Nov 11, 2022
How to efficiently work with gettext PO files when making small editsto large text values	0	Jun 3, 2010
'Needless flexibilities' and structured records [very long]	10	Mar 15, 2013
What's the best way to extract 2 values from a CSV file from each row systematically?	6	Sep 23, 2013

How to efficiently extract information from structured text file

Imaginationworks

Gary Herron

Imaginationworks

Paul McGuire

Imaginationworks

Jonathan Gardner

Steven D'Aprano

Paul McGuire

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads