How to efficiently extract information from structured text file

I

Imaginationworks

Hi,

I am trying to read object information from a text file (approx.
30,000 lines) with the following format, each line corresponds to a
line in the text file. Currently, the whole file was read into a
string list using readlines(), then use for loop to search the "= {"
and "};" to determine the Object, SubObject,and SubSubObject. My
questions are

1) Is there any efficient method that I can search the whole string
list to find the location of the tokens(such as '= {' or '};'

2) Is there any efficient ways to extract the object information you
may suggest?

Thanks,

- Jeremy



===== Structured text file =================
Object1 = {

....

SubObject1 = {
.....

SubSubObject1 = {
....
};
};

SubObject2 = {
.....

SubSubObject21 = {
....
};
};

SubObjectN = {
.....

SubSubObjectN = {
....
};
};
};
 
G

Gary Herron

Imaginationworks said:
Hi,

I am trying to read object information from a text file (approx.
30,000 lines) with the following format, each line corresponds to a
line in the text file. Currently, the whole file was read into a
string list using readlines(), then use for loop to search the "= {"
and "};" to determine the Object, SubObject,and SubSubObject. My
questions are

1) Is there any efficient method that I can search the whole string
list to find the location of the tokens(such as '= {' or '};'

Yes. Read the *whole* file into a single string using file.read()
method, and then search through the string using string methods (for
simple things) or use re, the regular expression module, (for more
complex searches).

Note: There is a point where a file becomes large enough that reading
the whole file into memory at once (either as a single string or as a
list of strings) is foolish. However, 30,000 lines doesn't push that
boundary.
2) Is there any efficient ways to extract the object information you
may suggest?

Again, the re module has nice ways to find a pattern, and return parse
out pieces of it. Building a good regular expression takes time,
experience, and a bit of black magic... To do so for this case, we
might need more knowledge of your format. Also regular expressions have
their limits. For instance, if the sub objects can nest to any level,
then in fact, regular expressions alone can't solve the whole problem,
and you'll need a more robust parser.
 
I

Imaginationworks

Yes.   Read the *whole* file into a single string using file.read()
method, and then search through the string using string methods (for
simple things) or use re, the regular expression module, (for more
complex searches).    

Note:  There is a point where a file becomes large enough that reading
the whole file into memory at once (either as a single string or as a
list of strings) is foolish.    However, 30,000 lines doesn't push that
boundary.


Again, the re module has nice ways to find a pattern, and return parse
out pieces of it.   Building a good regular expression takes time,
experience, and a bit of black magic...    To do so for this case, we
might need more knowledge of your format.  Also regular expressions have
their limits.  For instance, if the sub objects can nest to any level,
then in fact, regular expressions alone can't solve the whole problem,
and you'll need a more robust parser.

Gary and Rhodri, Thank you for the suggestions.
 
P

Paul McGuire

Hi,

I am trying to read object information from a text file (approx.
30,000 lines) with the following format, each line corresponds to a
line in the text file.  Currently, the whole file was read into a
string list using readlines(), then use for loop to search the "= {"
and "};" to determine the Object, SubObject,and SubSubObject.

If you open(filename).read() this file into a variable named data, the
following pyparsing parser will pick out your nested brace
expressions:

from pyparsing import *

EQ,LBRACE,RBRACE,SEMI = map(Suppress,"={};")
ident = Word(alphas, alphanums)
contents = Forward()
defn = Group(ident + EQ + Group(LBRACE + contents + RBRACE + SEMI))

contents << ZeroOrMore(defn | ~(LBRACE|RBRACE) + Word(printables))

results = defn.parseString(data)

print results

Prints:

[
['Object1',
['...',
['SubObject1',
['....',
['SubSubObject1',
['...']
]
]
],
['SubObject2',
['....',
['SubSubObject21',
['...']
]
]
],
['SubObjectN',
['....',
['SubSubObjectN',
['...']
]
]
]
]
]
]

-- Paul
 
I

Imaginationworks

I am trying to read object information from a text file (approx.
30,000 lines) with the following format, each line corresponds to a
line in the text file.  Currently, the whole file was read into a
string list using readlines(), then use for loop to search the "= {"
and "};" to determine the Object, SubObject,and SubSubObject.

If you open(filename).read() this file into a variable named data, the
following pyparsing parser will pick out your nested brace
expressions:

from pyparsing import *

EQ,LBRACE,RBRACE,SEMI = map(Suppress,"={};")
ident = Word(alphas, alphanums)
contents = Forward()
defn = Group(ident + EQ + Group(LBRACE + contents + RBRACE + SEMI))

contents << ZeroOrMore(defn | ~(LBRACE|RBRACE) + Word(printables))

results = defn.parseString(data)

print results

Prints:

[
 ['Object1',
   ['...',
    ['SubObject1',
      ['....',
        ['SubSubObject1',
          ['...']
        ]
      ]
    ],
    ['SubObject2',
      ['....',
       ['SubSubObject21',
         ['...']
       ]
      ]
    ],
    ['SubObjectN',
      ['....',
       ['SubSubObjectN',
         ['...']
       ]
      ]
    ]
   ]
 ]
]

-- Paul

Wow, that is great! Thanks
 
J

Jonathan Gardner

Hi,

I am trying to read object information from a text file (approx.
30,000 lines) with the following format, each line corresponds to a
line in the text file.  Currently, the whole file was read into a
string list using readlines(), then use for loop to search the "= {"
and "};" to determine the Object, SubObject,and SubSubObject. My
questions are

1) Is there any efficient method that I can search the whole string
list to find the location of the tokens(such as '= {' or '};'

2) Is there any efficient ways to extract the object information you
may suggest?

Parse it!

Go full-bore with a real parser. You may want to consider one of the
many fine Pythonic implementations of modern parsers, or break out
more traditional parsing tools.

This format is nested, meaning that you can't use regexes to parse
what you want out of it. You're going to need a real, full-bore, no-
holds-barred parser for this.

Don't worry, the road is not easy but the destination is absolutely
worth it.

Once you come to appreciate and understand parsing, you have earned
the right to call yourself a red-belt programmer. To get your black-
belt, you'll need to write your own compiler. Having mastered these
two tasks, there is no problem you cannot tackle.

And once you realize that every program is really a compiler, then you
have truly mastered the Zen of Programming in Any Programming Language
That Will Ever Exist.

With this understanding, you will judge programming language utility
based solely on how hard it is to write a compiler in it, and
complexity based on how hard it is to write a compiler for it. (Notice
there are not a few parsers written in Python, as well as Jython and
PyPy and others written for Python!)
 
S

Steven D'Aprano

And once you realize that every program is really a compiler, then you
have truly mastered the Zen of Programming in Any Programming Language
That Will Ever Exist.

In the same way that every tool is really a screwdriver.
 
P

Paul McGuire

In the same way that every tool is really a screwdriver.

The way I learned this was:
- Use the right tool for the right job.
- Every tool is a hammer.

-- Paul
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,534
Members
45,008
Latest member
Rahul737

Latest Threads

Top