A
Angelic Devil
I'm building a file parser but I have a problem I'm not sure how to
solve. The files this will parse have the potential to be huge
(multiple GBs). There are distinct sections of the file that I
want to read into separate dictionaries to perform different
operations on. Each section has specific begin and end statements
like the following:
KEYWORD
..
..
..
END KEYWORD
The very first thing I do is read the entire file contents into a
string. I then store the contents in a list, splitting on line ends
as follows:
file_lines = file_contents.split('\n')
Next, I build smaller lists from the different sections using the
begin and end keywords:
begin_index = file_lines.index(begin_keyword)
end_index = file_lines.index(end_keyword)
small_list = [ file_lines[begin_index + 1] : file_lines[end_index - 1] ]
I then plan on parsing each list to build the different dictionaries.
The problem is that one begin statement is a substring of another
begin statement as in the following example:
BAR
END BAR
FOOBAR
END FOOBAR
I can't just look for the line in the list that contains BAR because
FOOBAR might come first in the list. My list would then look like
[foobar_1, foobar_2, ..., foobar_n, ..., bar_1, bar_2, ..., bar_m]
I don't really want to use regular expressions, but I don't see a way
to get around this without doing so. Does anyone have any suggestions
on how to accomplish this? If regexps are the way to go, is there an
efficient way to parse the contents of a potentially large list using
regular expressions?
Any help is appreciated!
Thanks,
Aaron
solve. The files this will parse have the potential to be huge
(multiple GBs). There are distinct sections of the file that I
want to read into separate dictionaries to perform different
operations on. Each section has specific begin and end statements
like the following:
KEYWORD
..
..
..
END KEYWORD
The very first thing I do is read the entire file contents into a
string. I then store the contents in a list, splitting on line ends
as follows:
file_lines = file_contents.split('\n')
Next, I build smaller lists from the different sections using the
begin and end keywords:
begin_index = file_lines.index(begin_keyword)
end_index = file_lines.index(end_keyword)
small_list = [ file_lines[begin_index + 1] : file_lines[end_index - 1] ]
I then plan on parsing each list to build the different dictionaries.
The problem is that one begin statement is a substring of another
begin statement as in the following example:
BAR
END BAR
FOOBAR
END FOOBAR
I can't just look for the line in the list that contains BAR because
FOOBAR might come first in the list. My list would then look like
[foobar_1, foobar_2, ..., foobar_n, ..., bar_1, bar_2, ..., bar_m]
I don't really want to use regular expressions, but I don't see a way
to get around this without doing so. Does anyone have any suggestions
on how to accomplish this? If regexps are the way to go, is there an
efficient way to parse the contents of a potentially large list using
regular expressions?
Any help is appreciated!
Thanks,
Aaron