Parsing problems: A journey from a text file to a directory tree

M

Martin M.

Hi everybody,

Some of my colleagues want me to write a script for easy folder and
subfolder creation on the Mac.

The script is supposed to scan a text file containing directory trees
in the following format:

[New client]
|-Invoices
|-Offers
|--Denied
|--Accepted
|-Delivery notes

As you can see, the folder hierarchy is expressed by the amounts of
minuses, each section header framed by brackets (like in Windows
config files).

After the scan process, the script is supposed to show a dialog, where
the user can choose from the different sections (e.g. 'Alphabet',
'Months', 'New client' etc.). Then the script will create the
corresponding folder hierarchy in the currently selected folder (done
via AppleScript).

But currently I simply don't know how to parse these folder lists and
how to save them in an array accordingly.

First I thought of an array like this:

dirtreedb = {'New client': {'Invoices': {}, 'Offers': {'Denied': {},
'Accpeted': {}}, 'Delivery notes': {}}}

But this doesn't do the trick, as I also have to save the hierarchy
level of the current folder as well...

Argh, I really don't get my head around this problem and I need your
help. I have the feeling, that the answer is not that complicated, but
I just don't get it right now...

Your desperate,

Martin
 
N

Neil Cerutti

Hi everybody,

Some of my colleagues want me to write a script for easy folder and
subfolder creation on the Mac.

The script is supposed to scan a text file containing directory trees
in the following format:

[New client]
|-Invoices
|-Offers
|--Denied
|--Accepted
|-Delivery notes

Would it make sense to store it like this?

[('New client',
[('Invoices', []),
('Offers', [('Denied', []), ('Accepted', [])]),
('Delivery notes', [])]]
First I thought of an array like this:

dirtreedb = {'New client': {'Invoices': {}, 'Offers': {'Denied': {},
'Accpeted': {}}, 'Delivery notes': {}}}

A dictionary approach is fine if it's OK for the directories to
be unordered, which doesn't appear to be the case.
But this doesn't do the trick, as I also have to save the
hierarchy level of the current folder as well...

The above does store the hierarchy, as the number of nesting
levels.

ditreedb['New Client']['Offers']['Denied']
 
L

Larry Bates

Since you are going to need to do a dialog, I would use wxWindows tree
control. It already knows how to do what you describe. Then you can
just walk all the branches and create the folders.

-Larry
 
M

Michael J. Fromberger

"Martin M. said:
Hi everybody,

Some of my colleagues want me to write a script for easy folder and
subfolder creation on the Mac.

The script is supposed to scan a text file containing directory trees
in the following format:

[New client]
|-Invoices
|-Offers
|--Denied
|--Accepted
|-Delivery notes

As you can see, the folder hierarchy is expressed by the amounts of
minuses, each section header framed by brackets (like in Windows
config files).

After the scan process, the script is supposed to show a dialog, where
the user can choose from the different sections (e.g. 'Alphabet',
'Months', 'New client' etc.). Then the script will create the
corresponding folder hierarchy in the currently selected folder (done
via AppleScript).

But currently I simply don't know how to parse these folder lists and
how to save them in an array accordingly.

First I thought of an array like this:

dirtreedb = {'New client': {'Invoices': {}, 'Offers': {'Denied': {},
'Accpeted': {}}, 'Delivery notes': {}}}

But this doesn't do the trick, as I also have to save the hierarchy
level of the current folder as well...

Argh, I really don't get my head around this problem and I need your
help. I have the feeling, that the answer is not that complicated, but
I just don't get it right now...

Hello, Martin,

A good way to approach this problem is to recognize that each section of
your proposed configuration represents a kind of depth-first traversal
of the tree structure you propose to create. Thus, you can reconstruct
the tree by keeping track at all times of the path from the "root" of
the tree to the "current location" in the tree.

Below is one possible implementation of this idea in Python. In short,
the function keeps track of a stack of dictionaries, each of which
represents the contents of some directory in your hierarchy. As you
encounter "|--" lines, entries are pushed to or popped from the stack
according to whether the nesting level has increased or decreased.

This code is not heavily tested, but hopefully it should be clear:

..import re
..
..def parse_folders(input):
.. """Read input from a file-like object that describes directory
.. structures to be created. The input format is:
..
.. [Top-level name]
.. |-Subdirectory1
.. |--SubSubDirectory1
.. |--SubSubDirectory2
.. |---SubSubSubDirectory1
.. |-Subdirectory2
.. |-Subdirectory3
..
.. The input may consist of any number of such groups. The result is
.. a dictionary structure in which each key names a directory, and
.. the corresponding value is a dictionary structure showing the
.. contents of that directory, possibly empty.
.. """
..
.. # This expression matches "header" lines, defining a new section.
.. new_re = re.compile(r'\[([\w ]+)\]\s*$')
..
.. # This expression matches "nesting" lines, defining subdirectories.
.. more_re = re.compile(r'(\|-+)([\w ]+)$')
..
.. out = {} # Root: Maps section names to subtrees.
.. state = [out] # Stack of dictionaries, current path.
..
.. for line in input:
.. m = new_re.match(line)
.. if m: # New section begins here...
.. key = m.group(1).strip()
.. out[key] = {}
.. state = [out, out[key]]
.. continue
..
.. m = more_re.match(line)
.. if m: # Add a directory to an existing section
.. assert state
..
.. new_level = len(m.group(1))
.. key = m.group(2).strip()
..
.. while new_level < len(state):
.. state.pop()
..
.. state[-1][key] = {}
.. state.append(state[-1][key])
..
.. return out

To call this, pass a file-like object to parse_folders(), e.g.:

test1 = '''
[New client].
|-Invoices
|-Offers
|--Denied
|--Accepted
|---Reasons
|---Rhymes
|-Delivery notes
'''

from StringIO import StringIO
result = parse_folders(StringIO(test1))

As the documentation suggests, the result is a nested dictionary
structure, representing the folder structure you encoded. I hope this
helps.

Cheers,
-M
 
J

John Machin

.
. # This expression matches "header" lines, defining a new section.
. new_re = re.compile(r'\[([\w ]+)\]\s*$')

Directory names can contain more different characters than those which
match [\w ] ... and which ones depends on the OS; might as well just
allow anything, and leave it to the OS to complain. Also consider
using line.rstrip() (usually a handy precaution on ANY input text
file) instead of having \s*$ at the end of your regex.
.
. while new_level < len(state):
. state.pop()

Hmmm ... consider rewriting that as the slightly less obfuscatory

while len(state) > new_level:
state.pop()

If you really want to make the reader slow down and think, try this:

del state[new_level:]

A warning message if there are too many "-" characters might be a good
idea:

[foo]
|-bar
|-zot
|---plugh
.
. state[-1][key] = {}
. state.append(state[-1][key])
.

And if the input line matches neither regex?
. return out

To call this, pass a file-like object to parse_folders(), e.g.:

test1 = '''
[New client].

Won't work with the dot on the end.
 
M

Michael J. Fromberger

Hi, John,

Your comments below are all reasonable. However, I would like to point
out that the purpose of my example was to provide a demonstration of an
algorithm, not an industrial-grade solution to every aspect of the
original poster's problem. I am confident the original poster can deal
with these aspects of his problem space on his own.

John Machin said:
[...]
. while new_level < len(state):
. state.pop()

Hmmm ... consider rewriting that as the slightly less obfuscatory

while len(state) > new_level:
state.pop()

This seems to me to be an aesthetic consideration only; I'm not sure I
understand your rationale for reversing the sense of the comparison.
Since it does not change the functionality, it's hardly worthy of
complaint, but I don't see any improvement, either.
A warning message if there are too many "-" characters might be a good
idea:

[foo]
|-bar
|-zot
|---plugh

Perhaps so. Again, the original poster will have to decide what should
be the correct response to input of this sort; at present, the
implementation is tolerant of such variations, without loss of
generality.
And if the input line matches neither regex?

I believe it should be clear that such lines are ignored. Again, this
is an opportunity for the original poster to determine an alternative
response -- perhaps an exception could be raised, if that is his desire.
The problem specification did not constrain this case.
To call this, pass a file-like object to parse_folders(), e.g.:

test1 = '''
[New client].

Won't work with the dot on the end.

My mistake. The period was a copy-and-paste artifact, which I missed.

Cheers,
-M
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,776
Messages
2,569,603
Members
45,188
Latest member
Crypto TaxSoftware

Latest Threads

Top