H
Henry
Folks:
I've got a bunch of fixed-format text files (< 100k bytes each) to sniff.
Each file is divided into paragraphs. Each para is preceded by at least
three blank lines, and is introduced by a section number of 1 to 6 digits
followed by a period and two spaces, OR, 1 to 6 digits followed by a period
and at least one digit, followed by a period and two spaces, e.g.
------------------------------------------------------
......
<empty>
<empty>
<empty>
12034. Blah, blah, blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah,
blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah.
Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah. Blah Blah Blah.
Blah. ...
------------------------------------------------------
Or, the second format:
------------------------------------------------------
......
<empty>
<empty>
<empty>
12034.1. Blah, blah, blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah,
blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah.
Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah. Blah Blah Blah. ...
.....
------------------------------------------------------
Yes, if you are wondering, these are legal blah-blah-blahs.
Seems the best way to deal with this is to slurp, and use "split" with the
appropriate regexp. Wrinkle: I need to retain the section numbers in the
return strings.
Right! I've been writing trial regular expressions all day, and I have come
to the conclusion that I'm not very good at it. I've also examined examples
and help pages until I'm ... really tired and no wiser.
Well, I _can_ split based on assuming that the three empty lines _always_
appear before a new section, but this doesn't seem very robust. Seems like
I really ought to be able to recognize at least two empties followed by
these two fixed-format alternatives.
Best I've figured out takes a common subset of the two cases:
#@sections = split /\n\n\n[0-9][0-9]+\./,<>;
This works ok, but it eats the match string. Non-capturing parentheses? I
wish I could make heads or tails of this syntax. Look-ahead assertion?
Even more cryptic.
I can't even figure out why I seem to need "[0-9][0-9]+" for my 5 digit test
case when it seems "[0-9]+" ought to suffice. (Yeah, I know my solution
will fail if there's only 1 digit --i.e. the first 9 sections-- but that's
obviously the least of my problems).
Could some wizard teach me to fish: Please don't give me a solution, merely
tell me where I'm going wrong and put me back on the right path.
Or should I go back to my awk hack that works and which I actually
understand?
Thanks,
Henry
(e-mail address removed) remove 'zzz'
I've got a bunch of fixed-format text files (< 100k bytes each) to sniff.
Each file is divided into paragraphs. Each para is preceded by at least
three blank lines, and is introduced by a section number of 1 to 6 digits
followed by a period and two spaces, OR, 1 to 6 digits followed by a period
and at least one digit, followed by a period and two spaces, e.g.
------------------------------------------------------
......
<empty>
<empty>
<empty>
12034. Blah, blah, blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah,
blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah.
Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah. Blah Blah Blah.
Blah. ...
------------------------------------------------------
Or, the second format:
------------------------------------------------------
......
<empty>
<empty>
<empty>
12034.1. Blah, blah, blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah,
blah, blah. Blah. Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah.
Blah Blah Blah. Blah. Blah, blah, blah, blah. Blah. Blah Blah Blah. ...
.....
------------------------------------------------------
Yes, if you are wondering, these are legal blah-blah-blahs.
Seems the best way to deal with this is to slurp, and use "split" with the
appropriate regexp. Wrinkle: I need to retain the section numbers in the
return strings.
Right! I've been writing trial regular expressions all day, and I have come
to the conclusion that I'm not very good at it. I've also examined examples
and help pages until I'm ... really tired and no wiser.
Well, I _can_ split based on assuming that the three empty lines _always_
appear before a new section, but this doesn't seem very robust. Seems like
I really ought to be able to recognize at least two empties followed by
these two fixed-format alternatives.
Best I've figured out takes a common subset of the two cases:
#@sections = split /\n\n\n[0-9][0-9]+\./,<>;
This works ok, but it eats the match string. Non-capturing parentheses? I
wish I could make heads or tails of this syntax. Look-ahead assertion?
Even more cryptic.
I can't even figure out why I seem to need "[0-9][0-9]+" for my 5 digit test
case when it seems "[0-9]+" ought to suffice. (Yeah, I know my solution
will fail if there's only 1 digit --i.e. the first 9 sections-- but that's
obviously the least of my problems).
Could some wizard teach me to fish: Please don't give me a solution, merely
tell me where I'm going wrong and put me back on the right path.
Or should I go back to my awk hack that works and which I actually
understand?
Thanks,
Henry
(e-mail address removed) remove 'zzz'