Extracting strings delimited by other strings

Scott Bass · May 7, 2005

Hi,

I need to write some code that will allowed embedded, specially formatted
comments to document test cases within a program (SAS code). The code will
process the programs, pulling out the test case information. AFAIK this is
similar to how Javadoc works, embedding documentation alongside code.

The syntax will look like:

/*
<testcase>
TESTID: TEST1
OBJECTIVE: The objective of the test
PROCEDURE: The procedure that the test uses
Continuation line from the above
RESULTS: The expected results of the test
Continuation line
Another "continuation" line
</testcase>
*/

The syntax can also be embedded in titles statements:

/* <testcase> */
title3 "TESTID: TEST1";
title4 "OBJECTIVE: The objective of the test";
title5 "PROCEDURE: The procedure that the test uses";
title6 " Continuation line from the above";
title6 "RESULTS: The expected results of the test";
title7 " Continuation line";
title8 ' Another "continuation" line';
/* </testcase> */

After processing the program, the desired output is a tab-delimited string
containing filename, testid, objective, procedure, and results. For those
lines that were continued, I would like an embedded CR/LF. Leading spaces
should be
removed, as well as any title statements, "outer" quotation marks
(preserving inner quotation marks), and trailing semi-colons.

Are there any modules that I could use as a starting point for this? If you
have any code does something similar, could you either post it or email it
to me? It will be easier to modify an existing example than to start from
scratch.

Kind Regards,
Scott

A. Sinan Unur · May 7, 2005

I need to write some code
....

After processing the program, the desired output is a tab-delimited
string containing filename, testid, objective, procedure, and results.
For those lines that were continued, I would like an embedded CR/LF.

Surely, you do not expect people here to write a program to your specs.

Are there any modules that I could use as a starting point for this?

Did you find anything on CPAN? Did you try Google?

The surest way to get quality help here is to post what you have
attempted so far, and specific questions regarding specific issues you
are have encountered.

Sinan

Tad McClellan · May 7, 2005

Scott Bass said:
/*
<testcase>
TESTID: TEST1
OBJECTIVE: The objective of the test
PROCEDURE: The procedure that the test uses
Continuation line from the above
RESULTS: The expected results of the test
Continuation line
Another "continuation" line
</testcase>
*/

The syntax can also be embedded in titles statements:

/* <testcase> */
title3 "TESTID: TEST1";
title4 "OBJECTIVE: The objective of the test";
title5 "PROCEDURE: The procedure that the test uses";
title6 " Continuation line from the above";
title6 "RESULTS: The expected results of the test";
title7 " Continuation line";
title8 ' Another "continuation" line';
/* </testcase> */

After processing the program, the desired output is a tab-delimited string

I'm going to use commas because they are easier to see.

containing filename, testid, objective, procedure, and results. For those
lines that were continued, I would like an embedded CR/LF. Leading spaces
should be
removed, as well as any title statements, "outer" quotation marks
(preserving inner quotation marks), and trailing semi-colons.

Are there any modules that I could use as a starting point for this?

I dunno.

Hardly seems worth modularization when it only takes about
20 lines of regular ol' Perl.

Assuming the whole file is slurped into $_ :

------------------------------
while ( m#<testcase>(.*?)</testcase>#gs ) {
my $record = normalize($1);
my(undef,@parts) = split /(?:TESTID|OBJECTIVE|PROCEDURE|RESULTS):\s+/,
$record;
chomp @parts;
print join( ',', @parts), "\n";
}

sub normalize {
my($r) = @_;

$r =~ s#^\s*\*/##; # snip bits of comment delimiters
$r =~ s#/\*\s*##;

$r =~ s#^title\d+ ##gm; # remove title statement cruft
$r =~ s#^(['"])(.*?)\1;#$2#gm;

$r =~ s#\n\s\s+#\n#; # join continuation lines

$r =~ s#^\s*##; # trim spaces
$r =~ s#\s*$##;
return $r;
}

Scott Bass · May 12, 2005

Tad McClellan said:
Scott Bass <> wrote:

I didn't know if maybe one of the XML modules could be coaxed to parse this.
But, yes, I need to do my homework on CPAN...

I dunno.

Hardly seems worth modularization when it only takes about
20 lines of regular ol' Perl.

Assuming the whole file is slurped into $_ :

------------------------------
while ( m#<testcase>(.*?)</testcase>#gs ) {
my $record = normalize($1);
my(undef,@parts) = split /(?:TESTID|OBJECTIVE|PROCEDURE|RESULTS):\s+/,
$record;
chomp @parts;
print join( ',', @parts), "\n";
}

sub normalize {
my($r) = @_;

$r =~ s#^\s*\*/##; # snip bits of comment delimiters
$r =~ s#/\*\s*##;

$r =~ s#^title\d+ ##gm; # remove title statement cruft
$r =~ s#^(['"])(.*?)\1;#$2#gm;

$r =~ s#\n\s\s+#\n#; # join continuation lines

$r =~ s#^\s*##; # trim spaces
$r =~ s#\s*$##;
return $r;
}

Tad, thanks a lot for the above code. Much and sincerely appreciated. I
was actually just wanting architectural ideas (which I now realize I didn't
make clear), so the above code was above and beyond...

And thank you too Sinan. Your chastisement is noted - I've read Tad's
posting guidelines, and will attempt to comply in the future. Thanks for
pointing me to CPAN and perldoc. Since I'm coding on Win32 and ActiveState
Perl, I'm just using the online (HTML) doc, which looks to be synonymous
with perldoc. As well as Googling and O'Reilly _Programming Perl_ and _Perl
Cookbook_ books.

Lastly, I have made good progress on my script. However, once finished, it
would be nice if someone could give pointers on tightening it up. I'm
tempted to post the entire script (about 100 lines) so it's all in context,
and identify key areas where I have questions. However, I do want to be a
good net citizen, and don't want to piss off the other readers of the list
by not asking specific questions.

I welcome any posting guidelines you may have for this situation.

Kind Regards,
Scott

Tad McClellan · May 12, 2005

Scott Bass said:
I didn't know if maybe one of the XML modules could be coaxed to parse this.

But it is not XML data.

It has all kinds of cruft surrounding XML-looking tags.

(but maybe that's what you meant by "coaxed"?)

Lastly, I have made good progress on my script. However, once finished, it
would be nice if someone could give pointers on tightening it up. I'm
tempted to post the entire script (about 100 lines) so it's all in context,
and identify key areas where I have questions. However, I do want to be a
good net citizen, and don't want to piss off the other readers of the list
by not asking specific questions.

I welcome any posting guidelines you may have for this situation.

Include the specific parts in your post, along with a link to
someplace that has the entire program.

strings and regex's	5	Sep 14, 2005
Revised PEP 349: Allow str() to return unicode strings	2	Aug 22, 2005
[ANN] ZenTest 3.5.2 Released	0	May 1, 2007
YUI--doomed?	7	May 22, 2010
Musatov's 'Mode/Code' Primary method call	4	Oct 31, 2009
The devolution of English language and slothful c.l.p behaviors exposed!	50	Jan 24, 2012
[ANN] JRuby 0.8.2 released	1	Sep 29, 2005
[ANN] JRuby 1.4.0 Released	2	Nov 2, 2009

Extracting strings delimited by other strings

Scott Bass

A. Sinan Unur

Tad McClellan

Scott Bass

Tad McClellan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads