Parsing: Help on ignoring quoted tokens.

P

paktsardines

Hi all,

I am writing a (hopefully) simple parser to parse the contents of a
text file and turn it into some sort of html form. Here's a small
example:

forms.txt contains something like:

# Registration Form
registration {
numcols:2
[heading: Account Details] [ ]
[label:"User Name:"] [textbox:username:amcnab:mandatory]
[label:"First Name:"] [textbox:first_name:Andy]
[label:"Last Name:"] [textbox:last_name:McNab]
[label:"Password:"] [passbox:passwd::mandatory]
}

# Error form
error {
numcols:2
[heading:Explosion Error!][]
[label:"Vent Gas?:"] [select:vent:yes|no:no]
}

where:
[.*] denotes an html table cell.

Then, later in my perl code I want to be able to do:

show_form("registration"), or show_form("error") and have it render
the appropriateform layout.

Now, my question is: what is the best way to approach the parsing of
this file? Perhaps more importantly, how can i structure the file to
make the parsing as easy and practical as possible?

Also, can anyone please suggest how to ignore tokens (like ':') that
occur within quoted strings?

Bonus points if your answer makes no reference to lex or yacc. :)

Thank you for any suggestions!

pakt.
 
B

Brian McCauley

Also, can anyone please suggest how to ignore tokens (like ':') that
occur within quoted strings?

This is very closely related to the FAQ "How can I split a [character]
delimited string except when inside [character]?"
 
M

Martijn Lievaart

Now, my question is: what is the best way to approach the parsing of
this file?

Use Parse::RecDecent. A bit of a learning curve, but very, very powerful.

HTH,
M4
 
K

Klaus

I am writing a (hopefully) simple parser to parse the contents of a
text file and turn it into some sort of html form. Here's a small
example:

forms.txt contains something like:

# Registration Form
registration {
numcols:2
[heading: Account Details] [ ]
[label:"User Name:"] [textbox:username:amcnab:mandatory]
[label:"First Name:"] [textbox:first_name:Andy]
[label:"Last Name:"] [textbox:last_name:McNab]
[label:"Password:"] [passbox:passwd::mandatory]

}

# Error form
error {
numcols:2
[heading:Explosion Error!][]
[label:"Vent Gas?:"] [select:vent:yes|no:no]

}

where:
[.*] denotes an html table cell.
[...snip...]

Now, my question is: what is the best way to approach the parsing of
this file?

If you say "parse a text file", you are usually dealing with brackets
and/or nested { ... } constructs and I can clearly see the
"registration { ... }" - and "error { ... }" - structure in your
file.

I strongly recommend to read first perlfaq4: "How do I find matching/
nesting anything?"

However, in order to keep this simple, I would suggest to make a few
assumptions about the structure of your file, thereby effectively
eliminating the inherent nested structure.

Those assumption would be, for example:
- there are no nested { ... } constructs.
- each { ... } - contruct begins with a single line format /^\w+\s*{$/
and it ends with a single line /^}$/
- inside a { ... } construct, each line begins with format /^\s+/
and it is of the form /\s*\[.*?\]/g
- the first line inside a { ... } construct would be of the form
/^\s+\[heading:.*?\]\s+\[\s*\]$/

This would allow to process the file line-by-line using only regexes,
but still producing valid html code. At first, this solution seems to
be over simplified, but as long as you can keep away from nested
structures, you can easily add/remove/modify more regexes in a trial-
and-error approach as you develop your Perl program from the bottom
up.

Here is how I would start the bottom-up approach with your test-file:

==============================
use strict;
use warnings;

my $inputfile = 'forms.txt';
open my $inp, '<', $inputfile
or die "Error 0010: open < '$inputfile': $!";

my $comment = '';
while (<$inp>) {
chomp;
if (m{^\#\s*(.*)$}xms) {
$comment = $1;
}
if (m{^\s+\[}xms) {
my @td = m{\[(.*?)\]}gxms;
if ($comment ne '') {
if (@td != 2
or $td[0] !~ m{^heading:(.*)$}xms) {
die "Error 0020: unexpected '$_'";
}
print "<h2>$1 ($comment)</h2>\n";
print "<table>\n";
$comment = '';
next;
}
print " <tr>\n";
for my $element (@td) {
if ($element =~ m{^\s*$}xms) {
print " <td>&nbsp;</td>\n";
}
else {
print " <td>$element</td>\n";
}
}
print " </tr>\n";
next;
}
if (/^}/xms) {
print "</table>\n";
$comment = '';
next;
}
}

close $inp;
==============================

This approach is very flexible and extremely scalable, I've already
tried it successfully by transforming a plain old schema-listing of a
mainframe database from basic Ascii format into Html.
Bonus points if your answer makes no reference to lex or yacc. :)

Thanks for the bonus points :)
 
X

Xicheng Jia

Hi all,

I am writing a (hopefully) simple parser to parse the contents of a
text file and turn it into some sort of html form. Here's a small
example:

forms.txt contains something like:

# Registration Form
registration {
numcols:2
[heading: Account Details] [ ]
[label:"User Name:"] [textbox:username:amcnab:mandatory]
[label:"First Name:"] [textbox:first_name:Andy]
[label:"Last Name:"] [textbox:last_name:McNab]
[label:"Password:"] [passbox:passwd::mandatory]

}

# Error form
error {
numcols:2
[heading:Explosion Error!][]
[label:"Vent Gas?:"] [select:vent:yes|no:no]

}

where:
[.*] denotes an html table cell.

Then, later in my perl code I want to be able to do:

show_form("registration"), or show_form("error") and have it render
the appropriateform layout.

Now, my question is: what is the best way to approach the parsing of
this file? Perhaps more importantly, how can i structure the file to
make the parsing as easy and practical as possible?

I think your input data format is just fine, so:

1) use paragraph-mode to separate between tables, make sure no empty
line within a single table block.
* specify number of columns and optional table caption, make each of
them in the same line(no embedded newline). (you could make caption in
multiple lines though:))
* each table row is in the same line, and each column enclosed by
square brackets.
* if you have embedded square brackets, make a rule and leave that
to Perl regex:).

I guess you've done all these above. :)

2) then you need a data structure or probably database. For a data
structure, I would use a hash to organize tables and then use array of
array to define each table.

Here is a sample:

#!/usr/local/bin/perl
use warnings;
use strict;

my %tables = ();
local $/ = "\n\n";

# build the data structure
while(my $tbl = <DATA>)
{
# find table name
next if not $tbl =~ /^(\w+)\s*\{\s*$/m;
my $table = $1;
# get the number of columns
my $numCol = $1 if $tbl =~ /^\s*numcols:(\d+)/m;
# find caption if there is any (note: it parses only the first
#line)
my $caption = $1 if $tbl =~ /^#(.*)/m;
push @{$tables{$table}}, $caption if defined $caption;
# check each line and find table rows
foreach my $row (split "\n", $tbl) {
# adjust the following regex if you have embedded square
bracket
my @cols = ($row =~ /\[([^][]*)\]/g);
push @{$tables{$table}}, [ @cols ] if scalar @cols == $numCol;
}
}

print "Check registration form\n";
show_form('registration');

print "\n\nCheck error form\n";
show_form('error');

##### subroutines #####
sub show_form {
my $tbl = shift;
my @form = @{$tables{$tbl}};
print "<table>\n";
if (not ref $form[0]) {
print " <caption>$form[0]</caption>\n";
shift @form;
}
foreach my $row (@form) {
print " <tr>\n";
foreach my $col (@{$row}) {
$col = '&nbsp;' if $col =~ /^\s*$/;
my $var = mkCol($col);
print " <th>$var</th>\n" if $row->[0] =~ /^heading:/;
print " <td>$var</td>\n" if $row->[0] =~ /^label:/;
}
print " </tr>\n";
}
print "</table>\n";
}

##### subroutine to parse table cell #####
sub mkCol {
my $col = shift;
return $1 if $col =~ /^label:"([^"]*?):?"$/;
return $1 if $col =~ /^heading:\s*(.*)/;
return $col;
}

__DATA__
# Registration Form
registration {
numcols:2
[heading: Account Details] [ ]
[label:"User Name:"] [textbox:username:amcnab:mandatory]
[label:"First Name:"] [textbox:first_name:Andy]
[label:"Last Name:"] [textbox:last_name:McNab]
[label:"Password:"] [passbox:passwd::mandatory]
}

# Error form
error {
numcols:2
[heading:Explosion Error!][]
[label:"Vent Gas?:"] [select:vent:yes|no:no]
}

(you need to do more test by yourself though)
Also, can anyone please suggest how to ignore tokens (like ':') that
occur within quoted strings?

Don't know your final goal, but you probably can leave that to
handling each cell (i.e. subroutine mkCol() in my test code).

Good luck,
Xicheng
 
X

Xicheng Jia

I am writing a (hopefully) simple parser to parse the contents of a
text file and turn it into some sort of html form. Here's a small
example:
forms.txt contains something like:
# Registration Form
registration {
numcols:2
[heading: Account Details] [ ]
[label:"User Name:"] [textbox:username:amcnab:mandatory]
[label:"First Name:"] [textbox:first_name:Andy]
[label:"Last Name:"] [textbox:last_name:McNab]
[label:"Password:"] [passbox:passwd::mandatory]

# Error form
error {
numcols:2
[heading:Explosion Error!][]
[label:"Vent Gas?:"] [select:vent:yes|no:no]

where:
[.*] denotes an html table cell.
Then, later in my perl code I want to be able to do:
show_form("registration"), or show_form("error") and have it render
the appropriateform layout.
Now, my question is: what is the best way to approach the parsing of
this file? Perhaps more importantly, how can i structure the file to
make the parsing as easy and practical as possible?

I think your input data format is just fine, so:

1) use paragraph-mode to separate between tables, make sure no empty
line within a single table block. [..snip..]
local $/ = "\n\n";

In fact, no need to use paragraph-mode to read your data, just set $/
= "\n}"; I guess this should work for you, just make sure the closing
curly bracket of any table blocks is the first character on a
line. :)
[..snip..]

##### subroutines #####
sub show_form {
my $tbl = shift;

add at least this block:

if (not exists $tables{$tbl}) {
print "table '$tbl' not exists\n";
return;
}
my @form = @{$tables{$tbl}};
print "<table>\n";
if (not ref $form[0]) {
print " <caption>$form[0]</caption>\n";
shift @form;
}
foreach my $row (@form) {
print " <tr>\n";
foreach my $col (@{$row}) {
$col = '&nbsp;' if $col =~ /^\s*$/;
my $var = mkCol($col);
print " <th>$var</th>\n" if $row->[0] =~ /^heading:/;
print " <td>$var</td>\n" if $row->[0] =~ /^label:/;
}
print " </tr>\n";
}
print "</table>\n";
}

[..cut..]

Good luck,
Xicheng
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,482
Members
44,900
Latest member
Nell636132

Latest Threads

Top