Parsing table in rtf file

Peter Jamieson · Dec 29, 2007

I am trying to extract data from the table in a large number of rtf files.
I tried RTF::Tokenizer and RTF:

arser but could not make progress
so have decided to try regular expressions.

My project is to get the tabular data into a db for further analysis.
My problem is that I cannot see how to parse the data rows so
that they match the correct field headings.

Any advice or suggestions appreciated!

###########################################
# Perl code to parse table in rtf files #
###########################################

#!/usr/bin/perl -w
use strict;
use warnings;

use Time::Local;
use Win32::ODBC;
# use RTF::Tokenizer; # unsuccessful
# use RTF:

arser; # unsuccessful
use dbi;
use Getopt::Long;

my $ett = localtime();
print "\n Time : $ett \n";

my $file_ = 'BURN_RDX_01.rtf';
my @lines;

open(INFO, $file_) || die("Unable to open file!");
@lines = <INFO>;
close(INFO);

# get the useful line data
my $line;
my $useful_data;

foreach $line (@lines) {
if ($line =~ /\\pard\\intbl/) {
$useful_data = "$useful_data.$line \n";
}
}
print "useful_data are: $useful_data \n";

Inspection of the table headings reveals they may vary (sometimes no
telemetry data for a particular range or table has different

ranges) but typical headings are like this:

\pard\intbl {\b\f1\fs24\qc Propellant Burn Times \cell }\pard\intbl
{\f1\fs20\qc 22000m\par 20000m\cell
20000m\par 18000m\cell 18000m\par 16000m\cell 16000m\par 14000m\cell
14000m\par 12000m\cell 12000m\par
10000m\cell 10000m\par 8000m\cell 8000m\par 6000m\cell 6000m\par 4000m\cell
4000m\par 2000m\cell
2000m\par BURN CUT OFF\cell }\pard\intbl {\b\f1\qc 17812\cell }\pard\intbl
{\row }

There may be 6 to 30 data rows in the table, typical row looks like this:

\pard\intbl {\b\f1\fs20\qc 1\cell 40\cell Composition (RDX1)\cell \b0\fs16
\cell \b \cell \cell
1319\cell [90]\cell 1293\cell [90]\cell 1321\cell [90]\cell 1273\cell
[90]\cell 1245\cell [90]\cell
1173\cell [90]\cell 1117\cell [100]\cell 1102\cell [70]\cell 1119\cell
[10]\cell 1218\cell [10]\cell
17817 \cell }\pard\intbl {\row }

Skye Shaw!@#$ · Dec 30, 2007

Peter said:
I am trying to extract data from the table in a large number of rtf files.
I tried RTF::Tokenizer and RTF:arser but could not make progress
so have decided to try regular expressions.

What problem(s) were you having with the RTF modules?

I know looking at RTF can be fun and all, but why hammer out some
regexes to parse RTF
when a module already exists for this?

My project is to get the tabular data into a db for further analysis.
My problem is that I cannot see how to parse the data rows so
that they match the correct field headings.

Any advice or suggestions appreciated!

Not familiar with the format's tokens, but from looking at it quickly,
it appears as though the type of token is given after the text
portion, so you can try something like:

#your sub class of RTF:

arser
#not tested

my $tables = [];
my $cells = [];
my $rows = [];

my $token;

#define tokens...

sub text {
$token = $_[1];
}

my %do_on_control = (

'__DEFAULT__' => sub {

my ( $self, $type, $arg ) = @_;

if($arg) {
if($arg eq $CELL_END ) {
push @$cells, $tok;
}
elsif($arg eq $ROW_END ) {
push @$rows, $cells;
$cells = []
}
elsif($arg eq $TABLE_END ) {
push @$tables, $rows;
$rows = []
}

}
});

sub parse
{
my ($self,$file) = @_;
$self->control_definition( \%do_on_control );
open(my $IN,$file) || die $!;
$self->parse_stream($IN);
close($IN);

$tables;
}

Peter Jamieson · Dec 30, 2007

Skye Shaw!@#$ said:
Peter said:

I am trying to extract data from the table in a large number of rtf
files.
I tried RTF::Tokenizer and RTF:arser but could not make progress
so have decided to try regular expressions.

Click to expand...

What problem(s) were you having with the RTF modules?

I know looking at RTF can be fun and all, but why hammer out some
regexes to parse RTF
when a module already exists for this?

My project is to get the tabular data into a db for further analysis.
My problem is that I cannot see how to parse the data rows so
that they match the correct field headings.

Any advice or suggestions appreciated!

Click to expand...

Not familiar with the format's tokens, but from looking at it quickly,
it appears as though the type of token is given after the text
portion, so you can try something like:

#your sub class of RTF:arser
#not tested

my $tables = [];
my $cells = [];
my $rows = [];

my $token;

#define tokens...

sub text {
$token = $_[1];
}

my %do_on_control = (

'__DEFAULT__' => sub {

my ( $self, $type, $arg ) = @_;

if($arg) {
if($arg eq $CELL_END ) {
push @$cells, $tok;
}
elsif($arg eq $ROW_END ) {
push @$rows, $cells;
$cells = []
}
elsif($arg eq $TABLE_END ) {
push @$tables, $rows;
$rows = []
}

}
});

sub parse
{
my ($self,$file) = @_;
$self->control_definition( \%do_on_control );
open(my $IN,$file) || die $!;
$self->parse_stream($IN);
close($IN);

$tables;
}

Thanks for the input Skye!
I read up all I could find on the rtf parsing and tokenizing modules
and came to the conclusion that they were good for text data but
not well suited to tabular data. However I would be more than happy
to be proven wrong!. I can get the header and footer info from the
rtf files OK into a db but could not make progress with the tabular
data. The sticking point was getting the data rows to line up with the
field headings. I had previously used VBA code in MS Excel and MS Word
for this project but file bloat and unreliability has me searching for a
Perl solution.
I will have a close look at your suggestions asap.
Thanks for your help...very much appreciated!...all the best for 2008!
....cheers, Peter

Rich Text Format (RTF) Document Builder in C++: Code and Features	0	Sep 28, 2025
Sending RTF email with Perl?	2	Jun 4, 2007
(module for) RTF parsing ?	1	Dec 10, 2003
[JTable]with Jtables in cells	1	Jun 14, 2013
parsing data file into mysql table - help	3	Feb 21, 2004
sorting file according to a unicode column	17	May 28, 2014
Table tests in IE6	1	Aug 8, 2003
Can I Import table from txt file into form letter using Python?	3	Feb 21, 2013

Parsing table in rtf file

Peter Jamieson

Skye Shaw!@#$

Peter Jamieson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads