Reading in data until I have a full structure

P

pwaring

I've got a text file which is full of questions in a format similar to
the following:

QUESTION_ID "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
...
ANSWER_N)

At the moment I can parse each individual question into its component
parts without any problems (it's not the most pleasant regex in the
world, but it works), however I'm having trouble turning the whole
file into an array of questions which I can then parse individually.
Each question is separated from the next by at least two newlines, but
unfortunately there is sometimes two newlines between SHORT_QUESTION
and (ANSWER_1, so I can't assume that two newlines indicate the end of
a question, which is what I've been doing so far.

I was wondering if anyone could point me in the right direction for a
way to get around this problem - basically I need to read in data
until I know I've got a full question with answers (assuming this ends
at two newlines often means I get the answers separately, which causes
problems when I try to split this into smaller parts), parse that
(which I can already do), save the results somewhere (already done as
well) and then carry on to read in the next question.

If anyone has any ideas as to how I can get around this, I'd be very
grateful.

Thanks in advance,

Paul
 
M

Mark Clements

I've got a text file which is full of questions in a format similar to
the following:

QUESTION_ID "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
...
ANSWER_N)

At the moment I can parse each individual question into its component
parts without any problems (it's not the most pleasant regex in the
world, but it works), however I'm having trouble turning the whole
file into an array of questions which I can then parse individually.
Each question is separated from the next by at least two newlines, but
unfortunately there is sometimes two newlines between SHORT_QUESTION
and (ANSWER_1, so I can't assume that two newlines indicate the end of
a question, which is what I've been doing so far.

I was wondering if anyone could point me in the right direction for a
way to get around this problem - basically I need to read in data
until I know I've got a full question with answers (assuming this ends
at two newlines often means I get the answers separately, which causes
problems when I try to split this into smaller parts), parse that
(which I can already do), save the results somewhere (already done as
well) and then carry on to read in the next question.

I'm sure someone here who knows far more about regular expressions than
I do will come up with a workable solution, but personally I'd be
tempted to use a lexer instead.

http://www.perl.com/pub/a/2006/01/05/parsing.html

Mark
 
A

A. Sinan Unur

I've got a text file which is full of questions in a format similar to
the following:

Please read the posting guidelines for this group before posting again.
QUESTION_ID "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
...
ANSWER_N)

....

I was wondering if anyone could point me in the right direction for a
way to get around this problem - basically I need to read in data
until I know I've got a full question with answers (assuming this ends
at two newlines often means I get the answers separately, which causes
problems when I try to split this into smaller parts), parse that
(which I can already do), save the results somewhere (already done as
well) and then carry on to read in the next question.

You might want to read perldoc perlvar, especially about $/ :

#!/usr/bin/perl

use strict;
use warnings;

local $/ = ")\n\n";

my %questions;

while( my $chunk = <DATA> ) {
chomp $chunk;

$chunk =~ s/\A\s+//;
$chunk =~ s/\s+\z//;

if( $chunk =~ m{
\A
\s*
(\w+) # QUESTION_ID
\s+"
(\w+) # QUESTION_META_DATA
\n+\s+
(\w+) # FULL_QUESTION
"\n\s+/"
(\w+) # SHORT_QUESTION
"\n+\s+\(
(.+) # ANSWERS
}xms
)
{
my %q;
@q{ qw( qmeta qfull qshort ) } = ($2, $3, $4);
$q{ answers } = [ split /,\n\s+/, $5 ];
$questions{ $1 } = \%q;
}
}

use Data::Dumper;
print Dumper \%questions;

__DATA__

QUESTION_1 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,

ANSWER_3,

ANSWER_4,
ANSWER_N)


QUESTION_2 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"

(ANSWER_1,
ANSWER_2,
ANSWER_N)

QUESTION_3 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
ANSWER_X,
ANSWER_N)

C:\DOCUME~1\asu1\LOCALS~1\Temp\2> t
$VAR1 = {
'QUESTION_3' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_X',
'ANSWER_N'
],
'qmeta' => 'QUESTION_META_DATA'
},
'QUESTION_1' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_3',
'ANSWER_4',
'ANSWER_N'
],
'qmeta' => 'QUESTION_META_DATA'
},
'QUESTION_2' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_N'
],
'qmeta' => 'QUESTION_META_DATA'
}
};
 
P

pwaring

You might want to read perldoc perlvar, especially about $/ :

#!/usr/bin/perl

use strict;
use warnings;

local $/ = ")\n\n";

That looks almost like what I want, but I should have mentioned in my
original post that the brackets are optional if there is only one
answer, so I don't think that looking for )\n\n would work.

Paul
 
A

A. Sinan Unur

That looks almost like what I want, but I should have mentioned in my
original post that the brackets are optional if there is only one
answer, so I don't think that looking for )\n\n would work.

Well, here's your last fish:

#!/usr/bin/perl

use strict;
use warnings;

my %questions;

LINE: while( my $line = <DATA> ) {
next LINE unless $line =~ /\AQUESTION/;

NEW_QUESTION: my $chunk = $line;

do {
$line = <DATA>;

unless ( defined $line ) {
parse_chunk( $chunk );
last LINE;
}

if ( $line =~ /\AQUESTION/ ) {
parse_chunk( $chunk );
goto NEW_QUESTION;
}

$chunk .= $line;
} while ( 1 );
}

sub parse_chunk {
my ($chunk) = @_;

$chunk =~ s/\A\s+//;
$chunk =~ s/\s+\z//;

if( $chunk =~ m{
\A
\s*
(\w+) # QUESTION_ID
\s+"
(\w+) # QUESTION_META_DATA
\n+\s+
(\w+) # FULL_QUESTION
"\n\s+/"
(\w+) # SHORT_QUESTION
"\n+\s+\(
(.+) # ANSWERS
}xms
)
{
my %q;
@q{ qw( qmeta qfull qshort ) } = ($2, $3, $4);
$q{ answers } = [ split /,\n\s+/, $5 ];
$questions{ $1 } = \%q;
}
}



use Data::Dumper;
print Dumper \%questions;

__DATA__

QUESTION_1 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,

ANSWER_3,

ANSWER_4,
ANSWER_N)


QUESTION_2 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"

(ANSWER_1,
ANSWER_2,
ANSWER_N)

QUESTION_3 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
ANSWER_X,
ANSWER_N)

$VAR1 = {
'QUESTION_3' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_X',
'ANSWER_N)'
],
'qmeta' => 'QUESTION_META_DATA'
},
'QUESTION_1' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_3',
'ANSWER_4',
'ANSWER_N)'
],
'qmeta' => 'QUESTION_META_DATA'
},
'QUESTION_2' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_N)'
],
'qmeta' => 'QUESTION_META_DATA'
}
};


Sinan
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,020
Latest member
GenesisGai

Latest Threads

Top