Reading in data until I have a full structure

pwaring · Feb 25, 2007

I've got a text file which is full of questions in a format similar to
the following:

QUESTION_ID "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
...
ANSWER_N)

At the moment I can parse each individual question into its component
parts without any problems (it's not the most pleasant regex in the
world, but it works), however I'm having trouble turning the whole
file into an array of questions which I can then parse individually.
Each question is separated from the next by at least two newlines, but
unfortunately there is sometimes two newlines between SHORT_QUESTION
and (ANSWER_1, so I can't assume that two newlines indicate the end of
a question, which is what I've been doing so far.

I was wondering if anyone could point me in the right direction for a
way to get around this problem - basically I need to read in data
until I know I've got a full question with answers (assuming this ends
at two newlines often means I get the answers separately, which causes
problems when I try to split this into smaller parts), parse that
(which I can already do), save the results somewhere (already done as
well) and then carry on to read in the next question.

If anyone has any ideas as to how I can get around this, I'd be very
grateful.

Thanks in advance,

Paul

Mark Clements · Feb 25, 2007

I've got a text file which is full of questions in a format similar to
the following:

QUESTION_ID "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
...
ANSWER_N)

At the moment I can parse each individual question into its component
parts without any problems (it's not the most pleasant regex in the
world, but it works), however I'm having trouble turning the whole
file into an array of questions which I can then parse individually.
Each question is separated from the next by at least two newlines, but
unfortunately there is sometimes two newlines between SHORT_QUESTION
and (ANSWER_1, so I can't assume that two newlines indicate the end of
a question, which is what I've been doing so far.

I was wondering if anyone could point me in the right direction for a
way to get around this problem - basically I need to read in data
until I know I've got a full question with answers (assuming this ends
at two newlines often means I get the answers separately, which causes
problems when I try to split this into smaller parts), parse that
(which I can already do), save the results somewhere (already done as
well) and then carry on to read in the next question.

I'm sure someone here who knows far more about regular expressions than
I do will come up with a workable solution, but personally I'd be
tempted to use a lexer instead.

http://www.perl.com/pub/a/2006/01/05/parsing.html

Mark

A. Sinan Unur · Feb 25, 2007

I've got a text file which is full of questions in a format similar to
the following:

Please read the posting guidelines for this group before posting again.

QUESTION_ID "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
...
ANSWER_N)

....

I was wondering if anyone could point me in the right direction for a
way to get around this problem - basically I need to read in data
until I know I've got a full question with answers (assuming this ends
at two newlines often means I get the answers separately, which causes
problems when I try to split this into smaller parts), parse that
(which I can already do), save the results somewhere (already done as
well) and then carry on to read in the next question.

You might want to read perldoc perlvar, especially about $/ :

#!/usr/bin/perl

use strict;
use warnings;

local $/ = ")\n\n";

my %questions;

while( my $chunk = <DATA> ) {
chomp $chunk;

$chunk =~ s/\A\s+//;
$chunk =~ s/\s+\z//;

if( $chunk =~ m{
\A
\s*
(\w+) # QUESTION_ID
\s+"
(\w+) # QUESTION_META_DATA
\n+\s+
(\w+) # FULL_QUESTION
"\n\s+/"
(\w+) # SHORT_QUESTION
"\n+\s+\(
(.+) # ANSWERS
}xms
)
{
my %q;
@q{ qw( qmeta qfull qshort ) } = ($2, $3, $4);
$q{ answers } = [ split /,\n\s+/, $5 ];
$questions{ $1 } = \%q;
}
}

use Data:

umper;
print Dumper \%questions;

__DATA__

QUESTION_1 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,

ANSWER_3,

ANSWER_4,
ANSWER_N)

QUESTION_2 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"

(ANSWER_1,
ANSWER_2,
ANSWER_N)

QUESTION_3 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
ANSWER_X,
ANSWER_N)

C:\DOCUME~1\asu1\LOCALS~1\Temp\2> t
$VAR1 = {
'QUESTION_3' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_X',
'ANSWER_N'
],
'qmeta' => 'QUESTION_META_DATA'
},
'QUESTION_1' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_3',
'ANSWER_4',
'ANSWER_N'
],
'qmeta' => 'QUESTION_META_DATA'
},
'QUESTION_2' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_N'
],
'qmeta' => 'QUESTION_META_DATA'
}
};

pwaring · Feb 25, 2007

You might want to read perldoc perlvar, especially about $/ :

#!/usr/bin/perl

use strict;
use warnings;

local $/ = ")\n\n";

That looks almost like what I want, but I should have mentioned in my
original post that the brackets are optional if there is only one
answer, so I don't think that looking for )\n\n would work.

Paul

A. Sinan Unur · Feb 25, 2007

That looks almost like what I want, but I should have mentioned in my
original post that the brackets are optional if there is only one
answer, so I don't think that looking for )\n\n would work.

Well, here's your last fish:

#!/usr/bin/perl

use strict;
use warnings;

my %questions;

LINE: while( my $line = <DATA> ) {
next LINE unless $line =~ /\AQUESTION/;

NEW_QUESTION: my $chunk = $line;

do {
$line = <DATA>;

unless ( defined $line ) {
parse_chunk( $chunk );
last LINE;
}

if ( $line =~ /\AQUESTION/ ) {
parse_chunk( $chunk );
goto NEW_QUESTION;
}

$chunk .= $line;
} while ( 1 );
}

sub parse_chunk {
my ($chunk) = @_;

$chunk =~ s/\A\s+//;
$chunk =~ s/\s+\z//;

if( $chunk =~ m{
\A
\s*
(\w+) # QUESTION_ID
\s+"
(\w+) # QUESTION_META_DATA
\n+\s+
(\w+) # FULL_QUESTION
"\n\s+/"
(\w+) # SHORT_QUESTION
"\n+\s+\(
(.+) # ANSWERS
}xms
)
{
my %q;
@q{ qw( qmeta qfull qshort ) } = ($2, $3, $4);
$q{ answers } = [ split /,\n\s+/, $5 ];
$questions{ $1 } = \%q;
}
}

use Data:

umper;
print Dumper \%questions;

__DATA__

QUESTION_1 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,

ANSWER_3,

ANSWER_4,
ANSWER_N)

QUESTION_2 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"

(ANSWER_1,
ANSWER_2,
ANSWER_N)

QUESTION_3 "QUESTION_META_DATA
FULL_QUESTION"
/"SHORT_QUESTION"
(ANSWER_1,
ANSWER_2,
ANSWER_X,
ANSWER_N)

$VAR1 = {
'QUESTION_3' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_X',
'ANSWER_N)'
],
'qmeta' => 'QUESTION_META_DATA'
},
'QUESTION_1' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_3',
'ANSWER_4',
'ANSWER_N)'
],
'qmeta' => 'QUESTION_META_DATA'
},
'QUESTION_2' => {
'qfull' => 'FULL_QUESTION',
'qshort' => 'SHORT_QUESTION',
'answers' => [
'ANSWER_1',
'ANSWER_2',
'ANSWER_N)'
],
'qmeta' => 'QUESTION_META_DATA'
}
};

Sinan

Please complete a task, I have one task for you	0	Apr 21, 2023
Hey, I have Questions	3	Aug 16, 2021
So I have (a sketch of) a universal system...	3	Sep 2, 2022
How can I guarantee that the all callback functions of the first Ajax API call have finished executing before initiating the 2 call in JavaScript?	2	Oct 30, 2023
Hi, I am a webflow user. I am looking for CSS code that can KEEP ALL ELEMENTS POSITIONED in the SAME spot across all resolutions	0	Oct 27, 2023
I am having trouble finding a method of using the git enterprise api to scrape data from projects	1	Jun 1, 2023
Hey everybody, nice to be here. I have a question -	3	Aug 28, 2018
Sending data from web page to Raspberry Pi	0	Nov 26, 2022

Reading in data until I have a full structure

pwaring

Mark Clements

A. Sinan Unur

pwaring

A. Sinan Unur

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads