Parse transcripts on speaker's name and grab subsequent paragraphs

P

perchance

Here's the sort of text I'm looking at that's driving me nuts.

####

JOE: Hello, Jane.

How are you?

Has it been a good day?

JANE: Hey, Joe.

It's been good for me.

JOE: Great.

####

I'd like to parse the transcripts into an ordered hash that would have

[speaker => name,
statement => concatenation of multiple lines of text spoken by that
person
order => For instance, Joe's first statement is 1, Jane's 2, et
cetera.
]

I've tried stepping through the text file with a foreach $line, or as
a total string, with split()'s and regexes built around /[A-Z]+:/ but
I can't get it line up. I fear the regex is beyond me. Can anyone
help?

Thanks.
 
T

Tad J McClellan

I'd like to parse the transcripts into an ordered hash that would have


There is no such thing as an "ordered hash"...

[speaker => name,
statement => concatenation of multiple lines of text spoken by that
person
order => For instance, Joe's first statement is 1, Jane's 2, et
cetera.
]

I've tried stepping through the text file with a foreach $line, or as
a total string, with split()'s and regexes built around /[A-Z]+:/ but


BILLY BOB: But what about matching my name Perchance?

I can't get it line up. I fear the regex is beyond me.


The regex is of "Hello World" complexity, it must be something
else that is beyond you.

:)

Can anyone
help?


You simply need a better data structure.

If you want ordering, then you want an array.

If you want to save several attributes in each array element,
then you want a hash.

If you want ordering and named attributes, you want a LoH.

(List of Hashes, really an array containing hash references.)

See:
perldoc perlreftut
etc...

--------------------------------
#!/usr/bin/perl
use warnings;
use strict;

my($speaker, $stmt);
my @stmts;
while ( <DATA> ) {
next if /^\s+$/;

if ( /^([A-Z ]+):\s+(.*)/ ) { # new speaker
push @stmts, { speaker => $speaker, stmt => $stmt} if $stmt;
$speaker = $1;
$stmt = $2;
}
else { # more dialog
chomp;
$stmt .= " $_";
}
}
push @stmts, { speaker => $speaker, stmt => $stmt};

foreach ( 0 .. $#stmts ) { # Hash Slice to get attributes out
my($speaker, $stmt) = @{ $stmts[$_] }{ qw/ speaker stmt / };
print "$_: $speaker\n $stmt\n\n";
}

__DATA__
JOE: Hello, Jane.

How are you?

Has it been a good day?

JANE: Hey, Joe.

It's been good for me.

JOE: Great.
--------------------------------
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Tad J McClellan
my($speaker, $stmt);
my @stmts;
while ( <DATA> ) {
next if /^\s+$/;

Do not see a switch to a paragraph mode.
if ( /^([A-Z ]+):\s+(.*)/ ) { # new speaker
push @stmts, { speaker => $speaker, stmt => $stmt} if $stmt;
$speaker = $1;
$stmt = $2;
}
else { # more dialog
chomp;
$stmt .= " $_";

Chomp()ing looks suspicious... I would remove NL from each paragraph,
and would separate same-speaker paragraphs by a double-NL (if this is
what the OP wanted).

Hope this helps,
Ilya
 
D

Dr.Ruud

Tad J McClellan schreef:
__DATA__
JOE: Hello, Jane.

How are you?

Has it been a good day?

JANE: Hey, Joe.

It's been good for me.

JOE: Great.

Yesterday I asked

BOB: How are you?

;)
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,021
Latest member
AkilahJaim

Latest Threads

Top