Parsing multi-line text

K

keith

Hi all,

I have a data file structured something like this:

------------------8<-----------------------
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
Chunk 02
NAME: "Bob"
Description: "Some other string"
Age: 28
Chunk 03
FIRST: "Carol"
Description: "Some other string"
Age: 32
Chunk 04
FIRST: "Dave"
Description: "Some other string"
Age: 22
------------------8<-----------------------

and I want to extract from it to produce output something like this:

------------------8<-----------------------
01 NAME: Alice -> 37
02 NAME: Bob -> 28
03 NAME: Carol -> 32
04 NAME: Dave -> 22
------------------8<-----------------------

I've read and re-read the section in perlfaq6 (no, really, I have!)
about milt-line matching, but I can't see how to adapt what is there
to this.

Can someone please point me in the right direction?
Thx!
 
G

Gunnar Hjalmarsson

I have a data file structured something like this:

------------------8<-----------------------
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
Chunk 02
NAME: "Bob"
Description: "Some other string"
Age: 28
Chunk 03
FIRST: "Carol"
Description: "Some other string"
Age: 32
Chunk 04
FIRST: "Dave"
Description: "Some other string"
Age: 22
------------------8<-----------------------

and I want to extract from it to produce output something like this:

------------------8<-----------------------
01 NAME: Alice -> 37
02 NAME: Bob -> 28
03 NAME: Carol -> 32
04 NAME: Dave -> 22
------------------8<-----------------------

local $/ = 'Chunk';
while (<>) {
if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
}
}
 
K

keith

I have a data file structured something like this:
------------------8<-----------------------
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
------------------8<-----------------------
and I want to extract from it to produce output something like this:
------------------8<-----------------------
01 NAME: Alice -> 37
------------------8<-----------------------

local $/ = 'Chunk';
while (<>) {
if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
}
}

Gotta love this place - thanks!

Now, let's see if I can decipher (no point in asking if I don't learn
from the answer)...

You make the text 'Chunk' the record delimiter. Then inside each
record you look for digits (store in $1). Skip anything followed by
uppercase text followed by colon followed by space followed by double-
quote. Now grab everything up to next double quote (store in $2).
Skip double-quote, then anything then the text 'Age:' then spaces,
then grab digits (store in $3), and we're done.

Is that close?!
 
G

Gunnar Hjalmarsson

local $/ = 'Chunk';
while (<>) {
if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
}
}

Gotta love this place - thanks!

Now, let's see if I can decipher (no point in asking if I don't learn
from the answer)...

You make the text 'Chunk' the record delimiter. Then inside each
record you look for digits (store in $1). Skip anything followed by
uppercase text followed by colon followed by space followed by double-
quote. Now grab everything up to next double quote (store in $2).
Skip double-quote, then anything then the text 'Age:' then spaces,
then grab digits (store in $3), and we're done.

Is that close?!

Yep, that's about it.

Since each chunk spans over multiple lines, the /s modifier is important
(makes . match also newlines).
 
P

Peter J. Holzer

I have a data file structured something like this:

------------------8<-----------------------
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
Chunk 02
NAME: "Bob"
Description: "Some other string"
Age: 28
Chunk 03
FIRST: "Carol"
Description: "Some other string"
Age: 32
Chunk 04
FIRST: "Dave"
Description: "Some other string"
Age: 22
------------------8<-----------------------

and I want to extract from it to produce output something like this:

------------------8<-----------------------
01 NAME: Alice -> 37
02 NAME: Bob -> 28
03 NAME: Carol -> 32
04 NAME: Dave -> 22
------------------8<-----------------------

Sure about that? In the input you have sometimes "FIRST" and sometimes
"NAME", but in the output it is always NAME. Assuming this is
intentional:


#!/usr/bin/perl
use strict;
use warnings;

my $s = <<EOS;
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
Chunk 02
NAME: "Bob"
Description: "Some other string"
Age: 28
Chunk 03
FIRST: "Carol"
Description: "Some other string"
Age: 32
Chunk 04
FIRST: "Dave"
Description: "Some other string"
Age: 22
EOS

while ($s =~ m{
^Chunk \s (\d+) \n
\s+(NAME|FIRST): \s "(.*?)" \n
\s+Description: \s "(.*?)" \n
\s+Age: \s (\d+) \n
}xmg
) {
print "$1 NAME: $3 -> $5\n";
}

hp
 
P

Peter J. Holzer

change this line to

Description: "Some Chunky string"
Age: 28 ....
------------------8<-----------------------

local $/ = 'Chunk';
while (<>) {
if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
}
}

and then run this script again.

hp
 
G

Gunnar Hjalmarsson

Peter said:
change this line to

Description: "Some Chunky string"
Age: 28 ...
------------------8<-----------------------
local $/ = 'Chunk';
while (<>) {
if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
}
}

and then run this script again.

Well, what's the likelihood that that would happen? At least the OP
didn't object to the idea with 'Chunk' as record separator.
 
P

Peter J. Holzer

Peter said:
(e-mail address removed) wrote:
I have a data file structured something like this:

------------------8<-----------------------
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
Chunk 02
NAME: "Bob"
Description: "Some other string"

change this line to

Description: "Some Chunky string"
Age: 28 ...
------------------8<-----------------------
local $/ = 'Chunk';
while (<>) {
if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
}
}

and then run this script again.

Well, what's the likelihood that that would happen?

How would I know? The OP didn't say much about the contents of the
fields. But I'd say it is non-zero. "Chunk" is an English word which
might occur in a description, and it might even be the first 5
characters of a name. Finally, we don't know where data comes from -
somebody might deliberately try to sabotage the script.
At least the OP didn't object to the idea with 'Chunk' as record
separator.

I was under the impression that he was glad to understand your solution
at all and wasn't trying to find flaws in it. Far too few people think
about the edge-cases of possible input.

A word of warning about the solution I posted in a different message: It
doesn't handle embedded quotes - that would be quite easy to add, but
there are different systems of escaping quotes and one would need to
know which one to use - the OP didn't tell us.

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top