Parsing multi-line text

keith · Feb 18, 2008

Hi all,

I have a data file structured something like this:

------------------8<-----------------------
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
Chunk 02
NAME: "Bob"
Description: "Some other string"
Age: 28
Chunk 03
FIRST: "Carol"
Description: "Some other string"
Age: 32
Chunk 04
FIRST: "Dave"
Description: "Some other string"
Age: 22
------------------8<-----------------------

and I want to extract from it to produce output something like this:

------------------8<-----------------------
01 NAME: Alice -> 37
02 NAME: Bob -> 28
03 NAME: Carol -> 32
04 NAME: Dave -> 22
------------------8<-----------------------

I've read and re-read the section in perlfaq6 (no, really, I have!)
about milt-line matching, but I can't see how to adapt what is there
to this.

Can someone please point me in the right direction?
Thx!

Gunnar Hjalmarsson · Feb 18, 2008

I have a data file structured something like this:

------------------8<-----------------------
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
Chunk 02
NAME: "Bob"
Description: "Some other string"
Age: 28
Chunk 03
FIRST: "Carol"
Description: "Some other string"
Age: 32
Chunk 04
FIRST: "Dave"
Description: "Some other string"
Age: 22
------------------8<-----------------------

and I want to extract from it to produce output something like this:

------------------8<-----------------------
01 NAME: Alice -> 37
02 NAME: Bob -> 28
03 NAME: Carol -> 32
04 NAME: Dave -> 22
------------------8<-----------------------

local $/ = 'Chunk';
while (<>) {
if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
}
}

keith · Feb 18, 2008

[email protected] said:
[email protected] said:

I have a data file structured something like this:

Click to expand...

------------------8<-----------------------
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
------------------8<-----------------------

Click to expand...

and I want to extract from it to produce output something like this:

Click to expand...

------------------8<-----------------------
01 NAME: Alice -> 37
------------------8<-----------------------

Click to expand...

local $/ = 'Chunk';
while (<>) {
if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
}
}

Gotta love this place - thanks!

Now, let's see if I can decipher (no point in asking if I don't learn
from the answer)...

You make the text 'Chunk' the record delimiter. Then inside each
record you look for digits (store in $1). Skip anything followed by
uppercase text followed by colon followed by space followed by double-
quote. Now grab everything up to next double quote (store in $2).
Skip double-quote, then anything then the text 'Age:' then spaces,
then grab digits (store in $3), and we're done.

Is that close?!

Gunnar Hjalmarsson · Feb 18, 2008

local $/ = 'Chunk';
while (<>) {
if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
}
}

Click to expand...

Gotta love this place - thanks!

Now, let's see if I can decipher (no point in asking if I don't learn
from the answer)...

You make the text 'Chunk' the record delimiter. Then inside each
record you look for digits (store in $1). Skip anything followed by
uppercase text followed by colon followed by space followed by double-
quote. Now grab everything up to next double quote (store in $2).
Skip double-quote, then anything then the text 'Age:' then spaces,
then grab digits (store in $3), and we're done.

Is that close?!

Yep, that's about it.

Since each chunk spans over multiple lines, the /s modifier is important
(makes . match also newlines).

Peter J. Holzer · Mar 1, 2008

I have a data file structured something like this:

------------------8<-----------------------
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
Chunk 02
NAME: "Bob"
Description: "Some other string"
Age: 28
Chunk 03
FIRST: "Carol"
Description: "Some other string"
Age: 32
Chunk 04
FIRST: "Dave"
Description: "Some other string"
Age: 22
------------------8<-----------------------

and I want to extract from it to produce output something like this:

------------------8<-----------------------
01 NAME: Alice -> 37
02 NAME: Bob -> 28
03 NAME: Carol -> 32
04 NAME: Dave -> 22
------------------8<-----------------------

Sure about that? In the input you have sometimes "FIRST" and sometimes
"NAME", but in the output it is always NAME. Assuming this is
intentional:

#!/usr/bin/perl
use strict;
use warnings;

my $s = <<EOS;
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
Chunk 02
NAME: "Bob"
Description: "Some other string"
Age: 28
Chunk 03
FIRST: "Carol"
Description: "Some other string"
Age: 32
Chunk 04
FIRST: "Dave"
Description: "Some other string"
Age: 22
EOS

while ($s =~ m{
^Chunk \s (\d+) \n
\s+(NAME|FIRST): \s "(.*?)" \n
\s+Description: \s "(.*?)" \n
\s+Age: \s (\d+) \n
}xmg
) {
print "$1 NAME: $3 -> $5\n";
}

hp

Peter J. Holzer · Mar 1, 2008

change this line to

Description: "Some Chunky string"

Age: 28 ....
------------------8<-----------------------

Click to expand...

local $/ = 'Chunk';
while (<>) {
if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
}
}

and then run this script again.

hp

Gunnar Hjalmarsson · Mar 2, 2008

Peter said:
change this line to

Description: "Some Chunky string"

Age: 28 ...
------------------8<-----------------------

Click to expand...

local $/ = 'Chunk';
while (<>) {
if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
}
}

Click to expand...

and then run this script again.

Well, what's the likelihood that that would happen? At least the OP
didn't object to the idea with 'Chunk' as record separator.

Peter J. Holzer · Mar 2, 2008

Peter said:
Peter said:

(e-mail address removed) wrote:
I have a data file structured something like this:

------------------8<-----------------------
Chunk 01
NAME: "Alice"
Description: "Some other string"
Age: 37
Chunk 02
NAME: "Bob"
Description: "Some other string"

Click to expand...

change this line to

Description: "Some Chunky string"

Age: 28 ...
------------------8<-----------------------
local $/ = 'Chunk';
while (<>) {
if ( /(\d+).+[A-Z]+:\s+"([^"]*)".+Age:\s+(\d+)/s ) {
printf "%02d NAME: %-10s -> %d\n", $1, $2, $3;
}
}

Click to expand...

and then run this script again.

Click to expand...

Well, what's the likelihood that that would happen?

How would I know? The OP didn't say much about the contents of the
fields. But I'd say it is non-zero. "Chunk" is an English word which
might occur in a description, and it might even be the first 5
characters of a name. Finally, we don't know where data comes from -
somebody might deliberately try to sabotage the script.

At least the OP didn't object to the idea with 'Chunk' as record
separator.

I was under the impression that he was glad to understand your solution
at all and wasn't trying to find flaws in it. Far too few people think
about the edge-cases of possible input.

A word of warning about the solution I posted in a different message: It
doesn't handle embedded quotes - that would be quite easy to add, but
there are different systems of escaping quotes and one would need to
know which one to use - the OP didn't tell us.

hp

Minimum Total Difficulty	0	Nov 15, 2023
next line (data parsing)	5	Jan 17, 2008
Working on mobile css menu with plenty of frustration!	2	Dec 29, 2022
Derived class question	9	Jul 5, 2013
How to use ufixed when it involves multiplication a number of times?(VHDL question)	0	Aug 22, 2016
parsing text in blocks and line too	2	Apr 12, 2007
Text processing	29	Sep 26, 2011
Emacs Time Line, Graphical Chart by Jamie Zawinski - Valuable	0	Jul 19, 2010

Parsing multi-line text

keith

Gunnar Hjalmarsson

keith

Gunnar Hjalmarsson

Peter J. Holzer

Peter J. Holzer

Gunnar Hjalmarsson

Peter J. Holzer

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads