Remembering part of last matched string

Chandramohan Neelakantan · Oct 10, 2003

Hello

I have output from 'pdftotext' command and I need to parse the
information in the text.
The text is in a table format.
This is a hardware design rules file and has associated information
in these tables for every rule.

For better understanding I have marked all new lines with \n and all
spaces with ^

Also
- Each Rule begins with the number like 23.1 or 23.1.1
- The 'Description' column can be spanned over several lines/rows
with blank lines in between
- Devices column could be empty
- It is not possible to ascertain the width of each column

-------------------------------------------------
Rule Description Dimensions Devices Fig
-------------------------------------------------
23.1^^^^^^^^^Text ^^^^^^^^^^^^2.0 x 4.9^^^^^^^ All ^^^^^^^^^^^2,F1\n

except NPN\n
\n
^^^^^^^^^^^^Text Here\n

23.1.1^^^^^^Text ^^^^^^^^^^^^^2.0x4.9^^^^^^^^^^^^^^^^^^^^^^^ 1,F1\n
\n
\n
^^^^^^^^^^^^Text Here\n
^^^^^^^^^^^^Text Here\n

-------------------------------------------------

I want all the the columns in each rule including additional text in
the following lines in foreach column.

This is what I have done:

- Try to match each rule beginning with \d+\.\d+
or \d+\.\d+\d+ operator for the rule column
- using \s+(.*?)\s+ for the second column
- \d+ | \d+ x \d+ for the dimensions column
and so on.

To parse all the rules I use the /g along with /^\d+\.\d+ operator
but this scans only lines beginning
with rule numbers. text information for the description and devices
column are lost

The problems are
- this is a row wise pattern matching approach and I do not know the
end a rule until I see the start of another
- Both the 'Description' and 'Devices' column contensts could be in
several lines with blank lines in between

I need to extract the information in all the columns for every rule
and save it in a separate file.

I would appreciate if someone could give suggestions to improve my
approach or throw in a fresh approach altogether

Many thanks!!
CM

Gunnar Hjalmarsson · Oct 10, 2003

Chandramohan said:
I have output from 'pdftotext' command and I need to parse the
information in the text.
The text is in a table format.
This is a hardware design rules file and has associated
information in these tables for every rule.

For better understanding I have marked all new lines with \n and
all spaces with ^

Also
- Each Rule begins with the number like 23.1 or 23.1.1
- The 'Description' column can be spanned over several lines/rows
with blank lines in between
- Devices column could be empty
- It is not possible to ascertain the width of each column

-------------------------------------------------
Rule Description Dimensions Devices Fig
-------------------------------------------------
23.1^^^^^^^^^Text ^^^^^^^^^^^^2.0 x 4.9^^^^^^^ All ^^^^^^^^^^^2,F1\n

except NPN\n
\n
^^^^^^^^^^^^Text Here\n

23.1.1^^^^^^Text ^^^^^^^^^^^^^2.0x4.9^^^^^^^^^^^^^^^^^^^^^^^ 1,F1\n
\n
\n
^^^^^^^^^^^^Text Here\n
^^^^^^^^^^^^Text Here\n

-------------------------------------------------

How about storing the values in a hash of hashes? This may be a start,
assuming the output is in $_:

my %rules = ();

while (
/(\d+(?:\.\d+)+) # Rule
\s+
([\w\-\s]+[\w\-]) # Description
\s+
(\d+\.\d\s?x\s?\d+\.\d) # Dimensions
\s+
([A-Z]*) # Devices
\s+
(\d,[A-Z]\d) # Fig
\s+
([\w\-\s]+[\w\-]) # except NPN
\s+
(?=\d+(?:\.\d+)+|$) # next rule or end
/gix
)

{
$rules{ $1 } = {
desc => $2,
dim => $3,
dev => $4,
fig => $5,
except => $6,
};
}

Matija Papec · Oct 10, 2003

X-Ftn-To: Chandramohan Neelakantan

The problems are
- this is a row wise pattern matching approach and I do not know the
end a rule until I see the start of another
- Both the 'Description' and 'Devices' column contensts could be in
several lines with blank lines in between

I need to extract the information in all the columns for every rule
and save it in a separate file.

my $s = rules...

for my $rule (split /(?:^|\n)(?=\d)/, $s) {

my($line, $restdesc) = split /\n+/, $rule, 2;
my @attr = split /\s\s+/, $line;
$attr [1] .= $restdesc;

print join '~', @attr;
print "\n";
}

I've assumed that every rule begins on newline and with at least one number.
Also, on the first line of the rule two or more whitespaces are considered
as field separator. You'll only have to check the size of @attr to see if
"Devices" is defined for particular rule.

Chandramohan Neelakantan · Oct 13, 2003

-------------------------------------------------

How about storing the values in a hash of hashes? This may be a start,
assuming the output is in $_:

my %rules = ();

while (
/(\d+(?:\.\d+)+) # Rule
\s+
([\w\-\s]+[\w\-]) # Description
\s+
(\d+\.\d\s?x\s?\d+\.\d) # Dimensions
\s+
([A-Z]*) # Devices
\s+
(\d,[A-Z]\d) # Fig
\s+
([\w\-\s]+[\w\-]) # except NPN
\s+
(?=\d+(?:\.\d+)+|$) # next rule or end
/gix
)

{
$rules{ $1 } = {
desc => $2,
dim => $3,
dev => $4,
fig => $5,
except => $6,
};
}

could you explain the line

(?=\d+(?:\.\d+)+|$) # next rule or end

please ?

CM

Gunnar Hjalmarsson · Oct 14, 2003

Chandramohan said:
could you explain the line

(?=\d+(?:\.\d+)+|$) # next rule or end

I guess you mean the ?= part of it. Maybe I could, but I'm sure that
this would explain it much better:
http://www.perldoc.com/perl5.8.0/pod/perlretut.html#Looking-ahead-and-looking-behind

Please read that, and possibly also the relevant part of
"perldoc perlre", and come back here if there is something in those
parts of the docs that you don't understand.

Chandramohan Neelakantan · Oct 17, 2003

Many thanks

I read the docs and found out the forward looking operators. The docs
were very helpful.

Many thanks for the tips!

Mini Web Server in C++ (Part One)	4	Oct 2, 2025
Select files based on text list of filenames(part of the name:date) with condition	0	May 4, 2022
Measuring a string of text	1	Sep 15, 2022
How can I calculate the last payment for Reprofiled Amount column with 2 decimal places to make the sum of all payments to be the same as RC amount?	2	Jul 13, 2023
Problem Splitting Text String	2	Dec 28, 2022
Help with importing from multiple files and printing lines in designated spot to spit out one file.	1	Jan 16, 2023
Explore the Power of AI: Build Your Own Console Chatbot Using GPT-2 XL in Python	2	Mar 17, 2026
How to Make CSV Contact Files Work Seamlessly Across All Smartphones?	0	Sep 17, 2025

Remembering part of last matched string

Chandramohan Neelakantan

Gunnar Hjalmarsson

Matija Papec

Chandramohan Neelakantan

Gunnar Hjalmarsson

Chandramohan Neelakantan

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads