Remembering part of last matched string

  • Thread starter Chandramohan Neelakantan
  • Start date
C

Chandramohan Neelakantan

Hello

I have output from 'pdftotext' command and I need to parse the
information in the text.
The text is in a table format.
This is a hardware design rules file and has associated information
in these tables for every rule.

For better understanding I have marked all new lines with \n and all
spaces with ^

Also
- Each Rule begins with the number like 23.1 or 23.1.1
- The 'Description' column can be spanned over several lines/rows
with blank lines in between
- Devices column could be empty
- It is not possible to ascertain the width of each column


-------------------------------------------------
Rule Description Dimensions Devices Fig
-------------------------------------------------
23.1^^^^^^^^^Text ^^^^^^^^^^^^2.0 x 4.9^^^^^^^ All ^^^^^^^^^^^2,F1\n

except NPN\n
\n
^^^^^^^^^^^^Text Here\n




23.1.1^^^^^^Text ^^^^^^^^^^^^^2.0x4.9^^^^^^^^^^^^^^^^^^^^^^^ 1,F1\n
\n
\n
^^^^^^^^^^^^Text Here\n
^^^^^^^^^^^^Text Here\n



-------------------------------------------------


I want all the the columns in each rule including additional text in
the following lines in foreach column.

This is what I have done:

- Try to match each rule beginning with \d+\.\d+
or \d+\.\d+\d+ operator for the rule column
- using \s+(.*?)\s+ for the second column
- \d+ | \d+ x \d+ for the dimensions column
and so on.

To parse all the rules I use the /g along with /^\d+\.\d+ operator
but this scans only lines beginning
with rule numbers. text information for the description and devices
column are lost


The problems are
- this is a row wise pattern matching approach and I do not know the
end a rule until I see the start of another
- Both the 'Description' and 'Devices' column contensts could be in
several lines with blank lines in between


I need to extract the information in all the columns for every rule
and save it in a separate file.


I would appreciate if someone could give suggestions to improve my
approach or throw in a fresh approach altogether


Many thanks!!
CM
 
G

Gunnar Hjalmarsson

Chandramohan said:
I have output from 'pdftotext' command and I need to parse the
information in the text.
The text is in a table format.
This is a hardware design rules file and has associated
information in these tables for every rule.

For better understanding I have marked all new lines with \n and
all spaces with ^

Also
- Each Rule begins with the number like 23.1 or 23.1.1
- The 'Description' column can be spanned over several lines/rows
with blank lines in between
- Devices column could be empty
- It is not possible to ascertain the width of each column


-------------------------------------------------
Rule Description Dimensions Devices Fig
-------------------------------------------------
23.1^^^^^^^^^Text ^^^^^^^^^^^^2.0 x 4.9^^^^^^^ All ^^^^^^^^^^^2,F1\n

except NPN\n
\n
^^^^^^^^^^^^Text Here\n




23.1.1^^^^^^Text ^^^^^^^^^^^^^2.0x4.9^^^^^^^^^^^^^^^^^^^^^^^ 1,F1\n
\n
\n
^^^^^^^^^^^^Text Here\n
^^^^^^^^^^^^Text Here\n



-------------------------------------------------

How about storing the values in a hash of hashes? This may be a start,
assuming the output is in $_:

my %rules = ();

while (
/(\d+(?:\.\d+)+) # Rule
\s+
([\w\-\s]+[\w\-]) # Description
\s+
(\d+\.\d\s?x\s?\d+\.\d) # Dimensions
\s+
([A-Z]*) # Devices
\s+
(\d,[A-Z]\d) # Fig
\s+
([\w\-\s]+[\w\-]) # except NPN
\s+
(?=\d+(?:\.\d+)+|$) # next rule or end
/gix
)

{
$rules{ $1 } = {
desc => $2,
dim => $3,
dev => $4,
fig => $5,
except => $6,
};
}
 
M

Matija Papec

X-Ftn-To: Chandramohan Neelakantan

The problems are
- this is a row wise pattern matching approach and I do not know the
end a rule until I see the start of another
- Both the 'Description' and 'Devices' column contensts could be in
several lines with blank lines in between

I need to extract the information in all the columns for every rule
and save it in a separate file.

my $s = rules...

for my $rule (split /(?:^|\n)(?=\d)/, $s) {

my($line, $restdesc) = split /\n+/, $rule, 2;
my @attr = split /\s\s+/, $line;
$attr [1] .= $restdesc;

print join '~', @attr;
print "\n";
}

I've assumed that every rule begins on newline and with at least one number.
Also, on the first line of the rule two or more whitespaces are considered
as field separator. You'll only have to check the size of @attr to see if
"Devices" is defined for particular rule.
 
C

Chandramohan Neelakantan

-------------------------------------------------
How about storing the values in a hash of hashes? This may be a start,
assuming the output is in $_:

my %rules = ();

while (
/(\d+(?:\.\d+)+) # Rule
\s+
([\w\-\s]+[\w\-]) # Description
\s+
(\d+\.\d\s?x\s?\d+\.\d) # Dimensions
\s+
([A-Z]*) # Devices
\s+
(\d,[A-Z]\d) # Fig
\s+
([\w\-\s]+[\w\-]) # except NPN
\s+
(?=\d+(?:\.\d+)+|$) # next rule or end
/gix
)

{
$rules{ $1 } = {
desc => $2,
dim => $3,
dev => $4,
fig => $5,
except => $6,
};
}




could you explain the line

(?=\d+(?:\.\d+)+|$) # next rule or end

please ?




CM
 
G

Gunnar Hjalmarsson

C

Chandramohan Neelakantan

Many thanks

I read the docs and found out the forward looking operators. The docs
were very helpful.

Many thanks for the tips!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top