C
Chandramohan Neelakantan
Hello
I have output from 'pdftotext' command and I need to parse the
information in the text.
The text is in a table format.
This is a hardware design rules file and has associated information
in these tables for every rule.
For better understanding I have marked all new lines with \n and all
spaces with ^
Also
- Each Rule begins with the number like 23.1 or 23.1.1
- The 'Description' column can be spanned over several lines/rows
with blank lines in between
- Devices column could be empty
- It is not possible to ascertain the width of each column
-------------------------------------------------
Rule Description Dimensions Devices Fig
-------------------------------------------------
23.1^^^^^^^^^Text ^^^^^^^^^^^^2.0 x 4.9^^^^^^^ All ^^^^^^^^^^^2,F1\n
except NPN\n
\n
^^^^^^^^^^^^Text Here\n
23.1.1^^^^^^Text ^^^^^^^^^^^^^2.0x4.9^^^^^^^^^^^^^^^^^^^^^^^ 1,F1\n
\n
\n
^^^^^^^^^^^^Text Here\n
^^^^^^^^^^^^Text Here\n
-------------------------------------------------
I want all the the columns in each rule including additional text in
the following lines in foreach column.
This is what I have done:
- Try to match each rule beginning with \d+\.\d+
or \d+\.\d+\d+ operator for the rule column
- using \s+(.*?)\s+ for the second column
- \d+ | \d+ x \d+ for the dimensions column
and so on.
To parse all the rules I use the /g along with /^\d+\.\d+ operator
but this scans only lines beginning
with rule numbers. text information for the description and devices
column are lost
The problems are
- this is a row wise pattern matching approach and I do not know the
end a rule until I see the start of another
- Both the 'Description' and 'Devices' column contensts could be in
several lines with blank lines in between
I need to extract the information in all the columns for every rule
and save it in a separate file.
I would appreciate if someone could give suggestions to improve my
approach or throw in a fresh approach altogether
Many thanks!!
CM
I have output from 'pdftotext' command and I need to parse the
information in the text.
The text is in a table format.
This is a hardware design rules file and has associated information
in these tables for every rule.
For better understanding I have marked all new lines with \n and all
spaces with ^
Also
- Each Rule begins with the number like 23.1 or 23.1.1
- The 'Description' column can be spanned over several lines/rows
with blank lines in between
- Devices column could be empty
- It is not possible to ascertain the width of each column
-------------------------------------------------
Rule Description Dimensions Devices Fig
-------------------------------------------------
23.1^^^^^^^^^Text ^^^^^^^^^^^^2.0 x 4.9^^^^^^^ All ^^^^^^^^^^^2,F1\n
except NPN\n
\n
^^^^^^^^^^^^Text Here\n
23.1.1^^^^^^Text ^^^^^^^^^^^^^2.0x4.9^^^^^^^^^^^^^^^^^^^^^^^ 1,F1\n
\n
\n
^^^^^^^^^^^^Text Here\n
^^^^^^^^^^^^Text Here\n
-------------------------------------------------
I want all the the columns in each rule including additional text in
the following lines in foreach column.
This is what I have done:
- Try to match each rule beginning with \d+\.\d+
or \d+\.\d+\d+ operator for the rule column
- using \s+(.*?)\s+ for the second column
- \d+ | \d+ x \d+ for the dimensions column
and so on.
To parse all the rules I use the /g along with /^\d+\.\d+ operator
but this scans only lines beginning
with rule numbers. text information for the description and devices
column are lost
The problems are
- this is a row wise pattern matching approach and I do not know the
end a rule until I see the start of another
- Both the 'Description' and 'Devices' column contensts could be in
several lines with blank lines in between
I need to extract the information in all the columns for every rule
and save it in a separate file.
I would appreciate if someone could give suggestions to improve my
approach or throw in a fresh approach altogether
Many thanks!!
CM