Regex to extract row data from text

TimBenz · Oct 22, 2003

I need a RegEx that I can use to scroll through textual data to extract
lines in a semi-regular format. The original data is a form something like
this:

AAA AAAAA BBBB BB CCCCC DDDDD EEEEEE FFFFFFF

Note, there are zero or more spaces in the "A" entity and the "B" entity,
and the rest of the entities have no spaces. Second, there is no fixed
length for any of the entities. They can be any non-zero length. About the
only point of consistency is that the "B" entity has a finite number of
forms, about fifteen. So far my attempt has been like this:

(.*)(COM|COMMON SHARES|Domestic Common)\s{1,}(.*?)\s{1,}(.*?)\s{1,}(.*?)\s

From which I extract $1, $3, and $5.

How do I spool through the whole text file and extract every line for which
the above holds? Are there better ways of doing this without the arduous
part where I have to detail all the variants of the B entity?

Thanks.

Anno Siegel · Oct 22, 2003

TimBenz said:
I need a RegEx that I can use to scroll through textual data to extract
lines in a semi-regular format. The original data is a form something like
this:

AAA AAAAA BBBB BB CCCCC DDDDD EEEEEE FFFFFFF

Note, there are zero or more spaces in the "A" entity and the "B" entity,
and the rest of the entities have no spaces. Second, there is no fixed
length for any of the entities. They can be any non-zero length. About the
only point of consistency is that the "B" entity has a finite number of
forms, about fifteen. So far my attempt has been like this:

(.*)(COM|COMMON SHARES|Domestic Common)\s{1,}(.*?)\s{1,}(.*?)\s{1,}(.*?)\s

Which is the part that is supposed to catch the "B" entry? The one
starting "(COM..." has only three alternatives.

From which I extract $1, $3, and $5.

What about $2?

How do I spool through the whole text file and extract every line for which
the above holds?

my @extract;
while ( <FILE> ){
push @extract, $_ if /.../;
}

Are there better ways of doing this without the arduous
part where I have to detail all the variants of the B entity?

No. From what you say, it is only possible to delimit the "A" record
after having identified the "B" record.

Anno

David Oswald · Oct 22, 2003

TimBenz said:
I need a RegEx that I can use to scroll through textual data to extract
lines in a semi-regular format. The original data is a form something like
this:

AAA AAAAA BBBB BB CCCCC DDDDD EEEEEE FFFFFFF

Note, there are zero or more spaces in the "A" entity and the "B" entity,
and the rest of the entities have no spaces. Second, there is no fixed
length for any of the entities. They can be any non-zero length. About the
only point of consistency is that the "B" entity has a finite number of
forms, about fifteen. So far my attempt has been like this:

(.*)(COM|COMMON SHARES|Domestic Common)\s{1,}(.*?)\s{1,}(.*?)\s{1,}(.*?)\s

From which I extract $1, $3, and $5.

The biggest problem is, how are you planning on delimiting the A segment
from the B segment, if the A segment itself can contain any one-or-more
number of characters that include the space, and yet it's a space that
separates
A from B? The only way to solve that problem IS to enumerate through
alternation
all the forms that B can take, so that you can use B as an anchor-point.

Fortunately, you don't have to do it in quite so ugly a way.

Try something like this:

while ( my $line = <DATA> );
my $re_alternates = join "|", @alternates_list;
if ( my ($first, $third, $fifth) = $line =~
m/^(.+?)(?:$re_alternates)\s+(\w+)\s+\w+\s+(\w+)\s+$/ ) {
#do your stuff...
}
}

....to explain...
You said you only want to capture the first, third and fifth groupings. So
I only used
capturing parenthesis on those portions of the match. I used non-capturing
parens
to confine the alternation. And all of the alternates are built up into
$re_alternates.

Finally, instead of using $1, $2, $3, I just used the regexp in list context
so that the
scalars $first, $third, and $fifth would be populated in case of a match.

Good luck...

Tore Aursand · Oct 22, 2003

The original data is a form something like this:
[...]

Why don't you post a bit of the _excact_ data you're trying to parse, thus
making it a lot easier for us?

Chance is that you'll get a few answers to your original post, and then
you goes "yeah, but the data could also include...blah...blah...".

Tassilo v. Parseval · Oct 22, 2003

Also sprach Tore Aursand:

The original data is a form something like this:
[...]

Click to expand...

Why don't you post a bit of the _excact_ data you're trying to parse, thus
making it a lot easier for us?

Chance is that you'll get a few answers to your original post, and then
you goes "yeah, but the data could also include...blah...blah...".

This chance is even higher when he posts a sample of exact data.

Tassilo

Chris Mattern · Oct 22, 2003

Tassilo said:
Also sprach Tore Aursand:

The original data is a form something like this:
[...]

Click to expand...

Why don't you post a bit of the _excact_ data you're trying to parse, thus
making it a lot easier for us?

Chance is that you'll get a few answers to your original post, and then
you goes "yeah, but the data could also include...blah...blah...".

Click to expand...

This chance is even higher when he posts a sample of exact data.

When you're parsing input data, what is necessary is a true understanding
of its syntax, not samples which will almost invariably fail to cover
certain cases. "The data looks like such-and-so" or "The data is in
a form like this" is usually a red flag that the speaker doesn't understand
his input data well enough to parse it properly.

Chris Mattern

TimBenz · Oct 22, 2003

Thanks for all the replies. Sorry for having been remiss in not posting the
exact data, but it's proprietary trading data for our money management
firm, so I didn't know what I could post. Here is a representative piece,
however, that I don't think should worry anyone:

NAME OF ISSUER TITLE OF CUSIP MARKET AMOUNT SH/PRINV
DISC OTHER VOTING AUTHORITY

21ST CENTURY INS GRP COMMON 90130N103 974 70700 SH SOLE
70700 0 0
3COM CORP COMMON 885535104 5156 873949 SH SOLE
873949 0 0
3M COMPANY COMMON 88579Y101 36846 533460 SH SOLE
527760 0 5700
3M COMPANY COMMON 88579Y101 2735 39596 SH
OTHER 39596 0 0
IBM CORP COMMON 88179Y101 735 35110 SH SOLE
35110 0 0

As you can see, the structure is fairly open, and even the tab/space
structure changes depending on the size of entry in the first column.

Glenn Jackman · Oct 22, 2003

TimBenz said:
Here is a representative piece,

NAME OF ISSUER TITLE OF CUSIP MARKET AMOUNT SH/PRINV
DISC OTHER VOTING AUTHORITY

21ST CENTURY INS GRP COMMON 90130N103 974 70700 SH SOLE
70700 0 0
3COM CORP COMMON 885535104 5156 873949 SH SOLE
873949 0 0
3M COMPANY COMMON 88579Y101 36846 533460 SH SOLE
527760 0 5700
3M COMPANY COMMON 88579Y101 2735 39596 SH OTHER
39596 0 0
IBM CORP COMMON 88179Y101 735 35110 SH SOLE
35110 0 0

Looks like fixed width fields, as opposed to delimited.
Does the "COMMON" always start at the 31st character?
If so, use substr() to extract the data.

TimBenz · Oct 22, 2003

Looks like fixed width fields, as opposed to delimited.
Does the "COMMON" always start at the 31st character?
If so, use substr() to extract the data.

Sadly, the field widths aren't fixed. It really depends on who filed the
trading report how wide the fields are -- they vary all over the map. So
the substr() method doesn't work. Following advice here, I have written a
regex that keys on the 10 or so variants of the second column and hinges
around that. Irritating, but that seems to be the only thing that works for
me.

Tad McClellan · Oct 22, 2003

Glenn Jackman said:
Looks like fixed width fields, as opposed to delimited.

If so, use substr() to extract the data.

unpack() is the Right Tool for fixed width fields.

Tore Aursand · Oct 23, 2003

When you're parsing input data, what is necessary is a true understanding
of its syntax, not samples which will almost invariably fail to cover
certain cases. "The data looks like such-and-so" or "The data is in
a form like this" is usually a red flag that the speaker doesn't understand
his input data well enough to parse it properly.

Isn't that why Perl was created?

Tore Aursand · Oct 23, 2003

Thanks for all the replies. Sorry for having been remiss in not posting the
exact data, but it's proprietary trading data for our money management
firm, so I didn't know what I could post. Here is a representative piece,
however, that I don't think should worry anyone:
[...]

The data was wrapped, so I still don't know the original format. It
seems, however, that it's quite hard to parse this data.

But! If you're sure that you know the text on the first line, and that
the following lines are formatted as that line, you could always "cheat":

1. Get the first line.
2. Get the position of each column from that line.
3. Iterate through the "remaining" lines, gathering the data
based on the format of the first line.

Not a clever solution, but it would work.

Chris Mattern · Oct 23, 2003

Tore said:
Isn't that why Perl was created?

Heh. Yeah, I guess so. Often, the best way to get a handle on understanding
your data is to go through several iterations of failing to parse it correctly.
That's not something a newsgroup can really do for you, of course...

Chris Mattern

Collect Excel Data from Website	5	Apr 30, 2022
Regex to extract email from .msg	11	Jan 7, 2010
What's the best way to extract 2 values from a CSV file from each row systematically?	6	Sep 23, 2013
regex to extract color guide from html	2	Oct 26, 2004
FAQ 4.34 How do I extract selected columns from a string?	0	Apr 27, 2011
How to extract Arabic Text from PDF file	3	Jan 28, 2009
Dear gurus how can I extract an ARRAY from a scalar regex-wise	5	May 9, 2008
I need to extract an array from a scalar regex-wise ?	11	May 9, 2008

Regex to extract row data from text

TimBenz

Anno Siegel

David Oswald

Tore Aursand

Tassilo v. Parseval

Chris Mattern

TimBenz

Glenn Jackman

TimBenz

Tad McClellan

Tore Aursand

Tore Aursand

Chris Mattern

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads