Regex to extract row data from text

T

TimBenz

I need a RegEx that I can use to scroll through textual data to extract
lines in a semi-regular format. The original data is a form something like
this:

AAA AAAAA BBBB BB CCCCC DDDDD EEEEEE FFFFFFF

Note, there are zero or more spaces in the "A" entity and the "B" entity,
and the rest of the entities have no spaces. Second, there is no fixed
length for any of the entities. They can be any non-zero length. About the
only point of consistency is that the "B" entity has a finite number of
forms, about fifteen. So far my attempt has been like this:

(.*)(COM|COMMON SHARES|Domestic Common)\s{1,}(.*?)\s{1,}(.*?)\s{1,}(.*?)\s

From which I extract $1, $3, and $5.

How do I spool through the whole text file and extract every line for which
the above holds? Are there better ways of doing this without the arduous
part where I have to detail all the variants of the B entity?

Thanks.
 
A

Anno Siegel

TimBenz said:
I need a RegEx that I can use to scroll through textual data to extract
lines in a semi-regular format. The original data is a form something like
this:

AAA AAAAA BBBB BB CCCCC DDDDD EEEEEE FFFFFFF

Note, there are zero or more spaces in the "A" entity and the "B" entity,
and the rest of the entities have no spaces. Second, there is no fixed
length for any of the entities. They can be any non-zero length. About the
only point of consistency is that the "B" entity has a finite number of
forms, about fifteen. So far my attempt has been like this:

(.*)(COM|COMMON SHARES|Domestic Common)\s{1,}(.*?)\s{1,}(.*?)\s{1,}(.*?)\s

Which is the part that is supposed to catch the "B" entry? The one
starting "(COM..." has only three alternatives.
From which I extract $1, $3, and $5.

What about $2?
How do I spool through the whole text file and extract every line for which
the above holds?

my @extract;
while ( <FILE> ){
push @extract, $_ if /.../;
}
Are there better ways of doing this without the arduous
part where I have to detail all the variants of the B entity?

No. From what you say, it is only possible to delimit the "A" record
after having identified the "B" record.

Anno
 
D

David Oswald

TimBenz said:
I need a RegEx that I can use to scroll through textual data to extract
lines in a semi-regular format. The original data is a form something like
this:

AAA AAAAA BBBB BB CCCCC DDDDD EEEEEE FFFFFFF

Note, there are zero or more spaces in the "A" entity and the "B" entity,
and the rest of the entities have no spaces. Second, there is no fixed
length for any of the entities. They can be any non-zero length. About the
only point of consistency is that the "B" entity has a finite number of
forms, about fifteen. So far my attempt has been like this:

(.*)(COM|COMMON SHARES|Domestic Common)\s{1,}(.*?)\s{1,}(.*?)\s{1,}(.*?)\s

From which I extract $1, $3, and $5.

The biggest problem is, how are you planning on delimiting the A segment
from the B segment, if the A segment itself can contain any one-or-more
number of characters that include the space, and yet it's a space that
separates
A from B? The only way to solve that problem IS to enumerate through
alternation
all the forms that B can take, so that you can use B as an anchor-point.

Fortunately, you don't have to do it in quite so ugly a way.

Try something like this:

while ( my $line = <DATA> );
my $re_alternates = join "|", @alternates_list;
if ( my ($first, $third, $fifth) = $line =~
m/^(.+?)(?:$re_alternates)\s+(\w+)\s+\w+\s+(\w+)\s+$/ ) {
#do your stuff...
}
}

....to explain...
You said you only want to capture the first, third and fifth groupings. So
I only used
capturing parenthesis on those portions of the match. I used non-capturing
parens
to confine the alternation. And all of the alternates are built up into
$re_alternates.

Finally, instead of using $1, $2, $3, I just used the regexp in list context
so that the
scalars $first, $third, and $fifth would be populated in case of a match.

Good luck...
 
T

Tore Aursand

The original data is a form something like this:
[...]

Why don't you post a bit of the _excact_ data you're trying to parse, thus
making it a lot easier for us?

Chance is that you'll get a few answers to your original post, and then
you goes "yeah, but the data could also include...blah...blah...".
 
T

Tassilo v. Parseval

Also sprach Tore Aursand:
The original data is a form something like this:
[...]

Why don't you post a bit of the _excact_ data you're trying to parse, thus
making it a lot easier for us?

Chance is that you'll get a few answers to your original post, and then
you goes "yeah, but the data could also include...blah...blah...".

This chance is even higher when he posts a sample of exact data.

Tassilo
 
C

Chris Mattern

Tassilo said:
Also sprach Tore Aursand:

The original data is a form something like this:
[...]

Why don't you post a bit of the _excact_ data you're trying to parse, thus
making it a lot easier for us?

Chance is that you'll get a few answers to your original post, and then
you goes "yeah, but the data could also include...blah...blah...".


This chance is even higher when he posts a sample of exact data.
When you're parsing input data, what is necessary is a true understanding
of its syntax, not samples which will almost invariably fail to cover
certain cases. "The data looks like such-and-so" or "The data is in
a form like this" is usually a red flag that the speaker doesn't understand
his input data well enough to parse it properly.

Chris Mattern
 
T

TimBenz

Thanks for all the replies. Sorry for having been remiss in not posting the
exact data, but it's proprietary trading data for our money management
firm, so I didn't know what I could post. Here is a representative piece,
however, that I don't think should worry anyone:

NAME OF ISSUER TITLE OF CUSIP MARKET AMOUNT SH/PRINV
DISC OTHER VOTING AUTHORITY

21ST CENTURY INS GRP COMMON 90130N103 974 70700 SH SOLE
70700 0 0
3COM CORP COMMON 885535104 5156 873949 SH SOLE
873949 0 0
3M COMPANY COMMON 88579Y101 36846 533460 SH SOLE
527760 0 5700
3M COMPANY COMMON 88579Y101 2735 39596 SH
OTHER 39596 0 0
IBM CORP COMMON 88179Y101 735 35110 SH SOLE
35110 0 0



As you can see, the structure is fairly open, and even the tab/space
structure changes depending on the size of entry in the first column.
 
G

Glenn Jackman

TimBenz said:
Here is a representative piece,

NAME OF ISSUER TITLE OF CUSIP MARKET AMOUNT SH/PRINV
DISC OTHER VOTING AUTHORITY

21ST CENTURY INS GRP COMMON 90130N103 974 70700 SH SOLE
70700 0 0
3COM CORP COMMON 885535104 5156 873949 SH SOLE
873949 0 0
3M COMPANY COMMON 88579Y101 36846 533460 SH SOLE
527760 0 5700
3M COMPANY COMMON 88579Y101 2735 39596 SH OTHER
39596 0 0
IBM CORP COMMON 88179Y101 735 35110 SH SOLE
35110 0 0

Looks like fixed width fields, as opposed to delimited.
Does the "COMMON" always start at the 31st character?
If so, use substr() to extract the data.
 
T

TimBenz

Looks like fixed width fields, as opposed to delimited.
Does the "COMMON" always start at the 31st character?
If so, use substr() to extract the data.

Sadly, the field widths aren't fixed. It really depends on who filed the
trading report how wide the fields are -- they vary all over the map. So
the substr() method doesn't work. Following advice here, I have written a
regex that keys on the 10 or so variants of the second column and hinges
around that. Irritating, but that seems to be the only thing that works for
me.
 
T

Tore Aursand

When you're parsing input data, what is necessary is a true understanding
of its syntax, not samples which will almost invariably fail to cover
certain cases. "The data looks like such-and-so" or "The data is in
a form like this" is usually a red flag that the speaker doesn't understand
his input data well enough to parse it properly.

Isn't that why Perl was created? :)
 
T

Tore Aursand

Thanks for all the replies. Sorry for having been remiss in not posting the
exact data, but it's proprietary trading data for our money management
firm, so I didn't know what I could post. Here is a representative piece,
however, that I don't think should worry anyone:
[...]

The data was wrapped, so I still don't know the original format. It
seems, however, that it's quite hard to parse this data.

But! If you're sure that you know the text on the first line, and that
the following lines are formatted as that line, you could always "cheat":

1. Get the first line.
2. Get the position of each column from that line.
3. Iterate through the "remaining" lines, gathering the data
based on the format of the first line.

Not a clever solution, but it would work.
 
C

Chris Mattern

Tore said:
Isn't that why Perl was created? :)

Heh. Yeah, I guess so. Often, the best way to get a handle on understanding
your data is to go through several iterations of failing to parse it correctly.
That's not something a newsgroup can really do for you, of course...

Chris Mattern
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,756
Messages
2,569,533
Members
45,007
Latest member
OrderFitnessKetoCapsules

Latest Threads

Top