Regex to extract row data from text

Discussion in 'Perl Misc' started by TimBenz, Oct 22, 2003.

  1. TimBenz

    TimBenz Guest

    I need a RegEx that I can use to scroll through textual data to extract
    lines in a semi-regular format. The original data is a form something like
    this:

    AAA AAAAA BBBB BB CCCCC DDDDD EEEEEE FFFFFFF

    Note, there are zero or more spaces in the "A" entity and the "B" entity,
    and the rest of the entities have no spaces. Second, there is no fixed
    length for any of the entities. They can be any non-zero length. About the
    only point of consistency is that the "B" entity has a finite number of
    forms, about fifteen. So far my attempt has been like this:

    (.*)(COM|COMMON SHARES|Domestic Common)\s{1,}(.*?)\s{1,}(.*?)\s{1,}(.*?)\s

    From which I extract $1, $3, and $5.

    How do I spool through the whole text file and extract every line for which
    the above holds? Are there better ways of doing this without the arduous
    part where I have to detail all the variants of the B entity?

    Thanks.
    TimBenz, Oct 22, 2003
    #1
    1. Advertising

  2. TimBenz

    Anno Siegel Guest

    TimBenz <> wrote in comp.lang.perl.misc:
    > I need a RegEx that I can use to scroll through textual data to extract
    > lines in a semi-regular format. The original data is a form something like
    > this:
    >
    > AAA AAAAA BBBB BB CCCCC DDDDD EEEEEE FFFFFFF
    >
    > Note, there are zero or more spaces in the "A" entity and the "B" entity,
    > and the rest of the entities have no spaces. Second, there is no fixed
    > length for any of the entities. They can be any non-zero length. About the
    > only point of consistency is that the "B" entity has a finite number of
    > forms, about fifteen. So far my attempt has been like this:
    >
    > (.*)(COM|COMMON SHARES|Domestic Common)\s{1,}(.*?)\s{1,}(.*?)\s{1,}(.*?)\s


    Which is the part that is supposed to catch the "B" entry? The one
    starting "(COM..." has only three alternatives.

    > From which I extract $1, $3, and $5.


    What about $2?

    > How do I spool through the whole text file and extract every line for which
    > the above holds?


    my @extract;
    while ( <FILE> ){
    push @extract, $_ if /.../;
    }

    > Are there better ways of doing this without the arduous
    > part where I have to detail all the variants of the B entity?


    No. From what you say, it is only possible to delimit the "A" record
    after having identified the "B" record.

    Anno
    Anno Siegel, Oct 22, 2003
    #2
    1. Advertising

  3. TimBenz

    David Oswald Guest

    "TimBenz" <> wrote in message

    > I need a RegEx that I can use to scroll through textual data to extract
    > lines in a semi-regular format. The original data is a form something like
    > this:
    >
    > AAA AAAAA BBBB BB CCCCC DDDDD EEEEEE FFFFFFF
    >
    > Note, there are zero or more spaces in the "A" entity and the "B" entity,
    > and the rest of the entities have no spaces. Second, there is no fixed
    > length for any of the entities. They can be any non-zero length. About the
    > only point of consistency is that the "B" entity has a finite number of
    > forms, about fifteen. So far my attempt has been like this:
    >
    > (.*)(COM|COMMON SHARES|Domestic Common)\s{1,}(.*?)\s{1,}(.*?)\s{1,}(.*?)\s
    >
    > From which I extract $1, $3, and $5.


    The biggest problem is, how are you planning on delimiting the A segment
    from the B segment, if the A segment itself can contain any one-or-more
    number of characters that include the space, and yet it's a space that
    separates
    A from B? The only way to solve that problem IS to enumerate through
    alternation
    all the forms that B can take, so that you can use B as an anchor-point.

    Fortunately, you don't have to do it in quite so ugly a way.

    Try something like this:

    while ( my $line = <DATA> );
    my $re_alternates = join "|", @alternates_list;
    if ( my ($first, $third, $fifth) = $line =~
    m/^(.+?)(?:$re_alternates)\s+(\w+)\s+\w+\s+(\w+)\s+$/ ) {
    #do your stuff...
    }
    }

    ....to explain...
    You said you only want to capture the first, third and fifth groupings. So
    I only used
    capturing parenthesis on those portions of the match. I used non-capturing
    parens
    to confine the alternation. And all of the alternates are built up into
    $re_alternates.

    Finally, instead of using $1, $2, $3, I just used the regexp in list context
    so that the
    scalars $first, $third, and $fifth would be populated in case of a match.

    Good luck...
    David Oswald, Oct 22, 2003
    #3
  4. TimBenz

    Tore Aursand Guest

    On Wed, 22 Oct 2003 07:25:26 +0000, TimBenz wrote:
    > The original data is a form something like this:
    > [...]


    Why don't you post a bit of the _excact_ data you're trying to parse, thus
    making it a lot easier for us?

    Chance is that you'll get a few answers to your original post, and then
    you goes "yeah, but the data could also include...blah...blah...".


    --
    Tore Aursand <>
    Tore Aursand, Oct 22, 2003
    #4
  5. Also sprach Tore Aursand:

    > On Wed, 22 Oct 2003 07:25:26 +0000, TimBenz wrote:
    >> The original data is a form something like this:
    >> [...]

    >
    > Why don't you post a bit of the _excact_ data you're trying to parse, thus
    > making it a lot easier for us?
    >
    > Chance is that you'll get a few answers to your original post, and then
    > you goes "yeah, but the data could also include...blah...blah...".


    This chance is even higher when he posts a sample of exact data.

    Tassilo
    --
    $_=q#",}])!JAPH!qq(tsuJ[{@"tnirp}3..0}_$;//::niam/s~=)]3[))_$-3(rellac(=_$({
    pam{rekcahbus})(rekcah{lrePbus})(lreP{rehtonabus})!JAPH!qq(rehtona{tsuJbus#;
    $_=reverse,s+(?<=sub).+q#q!'"qq.\t$&."'!#+sexisexiixesixeseg;y~\n~~dddd;eval
    Tassilo v. Parseval, Oct 22, 2003
    #5
  6. Tassilo v. Parseval wrote:
    > Also sprach Tore Aursand:
    >
    >
    >>On Wed, 22 Oct 2003 07:25:26 +0000, TimBenz wrote:
    >>
    >>>The original data is a form something like this:
    >>>[...]

    >>
    >>Why don't you post a bit of the _excact_ data you're trying to parse, thus
    >>making it a lot easier for us?
    >>
    >>Chance is that you'll get a few answers to your original post, and then
    >>you goes "yeah, but the data could also include...blah...blah...".

    >
    >
    > This chance is even higher when he posts a sample of exact data.
    >

    When you're parsing input data, what is necessary is a true understanding
    of its syntax, not samples which will almost invariably fail to cover
    certain cases. "The data looks like such-and-so" or "The data is in
    a form like this" is usually a red flag that the speaker doesn't understand
    his input data well enough to parse it properly.

    Chris Mattern
    Chris Mattern, Oct 22, 2003
    #6
  7. TimBenz

    TimBenz Guest

    Re: Regex to extract row data from text (Copy of data included)

    Thanks for all the replies. Sorry for having been remiss in not posting the
    exact data, but it's proprietary trading data for our money management
    firm, so I didn't know what I could post. Here is a representative piece,
    however, that I don't think should worry anyone:

    NAME OF ISSUER TITLE OF CUSIP MARKET AMOUNT SH/PRINV
    DISC OTHER VOTING AUTHORITY

    21ST CENTURY INS GRP COMMON 90130N103 974 70700 SH SOLE
    70700 0 0
    3COM CORP COMMON 885535104 5156 873949 SH SOLE
    873949 0 0
    3M COMPANY COMMON 88579Y101 36846 533460 SH SOLE
    527760 0 5700
    3M COMPANY COMMON 88579Y101 2735 39596 SH
    OTHER 39596 0 0
    IBM CORP COMMON 88179Y101 735 35110 SH SOLE
    35110 0 0



    As you can see, the structure is fairly open, and even the tab/space
    structure changes depending on the size of entry in the first column.
    TimBenz, Oct 22, 2003
    #7
  8. Re: Regex to extract row data from text (Copy of data included)

    TimBenz <> wrote:
    > Here is a representative piece,
    >
    > NAME OF ISSUER TITLE OF CUSIP MARKET AMOUNT SH/PRINV
    > DISC OTHER VOTING AUTHORITY
    >
    > 21ST CENTURY INS GRP COMMON 90130N103 974 70700 SH SOLE
    > 70700 0 0
    > 3COM CORP COMMON 885535104 5156 873949 SH SOLE
    > 873949 0 0
    > 3M COMPANY COMMON 88579Y101 36846 533460 SH SOLE
    > 527760 0 5700
    > 3M COMPANY COMMON 88579Y101 2735 39596 SH OTHER
    > 39596 0 0
    > IBM CORP COMMON 88179Y101 735 35110 SH SOLE
    > 35110 0 0


    Looks like fixed width fields, as opposed to delimited.
    Does the "COMMON" always start at the 31st character?
    If so, use substr() to extract the data.


    --
    Glenn Jackman
    NCF Sysadmin
    Glenn Jackman, Oct 22, 2003
    #8
  9. TimBenz

    TimBenz Guest

    Re: Regex to extract row data from text (Copy of data included)

    Glenn Jackman <> wrote in
    news::

    >
    > Looks like fixed width fields, as opposed to delimited.
    > Does the "COMMON" always start at the 31st character?
    > If so, use substr() to extract the data.


    Sadly, the field widths aren't fixed. It really depends on who filed the
    trading report how wide the fields are -- they vary all over the map. So
    the substr() method doesn't work. Following advice here, I have written a
    regex that keys on the 10 or so variants of the second column and hinges
    around that. Irritating, but that seems to be the only thing that works for
    me.
    TimBenz, Oct 22, 2003
    #9
  10. Re: Regex to extract row data from text (Copy of data included)

    Glenn Jackman <> wrote:

    > Looks like fixed width fields, as opposed to delimited.


    > If so, use substr() to extract the data.



    unpack() is the Right Tool for fixed width fields.


    --
    Tad McClellan SGML consulting
    Perl programming
    Fort Worth, Texas
    Tad McClellan, Oct 22, 2003
    #10
  11. TimBenz

    Tore Aursand Guest

    On Wed, 22 Oct 2003 06:25:55 -0400, Chris Mattern wrote:
    > When you're parsing input data, what is necessary is a true understanding
    > of its syntax, not samples which will almost invariably fail to cover
    > certain cases. "The data looks like such-and-so" or "The data is in
    > a form like this" is usually a red flag that the speaker doesn't understand
    > his input data well enough to parse it properly.


    Isn't that why Perl was created? :)


    --
    Tore Aursand <>
    Tore Aursand, Oct 23, 2003
    #11
  12. TimBenz

    Tore Aursand Guest

    Re: Regex to extract row data from text (Copy of data included)

    On Wed, 22 Oct 2003 14:50:32 +0000, TimBenz wrote:
    > Thanks for all the replies. Sorry for having been remiss in not posting the
    > exact data, but it's proprietary trading data for our money management
    > firm, so I didn't know what I could post. Here is a representative piece,
    > however, that I don't think should worry anyone:
    > [...]


    The data was wrapped, so I still don't know the original format. It
    seems, however, that it's quite hard to parse this data.

    But! If you're sure that you know the text on the first line, and that
    the following lines are formatted as that line, you could always "cheat":

    1. Get the first line.
    2. Get the position of each column from that line.
    3. Iterate through the "remaining" lines, gathering the data
    based on the format of the first line.

    Not a clever solution, but it would work.


    --
    Tore Aursand <>
    Tore Aursand, Oct 23, 2003
    #12
  13. Tore Aursand wrote:
    > On Wed, 22 Oct 2003 06:25:55 -0400, Chris Mattern wrote:
    >
    >>When you're parsing input data, what is necessary is a true understanding
    >>of its syntax, not samples which will almost invariably fail to cover
    >>certain cases. "The data looks like such-and-so" or "The data is in
    >>a form like this" is usually a red flag that the speaker doesn't understand
    >>his input data well enough to parse it properly.

    >
    >
    > Isn't that why Perl was created? :)


    Heh. Yeah, I guess so. Often, the best way to get a handle on understanding
    your data is to go through several iterations of failing to parse it correctly.
    That's not something a newsgroup can really do for you, of course...

    Chris Mattern
    Chris Mattern, Oct 23, 2003
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Replies:
    3
    Views:
    889
    James Kanze
    Jun 20, 2007
  2. Replies:
    3
    Views:
    760
    Reedick, Andrew
    Jul 1, 2008
  3. Alessio
    Replies:
    5
    Views:
    1,265
    Alessio
    Apr 17, 2010
  4. D
    Replies:
    0
    Views:
    207
  5. Mladen
    Replies:
    5
    Views:
    175
    Peter Scott
    Feb 22, 2011
Loading...

Share This Page