Non-uniform split

Discussion in 'Perl Misc' started by thisismyidentity@gmail.com, Sep 7, 2006.

  1. Guest

    Hi all,
    I am writing a Perl script that should parse each line of a file (which
    unfortunately I cant modify) and split the line. The main problem is
    that every line (nearly 10000 lines) of the file is not uniform. So
    there doesnt seem to be a pattern or a delimiter on which I can simply
    split the line and could do it in a loop over all lines :(.
    Here is an example:
    ========================
    A B C D E
    d32 ab ae99 WB 89
    d33 cd e787 WC 78
    d34 ef WD
    d35 gh ancjd WT 100
    d36 ij WP
    ..
    ..
    ========================

    My main intention is to extract the values in Column A, B,C..into an
    array but since in some lines some values under a column may not be
    present..I am unable to have a single regex on which i can split all
    lines in a loop. I tried the (obvious) \s+ regex for splitting but
    since the columns that r empty have spaces, I get different results for
    a particular column on different lines. I am especially interested in
    two columns for which it is guaranteed that each line will be non-empty
    (like A,B,D) but coz of other empty columns cant get them on a
    particular index of the array which is returned by split().

    Please give suggestions for following:

    What regex could I use which wud solve my problem?

    Is there any other way apart from split by which i cud achieve this
    (assuming that there is no single regex to spit on) ?

    Any possible way (as far as I can loop..since no of lines is huge)

    Thanks.
    Greg
    , Sep 7, 2006
    #1
    1. Advertising

  2. wrote:
    > Hi all,
    > I am writing a Perl script that should parse each line of a file (which
    > unfortunately I cant modify) and split the line. The main problem is
    > that every line (nearly 10000 lines) of the file is not uniform. So
    > there doesnt seem to be a pattern or a delimiter on which I can simply
    > split the line and could do it in a loop over all lines :(.
    > Here is an example:
    > ========================
    > A B C D E
    > d32 ab ae99 WB 89
    > d33 cd e787 WC 78
    > d34 ef WD
    > d35 gh ancjd WT 100
    > d36 ij WP
    > .
    > .
    > ========================
    >
    > My main intention is to extract the values in Column A, B,C..into an
    > array but since in some lines some values under a column may not be
    > present..I am unable to have a single regex on which i can split all
    > lines in a loop. I tried the (obvious) \s+ regex for splitting but
    > since the columns that r empty have spaces, I get different results for
    > a particular column on different lines. I am especially interested in
    > two columns for which it is guaranteed that each line will be non-empty
    > (like A,B,D) but coz of other empty columns cant get them on a
    > particular index of the array which is returned by split().
    >
    > Please give suggestions for following:
    >
    > What regex could I use which wud solve my problem?
    >
    > Is there any other way apart from split by which i cud achieve this
    > (assuming that there is no single regex to spit on) ?
    >
    > Any possible way (as far as I can loop..since no of lines is huge)


    If we look at the row where the value of Column A is 'd34', how do you
    know that 'WD' is the value of column D versus column C (assuming that
    it actually is the value of column D)? if this were tab delimited, is
    there an empty string for the value of column C? perhaps this is fixed
    width?
    it_says_BALLS_on_your forehead, Sep 7, 2006
    #2
    1. Advertising

  3. [A complimentary Cc of this posting was sent to
    Christian Winter
    <>], who wrote in article <450073a8$0$17405$-online.net>:
    > wrote:
    > > Here is an example:
    > > ========================
    > > A B C D E
    > > d32 ab ae99 WB 89
    > > d33 cd e787 WC 78
    > > d34 ef WD
    > > d35 gh ancjd WT 100
    > > d36 ij WP


    This looks like width-encoded: all fields occupy certain columns. Use
    unpack "A[number] ..." to break it into parts, then strip extra whitespace.

    If, e.g., the boundary between D and E is not column-based, but other
    boundaries are, do the same, but extract "D + E" pair first; THEN use
    regexp approach to split D and E.

    Hope this helps,
    Ilya
    Ilya Zakharevich, Sep 7, 2006
    #3
  4. Mumia W. Guest

    On 09/07/2006 01:56 PM, wrote:
    > Hi all,
    > I am writing a Perl script that should parse each line of a file (which
    > unfortunately I cant modify) and split the line. The main problem is
    > that every line (nearly 10000 lines) of the file is not uniform. So
    > there doesnt seem to be a pattern or a delimiter on which I can simply
    > split the line and could do it in a loop over all lines :(.
    > Here is an example:
    > ========================
    > A B C D E
    > d32 ab ae99 WB 89
    > d33 cd e787 WC 78
    > d34 ef WD
    > d35 gh ancjd WT 100
    > d36 ij WP
    > ..
    > ..
    > ========================
    >
    > My main intention is to extract the values in Column A, B,C..into an
    > array but since in some lines some values under a column may not be
    > present..I am unable to have a single regex on which i can split all
    > lines in a loop. I tried the (obvious) \s+ regex for splitting but
    > since the columns that r empty have spaces, I get different results for
    > a particular column on different lines. I am especially interested in
    > two columns for which it is guaranteed that each line will be non-empty
    > (like A,B,D) but coz of other empty columns cant get them on a
    > particular index of the array which is returned by split().
    >
    > Please give suggestions for following:
    >
    > What regex could I use which wud solve my problem?
    >
    > Is there any other way apart from split by which i cud achieve this
    > (assuming that there is no single regex to spit on) ?
    >
    > Any possible way (as far as I can loop..since no of lines is huge)
    >
    > Thanks.
    > Greg
    >


    You can either use unpack() to extract ranges of bytes from
    the string, or you can use a regex that uses character
    quantifiers {}, e.g.

    my @fields = unpack('A6 A6 A8 A4 A*', $string);

    OR

    my @fields =~ m/^(.{6})(.{6})(.{8})(.{4})(.*)/;

    WARNING: UNTESTED CODE
    Mumia W., Sep 7, 2006
    #4
  5. Ted Zlatanov Guest

    On 7 Sep 2006, wrote:

    > Here is an example:
    > ========================
    > A B C D E
    > d32 ab ae99 WB 89
    > d33 cd e787 WC 78
    > d34 ef WD
    > d35 gh ancjd WT 100
    > d36 ij WP
    > .
    > .
    > ========================


    > What regex could I use which wud solve my problem?
    >
    > Is there any other way apart from split by which i cud achieve this
    > (assuming that there is no single regex to spit on) ?
    >
    > Any possible way (as far as I can loop..since no of lines is huge)


    10K lines is not big at all. Anyhow.

    Your essential problem is that you don't have consistent data. How
    can Perl or anyone else know that lines 3 and 5 in your data are
    missing the C column, for example? I'm guessing they don't have the C
    column because "WD" and "WP" look like they belong in the D column,
    and others have guessed that also, but it doesn't mean we're right.
    If you can "anchor" WD and WP for us, promising that anything that
    begins with W and has just two uppercase letters is in the D column,
    the problem is easy to solve.

    Finally, are there tab characters in the data? There aren't any in
    your example, but it's possible they are your delimiters and didn't
    come through the Usenet post.

    Ted
    Ted Zlatanov, Sep 7, 2006
    #5
  6. Dark Guest

    >
    > Is there any other way apart from split by which i cud achieve this
    > (assuming that there is no single regex to spit on) ?
    >
    > Any possible way (as far as I can loop..since no of lines is huge)
    >
    > Thanks.
    > Greg


    If you really want to use a regex here is something primative that
    might get the job done (fills a hash and prints it - keeping track of
    line numbers and columns). I'd probably just use unpack.

    -I


    $data = <<HERE
    A B C D E
    d32 ab ae99 WB 89
    d33 cd e787 WC 78
    d34 ef WD
    d35 gh ancjd WT 100
    d36 ij WP

    HERE
    ;
    @lines = split("\n", $data);
    my %data;
    my $counter;
    for ($counter=0;$counter<=$#lines;$counter++) {
    $line = $lines[$counter];
    $_ = $line;
    /([0-9\sa-zA-Z]{0,7})([0-9\sa-zA-Z]{0,8})([0-9\sa-zA-Z]{0,11})([0-9\sa-zA-Z]{0,7})([0-9\sa-zA-Z]{0,7})/;
    if ($1) {
    $data{$counter}{'a'} = $1;
    $data{$counter}{'b'} = $2;
    $data{$counter}{'c'} = $3;
    $data{$counter}{'d'} = $4;
    $data{$counter}{'e'} = $5;
    }
    }

    #Print out the data in the hash
    for ($counter=0;$counter<=$#lines;$counter++) {
    my @cols;
    ($cols[0], $cols[1], $cols[2], $cols[3], $cols[4]) =
    ('a','b','c','d','e');
    for ($incount=0;$incount<=$#cols;$incount++) {
    print "Line $counter column
    $cols[$incount]=\"$data{$counter}{$cols[$incount]}\"\n";
    }

    }
    Dark, Sep 7, 2006
    #6
  7. -berlin.de Guest

    Dark <> wrote in comp.lang.perl.misc:
    > >
    > > Is there any other way apart from split by which i cud achieve this
    > > (assuming that there is no single regex to spit on) ?
    > >
    > > Any possible way (as far as I can loop..since no of lines is huge)
    > >
    > > Thanks.
    > > Greg

    >
    > If you really want to use a regex here is something primative that
    > might get the job done (fills a hash and prints it - keeping track of
    > line numbers and columns). I'd probably just use unpack.
    >
    > -I


    Hmm... Your code is not strict-safe and produces a lot of warnings
    when those are switched on. The indentation is random. When run,
    it outputs 60 lines, beginning

    Line 0 column
    a=""
    Line 0 column
    b=""
    Line 0 column
    c=""
    Line 0 column
    d=""
    Line 0 column
    e=""
    Line 1 column
    a=""
    Line 1 column
    ....

    Is that what it is supposed to do?

    > $data = <<HERE


    Semicolon missing after that statement.

    > A B C D E
    > d32 ab ae99 WB 89
    > d33 cd e787 WC 78
    > d34 ef WD
    > d35 gh ancjd WT 100
    > d36 ij WP
    >
    > HERE
    > ;


    Misplaced semicolon.

    > @lines = split("\n", $data);
    > my %data;


    The keys in %data are the values of $counter below, so essentially the
    input line numbers. That kind of data is better kept in an array. Make
    that

    my @data,

    > my $counter;
    > for ($counter=0;$counter<=$#lines;$counter++) {
    > $line = $lines[$counter];
    > $_ = $line;


    All this data-shuffling is unnecessary. Replace it with

    for ( split /\n/, $data ) {

    > /([0-9\sa-zA-Z]{0,7})([0-9\sa-zA-Z]{0,8})([0-9\sa-zA-Z]{0,11})([0-9\sa-zA-Z]{0,7})([0-9\sa-zA-Z]{0,7})/;


    This regex is too big to be placed in the code directly. Define a regex
    variable outside the loop (my $re = qr/.../;) and use $re here:

    /$re/;

    I have not checked if the regex does indeed match what it needs to,
    I'm assuming it does. However, it captures trailing blanks with each
    field. In a complete solution these should be dropped.

    > if ($1) {


    What if $1 happens to contain a false boolean value? Check the entire
    match for success, not one haphazard match variable.

    > $data{$counter}{'a'} = $1;
    > $data{$counter}{'b'} = $2;
    > $data{$counter}{'c'} = $3;
    > $data{$counter}{'d'} = $4;
    > $data{$counter}{'e'} = $5;


    Since @data is an array now, this must be written differently:

    push @data, { a => $1, b => $2, c => $3, d => $4, e => $5};

    I'd write the entire loop body like this:

    if ( my @cols = /$re/ ) {
    push @data, { map { $_ => shift @cols } qw( a b c d e);
    } else {
    warn "invalid data";
    }

    > }
    > }


    The print loop below is also more roundabout than it has to be.

    > #Print out the data in the hash
    > for ($counter=0;$counter<=$#lines;$counter++) {
    > my @cols;
    > ($cols[0], $cols[1], $cols[2], $cols[3], $cols[4]) =
    > ('a','b','c','d','e');
    > for ($incount=0;$incount<=$#cols;$incount++) {
    > print "Line $counter column
    > $cols[$incount]=\"$data{$counter}{$cols[$incount]}\"\n";
    > }
    >
    > }


    That amounts to a re-write along these lines:

    $data = <<HERE;
    A B C D E
    d32 ab ae99 WB 89
    d33 cd e787 WC 78
    d34 ef WD
    d35 gh ancjd WT 100
    d36 ij WP
    HERE

    my $fc = '[0-9\sa-zA-Z]'; # a field character
    my $re = qr/($fc{0,7})($fc{0,8})($fc{0,11})($fc{0,7})($fc{0,7})/;

    my @recs;
    for ( split /\n/, $data) {
    if ( my @cols = /$re/ ) {
    s/ +$// for @cols; # trim trailing blanks
    @{ $recs[ @recs]}{ 'a' .. 'e'} = @cols;
    }
    }

    for my $rec ( @recs ) {
    print join( ', ', map "$_ => $rec->{ $_}", sort keys %$rec), "\n";
    }

    Anno
    -berlin.de, Sep 8, 2006
    #7
  8. Mumia W. Guest

    On 09/07/2006 01:56 PM, wrote:
    > Hi all,
    > I am writing a Perl script that should parse each line of a file (which
    > unfortunately I cant modify) and split the line. The main problem is
    > that every line (nearly 10000 lines) of the file is not uniform. So
    > there doesnt seem to be a pattern or a delimiter on which I can simply
    > split the line and could do it in a loop over all lines :(.
    > Here is an example:
    > ========================
    > A B C D E
    > d32 ab ae99 WB 89
    > d33 cd e787 WC 78
    > d34 ef WD
    > d35 gh ancjd WT 100
    > d36 ij WP
    > ..
    > ..
    > ========================
    >
    > My main intention is to extract the values in Column A, B,C..into an
    > array but since in some lines some values under a column may not be
    > present..I am unable to have a single regex on which i can split all
    > lines in a loop. I tried the (obvious) \s+ regex for splitting but
    > since the columns that r empty have spaces, I get different results for
    > a particular column on different lines. I am especially interested in
    > two columns for which it is guaranteed that each line will be non-empty
    > (like A,B,D) but coz of other empty columns cant get them on a
    > particular index of the array which is returned by split().
    >
    > Please give suggestions for following:
    >
    > What regex could I use which wud solve my problem?
    >
    > Is there any other way apart from split by which i cud achieve this
    > (assuming that there is no single regex to spit on) ?
    >
    > Any possible way (as far as I can loop..since no of lines is huge)
    >
    > Thanks.
    > Greg
    >


    Greg, this is your lucky day, because, even though you didn't
    post any attempt of your own to solve this problem, people
    (including me) are falling over themselves to write this
    program for you. E.g.:

    #!/usr/bin/perl

    use strict;
    use warnings;

    my ($line, @line);
    $line = <DATA>;

    local $\ = "\n";
    local $" = " | ";

    while ($line = <DATA>) {
    @line = unpack('A7 A8 A11 A7 A*', $line);
    @line = map m/^\s*(.*?)\s*$/, @line;
    print "@line";
    }


    __DATA__
    A B C D E
    d32 ab ae99 WB 89
    d33 cd e787 WC 78
    d34 ef WD
    d35 gh ancjd WT 100
    d36 ij WP

    ------------end of program ---------------

    OUTPUT:
    d32 | ab | ae99 | WB | 89
    d33 | cd | e787 | WC | 78
    d34 | ef | | WD |
    d35 | gh | ancjd | WT | 100
    d36 | ij | | WP |
    ------------end of output----------------

    I saw some of the other solutions, and all I could think was,
    "Wow, what a big program for such a small problem."
    Mumia W., Sep 8, 2006
    #8
  9. Mumia W. Guest

    On 09/08/2006 05:17 AM, Mumia W. wrote:
    > On 09/07/2006 01:56 PM, wrote:
    >> Hi all,
    >> I am writing a Perl script that should parse each line of a file
    >> [...]

    >
    > Greg, this is your lucky day,
    > [...]


    I don't think I can get it any closer to being a one-liner
    than this:

    #!/bin/sh
    echo '
    A B C D E
    d32 ab ae99 WB 89
    d33 cd e787 WC 78
    d34 ef WD
    d35 gh ancjd WT 100
    d36 ij WP
    ' | perl -nle '
    @line = unpack("A7 A8 A11 A7 A*",$_);
    print join(" | ", @line) if length($line[0]) > 1;
    '


    OUTPUT:
    d32 | ab | ae99 | WB | 89
    d33 | cd | e787 | WC | 78
    d34 | ef | | WD |
    d35 | gh | ancjd | WT | 100
    d36 | ij | | WP |
    Mumia W., Sep 8, 2006
    #9
  10. -berlin.de wrote:

    > if ( my @cols = /$re/ ) {
    > push @data, { map { $_ => shift @cols } qw( a b c d e) };
    > }


    TMTOWTDI, when optomising for clarity I prefer a slice over using map.
    Unfortunately this which means one can't avoid naming the hash without
    the code getting real ugly:

    if ( my @cols = /$re/ ) {
    push @data, \my %record;
    @record{ qw( a b c d e) } = @cols;
    }
    Brian McCauley, Sep 8, 2006
    #10
  11. -berlin.de Guest

    Brian McCauley <> wrote in comp.lang.perl.misc:
    >
    > -berlin.de wrote:
    >
    > > if ( my @cols = /$re/ ) {
    > > push @data, { map { $_ => shift @cols } qw( a b c d e) };
    > > }

    >
    > TMTOWTDI, when optomising for clarity I prefer a slice over using map.
    > Unfortunately this which means one can't avoid naming the hash without
    > the code getting real ugly:
    >
    > if ( my @cols = /$re/ ) {
    > push @data, \my %record;
    > @record{ qw( a b c d e) } = @cols;
    > }


    Yes, building the hash anonymously is a bit obscure. I was coming from

    @{ $data[ @data] }{ qw( a b c d e)} = @cols;

    which is probably what you mean with "real ugly".

    Anno
    -berlin.de, Sep 9, 2006
    #11
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Casper B
    Replies:
    3
    Views:
    436
    eranb
    Jan 13, 2005
  2. tobiah
    Replies:
    3
    Views:
    256
    tobiah
    Sep 14, 2006
  3. aegis

    uniform random distribution

    aegis, Jan 30, 2005, in forum: C Programming
    Replies:
    7
    Views:
    357
    Julian V. Noble
    Jan 31, 2005
  4. Horacius ReX

    non-uniform string substituion

    Horacius ReX, Feb 13, 2008, in forum: Python
    Replies:
    2
    Views:
    259
    7stud
    Feb 14, 2008
  5. Javier Montoya

    non-uniform distribution

    Javier Montoya, Jun 12, 2010, in forum: Python
    Replies:
    7
    Views:
    314
    Javier Montoya
    Jun 12, 2010
Loading...

Share This Page