Non-uniform split

thisismyidentity · Sep 7, 2006

Hi all,
I am writing a Perl script that should parse each line of a file (which
unfortunately I cant modify) and split the line. The main problem is
that every line (nearly 10000 lines) of the file is not uniform. So
there doesnt seem to be a pattern or a delimiter on which I can simply
split the line and could do it in a loop over all lines

.
Here is an example:
========================
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP
..
..
========================

My main intention is to extract the values in Column A, B,C..into an
array but since in some lines some values under a column may not be
present..I am unable to have a single regex on which i can split all
lines in a loop. I tried the (obvious) \s+ regex for splitting but
since the columns that r empty have spaces, I get different results for
a particular column on different lines. I am especially interested in
two columns for which it is guaranteed that each line will be non-empty
(like A,B,D) but coz of other empty columns cant get them on a
particular index of the array which is returned by split().

Please give suggestions for following:

What regex could I use which wud solve my problem?

Is there any other way apart from split by which i cud achieve this
(assuming that there is no single regex to spit on) ?

Any possible way (as far as I can loop..since no of lines is huge)

Thanks.
Greg

it_says_BALLS_on_your forehead · Sep 7, 2006

[email protected] said:
Hi all,
I am writing a Perl script that should parse each line of a file (which
unfortunately I cant modify) and split the line. The main problem is
that every line (nearly 10000 lines) of the file is not uniform. So
there doesnt seem to be a pattern or a delimiter on which I can simply
split the line and could do it in a loop over all lines .
Here is an example:
========================
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP
.
.
========================

My main intention is to extract the values in Column A, B,C..into an
array but since in some lines some values under a column may not be
present..I am unable to have a single regex on which i can split all
lines in a loop. I tried the (obvious) \s+ regex for splitting but
since the columns that r empty have spaces, I get different results for
a particular column on different lines. I am especially interested in
two columns for which it is guaranteed that each line will be non-empty
(like A,B,D) but coz of other empty columns cant get them on a
particular index of the array which is returned by split().

Please give suggestions for following:

What regex could I use which wud solve my problem?

Is there any other way apart from split by which i cud achieve this
(assuming that there is no single regex to spit on) ?

Any possible way (as far as I can loop..since no of lines is huge)

If we look at the row where the value of Column A is 'd34', how do you
know that 'WD' is the value of column D versus column C (assuming that
it actually is the value of column D)? if this were tab delimited, is
there an empty string for the value of column C? perhaps this is fixed
width?

Ilya Zakharevich · Sep 7, 2006

[A complimentary Cc of this posting was sent to
Christian Winter

This looks like width-encoded: all fields occupy certain columns. Use
unpack "A[number] ..." to break it into parts, then strip extra whitespace.

If, e.g., the boundary between D and E is not column-based, but other
boundaries are, do the same, but extract "D + E" pair first; THEN use
regexp approach to split D and E.

Hope this helps,
Ilya

Mumia W. · Sep 7, 2006

Hi all,
I am writing a Perl script that should parse each line of a file (which
unfortunately I cant modify) and split the line. The main problem is
that every line (nearly 10000 lines) of the file is not uniform. So
there doesnt seem to be a pattern or a delimiter on which I can simply
split the line and could do it in a loop over all lines .
Here is an example:
========================
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP
..
..
========================

My main intention is to extract the values in Column A, B,C..into an
array but since in some lines some values under a column may not be
present..I am unable to have a single regex on which i can split all
lines in a loop. I tried the (obvious) \s+ regex for splitting but
since the columns that r empty have spaces, I get different results for
a particular column on different lines. I am especially interested in
two columns for which it is guaranteed that each line will be non-empty
(like A,B,D) but coz of other empty columns cant get them on a
particular index of the array which is returned by split().

Please give suggestions for following:

What regex could I use which wud solve my problem?

Is there any other way apart from split by which i cud achieve this
(assuming that there is no single regex to spit on) ?

Any possible way (as far as I can loop..since no of lines is huge)

Thanks.
Greg

You can either use unpack() to extract ranges of bytes from
the string, or you can use a regex that uses character
quantifiers {}, e.g.

my @fields = unpack('A6 A6 A8 A4 A*', $string);

OR

my @fields =~ m/^(.{6})(.{6})(.{8})(.{4})(.*)/;

WARNING: UNTESTED CODE

Ted Zlatanov · Sep 7, 2006

Here is an example:
========================
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP
.
.
========================

What regex could I use which wud solve my problem?

Is there any other way apart from split by which i cud achieve this
(assuming that there is no single regex to spit on) ?

Any possible way (as far as I can loop..since no of lines is huge)

10K lines is not big at all. Anyhow.

Your essential problem is that you don't have consistent data. How
can Perl or anyone else know that lines 3 and 5 in your data are
missing the C column, for example? I'm guessing they don't have the C
column because "WD" and "WP" look like they belong in the D column,
and others have guessed that also, but it doesn't mean we're right.
If you can "anchor" WD and WP for us, promising that anything that
begins with W and has just two uppercase letters is in the D column,
the problem is easy to solve.

Finally, are there tab characters in the data? There aren't any in
your example, but it's possible they are your delimiters and didn't
come through the Usenet post.

Ted

Dark · Sep 7, 2006

Is there any other way apart from split by which i cud achieve this
(assuming that there is no single regex to spit on) ?

Any possible way (as far as I can loop..since no of lines is huge)

Thanks.
Greg

If you really want to use a regex here is something primative that
might get the job done (fills a hash and prints it - keeping track of
line numbers and columns). I'd probably just use unpack.

-I

$data = <<HERE
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP

HERE
;
@lines = split("\n", $data);
my %data;
my $counter;
for ($counter=0;$counter<=$#lines;$counter++) {
$line = $lines[$counter];
$_ = $line;
/([0-9\sa-zA-Z]{0,7})([0-9\sa-zA-Z]{0,8})([0-9\sa-zA-Z]{0,11})([0-9\sa-zA-Z]{0,7})([0-9\sa-zA-Z]{0,7})/;
if ($1) {
$data{$counter}{'a'} = $1;
$data{$counter}{'b'} = $2;
$data{$counter}{'c'} = $3;
$data{$counter}{'d'} = $4;
$data{$counter}{'e'} = $5;
}
}

#Print out the data in the hash
for ($counter=0;$counter<=$#lines;$counter++) {
my @cols;
($cols[0], $cols[1], $cols[2], $cols[3], $cols[4]) =
('a','b','c','d','e');
for ($incount=0;$incount<=$#cols;$incount++) {
print "Line $counter column
$cols[$incount]=\"$data{$counter}{$cols[$incount]}\"\n";
}

}

anno4000 · Sep 8, 2006

Dark said:
If you really want to use a regex here is something primative that
might get the job done (fills a hash and prints it - keeping track of
line numbers and columns). I'd probably just use unpack.

-I

Hmm... Your code is not strict-safe and produces a lot of warnings
when those are switched on. The indentation is random. When run,
it outputs 60 lines, beginning

Line 0 column
a=""
Line 0 column
b=""
Line 0 column
c=""
Line 0 column
d=""
Line 0 column
e=""
Line 1 column
a=""
Line 1 column
....

Is that what it is supposed to do?

$data = <<HERE

Semicolon missing after that statement.

A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP

HERE
;

Misplaced semicolon.

@lines = split("\n", $data);
my %data;

The keys in %data are the values of $counter below, so essentially the
input line numbers. That kind of data is better kept in an array. Make
that

my @data,

my $counter;
for ($counter=0;$counter<=$#lines;$counter++) {
$line = $lines[$counter];
$_ = $line;

All this data-shuffling is unnecessary. Replace it with

for ( split /\n/, $data ) {

/([0-9\sa-zA-Z]{0,7})([0-9\sa-zA-Z]{0,8})([0-9\sa-zA-Z]{0,11})([0-9\sa-zA-Z]{0,7})([0-9\sa-zA-Z]{0,7})/;

This regex is too big to be placed in the code directly. Define a regex
variable outside the loop (my $re = qr/.../

and use $re here:

/$re/;

I have not checked if the regex does indeed match what it needs to,
I'm assuming it does. However, it captures trailing blanks with each
field. In a complete solution these should be dropped.

if ($1) {

What if $1 happens to contain a false boolean value? Check the entire
match for success, not one haphazard match variable.

$data{$counter}{'a'} = $1;
$data{$counter}{'b'} = $2;
$data{$counter}{'c'} = $3;
$data{$counter}{'d'} = $4;
$data{$counter}{'e'} = $5;

Since @data is an array now, this must be written differently:

push @data, { a => $1, b => $2, c => $3, d => $4, e => $5};

I'd write the entire loop body like this:

if ( my @cols = /$re/ ) {
push @data, { map { $_ => shift @cols } qw( a b c d e);
} else {
warn "invalid data";
}

}
}

The print loop below is also more roundabout than it has to be.

#Print out the data in the hash
for ($counter=0;$counter<=$#lines;$counter++) {
my @cols;
($cols[0], $cols[1], $cols[2], $cols[3], $cols[4]) =
('a','b','c','d','e');
for ($incount=0;$incount<=$#cols;$incount++) {
print "Line $counter column
$cols[$incount]=\"$data{$counter}{$cols[$incount]}\"\n";
}

}

That amounts to a re-write along these lines:

$data = <<HERE;
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP
HERE

my $fc = '[0-9\sa-zA-Z]'; # a field character
my $re = qr/($fc{0,7})($fc{0,8})($fc{0,11})($fc{0,7})($fc{0,7})/;

my @recs;
for ( split /\n/, $data) {
if ( my @cols = /$re/ ) {
s/ +$// for @cols; # trim trailing blanks
@{ $recs[ @recs]}{ 'a' .. 'e'} = @cols;
}
}

for my $rec ( @recs ) {
print join( ', ', map "$_ => $rec->{ $_}", sort keys %$rec), "\n";
}

Anno

Mumia W. · Sep 8, 2006

Hi all,
I am writing a Perl script that should parse each line of a file (which
unfortunately I cant modify) and split the line. The main problem is
that every line (nearly 10000 lines) of the file is not uniform. So
there doesnt seem to be a pattern or a delimiter on which I can simply
split the line and could do it in a loop over all lines .
Here is an example:
========================
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP
..
..
========================

My main intention is to extract the values in Column A, B,C..into an
array but since in some lines some values under a column may not be
present..I am unable to have a single regex on which i can split all
lines in a loop. I tried the (obvious) \s+ regex for splitting but
since the columns that r empty have spaces, I get different results for
a particular column on different lines. I am especially interested in
two columns for which it is guaranteed that each line will be non-empty
(like A,B,D) but coz of other empty columns cant get them on a
particular index of the array which is returned by split().

Please give suggestions for following:

What regex could I use which wud solve my problem?

Is there any other way apart from split by which i cud achieve this
(assuming that there is no single regex to spit on) ?

Any possible way (as far as I can loop..since no of lines is huge)

Thanks.
Greg

Greg, this is your lucky day, because, even though you didn't
post any attempt of your own to solve this problem, people
(including me) are falling over themselves to write this
program for you. E.g.:

#!/usr/bin/perl

use strict;
use warnings;

my ($line, @line);
$line = <DATA>;

local $\ = "\n";
local $" = " | ";

while ($line = <DATA>) {
@line = unpack('A7 A8 A11 A7 A*', $line);
@line = map m/^\s*(.*?)\s*$/, @line;
print "@line";
}

__DATA__
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP

------------end of program ---------------

OUTPUT:
d32 | ab | ae99 | WB | 89
d33 | cd | e787 | WC | 78
d34 | ef | | WD |
d35 | gh | ancjd | WT | 100
d36 | ij | | WP |
------------end of output----------------

I saw some of the other solutions, and all I could think was,
"Wow, what a big program for such a small problem."

Mumia W. · Sep 8, 2006

Hi all,
I am writing a Perl script that should parse each line of a file
[...]

Click to expand...

Greg, this is your lucky day,
[...]

I don't think I can get it any closer to being a one-liner
than this:

#!/bin/sh
echo '
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP
' | perl -nle '
@line = unpack("A7 A8 A11 A7 A*",$_);
print join(" | ", @line) if length($line[0]) > 1;
'

OUTPUT:
d32 | ab | ae99 | WB | 89
d33 | cd | e787 | WC | 78
d34 | ef | | WD |
d35 | gh | ancjd | WT | 100
d36 | ij | | WP |

Brian McCauley · Sep 8, 2006

if ( my @cols = /$re/ ) {
push @data, { map { $_ => shift @cols } qw( a b c d e) };
}

TMTOWTDI, when optomising for clarity I prefer a slice over using map.
Unfortunately this which means one can't avoid naming the hash without
the code getting real ugly:

if ( my @cols = /$re/ ) {
push @data, \my %record;
@record{ qw( a b c d e) } = @cols;
}

anno4000 · Sep 9, 2006

Brian McCauley said:
TMTOWTDI, when optomising for clarity I prefer a slice over using map.
Unfortunately this which means one can't avoid naming the hash without
the code getting real ugly:

if ( my @cols = /$re/ ) {
push @data, \my %record;
@record{ qw( a b c d e) } = @cols;
}

Yes, building the hash anonymously is a bit obscure. I was coming from

@{ $data[ @data] }{ qw( a b c d e)} = @cols;

which is probably what you mean with "real ugly".

Anno

Uniform Function Call Syntax (UFCS)	35	Jun 7, 2014
split() and @_: Perl changed between 5.8 and 5.14	16	Dec 12, 2012
Please, help me.	1	Aug 15, 2023
How to loop in folder through all excel files and all sheets using pandas?	0	Dec 1, 2022
While loop unclear, can someone help?	4	Dec 6, 2023
split problem	6	Sep 20, 2004
Problem with KMKfw libraries	1	May 11, 2023
Trouble creating multi dimensional array. 0 to 26 in 3 dimensions.	1	Oct 12, 2022

Non-uniform split

thisismyidentity

it_says_BALLS_on_your forehead

Ilya Zakharevich

Mumia W.

Ted Zlatanov

Dark

anno4000

Mumia W.

Mumia W.

Brian McCauley

anno4000

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads