Non-uniform split

  • Thread starter thisismyidentity
  • Start date
T

thisismyidentity

Hi all,
I am writing a Perl script that should parse each line of a file (which
unfortunately I cant modify) and split the line. The main problem is
that every line (nearly 10000 lines) of the file is not uniform. So
there doesnt seem to be a pattern or a delimiter on which I can simply
split the line and could do it in a loop over all lines :(.
Here is an example:
========================
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP
..
..
========================

My main intention is to extract the values in Column A, B,C..into an
array but since in some lines some values under a column may not be
present..I am unable to have a single regex on which i can split all
lines in a loop. I tried the (obvious) \s+ regex for splitting but
since the columns that r empty have spaces, I get different results for
a particular column on different lines. I am especially interested in
two columns for which it is guaranteed that each line will be non-empty
(like A,B,D) but coz of other empty columns cant get them on a
particular index of the array which is returned by split().

Please give suggestions for following:

What regex could I use which wud solve my problem?

Is there any other way apart from split by which i cud achieve this
(assuming that there is no single regex to spit on) ?

Any possible way (as far as I can loop..since no of lines is huge)

Thanks.
Greg
 
I

it_says_BALLS_on_your forehead

Hi all,
I am writing a Perl script that should parse each line of a file (which
unfortunately I cant modify) and split the line. The main problem is
that every line (nearly 10000 lines) of the file is not uniform. So
there doesnt seem to be a pattern or a delimiter on which I can simply
split the line and could do it in a loop over all lines :(.
Here is an example:
========================
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP
.
.
========================

My main intention is to extract the values in Column A, B,C..into an
array but since in some lines some values under a column may not be
present..I am unable to have a single regex on which i can split all
lines in a loop. I tried the (obvious) \s+ regex for splitting but
since the columns that r empty have spaces, I get different results for
a particular column on different lines. I am especially interested in
two columns for which it is guaranteed that each line will be non-empty
(like A,B,D) but coz of other empty columns cant get them on a
particular index of the array which is returned by split().

Please give suggestions for following:

What regex could I use which wud solve my problem?

Is there any other way apart from split by which i cud achieve this
(assuming that there is no single regex to spit on) ?

Any possible way (as far as I can loop..since no of lines is huge)

If we look at the row where the value of Column A is 'd34', how do you
know that 'WD' is the value of column D versus column C (assuming that
it actually is the value of column D)? if this were tab delimited, is
there an empty string for the value of column C? perhaps this is fixed
width?
 
I

Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Christian Winter

This looks like width-encoded: all fields occupy certain columns. Use
unpack "A[number] ..." to break it into parts, then strip extra whitespace.

If, e.g., the boundary between D and E is not column-based, but other
boundaries are, do the same, but extract "D + E" pair first; THEN use
regexp approach to split D and E.

Hope this helps,
Ilya
 
M

Mumia W.

Hi all,
I am writing a Perl script that should parse each line of a file (which
unfortunately I cant modify) and split the line. The main problem is
that every line (nearly 10000 lines) of the file is not uniform. So
there doesnt seem to be a pattern or a delimiter on which I can simply
split the line and could do it in a loop over all lines :(.
Here is an example:
========================
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP
..
..
========================

My main intention is to extract the values in Column A, B,C..into an
array but since in some lines some values under a column may not be
present..I am unable to have a single regex on which i can split all
lines in a loop. I tried the (obvious) \s+ regex for splitting but
since the columns that r empty have spaces, I get different results for
a particular column on different lines. I am especially interested in
two columns for which it is guaranteed that each line will be non-empty
(like A,B,D) but coz of other empty columns cant get them on a
particular index of the array which is returned by split().

Please give suggestions for following:

What regex could I use which wud solve my problem?

Is there any other way apart from split by which i cud achieve this
(assuming that there is no single regex to spit on) ?

Any possible way (as far as I can loop..since no of lines is huge)

Thanks.
Greg

You can either use unpack() to extract ranges of bytes from
the string, or you can use a regex that uses character
quantifiers {}, e.g.

my @fields = unpack('A6 A6 A8 A4 A*', $string);

OR

my @fields =~ m/^(.{6})(.{6})(.{8})(.{4})(.*)/;

WARNING: UNTESTED CODE
 
T

Ted Zlatanov

Here is an example:
========================
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP
.
.
========================
What regex could I use which wud solve my problem?

Is there any other way apart from split by which i cud achieve this
(assuming that there is no single regex to spit on) ?

Any possible way (as far as I can loop..since no of lines is huge)

10K lines is not big at all. Anyhow.

Your essential problem is that you don't have consistent data. How
can Perl or anyone else know that lines 3 and 5 in your data are
missing the C column, for example? I'm guessing they don't have the C
column because "WD" and "WP" look like they belong in the D column,
and others have guessed that also, but it doesn't mean we're right.
If you can "anchor" WD and WP for us, promising that anything that
begins with W and has just two uppercase letters is in the D column,
the problem is easy to solve.

Finally, are there tab characters in the data? There aren't any in
your example, but it's possible they are your delimiters and didn't
come through the Usenet post.

Ted
 
D

Dark

Is there any other way apart from split by which i cud achieve this
(assuming that there is no single regex to spit on) ?

Any possible way (as far as I can loop..since no of lines is huge)

Thanks.
Greg

If you really want to use a regex here is something primative that
might get the job done (fills a hash and prints it - keeping track of
line numbers and columns). I'd probably just use unpack.

-I


$data = <<HERE
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP

HERE
;
@lines = split("\n", $data);
my %data;
my $counter;
for ($counter=0;$counter<=$#lines;$counter++) {
$line = $lines[$counter];
$_ = $line;
/([0-9\sa-zA-Z]{0,7})([0-9\sa-zA-Z]{0,8})([0-9\sa-zA-Z]{0,11})([0-9\sa-zA-Z]{0,7})([0-9\sa-zA-Z]{0,7})/;
if ($1) {
$data{$counter}{'a'} = $1;
$data{$counter}{'b'} = $2;
$data{$counter}{'c'} = $3;
$data{$counter}{'d'} = $4;
$data{$counter}{'e'} = $5;
}
}

#Print out the data in the hash
for ($counter=0;$counter<=$#lines;$counter++) {
my @cols;
($cols[0], $cols[1], $cols[2], $cols[3], $cols[4]) =
('a','b','c','d','e');
for ($incount=0;$incount<=$#cols;$incount++) {
print "Line $counter column
$cols[$incount]=\"$data{$counter}{$cols[$incount]}\"\n";
}

}
 
A

anno4000

Dark said:
If you really want to use a regex here is something primative that
might get the job done (fills a hash and prints it - keeping track of
line numbers and columns). I'd probably just use unpack.

-I

Hmm... Your code is not strict-safe and produces a lot of warnings
when those are switched on. The indentation is random. When run,
it outputs 60 lines, beginning

Line 0 column
a=""
Line 0 column
b=""
Line 0 column
c=""
Line 0 column
d=""
Line 0 column
e=""
Line 1 column
a=""
Line 1 column
....

Is that what it is supposed to do?
$data = <<HERE

Semicolon missing after that statement.
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP

HERE
;

Misplaced semicolon.
@lines = split("\n", $data);
my %data;

The keys in %data are the values of $counter below, so essentially the
input line numbers. That kind of data is better kept in an array. Make
that

my @data,
my $counter;
for ($counter=0;$counter<=$#lines;$counter++) {
$line = $lines[$counter];
$_ = $line;

All this data-shuffling is unnecessary. Replace it with

for ( split /\n/, $data ) {
/([0-9\sa-zA-Z]{0,7})([0-9\sa-zA-Z]{0,8})([0-9\sa-zA-Z]{0,11})([0-9\sa-zA-Z]{0,7})([0-9\sa-zA-Z]{0,7})/;

This regex is too big to be placed in the code directly. Define a regex
variable outside the loop (my $re = qr/.../;) and use $re here:

/$re/;

I have not checked if the regex does indeed match what it needs to,
I'm assuming it does. However, it captures trailing blanks with each
field. In a complete solution these should be dropped.
if ($1) {

What if $1 happens to contain a false boolean value? Check the entire
match for success, not one haphazard match variable.
$data{$counter}{'a'} = $1;
$data{$counter}{'b'} = $2;
$data{$counter}{'c'} = $3;
$data{$counter}{'d'} = $4;
$data{$counter}{'e'} = $5;

Since @data is an array now, this must be written differently:

push @data, { a => $1, b => $2, c => $3, d => $4, e => $5};

I'd write the entire loop body like this:

if ( my @cols = /$re/ ) {
push @data, { map { $_ => shift @cols } qw( a b c d e);
} else {
warn "invalid data";
}

The print loop below is also more roundabout than it has to be.
#Print out the data in the hash
for ($counter=0;$counter<=$#lines;$counter++) {
my @cols;
($cols[0], $cols[1], $cols[2], $cols[3], $cols[4]) =
('a','b','c','d','e');
for ($incount=0;$incount<=$#cols;$incount++) {
print "Line $counter column
$cols[$incount]=\"$data{$counter}{$cols[$incount]}\"\n";
}

}

That amounts to a re-write along these lines:

$data = <<HERE;
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP
HERE

my $fc = '[0-9\sa-zA-Z]'; # a field character
my $re = qr/($fc{0,7})($fc{0,8})($fc{0,11})($fc{0,7})($fc{0,7})/;

my @recs;
for ( split /\n/, $data) {
if ( my @cols = /$re/ ) {
s/ +$// for @cols; # trim trailing blanks
@{ $recs[ @recs]}{ 'a' .. 'e'} = @cols;
}
}

for my $rec ( @recs ) {
print join( ', ', map "$_ => $rec->{ $_}", sort keys %$rec), "\n";
}

Anno
 
M

Mumia W.

Hi all,
I am writing a Perl script that should parse each line of a file (which
unfortunately I cant modify) and split the line. The main problem is
that every line (nearly 10000 lines) of the file is not uniform. So
there doesnt seem to be a pattern or a delimiter on which I can simply
split the line and could do it in a loop over all lines :(.
Here is an example:
========================
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP
..
..
========================

My main intention is to extract the values in Column A, B,C..into an
array but since in some lines some values under a column may not be
present..I am unable to have a single regex on which i can split all
lines in a loop. I tried the (obvious) \s+ regex for splitting but
since the columns that r empty have spaces, I get different results for
a particular column on different lines. I am especially interested in
two columns for which it is guaranteed that each line will be non-empty
(like A,B,D) but coz of other empty columns cant get them on a
particular index of the array which is returned by split().

Please give suggestions for following:

What regex could I use which wud solve my problem?

Is there any other way apart from split by which i cud achieve this
(assuming that there is no single regex to spit on) ?

Any possible way (as far as I can loop..since no of lines is huge)

Thanks.
Greg

Greg, this is your lucky day, because, even though you didn't
post any attempt of your own to solve this problem, people
(including me) are falling over themselves to write this
program for you. E.g.:

#!/usr/bin/perl

use strict;
use warnings;

my ($line, @line);
$line = <DATA>;

local $\ = "\n";
local $" = " | ";

while ($line = <DATA>) {
@line = unpack('A7 A8 A11 A7 A*', $line);
@line = map m/^\s*(.*?)\s*$/, @line;
print "@line";
}


__DATA__
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP

------------end of program ---------------

OUTPUT:
d32 | ab | ae99 | WB | 89
d33 | cd | e787 | WC | 78
d34 | ef | | WD |
d35 | gh | ancjd | WT | 100
d36 | ij | | WP |
------------end of output----------------

I saw some of the other solutions, and all I could think was,
"Wow, what a big program for such a small problem."
 
M

Mumia W.

Hi all,
I am writing a Perl script that should parse each line of a file
[...]

Greg, this is your lucky day,
[...]

I don't think I can get it any closer to being a one-liner
than this:

#!/bin/sh
echo '
A B C D E
d32 ab ae99 WB 89
d33 cd e787 WC 78
d34 ef WD
d35 gh ancjd WT 100
d36 ij WP
' | perl -nle '
@line = unpack("A7 A8 A11 A7 A*",$_);
print join(" | ", @line) if length($line[0]) > 1;
'


OUTPUT:
d32 | ab | ae99 | WB | 89
d33 | cd | e787 | WC | 78
d34 | ef | | WD |
d35 | gh | ancjd | WT | 100
d36 | ij | | WP |
 
B

Brian McCauley

if ( my @cols = /$re/ ) {
push @data, { map { $_ => shift @cols } qw( a b c d e) };
}

TMTOWTDI, when optomising for clarity I prefer a slice over using map.
Unfortunately this which means one can't avoid naming the hash without
the code getting real ugly:

if ( my @cols = /$re/ ) {
push @data, \my %record;
@record{ qw( a b c d e) } = @cols;
}
 
A

anno4000

Brian McCauley said:
TMTOWTDI, when optomising for clarity I prefer a slice over using map.
Unfortunately this which means one can't avoid naming the hash without
the code getting real ugly:

if ( my @cols = /$re/ ) {
push @data, \my %record;
@record{ qw( a b c d e) } = @cols;
}

Yes, building the hash anonymously is a bit obscure. I was coming from

@{ $data[ @data] }{ qw( a b c d e)} = @cols;

which is probably what you mean with "real ugly".

Anno
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,756
Messages
2,569,540
Members
45,025
Latest member
KetoRushACVFitness

Latest Threads

Top