complex text file manipulation

O

oraustin

I've got a real puzzler here (well for me at least) - it's accounting
data that's provided as a text file with fixed with fields -
the problem is one invoice is split over multiple lines.
To view the following line-wrap needs to be off and the font spacing
needs to even.
------------------------------------------------------------------------------------------------------------------------------------
REG. REG VENDOR'S NR. VENDOR INVOICE INVOICE
INVOICE NET VAT % DESCR.
DATE sEQ. NR VENDOR'S NAME VAT REG. NR. DATE NR.
AMOUNT AMOUNT AMOUNT EXEMP.
------------------------------------------------------------------------------------------------------------------------------------
30/01/04 0000125 1149 00488160011 30/12/03 23091533
320.14 266.78 53.36 20.00
MARLENI & ROADSI P.A. 0000138784
30/01/04 0000126 1683 00777280157 31/12/03 8300030488
1,550.80 812.18 81.22 10.00
NERFFE' ITALIANA SPA 0000138785
547.83 109.57 20.00

1.00 0.00 0.00 ART 7
30/01/04 0000127 2098 02411210582 31/12/03 01138
68.00 61.82 6.18 10.00
CEGGI GEGGDO-CERBFTTO DE 0000138786

The eventual goal is to load this data into a database so I'd like to
convert this data to a CSV (or equivalent).
Each new invoice begins with dd/mm/yy, the name of the vendor occurs on
the next row, I need to sum the net amount column for each
invoice (the end of which is marked by the occurence of a new line
begining with a date)

My question is qhat tool/s should I be using to crack th is one and
could anyone help me get going please.
I know it's a lot to ask. But believe me I'm going to be do this stuff
day in day out so it won't be long before I can contribute
to the newsgroup as well as request.
Thanks Oliver
 
I

it_says_BALLS_on_your forehead

Hey Oliver,

I'm still not completely clear on what you're trying to do, or the
exact format of your data, but it seems that regular expressions are
what you need.
 
P

Paul Lalli

I've got a real puzzler here (well for me at least) - it's accounting
data that's provided as a text file with fixed with fields -
the problem is one invoice is split over multiple lines.
The eventual goal is to load this data into a database so I'd like to
convert this data to a CSV (or equivalent).
Each new invoice begins with dd/mm/yy, the name of the vendor occurs on
the next row, I need to sum the net amount column for each
invoice (the end of which is marked by the occurence of a new line
begining with a date)

My question is qhat tool/s should I be using to crack th is one and
could anyone help me get going please.

Please note that what follows is the first thing that came to mind. It
"works", but I am 100% convinced it's not the best way to go about it.
And removing the "header" record is left as an excercise to the reader

For more information on the tools used in this code:
perldoc -f open
perldoc -f readline
perldoc -f perlop

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my $record = '';
my @records;
while (<DATA>){
chomp;
#if the start of a new record
if ($_ =~ m{^\d\d/\d\d/\d\d}){
if ($record){
$record =~ tr/ \t\n/,/s;
push @records, $record;
}
$record = $_;
} else {
$record .= $_;
}
}
#once more for the final record
$record =~ tr/ /,/s;
push @records, $record;

print Dumper(\@records);

__DATA__
------------------------------------------------------------------------------------------------------------------------------------
REG. REG VENDOR'S NR. VENDOR INVOICE INVOICE
INVOICE NET VAT % DESCR.
DATE sEQ. NR VENDOR'S NAME VAT REG. NR. DATE NR.
AMOUNT AMOUNT AMOUNT EXEMP.
------------------------------------------------------------------------------------------------------------------------------------
30/01/04 0000125 1149 00488160011 30/12/03 23091533
320.14 266.78 53.36 20.00
MARLENI & ROADSI P.A. 0000138784
30/01/04 0000126 1683 00777280157 31/12/03 8300030488
1,550.80 812.18 81.22 10.00
NERFFE' ITALIANA SPA 0000138785
547.83 109.57 20.00

1.00 0.00 0.00 ART 7
30/01/04 0000127 2098 02411210582 31/12/03 01138
68.00 61.82 6.18 10.00
CEGGI GEGGDO-CERBFTTO DE 0000138786





$VAR1 = [

'------------------------------------------------------------------------------------------------------------------------------------REG.,REG,VENDOR\'S,NR.,VENDOR,INVOICE,INVOICE,INVOICE,NET,VAT,%,DESCR.DATE,sEQ.,NR,VENDOR\'S,NAME,VAT,REG.,NR.,DATE,NR.,AMOUNT,AMOUNT,AMOUNT,EXEMP.------------------------------------------------------------------------------------------------------------------------------------',

'30/01/04,0000125,1149,00488160011,30/12/03,23091533,320.14,266.78,53.36,20.00,MARLENI,&,ROADSI,P.A.,0000138784',

'30/01/04,0000126,1683,00777280157,31/12/03,8300030488,1,550.80,812.18,81.22,10.00,NERFFE\',ITALIANA,SPA,0000138785,547.83,109.57,20.00,1.00,0.00,0.00,ART,7',

'30/01/04,0000127,2098,02411210582,31/12/03,01138,68.00,61.82,6.18,10.00,CEGGI,GEGGDO-CERBFTTO,DE,0000138786'
];



Paul Lalli
 
O

oraustin

it_says_BALLS_on_your forehead said:
Hey Oliver,

I'm still not completely clear on what you're trying to do, or the
exact format of your data, but it seems that regular expressions are
what you need.

I'm using google groups and when I cut and paste my data back into
notepad the formatting is all wrong. How should I present the data the
newsgroup? attach a file?
I hope when you can see the data properly - the task should be more
clear
 
O

oraustin

Please note that what follows is the first thing that came to mind. It
"works", but I am 100% convinced it's not the best way to go about it.
And removing the "header" record is left as an excercise to the reader

For more information on the tools used in this code:
perldoc -f open
perldoc -f readline
perldoc -f perlop

#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;

my $record = '';
my @records;
while (<DATA>){
chomp;
#if the start of a new record
if ($_ =~ m{^\d\d/\d\d/\d\d}){
if ($record){
$record =~ tr/ \t\n/,/s;
push @records, $record;
}
$record = $_;
} else {
$record .= $_;
}
}
#once more for the final record
$record =~ tr/ /,/s;
push @records, $record;

print Dumper(\@records);

__DATA__
------------------------------------------------------------------------------------------------------------------------------------
REG. REG VENDOR'S NR. VENDOR INVOICE INVOICE
INVOICE NET VAT % DESCR.
DATE sEQ. NR VENDOR'S NAME VAT REG. NR. DATE NR.
AMOUNT AMOUNT AMOUNT EXEMP.
------------------------------------------------------------------------------------------------------------------------------------
30/01/04 0000125 1149 00488160011 30/12/03 23091533
320.14 266.78 53.36 20.00
MARLENI & ROADSI P.A. 0000138784
30/01/04 0000126 1683 00777280157 31/12/03 8300030488
1,550.80 812.18 81.22 10.00
NERFFE' ITALIANA SPA 0000138785
547.83 109.57 20.00

1.00 0.00 0.00 ART 7
30/01/04 0000127 2098 02411210582 31/12/03 01138
68.00 61.82 6.18 10.00
CEGGI GEGGDO-CERBFTTO DE 0000138786





$VAR1 = [

'------------------------------------------------------------------------------------------------------------------------------------REG.,REG,VENDOR\'S,NR.,VENDOR,INVOICE,INVOICE,INVOICE,NET,VAT,%,DESCR.DATE,sEQ.,NR,VENDOR\'S,NAME,VAT,REG.,NR.,DATE,NR.,AMOUNT,AMOUNT,AMOUNT,EXEMP.------------------------------------------------------------------------------------------------------------------------------------',

'30/01/04,0000125,1149,00488160011,30/12/03,23091533,320.14,266.78,53.36,20.00,MARLENI,&,ROADSI,P.A.,0000138784',

'30/01/04,0000126,1683,00777280157,31/12/03,8300030488,1,550.80,812.18,81.22,10.00,NERFFE\',ITALIANA,SPA,0000138785,547.83,109.57,20.00,1.00,0.00,0.00,ART,7',

'30/01/04,0000127,2098,02411210582,31/12/03,01138,68.00,61.82,6.18,10.00,CEGGI,GEGGDO-CERBFTTO,DE,0000138786'
];



Paul Lalli

Thanks for your input - my Perl knowledge is non-existent right now so
it's going to take a long time to figure out what your code does. I
see it doesn't sum the net amounts though. Were you able to view the
data in format that made sense?
 
P

Paul Lalli

Thanks for your input - my Perl knowledge is non-existent right now

http://learn.perl.org is a great place to start.
so
it's going to take a long time to figure out what your code does. I
see it doesn't sum the net amounts though.

Ah, I missed that requirement. Oops. Well, you have two options then.
Either scrap my way of doing it and actively parse out each needed
piece from each line (using regular expressions, most likely), or take
the output array of my code and parse that for the needed field. I'm
actually not sure which would be easier at this point.
Were you able to view the data in format that made sense?

Where "makes sense" means "could tell which figures went with which
labels", yes. What any of it actually means, I didn't really try to
determine.

Paul Lalli
 
P

Paul Lalli

I'm using google groups and when I cut and paste my data back into
notepad the formatting is all wrong.

Near the top of the message, click "Show options". Then click "Show
original". You should see your message without the line-wrapping.
How should I present the data the newsgroup? attach a file?

No no no. Please do not attach files if it can be at all avoided.

Paul Lalli
 
H

Harry

(e-mail address removed) wrote...
I'm using google groups and when I cut and paste my data back into
notepad the formatting is all wrong. How should I present the data the
newsgroup? attach a file?

Try label (part of) your data file with line number, like this.

cat -n mydata.txt
 
J

John W. Krahn

I've got a real puzzler here (well for me at least) - it's accounting
data that's provided as a text file with fixed with fields -
the problem is one invoice is split over multiple lines.
To view the following line-wrap needs to be off and the font spacing
needs to even.
------------------------------------------------------------------------------------------------------------------------------------
REG. REG VENDOR'S NR. VENDOR INVOICE INVOICE
INVOICE NET VAT % DESCR.
DATE sEQ. NR VENDOR'S NAME VAT REG. NR. DATE NR.
AMOUNT AMOUNT AMOUNT EXEMP.
------------------------------------------------------------------------------------------------------------------------------------
30/01/04 0000125 1149 00488160011 30/12/03 23091533
320.14 266.78 53.36 20.00
MARLENI & ROADSI P.A. 0000138784
30/01/04 0000126 1683 00777280157 31/12/03 8300030488
1,550.80 812.18 81.22 10.00
NERFFE' ITALIANA SPA 0000138785
547.83 109.57 20.00

1.00 0.00 0.00 ART 7
30/01/04 0000127 2098 02411210582 31/12/03 01138
68.00 61.82 6.18 10.00
CEGGI GEGGDO-CERBFTTO DE 0000138786

The eventual goal is to load this data into a database so I'd like to
convert this data to a CSV (or equivalent).
Each new invoice begins with dd/mm/yy, the name of the vendor occurs on
the next row, I need to sum the net amount column for each
invoice (the end of which is marked by the occurence of a new line
begining with a date)

My question is qhat tool/s should I be using to crack th is one and
could anyone help me get going please.
I know it's a lot to ask. But believe me I'm going to be do this stuff
day in day out so it won't be long before I can contribute
to the newsgroup as well as request.

It is hard to tell from the data but you would usually use unpack or
substr to extract fixed width data.

perldoc -f unpack
perldoc -f substr



John
 
W

William Park

In said:
I've got a real puzzler here (well for me at least) - it's accounting
data that's provided as a text file with fixed with fields -
the problem is one invoice is split over multiple lines.
To view the following line-wrap needs to be off and the font spacing
needs to even.
------------------------------------------------------------------------------------------------------------------------------------
REG. REG VENDOR'S NR. VENDOR INVOICE INVOICE
INVOICE NET VAT % DESCR.
DATE sEQ. NR VENDOR'S NAME VAT REG. NR. DATE NR.
AMOUNT AMOUNT AMOUNT EXEMP.
------------------------------------------------------------------------------------------------------------------------------------
30/01/04 0000125 1149 00488160011 30/12/03 23091533
320.14 266.78 53.36 20.00
MARLENI & ROADSI P.A. 0000138784
30/01/04 0000126 1683 00777280157 31/12/03 8300030488
1,550.80 812.18 81.22 10.00
NERFFE' ITALIANA SPA 0000138785
547.83 109.57 20.00

1.00 0.00 0.00 ART 7
30/01/04 0000127 2098 02411210582 31/12/03 01138
68.00 61.82 6.18 10.00
CEGGI GEGGDO-CERBFTTO DE 0000138786

The eventual goal is to load this data into a database so I'd like to
convert this data to a CSV (or equivalent).
Each new invoice begins with dd/mm/yy, the name of the vendor occurs on
the next row, I need to sum the net amount column for each
invoice (the end of which is marked by the occurence of a new line
begining with a date)

My question is qhat tool/s should I be using to crack th is one and
could anyone help me get going please.
I know it's a lot to ask. But believe me I'm going to be do this stuff
day in day out so it won't be long before I can contribute
to the newsgroup as well as request.
Thanks Oliver


NET AMOUNT is 8th field, so easy enough. But, complication is that
- you have empty line in the middle of 2nd record
- vendor's name has spaces
- number has comma (1,550.80)

First, I assume you want CSV that looks something like

30/01/04,..............,266.78,...,MARLENI & ROADSI P.A.,...
30/01/04,...,"1,550.80",812.18,...,NERFFE' ITALIANA SPA,...
30/01/04,..............,61.82,...,CEGGI GEGGDO-CERBFTTO DE,...

Assuming your input file is 'file.in', I would do

rm xx*
csplit -z file.in '/^[0-9][0-9]/' '{*}'
for i in xx*; do
out=()
while read -A; do
case $1 in
[A-Za-z]*) set -- "${*:1:$#-1}" "${!#}" ;;
esac
pp_append -a out "${@|.csvquote}"
done < $i
echo "${out[*]|,,}"
done > file.out

The output file 'file.out' is 3 lines, like

30/01/04,0000125,1149,00488160011,30/12/03,23091533,320.14,266.78,53.36,20.00,
MARLENI & ROADSI P.A.,0000138784
30/01/04,0000126,1683,00777280157,31/12/03,8300030488,"1,550.80",812.18,81.22,10.00,
NERFFE' ITALIANA SPA,0000138785,547.83,109.57,20.00,1.00,0.00,0.00,ART,7
30/01/04,0000127,2098,02411210582,31/12/03,01138,68.00,61.82,6.18,10.00,
CEGGI GEGGDO-CERBFTTO DE,0000138786

Then, you can extract 8th field and add them up,

while read -A -C; do
echo $8
done < file.out | awk '{sum += $1} END {print sum}'


Now, to run this on Window natively, you obviously need to compile and
install Bash and other Linux tools. <www.cygwin.com> is the only way I
know, but have no personal experience. Even if you can compile and
install Linux stuff on Windows, I would recommend setting up a Linux
box, and

- share a partition with Windows box. That way, Linux can do the
data processing, and Windows can use the result.

- put the script into /etc/inetd.conf. It works like web server.
You connect to a port in Linux box, and send your data. The
script will run, and spit out the result back to you.

- put the script into ~/.procmailrc. It works like spam filtering.
You send email to Linux box, and the script will send the result
back as email.

--
William Park <[email protected]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
--
William Park <[email protected]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
 
W

William Park

In said:
rm xx*
csplit -z file.in '/^[0-9][0-9]/' '{*}'
for i in xx*; do
out=()
while read -A; do
case $1 in
[A-Za-z]*) set -- "${*:1:$#-1}" "${!#}" ;;
esac
pp_append -a out "${@|.csvquote}"
done < $i
echo "${out[*]|,,}"
done > file.out
while read -A -C; do
echo $8
done < file.out | awk '{sum += $1} END {print sum}'

Forget to add references:

http://home.eol.ca/~parkw/index.html#stack_queue
http://home.eol.ca/~parkw/index.html#dos_line
http://home.eol.ca/~parkw/index.html#parameter_expansion

man csplit
help read pp_append '${'

--
William Park <[email protected]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
 
E

Ed Morton

I've got a real puzzler here (well for me at least) - it's accounting
data that's provided as a text file with fixed with fields -
the problem is one invoice is split over multiple lines.
To view the following line-wrap needs to be off and the font spacing
needs to even.
------------------------------------------------------------------------------------------------------------------------------------
REG. REG VENDOR'S NR. VENDOR INVOICE INVOICE
INVOICE NET VAT % DESCR.
DATE sEQ. NR VENDOR'S NAME VAT REG. NR. DATE NR.
AMOUNT AMOUNT AMOUNT EXEMP.
------------------------------------------------------------------------------------------------------------------------------------
30/01/04 0000125 1149 00488160011 30/12/03 23091533
320.14 266.78 53.36 20.00
MARLENI & ROADSI P.A. 0000138784
30/01/04 0000126 1683 00777280157 31/12/03 8300030488
1,550.80 812.18 81.22 10.00
NERFFE' ITALIANA SPA 0000138785
547.83 109.57 20.00

1.00 0.00 0.00 ART 7
30/01/04 0000127 2098 02411210582 31/12/03 01138
68.00 61.82 6.18 10.00
CEGGI GEGGDO-CERBFTTO DE 0000138786

The eventual goal is to load this data into a database so I'd like to
convert this data to a CSV (or equivalent).
Each new invoice begins with dd/mm/yy, the name of the vendor occurs on
the next row, I need to sum the net amount column for each
invoice (the end of which is marked by the occurence of a new line
begining with a date)

My question is qhat tool/s should I be using to crack th is one and
could anyone help me get going please.
I know it's a lot to ask. But believe me I'm going to be do this stuff
day in day out so it won't be long before I can contribute
to the newsgroup as well as request.
Thanks Oliver

Take a look at fixed width fields in the gawk man page. What you're
trying to do should be trivial with that. I can't be bothered trying to
figure out what your real data looks like or I'd post a script....
In future, if you have cumbersome data, it'd be best to post a smaller
sample that illustrates the problem.

Ed.
 
O

oraustin

Thanks for all help so far - sorry about problems with people viewing
the data - seems some can others can't. I'll look into gawk further
but I feel this kind of job is one for a fully featured language (Perl
I expect). The talk of linux boxes and multiple unix tools is a big
big big no no. Using a port of sed or gawk or perl is acceptable but
we only want to concentrate on one of these tools to do all our data
manipulation.
Seeing as time is against me I think I'll write it in Access VBA for
now and revist this afternoon.
Create a table to hold my data in Access then........
1.Read in a line at a time to find first line starting with a date.
2.Extract data using MID and populate temp variables (or an array)
3.Take next line extracting company name and second net amount if any
(adding it to first).
4.Keep looking down the lines addding net amount to running total until
find next line begining with a date.
5.INSERT data directly into table being careful to FORMAT date to
dd/mmm/yy so JET engine doesn't get confused.
6.Repeat until EOF

I am trying to wean the company off MS Access and VBA but I hate to
say it but the above sounds pretty simple and self contained and easy
:eek:( I know it won't be when the file manipulation is more complex.
PS. I need a good excuse to learn Perl :)
 
W

William Park

In said:
I expect). The talk of linux boxes and multiple unix tools is a big
big big no no. Using a port of sed or gawk or perl is acceptable but
we only want to concentrate on one of these tools to do all our data
manipulation.

If it has to be one of the 3, then I recommend 'awk'. It's easier to
learn, and it covers features of 'sed' to 'perl'.

Curious, though... for a company keenly aware of its bottom line, it
seems reluctant to improving their productivity by an order of
magnitude. There must be something else involved. :) The only reason
why I mentioned setting up a Linux box is because it involves no
compile, install, or configuration on Window machines. Totally
'turnkey' solution.

--
William Park <[email protected]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
 
K

kevindotcar

I've got a real puzzler here (well for me at least) - it's accounting
data that's provided as a text file with fixed with fields -
the problem is one invoice is split over multiple lines.
To view the following line-wrap needs to be off and the font spacing
needs to even.

Hi Oliver-

Don't forget that Perl is a *practical report extraction* language-
i.e, it has it's roots in this exact exercise.

I'd advise you to look at perl formats.
Since your pasted text is all kinda wicky-wacky from what it is
supposed to be, it's hard to give you exact help.

but as an example; take a line you gave;
30/01/04 0000125 1149 00488160011 30/12/03 23091533

A format for reading is would be:

format FNAME =
@<<<<<<<< @<<<<<<<< @<<<< @<<<<<<<<<<<<<<< @<<<<<<<< @<<<<<<<<<<
$adate, $areg, $vendnum, $vat, $vdate, $invoice
..


It looks like your report is multi-line input, and perl report formats
handle that, too.

Hoping this helps,

kDot
 
M

Mothra

Hi Kdot,

Hi Oliver-

Don't forget that Perl is a *practical report extraction* language-

Hmn...
from perldoc:
perldoc perl
NAME
perl - Practical Extraction and Report Language

you have it backwords
I'd advise you to look at perl formats. (snipped)

A format for reading is would be: ^^^^^^

format FNAME =
@<<<<<<<< @<<<<<<<< @<<<< @<<<<<<<<<<<<<<< @<<<<<<<< @<<<<<<<<<<
$adate, $areg, $vendnum, $vat, $vdate, $invoice
.

Formats are used for generating reports, not parsing them. The Op
wanted a way to parse the report that he was receiving.
you might want to look at perldoc perlform for more information.

Mothra
 
W

William Park

In said:
Formats are used for generating reports, not parsing them. The Op
wanted a way to parse the report that he was receiving.
you might want to look at perldoc perlform for more information.

sscanf(3) would be helpful. But, my Bash extension can only do 9
variables at the moment.

Ref:
http://home.eol.ca/~parkw/index.html#sscanf

--
William Park <[email protected]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
 
R

Robert Bonomi

Hi Kdot,



Hmn...
from perldoc:
perldoc perl
NAME
perl - Practical Extraction and Report Language

you have it backwords

According to it's inventor, it was the Pedantically Eclectic Rubbish Lister,
for which a more "politically accpetable' expansion was then retro-fitted.
 
D

David H. Adler

According to it's inventor, it was the Pedantically Eclectic Rubbish Lister,
for which a more "politically accpetable' expansion was then retro-fitted.

Actually, that's a retronym as well. See perlfaq1: What's the difference
between "perl" and "Perl"?

dha
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,007
Latest member
obedient dusk

Latest Threads

Top