complex text file manipulation

oraustin · Jul 27, 2005

I've got a real puzzler here (well for me at least) - it's accounting
data that's provided as a text file with fixed with fields -
the problem is one invoice is split over multiple lines.
To view the following line-wrap needs to be off and the font spacing
needs to even.
------------------------------------------------------------------------------------------------------------------------------------
REG. REG VENDOR'S NR. VENDOR INVOICE INVOICE
INVOICE NET VAT % DESCR.
DATE sEQ. NR VENDOR'S NAME VAT REG. NR. DATE NR.
AMOUNT AMOUNT AMOUNT EXEMP.
------------------------------------------------------------------------------------------------------------------------------------
30/01/04 0000125 1149 00488160011 30/12/03 23091533
320.14 266.78 53.36 20.00
MARLENI & ROADSI P.A. 0000138784
30/01/04 0000126 1683 00777280157 31/12/03 8300030488
1,550.80 812.18 81.22 10.00
NERFFE' ITALIANA SPA 0000138785
547.83 109.57 20.00

1.00 0.00 0.00 ART 7
30/01/04 0000127 2098 02411210582 31/12/03 01138
68.00 61.82 6.18 10.00
CEGGI GEGGDO-CERBFTTO DE 0000138786

The eventual goal is to load this data into a database so I'd like to
convert this data to a CSV (or equivalent).
Each new invoice begins with dd/mm/yy, the name of the vendor occurs on
the next row, I need to sum the net amount column for each
invoice (the end of which is marked by the occurence of a new line
begining with a date)

My question is qhat tool/s should I be using to crack th is one and
could anyone help me get going please.
I know it's a lot to ask. But believe me I'm going to be do this stuff
day in day out so it won't be long before I can contribute
to the newsgroup as well as request.
Thanks Oliver

it_says_BALLS_on_your forehead · Jul 27, 2005

Hey Oliver,

I'm still not completely clear on what you're trying to do, or the
exact format of your data, but it seems that regular expressions are
what you need.

Paul Lalli · Jul 27, 2005

I've got a real puzzler here (well for me at least) - it's accounting
data that's provided as a text file with fixed with fields -
the problem is one invoice is split over multiple lines.

The eventual goal is to load this data into a database so I'd like to
convert this data to a CSV (or equivalent).
Each new invoice begins with dd/mm/yy, the name of the vendor occurs on
the next row, I need to sum the net amount column for each
invoice (the end of which is marked by the occurence of a new line
begining with a date)

My question is qhat tool/s should I be using to crack th is one and
could anyone help me get going please.

Please note that what follows is the first thing that came to mind. It
"works", but I am 100% convinced it's not the best way to go about it.
And removing the "header" record is left as an excercise to the reader

For more information on the tools used in this code:
perldoc -f open
perldoc -f readline
perldoc -f perlop

#!/usr/bin/perl
use strict;
use warnings;
use Data:

umper;

my $record = '';
my @records;
while (<DATA>){
chomp;
#if the start of a new record
if ($_ =~ m{^\d\d/\d\d/\d\d}){
if ($record){
$record =~ tr/ \t\n/,/s;
push @records, $record;
}
$record = $_;
} else {
$record .= $_;
}
}
#once more for the final record
$record =~ tr/ /,/s;
push @records, $record;

print Dumper(\@records);

__DATA__
------------------------------------------------------------------------------------------------------------------------------------
REG. REG VENDOR'S NR. VENDOR INVOICE INVOICE
INVOICE NET VAT % DESCR.
DATE sEQ. NR VENDOR'S NAME VAT REG. NR. DATE NR.
AMOUNT AMOUNT AMOUNT EXEMP.
------------------------------------------------------------------------------------------------------------------------------------
30/01/04 0000125 1149 00488160011 30/12/03 23091533
320.14 266.78 53.36 20.00
MARLENI & ROADSI P.A. 0000138784
30/01/04 0000126 1683 00777280157 31/12/03 8300030488
1,550.80 812.18 81.22 10.00
NERFFE' ITALIANA SPA 0000138785
547.83 109.57 20.00

1.00 0.00 0.00 ART 7
30/01/04 0000127 2098 02411210582 31/12/03 01138
68.00 61.82 6.18 10.00
CEGGI GEGGDO-CERBFTTO DE 0000138786

$VAR1 = [

'------------------------------------------------------------------------------------------------------------------------------------REG.,REG,VENDOR\'S,NR.,VENDOR,INVOICE,INVOICE,INVOICE,NET,VAT,%,DESCR.DATE,sEQ.,NR,VENDOR\'S,NAME,VAT,REG.,NR.,DATE,NR.,AMOUNT,AMOUNT,AMOUNT,EXEMP.------------------------------------------------------------------------------------------------------------------------------------',

'30/01/04,0000125,1149,00488160011,30/12/03,23091533,320.14,266.78,53.36,20.00,MARLENI,&,ROADSI,P.A.,0000138784',

'30/01/04,0000126,1683,00777280157,31/12/03,8300030488,1,550.80,812.18,81.22,10.00,NERFFE\',ITALIANA,SPA,0000138785,547.83,109.57,20.00,1.00,0.00,0.00,ART,7',

'30/01/04,0000127,2098,02411210582,31/12/03,01138,68.00,61.82,6.18,10.00,CEGGI,GEGGDO-CERBFTTO,DE,0000138786'
];

Paul Lalli

oraustin · Jul 27, 2005

it_says_BALLS_on_your forehead said:
Hey Oliver,

I'm still not completely clear on what you're trying to do, or the
exact format of your data, but it seems that regular expressions are
what you need.

I'm using google groups and when I cut and paste my data back into
notepad the formatting is all wrong. How should I present the data the
newsgroup? attach a file?
I hope when you can see the data properly - the task should be more
clear

oraustin · Jul 27, 2005

Please note that what follows is the first thing that came to mind. It
"works", but I am 100% convinced it's not the best way to go about it.
And removing the "header" record is left as an excercise to the reader

For more information on the tools used in this code:
perldoc -f open
perldoc -f readline
perldoc -f perlop

#!/usr/bin/perl
use strict;
use warnings;
use Data:umper;

my $record = '';
my @records;
while (<DATA>){
chomp;
#if the start of a new record
if ($_ =~ m{^\d\d/\d\d/\d\d}){
if ($record){
$record =~ tr/ \t\n/,/s;
push @records, $record;
}
$record = $_;
} else {
$record .= $_;
}
}
#once more for the final record
$record =~ tr/ /,/s;
push @records, $record;

print Dumper(\@records);

__DATA__
------------------------------------------------------------------------------------------------------------------------------------
REG. REG VENDOR'S NR. VENDOR INVOICE INVOICE
INVOICE NET VAT % DESCR.
DATE sEQ. NR VENDOR'S NAME VAT REG. NR. DATE NR.
AMOUNT AMOUNT AMOUNT EXEMP.
------------------------------------------------------------------------------------------------------------------------------------
30/01/04 0000125 1149 00488160011 30/12/03 23091533
320.14 266.78 53.36 20.00
MARLENI & ROADSI P.A. 0000138784
30/01/04 0000126 1683 00777280157 31/12/03 8300030488
1,550.80 812.18 81.22 10.00
NERFFE' ITALIANA SPA 0000138785
547.83 109.57 20.00

1.00 0.00 0.00 ART 7
30/01/04 0000127 2098 02411210582 31/12/03 01138
68.00 61.82 6.18 10.00
CEGGI GEGGDO-CERBFTTO DE 0000138786

$VAR1 = [

'------------------------------------------------------------------------------------------------------------------------------------REG.,REG,VENDOR\'S,NR.,VENDOR,INVOICE,INVOICE,INVOICE,NET,VAT,%,DESCR.DATE,sEQ.,NR,VENDOR\'S,NAME,VAT,REG.,NR.,DATE,NR.,AMOUNT,AMOUNT,AMOUNT,EXEMP.------------------------------------------------------------------------------------------------------------------------------------',

'30/01/04,0000125,1149,00488160011,30/12/03,23091533,320.14,266.78,53.36,20.00,MARLENI,&,ROADSI,P.A.,0000138784',

'30/01/04,0000126,1683,00777280157,31/12/03,8300030488,1,550.80,812.18,81.22,10.00,NERFFE\',ITALIANA,SPA,0000138785,547.83,109.57,20.00,1.00,0.00,0.00,ART,7',

'30/01/04,0000127,2098,02411210582,31/12/03,01138,68.00,61.82,6.18,10.00,CEGGI,GEGGDO-CERBFTTO,DE,0000138786'
];

Paul Lalli

Thanks for your input - my Perl knowledge is non-existent right now so
it's going to take a long time to figure out what your code does. I
see it doesn't sum the net amounts though. Were you able to view the
data in format that made sense?

Paul Lalli · Jul 27, 2005

Thanks for your input - my Perl knowledge is non-existent right now

http://learn.perl.org is a great place to start.

so
it's going to take a long time to figure out what your code does. I
see it doesn't sum the net amounts though.

Ah, I missed that requirement. Oops. Well, you have two options then.
Either scrap my way of doing it and actively parse out each needed
piece from each line (using regular expressions, most likely), or take
the output array of my code and parse that for the needed field. I'm
actually not sure which would be easier at this point.

Were you able to view the data in format that made sense?

Where "makes sense" means "could tell which figures went with which
labels", yes. What any of it actually means, I didn't really try to
determine.

Paul Lalli

Paul Lalli · Jul 27, 2005

I'm using google groups and when I cut and paste my data back into
notepad the formatting is all wrong.

Near the top of the message, click "Show options". Then click "Show
original". You should see your message without the line-wrapping.

How should I present the data the newsgroup? attach a file?

No no no. Please do not attach files if it can be at all avoided.

Paul Lalli

Harry · Jul 27, 2005

(e-mail address removed) wrote...

I'm using google groups and when I cut and paste my data back into
notepad the formatting is all wrong. How should I present the data the
newsgroup? attach a file?

Try label (part of) your data file with line number, like this.

cat -n mydata.txt

John W. Krahn · Jul 27, 2005

I've got a real puzzler here (well for me at least) - it's accounting
data that's provided as a text file with fixed with fields -
the problem is one invoice is split over multiple lines.
To view the following line-wrap needs to be off and the font spacing
needs to even.
------------------------------------------------------------------------------------------------------------------------------------
REG. REG VENDOR'S NR. VENDOR INVOICE INVOICE
INVOICE NET VAT % DESCR.
DATE sEQ. NR VENDOR'S NAME VAT REG. NR. DATE NR.
AMOUNT AMOUNT AMOUNT EXEMP.
------------------------------------------------------------------------------------------------------------------------------------
30/01/04 0000125 1149 00488160011 30/12/03 23091533
320.14 266.78 53.36 20.00
MARLENI & ROADSI P.A. 0000138784
30/01/04 0000126 1683 00777280157 31/12/03 8300030488
1,550.80 812.18 81.22 10.00
NERFFE' ITALIANA SPA 0000138785
547.83 109.57 20.00

1.00 0.00 0.00 ART 7
30/01/04 0000127 2098 02411210582 31/12/03 01138
68.00 61.82 6.18 10.00
CEGGI GEGGDO-CERBFTTO DE 0000138786

The eventual goal is to load this data into a database so I'd like to
convert this data to a CSV (or equivalent).
Each new invoice begins with dd/mm/yy, the name of the vendor occurs on
the next row, I need to sum the net amount column for each
invoice (the end of which is marked by the occurence of a new line
begining with a date)

My question is qhat tool/s should I be using to crack th is one and
could anyone help me get going please.
I know it's a lot to ask. But believe me I'm going to be do this stuff
day in day out so it won't be long before I can contribute
to the newsgroup as well as request.

It is hard to tell from the data but you would usually use unpack or
substr to extract fixed width data.

perldoc -f unpack
perldoc -f substr

John

William Park · Jul 27, 2005

In said:
I've got a real puzzler here (well for me at least) - it's accounting
data that's provided as a text file with fixed with fields -
the problem is one invoice is split over multiple lines.
To view the following line-wrap needs to be off and the font spacing
needs to even.
------------------------------------------------------------------------------------------------------------------------------------
REG. REG VENDOR'S NR. VENDOR INVOICE INVOICE
INVOICE NET VAT % DESCR.
DATE sEQ. NR VENDOR'S NAME VAT REG. NR. DATE NR.
AMOUNT AMOUNT AMOUNT EXEMP.
------------------------------------------------------------------------------------------------------------------------------------
30/01/04 0000125 1149 00488160011 30/12/03 23091533
320.14 266.78 53.36 20.00
MARLENI & ROADSI P.A. 0000138784
30/01/04 0000126 1683 00777280157 31/12/03 8300030488
1,550.80 812.18 81.22 10.00
NERFFE' ITALIANA SPA 0000138785
547.83 109.57 20.00

1.00 0.00 0.00 ART 7
30/01/04 0000127 2098 02411210582 31/12/03 01138
68.00 61.82 6.18 10.00
CEGGI GEGGDO-CERBFTTO DE 0000138786

The eventual goal is to load this data into a database so I'd like to
convert this data to a CSV (or equivalent).
Each new invoice begins with dd/mm/yy, the name of the vendor occurs on
the next row, I need to sum the net amount column for each
invoice (the end of which is marked by the occurence of a new line
begining with a date)

My question is qhat tool/s should I be using to crack th is one and
could anyone help me get going please.
I know it's a lot to ask. But believe me I'm going to be do this stuff
day in day out so it won't be long before I can contribute
to the newsgroup as well as request.
Thanks Oliver

NET AMOUNT is 8th field, so easy enough. But, complication is that
- you have empty line in the middle of 2nd record
- vendor's name has spaces
- number has comma (1,550.80)

First, I assume you want CSV that looks something like

30/01/04,..............,266.78,...,MARLENI & ROADSI P.A.,...
30/01/04,...,"1,550.80",812.18,...,NERFFE' ITALIANA SPA,...
30/01/04,..............,61.82,...,CEGGI GEGGDO-CERBFTTO DE,...

Assuming your input file is 'file.in', I would do

rm xx*
csplit -z file.in '/^[0-9][0-9]/' '{*}'
for i in xx*; do
out=()
while read -A; do
case $1 in
[A-Za-z]*) set -- "${*:1:$#-1}" "${!#}" ;;
esac
pp_append -a out "${@|.csvquote}"
done < $i
echo "${out[*]|,,}"
done > file.out

The output file 'file.out' is 3 lines, like

30/01/04,0000125,1149,00488160011,30/12/03,23091533,320.14,266.78,53.36,20.00,
MARLENI & ROADSI P.A.,0000138784
30/01/04,0000126,1683,00777280157,31/12/03,8300030488,"1,550.80",812.18,81.22,10.00,
NERFFE' ITALIANA SPA,0000138785,547.83,109.57,20.00,1.00,0.00,0.00,ART,7
30/01/04,0000127,2098,02411210582,31/12/03,01138,68.00,61.82,6.18,10.00,
CEGGI GEGGDO-CERBFTTO DE,0000138786

Then, you can extract 8th field and add them up,

while read -A -C; do
echo $8
done < file.out | awk '{sum += $1} END {print sum}'

Now, to run this on Window natively, you obviously need to compile and
install Bash and other Linux tools. <www.cygwin.com> is the only way I
know, but have no personal experience. Even if you can compile and
install Linux stuff on Windows, I would recommend setting up a Linux
box, and

- share a partition with Windows box. That way, Linux can do the
data processing, and Windows can use the result.

- put the script into /etc/inetd.conf. It works like web server.
You connect to a port in Linux box, and send your data. The
script will run, and spit out the result back to you.

- put the script into ~/.procmailrc. It works like spam filtering.
You send email to Linux box, and the script will send the result
back as email.

--
William Park <[email protected]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/
--
William Park <[email protected]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/

William Park · Jul 27, 2005

In said:
rm xx*
csplit -z file.in '/^[0-9][0-9]/' '{*}'
for i in xx*; do
out=()
while read -A; do
case $1 in
[A-Za-z]*) set -- "${*:1:$#-1}" "${!#}" ;;
esac
pp_append -a out "${@|.csvquote}"
done < $i
echo "${out[*]|,,}"
done > file.out

while read -A -C; do
echo $8
done < file.out | awk '{sum += $1} END {print sum}'

Forget to add references:

http://home.eol.ca/~parkw/index.html#stack_queue
http://home.eol.ca/~parkw/index.html#dos_line
http://home.eol.ca/~parkw/index.html#parameter_expansion

man csplit
help read pp_append '${'

--
William Park <[email protected]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/

Ed Morton · Jul 28, 2005

I've got a real puzzler here (well for me at least) - it's accounting
data that's provided as a text file with fixed with fields -
the problem is one invoice is split over multiple lines.
To view the following line-wrap needs to be off and the font spacing
needs to even.
------------------------------------------------------------------------------------------------------------------------------------
REG. REG VENDOR'S NR. VENDOR INVOICE INVOICE
INVOICE NET VAT % DESCR.
DATE sEQ. NR VENDOR'S NAME VAT REG. NR. DATE NR.
AMOUNT AMOUNT AMOUNT EXEMP.
------------------------------------------------------------------------------------------------------------------------------------
30/01/04 0000125 1149 00488160011 30/12/03 23091533
320.14 266.78 53.36 20.00
MARLENI & ROADSI P.A. 0000138784
30/01/04 0000126 1683 00777280157 31/12/03 8300030488
1,550.80 812.18 81.22 10.00
NERFFE' ITALIANA SPA 0000138785
547.83 109.57 20.00

1.00 0.00 0.00 ART 7
30/01/04 0000127 2098 02411210582 31/12/03 01138
68.00 61.82 6.18 10.00
CEGGI GEGGDO-CERBFTTO DE 0000138786

The eventual goal is to load this data into a database so I'd like to
convert this data to a CSV (or equivalent).
Each new invoice begins with dd/mm/yy, the name of the vendor occurs on
the next row, I need to sum the net amount column for each
invoice (the end of which is marked by the occurence of a new line
begining with a date)

My question is qhat tool/s should I be using to crack th is one and
could anyone help me get going please.
I know it's a lot to ask. But believe me I'm going to be do this stuff
day in day out so it won't be long before I can contribute
to the newsgroup as well as request.
Thanks Oliver

Take a look at fixed width fields in the gawk man page. What you're
trying to do should be trivial with that. I can't be bothered trying to
figure out what your real data looks like or I'd post a script....
In future, if you have cumbersome data, it'd be best to post a smaller
sample that illustrates the problem.

Ed.

oraustin · Jul 28, 2005

Thanks for all help so far - sorry about problems with people viewing
the data - seems some can others can't. I'll look into gawk further
but I feel this kind of job is one for a fully featured language (Perl
I expect). The talk of linux boxes and multiple unix tools is a big
big big no no. Using a port of sed or gawk or perl is acceptable but
we only want to concentrate on one of these tools to do all our data
manipulation.
Seeing as time is against me I think I'll write it in Access VBA for
now and revist this afternoon.
Create a table to hold my data in Access then........
1.Read in a line at a time to find first line starting with a date.
2.Extract data using MID and populate temp variables (or an array)
3.Take next line extracting company name and second net amount if any
(adding it to first).
4.Keep looking down the lines addding net amount to running total until
find next line begining with a date.
5.INSERT data directly into table being careful to FORMAT date to
dd/mmm/yy so JET engine doesn't get confused.
6.Repeat until EOF

I am trying to wean the company off MS Access and VBA but I hate to
say it but the above sounds pretty simple and self contained and easy

( I know it won't be when the file manipulation is more complex.
PS. I need a good excuse to learn Perl

William Park · Jul 28, 2005

In said:
I expect). The talk of linux boxes and multiple unix tools is a big
big big no no. Using a port of sed or gawk or perl is acceptable but
we only want to concentrate on one of these tools to do all our data
manipulation.

If it has to be one of the 3, then I recommend 'awk'. It's easier to
learn, and it covers features of 'sed' to 'perl'.

Curious, though... for a company keenly aware of its bottom line, it
seems reluctant to improving their productivity by an order of
magnitude. There must be something else involved.

The only reason
why I mentioned setting up a Linux box is because it involves no
compile, install, or configuration on Window machines. Totally
'turnkey' solution.

--
William Park <[email protected]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/

kevindotcar · Jul 28, 2005

I've got a real puzzler here (well for me at least) - it's accounting
data that's provided as a text file with fixed with fields -
the problem is one invoice is split over multiple lines.
To view the following line-wrap needs to be off and the font spacing
needs to even.

Hi Oliver-

Don't forget that Perl is a *practical report extraction* language-
i.e, it has it's roots in this exact exercise.

I'd advise you to look at perl formats.
Since your pasted text is all kinda wicky-wacky from what it is
supposed to be, it's hard to give you exact help.

but as an example; take a line you gave;
30/01/04 0000125 1149 00488160011 30/12/03 23091533

A format for reading is would be:

format FNAME =
@<<<<<<<< @<<<<<<<< @<<<< @<<<<<<<<<<<<<<< @<<<<<<<< @<<<<<<<<<<
$adate, $areg, $vendnum, $vat, $vdate, $invoice
..

It looks like your report is multi-line input, and perl report formats
handle that, too.

Hoping this helps,

kDot

Mothra · Jul 29, 2005

Hi Kdot,

Hi Oliver-

Don't forget that Perl is a *practical report extraction* language-

Hmn...
from perldoc:
perldoc perl
NAME
perl - Practical Extraction and Report Language

you have it backwords

I'd advise you to look at perl formats. (snipped)

A format for reading is would be: ^^^^^^

format FNAME =
@<<<<<<<< @<<<<<<<< @<<<< @<<<<<<<<<<<<<<< @<<<<<<<< @<<<<<<<<<<
$adate, $areg, $vendnum, $vat, $vdate, $invoice
.

Formats are used for generating reports, not parsing them. The Op
wanted a way to parse the report that he was receiving.
you might want to look at perldoc perlform for more information.

Mothra

William Park · Jul 29, 2005

In said:
Formats are used for generating reports, not parsing them. The Op
wanted a way to parse the report that he was receiving.
you might want to look at perldoc perlform for more information.

sscanf(3) would be helpful. But, my Bash extension can only do 9
variables at the moment.

Ref:
http://home.eol.ca/~parkw/index.html#sscanf

--
William Park <[email protected]>, Toronto, Canada
ThinFlash: Linux thin-client on USB key (flash) drive
http://home.eol.ca/~parkw/thinflash.html
BashDiff: Super Bash shell
http://freshmeat.net/projects/bashdiff/

Robert Bonomi · Jul 30, 2005

Hi Kdot,

Hmn...
from perldoc:
perldoc perl
NAME
perl - Practical Extraction and Report Language

you have it backwords

According to it's inventor, it was the Pedantically Eclectic Rubbish Lister,
for which a more "politically accpetable' expansion was then retro-fitted.

David H. Adler · Aug 3, 2005

According to it's inventor, it was the Pedantically Eclectic Rubbish Lister,
for which a more "politically accpetable' expansion was then retro-fitted.

Actually, that's a retronym as well. See perlfaq1: What's the difference
between "perl" and "Perl"?

dha

Complex calculation of averages	1	Feb 19, 2007
Extract same text from file.	6	Aug 16, 2004
Collect Excel Data from Website	5	Apr 30, 2022
extract text from log file using re	2	Sep 13, 2007
[TinyCC] sizeof of element of struct returned by function	18	Feb 20, 2013
How to use ufixed when it involves multiplication a number of times?(VHDL question)	0	Aug 22, 2016
Date Calutlations from a DropDownList	2	Mar 9, 2009
[union] Pointers to inherited structs are valid ?	104	Jan 1, 2013

complex text file manipulation

oraustin

it_says_BALLS_on_your forehead

Paul Lalli

oraustin

oraustin

Paul Lalli

Paul Lalli

Harry

John W. Krahn

William Park

William Park

Ed Morton

oraustin

William Park

kevindotcar

Mothra

William Park

Robert Bonomi

David H. Adler

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads