Would like some words of wisdom - convert text files to CSV

G

GlenM

Okay;

I am sure that someone out there has done this before - I *think* I am
on the right track.

I have a directory full of emails. What I would like to do is read
each file in, then parse them into a CSV style file.

Example:

#!/usr/bin/perl

use warnings;
use strict;

open FILE , "/home/gmillard/SentMail/YourSatSetup.txt" or die $!;
my $linenum =1;

while (<FILE>) {
print "|", $linenum++;
print"$_" ;

}

Produces the following.

|1From - Sun Feb 21 11:40:01 2010
|2X-Mozilla-Status: 0001
|3X-Mozilla-Status2: 00000000
|4X-Gmail-Received: 58fa0ec68ca9975c1d187ceadc0ad3aeb1026134
|5Received: by 10.48.212.6 with HTTP; Fri, 17 Nov 2006 12:52:26 -0800
(PST)
|6Message-ID:
<[email protected]>
|7Date: Fri, 17 Nov 2006 15:52:26 -0500
|8From: "xxxxxxxxxxxxxxxxxxxxxxxx>
|9To: (e-mail address removed)
|10Subject: Your satellite set up. . From an article that i read.
|11MIME-Version: 1.0
|12Content-Type: text/plain; charset=ISO-8859-1; format=flowed
|13Content-Transfer-Encoding: 7bit
|14Content-Disposition: inline
|15Delivered-To: xxxxxxxxxxxxxxxxxxxxxx
|16
|17Hi Andrew;
|18I read an article about you a while back about your MythTV and VOip
|19setup. Would you mind if i asked you some tech questions ? I am
very
|20intrigued.
|21Thanks
|22Glen xxxxxxxxxx
|23xxxxxxxxxxxxx

I have hundreds of emails in this directory. I would like to parse
them into a single file where each comma separated/tab separated field
is a line from the email.

So, the first line of the CSV file is
|1From - Sun Feb 21 11:40:01 2010|2X-Mozilla-Status: 0001|3X-Mozilla-
Status2: 00000000|4X-Gmail-Received:
58fa0ec68ca9975c1d187ceadc0ad3aeb1026134
<truncated>

and each subsequent line is the next email and so forth.

Any words of wisdom?

Thanks much.

Glen
 
U

Uri Guttman

G> I have a directory full of emails. What I would like to do is read
G> each file in, then parse them into a CSV style file.

you need to be much clearer. is each mail file to be written out as a
single csv file? will the file names stay the same?


G> use warnings;
G> use strict;

good.

G> open FILE , "/home/gmillard/SentMail/YourSatSetup.txt" or die $!;
G> my $linenum =1;

be more consistant with spacing there.

my $linenum = 1;

G> while (<FILE>) {
G> print "|", $linenum++;
G> print"$_" ;

you don't need the quotes around $_ and it even can be an error in some
cases. don't unnecessarily quote scalar vars.

also that prints to stdout. if you want to do this per file and keep the
results you need to open an output file. and you will need an outer loop
to scan all the files. will they all be in a directory? passed in on the
command line into @ARGV? you need to ask and answer these questions.

G> Produces the following.

G> |1From - Sun Feb 21 11:40:01 2010

that isn't a csv format or anything but what you have printed.

G> I have hundreds of emails in this directory. I would like to parse
G> them into a single file where each comma separated/tab separated field
G> is a line from the email.

you aren't doing any parsing. reading line by line isn't
parsing. splitting on lines is what it would be called.

G> So, the first line of the CSV file is
G> |1From - Sun Feb 21 11:40:01 2010|2X-Mozilla-Status: 0001|3X-Mozilla-
G> Status2: 00000000|4X-Gmail-Received:
G> 58fa0ec68ca9975c1d187ceadc0ad3aeb1026134
G> <truncated>

well, think about your current output. why does it put each field (line)
on its own line? i will let you answer that first and then you can
easily fix it.

G> and each subsequent line is the next email and so forth.
G> Any words of wisdom?

that is a very strange format and will make for extremely long csv lines
(not a problem but just odd). also you are putting the line number in
front of each line. why? you can count the fields (lines). what happens
if a text line in an email starts with a number? then it will be next to
your line number making it hard to parse out the line number. also your
format starts with | so it means there is a leading empty field in the
csv. not a big problem but something to be aware of.

uri
 
J

Jürgen Exner

GlenM said:
I am sure that someone out there has done this before - I *think* I am
on the right track.

I have a directory full of emails. What I would like to do is read
each file in, then parse them into a CSV style file.

Example:

#!/usr/bin/perl

use warnings;
use strict;

Good, thank you.
open FILE , "/home/gmillard/SentMail/YourSatSetup.txt" or die $!;
my $linenum =1;

Perl already maintains a current input line counter for you, see $. in
'perldoc perlvar'

Obviously you need an additional outer loop to loop through all the
files. To get the file names please see 'perldoc opendir' and 'perldoc
readdir' and then just foreach(...){...} over those file names.
while (<FILE>) {
print "|", $linenum++;
print"$_" ;

$_ already contains a string, therefore there is no need to stringify it
again. Actually there are situations where stringifying a variable
causes unintentional effects, therefore you should not do it unless you
want those effects. Please see "perldoc -q quoting".
Produces the following.

Missing: how does this output fail to match your expectations, i.e. what
is wrong with it?
|1From - Sun Feb 21 11:40:01 2010
|2X-Mozilla-Status: 0001
|3X-Mozilla-Status2: 00000000

I am guessing (but I may be totally wrong) one issue might be that you
want all those lines merged in to one line? If that is the case then
please see "perldoc -f chomp'.
If there any other issues please let us know. My crystal ball is out for
repairs.

jue
 
G

GlenM

Thanks for stopping the bullies - appreciate it!

Okay, First question:
you need to be much clearer. is each mail file to be written out as a
single csv file? will the file names stay the same?

well, it doesn't really matter, but if it was one big CSV file, it
would probably be too big. So, I can break them up into groups
manually. So, for now, one big CSV file for all emails.

Second:

also that prints to stdout. if you want to do this per file and keep
the
results you need to open an output file. and you will need an outer loop
to scan all the files. will they all be in a directory? passed in on the
command line into @ARGV? you need to ask and answer these questions.

I can redirect the output to a file - yes, I see that it goes to
STDOUT. Not really looking for bells and whistles.
I will scan in every file in the directory - like I said previously,
it is a sheet-load of data so, I will split up the files.

Third:

well, think about your current output. why does it put each field
(line)
on its own line? i will let you answer that first and then you can
easily fix it.

I would like to have each field in a different 'column' or be
separated by a "|" or a "," (hence CSV). So, that is the next hurdle.

Fourth

that is a very strange format and will make for extremely long csv
lines
(not a problem but just odd). also you are putting the line number in
front of each line. why? you can count the fields (lines). what happens
if a text line in an email starts with a number? then it will be next to
your line number making it hard to parse out the line number. also your
format starts with | so it means there is a leading empty field in the
csv. not a big problem but something to be aware of.

Well, I just want to get all of the emails into a spreadsheet format,
then they are easier to work with. I can massage the data manually
afterward. Just want to get it into a CSV - maybe I can get fancy once
I get the fundamentals down.

Thank you for your response.

Glen
 
G

GlenM

  >> you need to be much clearer. is each mail file to be written out as a
  >> single csv file? will the file names stay the same?

  G> well, it doesn't really matter, but if it was one big CSV file, it
  G> would probably be too big. So, I can break them up into groups
  G> manually. So, for now, one big CSV file for all emails.

do you realize how wide this file will be? with some mails, it could be
10's of k wide. that is insane.

  G> Second:

  G>  also that prints to stdout. if you want to do this per file andkeep
  G> the
  >> results you need to open an output file. and you will need an outer loop
  >> to scan all the files. will they all be in a directory? passed in on the
  >> command line into @ARGV? you need to ask and answer these questions.

  G> I can redirect the output to a file - yes, I see that it goes to
  G> STDOUT. Not really looking for bells and whistles.
  G> I will scan in every file in the directory - like I said previously,
  G> it is a sheet-load of data so, I will split up the files.

redirecting to stdout is fine but if you do want to split this up, it is
better to manage your output directly in perl.

  G> Third:

  G> well, think about your current output. why does it put each field
  G> (line)
  >> on its own line? i will let you answer that first and then you can
  >> easily fix it.

  G> I would like to have each field in a different 'column' or be
  G> separated by a "|" or a "," (hence CSV). So, that is the next hurdle.

you didn't answer the question. WHY does your current code put each line
on its own line instead of all the mail lines in one long csv line?
until you fix that (and others already told you how), you can't get what
you want.

  G> Fourth

  G> that is a very strange format and will make for extremely long csv
  G> lines
  >> (not a problem but just odd). also you are putting the line numberin
  >> front of each line. why? you can count the fields (lines). what happens
  >> if a text line in an email starts with a number? then it will be next to
  >> your line number making it hard to parse out the line number. alsoyour
  >> format starts with | so it means there is a leading empty field inthe
  >> csv. not a big problem but something to be aware of.

  G> Well, I just want to get all of the emails into a spreadsheet format,
  G> then they are easier to work with. I can massage the data manually
  G> afterward. Just want to get it into a CSV - maybe I can get fancy once
  G> I get the fundamentals down.

easier to work with?? how? mail is easy to work with already. there are
many modules that will parse out mail including all the headers (which
aren't trivial to parse and a spreadsheet won't help).

and learn to bottom post. these netiquette rules are all covered in this
group's guidelines which are posted regularly. find a copy and read it.

uri

Okay - sorry about the post SNAFU.

Okay - perhaps I am barking up the wrong tree.

Perhaps you would be able to provide me with some module examples that
will extract email messages. The reason being is that I need to have
these emails in a spread sheet for a presentation that I am providing
to a government official - makes it easier for them to read.

Even a module (with examples) that will extract the email fields into
a database. That is a better idea because then I can search and
extract the data that I need.

Does that make any sense?

Thanks

Glen
 
U

Uri Guttman

G> Okay - sorry about the post SNAFU.

you still haven't gotten it. read the guidelines and google about bottom
posting. you don't quote the entire email but the parts you reply
to. you edit out the rest. you put your reply parts after the parts they
refer to. it is a conversation that reads top to bottom. very logical
but untaught to the intarweb unwashed masses. :/

G> Perhaps you would be able to provide me with some module examples
G> that will extract email messages. The reason being is that I need
G> to have these emails in a spread sheet for a presentation that I am
G> providing to a government official - makes it easier for them to
G> read.

search cpan. there are many modules that parse emails. just look at the
docs for some and see what catches your eye. then try them and see how
you like them. this is easy and very educational. using cpan is part of
the perl culture.

G> Even a module (with examples) that will extract the email fields
G> into a database. That is a better idea because then I can search
G> and extract the data that I need.

you can parse out the fields but putting them into a db is more likely
your code as that can be very custom.

G> Does that make any sense?

sort of. you need to learn to explain the bigger picture. saying you
want emails with one line per csv field is very different than saying
you want to be able to search emails by various headers and such.

uri
 
J

Jim Gibson

Well, I just want to get all of the emails into a spreadsheet format,
then they are easier to work with. I can massage the data manually
afterward. Just want to get it into a CSV - maybe I can get fancy once
I get the fundamentals down.

Perl can write Excel documents directly using the
Spreadsheet::WriteExcel module, available from CPAN. You can even add
some formatting like bold, italics, etc,
 
P

Peter J. Holzer

Perhaps you would be able to provide me with some module examples that
will extract email messages. The reason being is that I need to have
these emails in a spread sheet for a presentation that I am providing
to a government official - makes it easier for them to read.

I can think of a few reasons why you might want to represent an email
message in a single line, but "easy to read" is definitely not one of
them - how would a single line, thousands to millions of characters
long, be easy to read?

hp
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,581
Members
45,057
Latest member
KetoBeezACVGummies

Latest Threads

Top