Regex to extract email from .msg

  • Thread starter Bart Van der Donck
  • Start date
B

Bart Van der Donck

Hello,

I have been assigned a task to filter out an email address from the
body of a (.msg) source file.

The source file looks odd and displays differently in various
plaintext readers. It looks like some sort of half binary / half ascii
format (including the headers). The body of the file is more-or-less
consistent. The address to be extracted is in the following format:

"- n a m e @ h o s t . c o m "

All text in the source file is with such spaces between.

Spaces can be displayed like EOL, space or nothing. Binary characters
seem to be inserted randomly; sometimes I can recognize a pattern of a
repeated string. Maybe someone is familiar with this format ? The
messages were saved from MS Outlook.

I tried many variants, my best shot goes to:

if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs) { ...

But still no success. I was thinking of an encoding issue (Unicode/
UTF?), but the source file seems too different for that.

Thanks
 
P

Peter J. Holzer

I have been assigned a task to filter out an email address from the
body of a (.msg) source file.

The source file looks odd and displays differently in various
plaintext readers. It looks like some sort of half binary / half ascii
format (including the headers). The body of the file is more-or-less
consistent. The address to be extracted is in the following format:

"- n a m e @ h o s t . c o m "

All text in the source file is with such spaces between.

Spaces can be displayed like EOL, space or nothing.

Then they probably aren't spaces. Most likely they are nul characters.
If you have Linux, use hd (or od) to look at the file. If you use
Windows, there's probably some freeware hex editor/viewer you can use.
But still no success. I was thinking of an encoding issue (Unicode/
UTF?), but the source file seems too different for that.

Most likely UTF-16, but there may be some additional markup.

hp
 
S

Steve C

Bart said:
Hello,

I have been assigned a task to filter out an email address from the
body of a (.msg) source file.

The source file looks odd and displays differently in various
plaintext readers. It looks like some sort of half binary / half ascii
format (including the headers). The body of the file is more-or-less
consistent. The address to be extracted is in the following format:

"- n a m e @ h o s t . c o m "

All text in the source file is with such spaces between.

Spaces can be displayed like EOL, space or nothing. Binary characters
seem to be inserted randomly; sometimes I can recognize a pattern of a
repeated string. Maybe someone is familiar with this format ? The
messages were saved from MS Outlook.

I tried many variants, my best shot goes to:

if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs) { ...

But still no success. I was thinking of an encoding issue (Unicode/
UTF?), but the source file seems too different for that.

Thanks

Are you sure they are spaces and not NULs? Windows text files
frequently use 16-bit wide character format, which looks like
0x0 in the high byte and ASCII in the low byte for English
characters.

http://www.microsoft.com/opentype/unicode/cs.htm
 
W

Wanna-Be Sys Admin

Bart said:
Hello,

I have been assigned a task to filter out an email address from the
body of a (.msg) source file.

The source file looks odd and displays differently in various
plaintext readers. It looks like some sort of half binary / half ascii
format (including the headers). The body of the file is more-or-less
consistent. The address to be extracted is in the following format:

"- n a m e @ h o s t . c o m "

All text in the source file is with such spaces between.

Spaces can be displayed like EOL, space or nothing. Binary characters
seem to be inserted randomly; sometimes I can recognize a pattern of a
repeated string. Maybe someone is familiar with this format ? The
messages were saved from MS Outlook.

I tried many variants, my best shot goes to:

if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs) { ...

Maybe try stripping hidden characters from the file first? Trying to
guess just exactly how many spaces or other characters, can be a
hassle. The problem is, another address could be completely
inconsistent from another, I assume they aren't all the same? If so,
and if they are really white space, maybe \s+ in place of \s\s\s would
be better? Also, why are you capturing \s\s\s, is that intentional? Is
that expected and what you want? Anyway, you probably need to convert
the file/data to strip out the junk so you can get the actual data you
want and not try and work around ignoring or grabbing that junk
 
J

Jürgen Exner

Bart Van der Donck said:
The source file looks odd and displays differently in various
plaintext readers. It looks like some sort of half binary / half ascii
format (including the headers). The body of the file is more-or-less
consistent. The address to be extracted is in the following format:

"- n a m e @ h o s t . c o m "

All text in the source file is with such spaces between.

Spaces can be displayed like EOL, space or nothing. Binary characters
seem to be inserted randomly; sometimes I can recognize a pattern of a
repeated string. Maybe someone is familiar with this format ? The
messages were saved from MS Outlook.

This file is likely in UTF-16 or USC-2. Did you look at it in a
hex/binary editor? Those spaces are probably 0x00 bytes and not really
spaces at all.

Use the proper encoding and Perl should be able to read the file just
fine,

jue
 
B

Bart Van der Donck

Jürgen Exner said:
This file is likely in UTF-16 or USC-2. Did you look at it in a
hex/binary editor? Those spaces are probably 0x00 bytes and not really
spaces at all.

Use the proper encoding and Perl should be able to read the file just
fine,

Yes - it appeared to be a UTF-16 issue indeed. I tried about all
possible byte order encoding schemes... and the following finally did
the trick:

use Encode;
open(my $in, '<:raw', $mypath) || die "Couldn't open file: $!";
my $txt = do { local $/; <$in> };
close $in;
my @lines = split /\n/, decode('UTF-16LE', $txt);

Thanks all for your help!
 
P

Permostat

Hello,

I have been assigned a task to filter out an email address from the
body of a (.msg) source file.

The source file looks odd and displays differently in various
plaintext readers. It looks like some sort of half binary / half ascii
format (including the headers). The body of the file is more-or-less
consistent. The address to be extracted is in the following format:

   "-   n a m e @ h o s t . c o m   "

All text in the source file is with such spaces between.

Spaces can be displayed like EOL, space or nothing. Binary characters
seem to be inserted randomly; sometimes I can recognize a pattern of a
repeated string. Maybe someone is familiar with this format ? The
messages were saved from MS Outlook.

I tried many variants, my best shot goes to:

   if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs)  { ...

But still no success. I was thinking of an encoding issue (Unicode/
UTF?), but the source file seems too different for that.

Thanks

Do little tiny minor jobs usually make you break out in a sweat like
this??

PRONTOR
 
P

Peter J. Holzer

use Encode;
open(my $in, '<:raw', $mypath) || die "Couldn't open file: $!";
my $txt = do { local $/; <$in> };
close $in;
my @lines = split /\n/, decode('UTF-16LE', $txt);

Shorter:

open(my $in, '<:encoding(UTF-16LE)', $mypath) || die "Couldn't open file: $!";
my @lines = <$in>;
chomp @lines;

(untested)

hp
 
B

Bart Van der Donck

Peter said:
Shorter:

    open(my $in, '<:encoding(UTF-16LE)', $mypath) || die "Couldn't open file: $!";
    my @lines = <$in>;
    chomp @lines;

For my particular situation, it appears that I need the raw method
anyhow. When I read directly with '<:encoding(UTF-16LE)', it says:

"UTF-16LE:Unicode character fffe is illegal at script.pl line 32."

(32 is the line with the 'open'-call)
 
J

Jürgen Exner

Bart Van der Donck said:
For my particular situation, it appears that I need the raw method
anyhow. When I read directly with '<:encoding(UTF-16LE)', it says:

"UTF-16LE:Unicode character fffe is illegal at script.pl line 32."

The only place where 0xFFFE could possibly show up is the byte order
mark (BOM) and I would be very surprised if Perl couldn't handle the
BOM.
I would suggest to check the file with a hex editor to make sure it does
not contain an additional rouge BOM somewhere in the middle of the file.

jue
 
S

sln

For my particular situation, it appears that I need the raw method
anyhow. When I read directly with '<:encoding(UTF-16LE)', it says:

"UTF-16LE:Unicode character fffe is illegal at script.pl line 32."

(32 is the line with the 'open'-call)

Try:
  open(my $in, '<:encoding(UTF-16)', $mypath) || die "Couldn't open file: $!";
^^
UTF-16

fffe BOM is UTF-16LE, and should have opened ok.
However, when you read the first time without seeking past the
bom offset (2), fffe is read and is illeagal UTF-16 char.

When you open with UTF-16 instead, the layer expects a BOM and
automatically moves the file position past it for the first read.
Its called the BOM bug !!!

Of course if you don't have a BOM, using UTF-16 will die with
"no BOM". Another bug !!!

I posted code before that auto navigates these waters, if you
bothered to look.

-sln
 
S

sln

Try:
  open(my $in, '<:encoding(UTF-16)', $mypath) || die "Couldn't open file: $!";
^^
UTF-16

fffe BOM is UTF-16LE, and should have opened ok.
However, when you read the first time without seeking past the
bom offset (2), fffe is read and is illeagal UTF-16 char.

When you open with UTF-16 instead, the layer expects a BOM and
automatically moves the file position past it for the first read.
Its called the BOM bug !!!

The bug is that seek's are dead, you have to keep track of bom
offset yourself (if bom) and this should be transparent if :encoding(UTF-16).
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top