Regex to extract email from .msg

Bart Van der Donck · Jan 7, 2010

Hello,

I have been assigned a task to filter out an email address from the
body of a (.msg) source file.

The source file looks odd and displays differently in various
plaintext readers. It looks like some sort of half binary / half ascii
format (including the headers). The body of the file is more-or-less
consistent. The address to be extracted is in the following format:

"- n a m e @ h o s t . c o m "

All text in the source file is with such spaces between.

Spaces can be displayed like EOL, space or nothing. Binary characters
seem to be inserted randomly; sometimes I can recognize a pattern of a
repeated string. Maybe someone is familiar with this format ? The
messages were saved from MS Outlook.

I tried many variants, my best shot goes to:

if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs) { ...

But still no success. I was thinking of an encoding issue (Unicode/
UTF?), but the source file seems too different for that.

Thanks

Peter J. Holzer · Jan 7, 2010

I have been assigned a task to filter out an email address from the
body of a (.msg) source file.

The source file looks odd and displays differently in various
plaintext readers. It looks like some sort of half binary / half ascii
format (including the headers). The body of the file is more-or-less
consistent. The address to be extracted is in the following format:

"- n a m e @ h o s t . c o m "

All text in the source file is with such spaces between.

Spaces can be displayed like EOL, space or nothing.

Then they probably aren't spaces. Most likely they are nul characters.
If you have Linux, use hd (or od) to look at the file. If you use
Windows, there's probably some freeware hex editor/viewer you can use.

But still no success. I was thinking of an encoding issue (Unicode/
UTF?), but the source file seems too different for that.

Most likely UTF-16, but there may be some additional markup.

hp

Steve C · Jan 7, 2010

Bart said:
Hello,

I have been assigned a task to filter out an email address from the
body of a (.msg) source file.

The source file looks odd and displays differently in various
plaintext readers. It looks like some sort of half binary / half ascii
format (including the headers). The body of the file is more-or-less
consistent. The address to be extracted is in the following format:

"- n a m e @ h o s t . c o m "

All text in the source file is with such spaces between.

Spaces can be displayed like EOL, space or nothing. Binary characters
seem to be inserted randomly; sometimes I can recognize a pattern of a
repeated string. Maybe someone is familiar with this format ? The
messages were saved from MS Outlook.

I tried many variants, my best shot goes to:

if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs) { ...

But still no success. I was thinking of an encoding issue (Unicode/
UTF?), but the source file seems too different for that.

Thanks

Are you sure they are spaces and not NULs? Windows text files
frequently use 16-bit wide character format, which looks like
0x0 in the high byte and ASCII in the low byte for English
characters.

http://www.microsoft.com/opentype/unicode/cs.htm

Wanna-Be Sys Admin · Jan 8, 2010

Bart said:
Hello,

I have been assigned a task to filter out an email address from the
body of a (.msg) source file.

The source file looks odd and displays differently in various
plaintext readers. It looks like some sort of half binary / half ascii
format (including the headers). The body of the file is more-or-less
consistent. The address to be extracted is in the following format:

"- n a m e @ h o s t . c o m "

All text in the source file is with such spaces between.

Spaces can be displayed like EOL, space or nothing. Binary characters
seem to be inserted randomly; sometimes I can recognize a pattern of a
repeated string. Maybe someone is familiar with this format ? The
messages were saved from MS Outlook.

I tried many variants, my best shot goes to:

if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs) { ...

Maybe try stripping hidden characters from the file first? Trying to
guess just exactly how many spaces or other characters, can be a
hassle. The problem is, another address could be completely
inconsistent from another, I assume they aren't all the same? If so,
and if they are really white space, maybe \s+ in place of \s\s\s would
be better? Also, why are you capturing \s\s\s, is that intentional? Is
that expected and what you want? Anyway, you probably need to convert
the file/data to strip out the junk so you can get the actual data you
want and not try and work around ignoring or grabbing that junk

Jürgen Exner · Jan 8, 2010

Bart Van der Donck said:
The source file looks odd and displays differently in various
plaintext readers. It looks like some sort of half binary / half ascii
format (including the headers). The body of the file is more-or-less
consistent. The address to be extracted is in the following format:

"- n a m e @ h o s t . c o m "

All text in the source file is with such spaces between.

Spaces can be displayed like EOL, space or nothing. Binary characters
seem to be inserted randomly; sometimes I can recognize a pattern of a
repeated string. Maybe someone is familiar with this format ? The
messages were saved from MS Outlook.

This file is likely in UTF-16 or USC-2. Did you look at it in a
hex/binary editor? Those spaces are probably 0x00 bytes and not really
spaces at all.

Use the proper encoding and Perl should be able to read the file just
fine,

jue

Bart Van der Donck · Jan 8, 2010

Jürgen Exner said:
This file is likely in UTF-16 or USC-2. Did you look at it in a
hex/binary editor? Those spaces are probably 0x00 bytes and not really
spaces at all.

Use the proper encoding and Perl should be able to read the file just
fine,

Yes - it appeared to be a UTF-16 issue indeed. I tried about all
possible byte order encoding schemes... and the following finally did
the trick:

use Encode;
open(my $in, '<:raw', $mypath) || die "Couldn't open file: $!";
my $txt = do { local $/; <$in> };
close $in;
my @lines = split /\n/, decode('UTF-16LE', $txt);

Thanks all for your help!

Permostat · Jan 8, 2010

Hello,

I have been assigned a task to filter out an email address from the
body of a (.msg) source file.

The source file looks odd and displays differently in various
plaintext readers. It looks like some sort of half binary / half ascii
format (including the headers). The body of the file is more-or-less
consistent. The address to be extracted is in the following format:

"- n a m e @ h o s t . c o m "

All text in the source file is with such spaces between.

Spaces can be displayed like EOL, space or nothing. Binary characters
seem to be inserted randomly; sometimes I can recognize a pattern of a
repeated string. Maybe someone is familiar with this format ? The
messages were saved from MS Outlook.

I tried many variants, my best shot goes to:

if (/(-)(\s\s\s)(.+)(@)(.+)(\.)(.+)(\s\s\s)/gs) { ...

But still no success. I was thinking of an encoding issue (Unicode/
UTF?), but the source file seems too different for that.

Thanks

Do little tiny minor jobs usually make you break out in a sweat like
this??

PRONTOR

Peter J. Holzer · Jan 8, 2010

use Encode;
open(my $in, '<:raw', $mypath) || die "Couldn't open file: $!";
my $txt = do { local $/; <$in> };
close $in;
my @lines = split /\n/, decode('UTF-16LE', $txt);

Shorter:

open(my $in, '<:encoding(UTF-16LE)', $mypath) || die "Couldn't open file: $!";
my @lines = <$in>;
chomp @lines;

(untested)

hp

Bart Van der Donck · Jan 9, 2010

Peter said:
Shorter:

open(my $in, '<:encoding(UTF-16LE)', $mypath) || die "Couldn't open file: $!";
my @lines = <$in>;
chomp @lines;

For my particular situation, it appears that I need the raw method
anyhow. When I read directly with '<:encoding(UTF-16LE)', it says:

"UTF-16LE:Unicode character fffe is illegal at script.pl line 32."

(32 is the line with the 'open'-call)

Jürgen Exner · Jan 9, 2010

Bart Van der Donck said:
For my particular situation, it appears that I need the raw method
anyhow. When I read directly with '<:encoding(UTF-16LE)', it says:

"UTF-16LE:Unicode character fffe is illegal at script.pl line 32."

The only place where 0xFFFE could possibly show up is the byte order
mark (BOM) and I would be very surprised if Perl couldn't handle the
BOM.
I would suggest to check the file with a hex editor to make sure it does
not contain an additional rouge BOM somewhere in the middle of the file.

jue

sln · Jan 9, 2010

For my particular situation, it appears that I need the raw method
anyhow. When I read directly with '<:encoding(UTF-16LE)', it says:

"UTF-16LE:Unicode character fffe is illegal at script.pl line 32."

(32 is the line with the 'open'-call)

Try:
open(my $in, '<:encoding(UTF-16)', $mypath) || die "Couldn't open file: $!";
^^
UTF-16

fffe BOM is UTF-16LE, and should have opened ok.
However, when you read the first time without seeking past the
bom offset (2), fffe is read and is illeagal UTF-16 char.

When you open with UTF-16 instead, the layer expects a BOM and
automatically moves the file position past it for the first read.
Its called the BOM bug !!!

Of course if you don't have a BOM, using UTF-16 will die with
"no BOM". Another bug !!!

I posted code before that auto navigates these waters, if you
bothered to look.

-sln

sln · Jan 9, 2010

Try:
open(my $in, '<:encoding(UTF-16)', $mypath) || die "Couldn't open file: $!";
^^
UTF-16

fffe BOM is UTF-16LE, and should have opened ok.
However, when you read the first time without seeking past the
bom offset (2), fffe is read and is illeagal UTF-16 char.

When you open with UTF-16 instead, the layer expects a BOM and
automatically moves the file position past it for the first read.
Its called the BOM bug !!!

The bug is that seek's are dead, you have to keep track of bom
offset yourself (if bom) and this should be transparent if :encoding(UTF-16).

regex to extract color guide from html	2	Oct 26, 2004
Regex to extract row data from text	12	Oct 22, 2003
Collect Excel Data from Website	5	Apr 30, 2022
FAQ 4.34 How do I extract selected columns from a string?	0	Apr 27, 2011
Trouble with UnicodeEncodeError and email	0	Jan 8, 2014
Best way to extract from regex in if statement	5	Apr 4, 2009
Fwd: Extract value and average	0	Jun 9, 2009
How to send email programmatically from a gmail email a/c when port587(smtp) is blocked	5	Sep 11, 2012

Regex to extract email from .msg

Bart Van der Donck

Peter J. Holzer

Steve C

Wanna-Be Sys Admin

Jürgen Exner

Bart Van der Donck

Permostat

Peter J. Holzer

Bart Van der Donck

Jürgen Exner

sln

sln

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads