extract all hotmail email addresses in a file and store in separatefile

evanevankan2 · Jun 19, 2008

char const *const original = "\"(e-mail address removed)\"";

Click to expand...

char buf[50];

strcpy(buf,original);

Click to expand...

buf[strlen(original) - 1] = 0;

Click to expand...

That does not strip both of the " characters.
char const is confusing, and the second const is unnecessary.
fix:
const char *original = "\"...\"";

Take the last 12 characters, make them all lowercase, and then compare
with "@hotmail.com".

Click to expand...

@ does not belong to C's basic character set, so, that's not possible.

What does this mean? That you can't use '@' in strings without relying
on
the particular implementation?
so much for portability

santosh · Jun 19, 2008

On 19 Juni, 07:41, (e-mail address removed) wrote:

What does this mean? That you can't use '@' in strings without relying
on the particular implementation?

The relevant clause in the standard is 5.2.1(3). The extract that
follows is from n1256 which not the official standard but a working
draft.

-----------
Both the basic source and basic execution character sets shall have the
following members: the 26 uppercase letters of the Latin alphabet

A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z

the 26 lowercase letters of the Latin alphabet

a b c d e f g h i j k l m
n o p q r s t u v w x y z

the 10 decimal digits

0 1 2 3 4 5 6 7 8 9

the following 29 graphic characters

! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~

the space character, and control characters representing horizontal tab,
vertical tab, and form feed. The representation of each member of the
source and execution basic character sets shall fit in a byte. In both
the source and execution basic character sets, the value of each
character after 0 in the above list of decimal digits shall be one
greater than the value of the previous. In source files, there shall be
some way of indicating the end of each line of text; this International
Standard treats such an end-of-line indicator as if it were a single
new-line character. In the basic execution character set, there shall
be control characters representing alert, backspace, carriage return,
and new line. If any other characters are encountered in a source file
(except in an identifier, a character constant, a string literal, a
header name, a comment, or a preprocessing token that is never
converted to a token), the behavior is undefined.
-----------

so much for portability

But in practise most implementations do support the @ character, at
least those that I'm aware of, which is a tiny fraction of all the
implementations out there, so you might disregard my "most"
comment.

santosh · Jun 19, 2008

Bartc said:
news:e4359263-995d-4b1a-8865-9a97eceb7dc2@m36g2000hse.googlegroups.com...

Hi, I have a text file that contents a list of email addresses like
this:

"(e-mail address removed)"
"(e-mail address removed)"
"(e-mail address removed)"
"(e-mail address removed)"

I like to

1. Strip out the " characters and just leave the email addresses on
each line.
2. extract out the hotmail addresses and store it into another file.
The hotmail addresses in the original file would be deleted.

Click to expand...

You have perl solutions so you won't need this. But was an interesting
little snippet:

/* Sort email addresses (possibly for some nefarious purpose) from
file "input" */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void error(void) {puts("File error"); exit(0);}

int main(void) {
char line[200];
char *p;
int n;

FILE *in,*hot,*nothot;

in=fopen("input","r");
if (in==0) error();

hot=fopen("hotmail","w");
if (hot==0) {fclose(in); error();};

nothot=fopen("nothotmail","w");
if (nothot==0) {fclose(in); fclose(nothot); error();};

while (1) {

fgets(line,sizeof(line),in);
if (feof(in)) break;

Fgets could fail due to an I/O error too, not necessarily end-of-file.
You need to check ferror() too before proceeding to be absolutely safe,
since you don't check fgets for an EOF return. The later strategy
involves only one check unless EOF is returned, but your strategy would
involve two checks (perhaps full function calls) after every fgets
call.

n=strlen(line);

Better to make 'n' unsigned long.

p=line;
if (line[n-1]='\n') {line[n-1]=0; --n;};
if (n) {
if (line[n-1]='""') {line[n-1]=0; --n;};
if (*p=='"') ++p;
if (strstr(p,"@hotmail.com"))
fprintf(hot,"%s\n",p);
else
fprintf(nothot,"%s\n",p);
};
};
fclose(in);
fclose(hot);
fclose(nothot);
}

santosh · Jun 19, 2008

Bartc said:
news:e4359263-995d-4b1a-8865-9a97eceb7dc2@m36g2000hse.googlegroups.com...

Hi, I have a text file that contents a list of email addresses like
this:

"(e-mail address removed)"
"(e-mail address removed)"
"(e-mail address removed)"
"(e-mail address removed)"

I like to

1. Strip out the " characters and just leave the email addresses on
each line.
2. extract out the hotmail addresses and store it into another file.
The hotmail addresses in the original file would be deleted.

Click to expand...

You have perl solutions so you won't need this. But was an interesting
little snippet:

/* Sort email addresses (possibly for some nefarious purpose) from
file "input" */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void error(void) {puts("File error"); exit(0);}

int main(void) {
char line[200];
char *p;
int n;

FILE *in,*hot,*nothot;

in=fopen("input","r");
if (in==0) error();

hot=fopen("hotmail","w");
if (hot==0) {fclose(in); error();};

nothot=fopen("nothotmail","w");
if (nothot==0) {fclose(in); fclose(nothot); error();};

while (1) {

fgets(line,sizeof(line),in);
if (feof(in)) break;

n=strlen(line);
p=line;
if (line[n-1]='\n') {line[n-1]=0; --n;};
if (n) {
if (line[n-1]='""') {line[n-1]=0; --n;};

Also what do you mean by '""' here? Did you mean to write '"'?

<snip rest>

santosh · Jun 19, 2008

Bartc said:
news:e4359263-995d-4b1a-8865-9a97eceb7dc2@m36g2000hse.googlegroups.com...

Hi, I have a text file that contents a list of email addresses like
this:

"(e-mail address removed)"
"(e-mail address removed)"
"(e-mail address removed)"
"(e-mail address removed)"

I like to

1. Strip out the " characters and just leave the email addresses on
each line.
2. extract out the hotmail addresses and store it into another file.
The hotmail addresses in the original file would be deleted.

Click to expand...

You have perl solutions so you won't need this. But was an interesting
little snippet:

/* Sort email addresses (possibly for some nefarious purpose) from
file "input" */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void error(void) {puts("File error"); exit(0);}

int main(void) {
char line[200];

Also according to the relevant RFCs (2821 & 2822), just the domain part
of an email address can be up to 255 characters. To be safe you might
want to make line at least 512 bytes.

<snip rest>

evanevankan2 · Jun 19, 2008

[email protected] said:
[email protected] said:

On 19 Juni, 07:41, (e-mail address removed) wrote:

Click to expand...

What does this mean? That you can't use '@' in strings without relying
on the particular implementation?

Click to expand...

The relevant clause in the standard is 5.2.1(3). The extract that
follows is from n1256 which not the official standard but a working
draft.

-----------
[snip]
-----------

so much for portability

Click to expand...

But in practise most implementations do support the @ character, at
least those that I'm aware of, which is a tiny fraction of all the
implementations out there, so you might disregard my "most"
comment.

Thanks for the reply.

But what would happen if @ is not supported? Is it only in the source
code as character constants it is not supported?
I guess reading text from a file which contains @ works since
it is just bytes. Does toupper(), isdigit() and similar functions
just ignore unsupported characters?

And how much can be assumed to be supported, where does one draw
the line? For example @ is probably supported while a japanese
character is not, right?

santosh · Jun 19, 2008

[email protected] said:
[email protected] said:

On 19 Juni, 07:41, (e-mail address removed) wrote:

Click to expand...

@ does not belong to C's basic character set, so, that's not
possible.

Click to expand...

What does this mean? That you can't use '@' in strings without
relying on the particular implementation?

Click to expand...

The relevant clause in the standard is 5.2.1(3). The extract that
follows is from n1256 which not the official standard but a working
draft.

-----------
[snip]
-----------

so much for portability

Click to expand...

But in practise most implementations do support the @ character, at
least those that I'm aware of, which is a tiny fraction of all the
implementations out there, so you might disregard my "most"
comment.

Click to expand...

Thanks for the reply.

But what would happen if @ is not supported? Is it only in the source
code as character constants it is not supported?

It could be either one or both.

I guess reading text from a file which contains @ works since
it is just bytes.
Does toupper(), isdigit() and similar functions
just ignore unsupported characters?

The return false.

And how much can be assumed to be supported, where does one draw
the line? For example @ is probably supported while a japanese
character is not, right?

It depends on the implementation. Some implementations have locales for
non-latin environments. You need to read the documentation for your
implementation and set the appropriate locale after program start-up.

Martijn Lievaart · Jun 19, 2008

perl -nie 'if (/\@hotmail.com@$/) { s/"//g; print; }' text_file

Or even:

perl -nie 's/"//g; print if /\@hotmail.com@$/' text_file

M4

Bartc · Jun 19, 2008

santosh said:
Fgets could fail due to an I/O error too, not necessarily end-of-file.

OK. What happens to the buffer in that case, would it be an empty string?
And would feof() ever become true?

if (line[n-1]='""') {line[n-1]=0; --n;};

Click to expand...

Also what do you mean by '""' here? Did you mean to write '"'?

I can't actually see clearly, but yes it should be a single " inside single
quotes. Although two double quotes surprisingly still works; I would have
expected in this case to compare a char widened to whatever size '""' was,
not for the '""' to be narrowed to char.

char line[200];

Click to expand...

Also according to the relevant RFCs (2821 & 2822), just the domain part
of an email address can be up to 255 characters. To be safe you might
want to make line at least 512 bytes.

OK I was guessing that. Although someone with a 511-char email won't be very
popular with his friends. And he wouldn't be getting any spam via this
program either...

santosh · Jun 19, 2008

Bartc said:
OK. What happens to the buffer in that case, would it be an empty
string?

When a read error occurs (i.e., fgets returns NULL and ferror is true)
then the array contents are indeterminate. In the case of end-of-file
(i.e., fgets returns NULL and feof is true) where no characters were
read, the array contents are left unchanged.

And would feof() ever become true?

Yes, when end-of-file is encountered.

<snip>

Dr.Ruud · Jun 20, 2008

Martijn Lievaart schreef:

Or even:

perl -nie 's/"//g; print if /\@hotmail.com@$/' text_file

Don't you mean this?

perl -ne 's/"//g; print if /\@hotmail\./' text_file

Martijn Lievaart · Jun 20, 2008

Martijn Lievaart schreef:

Don't you mean this?

perl -ne 's/"//g; print if /\@hotmail\./' text_file

I think I ment this:

perl -ni -e 's/"//g; print if /\@hotmail.com@$/' text_file

(-i makes a backup, -ie probably takes the 'e' as the backup suffix.)

M4

szr · Jun 20, 2008

Martijn said:
I think I ment this:

perl -ni -e 's/"//g; print if /\@hotmail.com@$/' text_file

(-i makes a backup, -ie probably takes the 'e' as the backup suffix.)

Maybe I'm missing something, but I don't understand why you have an @
near the end of your regex just before the $ ? I can't find any mention
of it in perldoc or my Perl Pocket Refererence, but it's possible I'm
missing something.

Thanks.

Martijn Lievaart · Jun 20, 2008

Maybe I'm missing something, but I don't understand why you have an @
near the end of your regex just before the $ ? I can't find any mention
of it in perldoc or my Perl Pocket Refererence, but it's possible I'm
missing something.

Your missing nothing. I'm blind.

M4

Henry Law · Jun 21, 2008

Tomás Ó hÉilidhe said:
char const *const original = "\"(e-mail address removed)\"";

Please learn to look at the groups to which an article is posted and
strip out those for which your reply is not relevant. In this case
comp.lang.perl.misc.

Keith Thompson · Jul 2, 2008

On 19 Juni, 07:41, (e-mail address removed) wrote: [...]

@ does not belong to C's basic character set, so, that's not possible.

Click to expand...

What does this mean? That you can't use '@' in strings without
relying on the particular implementation? so much for portability

It means that it's possible to have a conforming C implementation on a
system where the '@' character is not supported. Likewise for '$' and
'`' (backtick); those happen to be the only three ASCII printable
characters that the C standard doesn't require.

As it happens, the vast majority of character sets in current use are
based on ASCII, and most non-ASCII systems with C implementations use
some variant of EBCDIC. And, as it happens, both ASCII and EBCDIC can
represent '@', '$', and '`'. So, although the C standard doesn't
*require* all systems to support these characters, you're not likely
to run across an implementation that doesn't support them.

(IMHO it wouldn't hurt for a future revision of the C standard to
require support for these three characters, even if they're only
usable in character constants, string literals, comments, and a few
other similar contexts.)

So vippstar's statement that "that's not possible" is a bit
over-stated. You can't use the '@' character in completely 100%
theoretically portable strictly conforming C code. But you can almost
certainly use it safely if you don't mind the *theoretical* loss of
portability that's unlikely to be an issue in real life.

Unless (a) there's some system out there that I've never heard of that
doesn't support the '@' character, and (b) you want to search for
hotmail.com addresses on such a system.

Walter Roberson · Jul 2, 2008

Keith Thompson said:
As it happens, the vast majority of character sets in current use are
based on ASCII, and most non-ASCII systems with C implementations use
some variant of EBCDIC. And, as it happens, both ASCII and EBCDIC can
represent '@', '$', and '`'.

'@' is apparently in the invariant code-sets (according
to wikipedia) but '$' and '`' are not, at least according to the
shading of the table (which might be wrong).
http://en.wikipedia.org/wiki/EBCDIC

I've looked through some of the IBM EBCDIC code pages but did not
happen upon any that were missing the '`' (though it was not in
the same place in all of the ones I looked at.) The EBCDIC code
page list, in case someone is sufficiently bored to look, is at
http://www-306.ibm.com/software/globalization/cp/cp_es.jsp#EBCDIC

Antoninus Twink · Jul 2, 2008

You can't use the '@' character in completely 100% theoretically
portable strictly conforming C code. But you can almost certainly use
it safely if you don't mind the *theoretical* loss of portability
that's unlikely to be an issue in real life.

Worrying about real life! Whatever next? I sense that KT is almost one
of us.

Kenny McCormack · Jul 3, 2008

Worrying about real life! Whatever next? I sense that KT is almost one
of us.

I *was* absolutely shocked to see KT mention real-world considerations.

What's going on here? Something we should know about? Death in the
family? What?

Keith Thompson · Jul 3, 2008

'@' is apparently in the invariant code-sets (according
to wikipedia) but '$' and '`' are not, at least according to the
shading of the table (which might be wrong).
http://en.wikipedia.org/wiki/EBCDIC

I've looked through some of the IBM EBCDIC code pages but did not
happen upon any that were missing the '`' (though it was not in
the same place in all of the ones I looked at.) The EBCDIC code
page list, in case someone is sufficiently bored to look, is at
http://www-306.ibm.com/software/globalization/cp/cp_es.jsp#EBCDIC

I took a cursory look at the EBCDIC table in the Wikipedia article and
saw all three characters. I didn't consider variations or invariant
code-sets, and I'm no expert on EBCDIC, so feel free to take what I
wrote with a grain of salt.

[This is partly a test of a new news server; I had problems with
aioe.org, so I'm trying motzarella.org.]

How to loop in folder through all excel files and all sheets using pandas?	0	Dec 1, 2022
Parsing for email addresses	5	Feb 15, 2010
Hi, I am a webflow user. I am looking for CSS code that can KEEP ALL ELEMENTS POSITIONED in the SAME spot across all resolutions	0	Oct 27, 2023
some hotmail and gmail can't render French characters	5	Mar 22, 2012
Beginner's Guide to getting CipherSweet working with PDO and MYSQL	1	Dec 1, 2022
Read a html file, extract email addresses?	0	Jan 23, 2004
Need A script to open a excel file and extract the data using autofilter	4	Oct 1, 2011
Regex to extract email from .msg	11	Jan 7, 2010

extract all hotmail email addresses in a file and store in separatefile

evanevankan2

santosh

santosh

santosh

santosh

evanevankan2

santosh

Martijn Lievaart

Bartc

santosh

Dr.Ruud

Martijn Lievaart

szr

Martijn Lievaart

Henry Law

Keith Thompson

Walter Roberson

Antoninus Twink

Kenny McCormack

Keith Thompson

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads