extract all hotmail email addresses in a file and store in separatefile

E

evanevankan2

char const *const original = "\"(e-mail address removed)\"";
char buf[50];
strcpy(buf,original);

buf[strlen(original) - 1] = 0;

That does not strip both of the " characters.
char const is confusing, and the second const is unnecessary.
fix:
const char *original = "\"...\"";
Take the last 12 characters, make them all lowercase, and then compare
with "@hotmail.com".

@ does not belong to C's basic character set, so, that's not possible.

What does this mean? That you can't use '@' in strings without relying
on
the particular implementation?
so much for portability
 
S

santosh

On 19 Juni, 07:41, (e-mail address removed) wrote:


What does this mean? That you can't use '@' in strings without relying
on the particular implementation?

The relevant clause in the standard is 5.2.1(3). The extract that
follows is from n1256 which not the official standard but a working
draft.

-----------
Both the basic source and basic execution character sets shall have the
following members: the 26 uppercase letters of the Latin alphabet

A B C D E F G H I J K L M
N O P Q R S T U V W X Y Z

the 26 lowercase letters of the Latin alphabet

a b c d e f g h i j k l m
n o p q r s t u v w x y z

the 10 decimal digits

0 1 2 3 4 5 6 7 8 9

the following 29 graphic characters

! " # % & ' ( ) * + , - . / :
; < = > ? [ \ ] ^ _ { | } ~

the space character, and control characters representing horizontal tab,
vertical tab, and form feed. The representation of each member of the
source and execution basic character sets shall fit in a byte. In both
the source and execution basic character sets, the value of each
character after 0 in the above list of decimal digits shall be one
greater than the value of the previous. In source files, there shall be
some way of indicating the end of each line of text; this International
Standard treats such an end-of-line indicator as if it were a single
new-line character. In the basic execution character set, there shall
be control characters representing alert, backspace, carriage return,
and new line. If any other characters are encountered in a source file
(except in an identifier, a character constant, a string literal, a
header name, a comment, or a preprocessing token that is never
converted to a token), the behavior is undefined.
-----------
so much for portability

But in practise most implementations do support the @ character, at
least those that I'm aware of, which is a tiny fraction of all the
implementations out there, so you might disregard my "most"
comment. :)
 
S

santosh

Bartc said:
news:e4359263-995d-4b1a-8865-9a97eceb7dc2@m36g2000hse.googlegroups.com...
Hi, I have a text file that contents a list of email addresses like
this:

"(e-mail address removed)"
"(e-mail address removed)"
"(e-mail address removed)"
"(e-mail address removed)"

I like to

1. Strip out the " characters and just leave the email addresses on
each line.
2. extract out the hotmail addresses and store it into another file.
The hotmail addresses in the original file would be deleted.

You have perl solutions so you won't need this. But was an interesting
little snippet:

/* Sort email addresses (possibly for some nefarious purpose) from
file "input" */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void error(void) {puts("File error"); exit(0);}

int main(void) {
char line[200];
char *p;
int n;

FILE *in,*hot,*nothot;

in=fopen("input","r");
if (in==0) error();

hot=fopen("hotmail","w");
if (hot==0) {fclose(in); error();};

nothot=fopen("nothotmail","w");
if (nothot==0) {fclose(in); fclose(nothot); error();};

while (1) {

fgets(line,sizeof(line),in);
if (feof(in)) break;

Fgets could fail due to an I/O error too, not necessarily end-of-file.
You need to check ferror() too before proceeding to be absolutely safe,
since you don't check fgets for an EOF return. The later strategy
involves only one check unless EOF is returned, but your strategy would
involve two checks (perhaps full function calls) after every fgets
call.
n=strlen(line);

Better to make 'n' unsigned long.
p=line;
if (line[n-1]='\n') {line[n-1]=0; --n;};
if (n) {
if (line[n-1]='""') {line[n-1]=0; --n;};
if (*p=='"') ++p;
if (strstr(p,"@hotmail.com"))
fprintf(hot,"%s\n",p);
else
fprintf(nothot,"%s\n",p);
};
};
fclose(in);
fclose(hot);
fclose(nothot);
}
 
S

santosh

Bartc said:
news:e4359263-995d-4b1a-8865-9a97eceb7dc2@m36g2000hse.googlegroups.com...
Hi, I have a text file that contents a list of email addresses like
this:

"(e-mail address removed)"
"(e-mail address removed)"
"(e-mail address removed)"
"(e-mail address removed)"

I like to

1. Strip out the " characters and just leave the email addresses on
each line.
2. extract out the hotmail addresses and store it into another file.
The hotmail addresses in the original file would be deleted.

You have perl solutions so you won't need this. But was an interesting
little snippet:

/* Sort email addresses (possibly for some nefarious purpose) from
file "input" */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void error(void) {puts("File error"); exit(0);}

int main(void) {
char line[200];
char *p;
int n;

FILE *in,*hot,*nothot;

in=fopen("input","r");
if (in==0) error();

hot=fopen("hotmail","w");
if (hot==0) {fclose(in); error();};

nothot=fopen("nothotmail","w");
if (nothot==0) {fclose(in); fclose(nothot); error();};

while (1) {

fgets(line,sizeof(line),in);
if (feof(in)) break;

n=strlen(line);
p=line;
if (line[n-1]='\n') {line[n-1]=0; --n;};
if (n) {
if (line[n-1]='""') {line[n-1]=0; --n;};

Also what do you mean by '""' here? Did you mean to write '"'?

<snip rest>
 
S

santosh

Bartc said:
news:e4359263-995d-4b1a-8865-9a97eceb7dc2@m36g2000hse.googlegroups.com...
Hi, I have a text file that contents a list of email addresses like
this:

"(e-mail address removed)"
"(e-mail address removed)"
"(e-mail address removed)"
"(e-mail address removed)"

I like to

1. Strip out the " characters and just leave the email addresses on
each line.
2. extract out the hotmail addresses and store it into another file.
The hotmail addresses in the original file would be deleted.

You have perl solutions so you won't need this. But was an interesting
little snippet:

/* Sort email addresses (possibly for some nefarious purpose) from
file "input" */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

void error(void) {puts("File error"); exit(0);}

int main(void) {
char line[200];

Also according to the relevant RFCs (2821 & 2822), just the domain part
of an email address can be up to 255 characters. To be safe you might
want to make line at least 512 bytes.

<snip rest>
 
E

evanevankan2

On 19 Juni, 07:41, (e-mail address removed) wrote:
What does this mean? That you can't use '@' in strings without relying
on the particular implementation?

The relevant clause in the standard is 5.2.1(3). The extract that
follows is from n1256 which not the official standard but a working
draft.

-----------
[snip]
-----------
so much for portability

But in practise most implementations do support the @ character, at
least those that I'm aware of, which is a tiny fraction of all the
implementations out there, so you might disregard my "most"
comment. :)

Thanks for the reply.

But what would happen if @ is not supported? Is it only in the source
code as character constants it is not supported?
I guess reading text from a file which contains @ works since
it is just bytes. Does toupper(), isdigit() and similar functions
just ignore unsupported characters?

And how much can be assumed to be supported, where does one draw
the line? For example @ is probably supported while a japanese
character is not, right?
 
S

santosh

On 19 Juni, 07:41, (e-mail address removed) wrote:

@ does not belong to C's basic character set, so, that's not
possible.
What does this mean? That you can't use '@' in strings without
relying on the particular implementation?

The relevant clause in the standard is 5.2.1(3). The extract that
follows is from n1256 which not the official standard but a working
draft.

-----------
[snip]
-----------
so much for portability

But in practise most implementations do support the @ character, at
least those that I'm aware of, which is a tiny fraction of all the
implementations out there, so you might disregard my "most"
comment. :)

Thanks for the reply.

But what would happen if @ is not supported? Is it only in the source
code as character constants it is not supported?

It could be either one or both.
I guess reading text from a file which contains @ works since
it is just bytes.
Does toupper(), isdigit() and similar functions
just ignore unsupported characters?

The return false.
And how much can be assumed to be supported, where does one draw
the line? For example @ is probably supported while a japanese
character is not, right?

It depends on the implementation. Some implementations have locales for
non-latin environments. You need to read the documentation for your
implementation and set the appropriate locale after program start-up.
 
B

Bartc

santosh said:
Fgets could fail due to an I/O error too, not necessarily end-of-file.

OK. What happens to the buffer in that case, would it be an empty string?
And would feof() ever become true?
if (line[n-1]='""') {line[n-1]=0; --n;};

Also what do you mean by '""' here? Did you mean to write '"'?

I can't actually see clearly, but yes it should be a single " inside single
quotes. Although two double quotes surprisingly still works; I would have
expected in this case to compare a char widened to whatever size '""' was,
not for the '""' to be narrowed to char.
Also according to the relevant RFCs (2821 & 2822), just the domain part
of an email address can be up to 255 characters. To be safe you might
want to make line at least 512 bytes.

OK I was guessing that. Although someone with a 511-char email won't be very
popular with his friends. And he wouldn't be getting any spam via this
program either...
 
S

santosh

Bartc said:
OK. What happens to the buffer in that case, would it be an empty
string?

When a read error occurs (i.e., fgets returns NULL and ferror is true)
then the array contents are indeterminate. In the case of end-of-file
(i.e., fgets returns NULL and feof is true) where no characters were
read, the array contents are left unchanged.
And would feof() ever become true?

Yes, when end-of-file is encountered.

<snip>
 
D

Dr.Ruud

Martijn Lievaart schreef:
Or even:

perl -nie 's/"//g; print if /\@hotmail.com@$/' text_file

Don't you mean this?

perl -ne 's/"//g; print if /\@hotmail\./' text_file
 
M

Martijn Lievaart

Martijn Lievaart schreef:

Don't you mean this?

perl -ne 's/"//g; print if /\@hotmail\./' text_file

I think I ment this:

perl -ni -e 's/"//g; print if /\@hotmail.com@$/' text_file

(-i makes a backup, -ie probably takes the 'e' as the backup suffix.)

M4
 
S

szr

Martijn said:
I think I ment this:

perl -ni -e 's/"//g; print if /\@hotmail.com@$/' text_file

(-i makes a backup, -ie probably takes the 'e' as the backup suffix.)

Maybe I'm missing something, but I don't understand why you have an @
near the end of your regex just before the $ ? I can't find any mention
of it in perldoc or my Perl Pocket Refererence, but it's possible I'm
missing something.

Thanks.
 
M

Martijn Lievaart

Maybe I'm missing something, but I don't understand why you have an @
near the end of your regex just before the $ ? I can't find any mention
of it in perldoc or my Perl Pocket Refererence, but it's possible I'm
missing something.

Your missing nothing. I'm blind.

M4
 
H

Henry Law

Tomás Ó hÉilidhe said:
char const *const original = "\"(e-mail address removed)\"";

Please learn to look at the groups to which an article is posted and
strip out those for which your reply is not relevant. In this case
comp.lang.perl.misc.
 
K

Keith Thompson

On 19 Juni, 07:41, (e-mail address removed) wrote: [...]
@ does not belong to C's basic character set, so, that's not possible.

What does this mean? That you can't use '@' in strings without
relying on the particular implementation? so much for portability

It means that it's possible to have a conforming C implementation on a
system where the '@' character is not supported. Likewise for '$' and
'`' (backtick); those happen to be the only three ASCII printable
characters that the C standard doesn't require.

As it happens, the vast majority of character sets in current use are
based on ASCII, and most non-ASCII systems with C implementations use
some variant of EBCDIC. And, as it happens, both ASCII and EBCDIC can
represent '@', '$', and '`'. So, although the C standard doesn't
*require* all systems to support these characters, you're not likely
to run across an implementation that doesn't support them.

(IMHO it wouldn't hurt for a future revision of the C standard to
require support for these three characters, even if they're only
usable in character constants, string literals, comments, and a few
other similar contexts.)

So vippstar's statement that "that's not possible" is a bit
over-stated. You can't use the '@' character in completely 100%
theoretically portable strictly conforming C code. But you can almost
certainly use it safely if you don't mind the *theoretical* loss of
portability that's unlikely to be an issue in real life.

Unless (a) there's some system out there that I've never heard of that
doesn't support the '@' character, and (b) you want to search for
hotmail.com addresses on such a system.
 
W

Walter Roberson

Keith Thompson said:
As it happens, the vast majority of character sets in current use are
based on ASCII, and most non-ASCII systems with C implementations use
some variant of EBCDIC. And, as it happens, both ASCII and EBCDIC can
represent '@', '$', and '`'.

'@' is apparently in the invariant code-sets (according
to wikipedia) but '$' and '`' are not, at least according to the
shading of the table (which might be wrong).
http://en.wikipedia.org/wiki/EBCDIC

I've looked through some of the IBM EBCDIC code pages but did not
happen upon any that were missing the '`' (though it was not in
the same place in all of the ones I looked at.) The EBCDIC code
page list, in case someone is sufficiently bored to look, is at
http://www-306.ibm.com/software/globalization/cp/cp_es.jsp#EBCDIC
 
A

Antoninus Twink

You can't use the '@' character in completely 100% theoretically
portable strictly conforming C code. But you can almost certainly use
it safely if you don't mind the *theoretical* loss of portability
that's unlikely to be an issue in real life.

Worrying about real life! Whatever next? I sense that KT is almost one
of us.
 
K

Kenny McCormack

Worrying about real life! Whatever next? I sense that KT is almost one
of us.

I *was* absolutely shocked to see KT mention real-world considerations.

What's going on here? Something we should know about? Death in the
family? What?
 
K

Keith Thompson

'@' is apparently in the invariant code-sets (according
to wikipedia) but '$' and '`' are not, at least according to the
shading of the table (which might be wrong).
http://en.wikipedia.org/wiki/EBCDIC

I've looked through some of the IBM EBCDIC code pages but did not
happen upon any that were missing the '`' (though it was not in
the same place in all of the ones I looked at.) The EBCDIC code
page list, in case someone is sufficiently bored to look, is at
http://www-306.ibm.com/software/globalization/cp/cp_es.jsp#EBCDIC

I took a cursory look at the EBCDIC table in the Wikipedia article and
saw all three characters. I didn't consider variations or invariant
code-sets, and I'm no expert on EBCDIC, so feel free to take what I
wrote with a grain of salt.

[This is partly a test of a new news server; I had problems with
aioe.org, so I'm trying motzarella.org.]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,774
Messages
2,569,598
Members
45,161
Latest member
GertrudeMa
Top