Please Help!!more string manipulation Qs...in C++

H

Hp

Hi All,
Thanks a lot for all your replies.

My requirement is as follows:
I need to read a text file, eliminate certain special characters(like !
, - = + ), and then convert it to lower case and then remove certain
stopwords(like and, a, an, by, the etc) which is there in another txt
file.
Then, i need to run it thru a stemmer(a program which converts words
like running to run, ie, converts them to roots words).
Then i need to create a term-by-document matrix, which would be a
matrix, where in M(i,j) will give the number of times the term j occurs
in the document i.

My situation as of now is as below:
I have read the file contents into a string variable, removed/replaced
the special characters with a space using the replace function, and
then converted the string completely to lower case, using the transform
function.

I would really appreciate .any help, thanks i advance.

Thanks,
Hp
 
B

Bob Hairgrove

Hi All,
Thanks a lot for all your replies.

My requirement is as follows:
I need to read a text file, eliminate certain special characters(like !
, - = + ), and then convert it to lower case and then remove certain
stopwords(like and, a, an, by, the etc) which is there in another txt
file.
Then, i need to run it thru a stemmer(a program which converts words
like running to run, ie, converts them to roots words).
Then i need to create a term-by-document matrix, which would be a
matrix, where in M(i,j) will give the number of times the term j occurs
in the document i.

My situation as of now is as below:
I have read the file contents into a string variable, removed/replaced
the special characters with a space using the replace function, and
then converted the string completely to lower case, using the transform
function.

I would really appreciate .any help, thanks i advance.

Thanks,
Hp

Is this homework??? Sure sounds like it.

If not, why do you have to use C++ at all? Perl or awk, using regular
expressions, is probably much easier for something like this.

At any rate, your question has to do with algorithms, not with the
language itself. Therefore, it is off-topic in this NG.
 
H

Hp

It is a project, where i m stuck at a particular point and i dont know
how to proceed. I know the algorithm, its just the implementation that
i cant get, and hence forth it deseves a post in the c++ newsgroups.
Hey bob, I would appreciate a solution to my question and can do
without unnecessary comments!
 
I

int2str

Hp said:
It is a project, where i m stuck at a particular point and i dont know
how to proceed. I know the algorithm, its just the implementation that
i cant get, and hence forth it deseves a post in the c++ newsgroups.
Hey bob, I would appreciate a solution to my question and can do
without unnecessary comments!

Why don't you show some code?
With none of your "project" problems have you shown any code.

Do something! Get stuck, then ask questions!

The comments you get are not unecessary. You are on a C++ _langugae_
newsgroup. Figure something out. Post again when you have _specific_
problems with a language construct and now a "write my program for me"
request!

Cheers,
Andre
 
H

Hp

I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <set>
#include <algorithm>
#include <cctype>

using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;


int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <<endl;

vector<string> files;
vector<string> punct;//Vector of strings to remove the punctuations
from each files
cout<<"This is a sample program"<<endl;
punct.push_back(",");punct.push_back(":");punct.push_back(";");
punct.push_back("'");
punct.push_back("'");punct.push_back("=");punct.push_back("-");
punct.push_back(".");punct.push_back(",");punct.push_back(",");

for (int i=0;i<punct.size();i++)
{
cout<<punct.at(i)<<endl;
}

std::replace(file.begin(),file.end(),',','');
std::replace(file.begin(),file.end(),';',' ');
std::replace(file.begin(),file.end(),':','');
std::replace(file.begin(),file.end(),'-',' ');
std::replace(file.begin(),file.end(),'=','');
std::replace(file.begin(),file.end(),'+',' ');
std::replace(file.begin(),file.end(),')','');
std::replace(file.begin(),file.end(),'(',' ');
std::replace(file.begin(),file.end(),'&','');
std::replace(file.begin(),file.end(),'!',' ');
std::replace(file.begin(),file.end(),'.','');
std::replace(file.begin(),file.end(),'/',' ');
//Removing single and double quotes
std::replace(file.begin(),file.end(),'\'','');
std::replace(file.begin(),file.end(),'\"',' ');

std::transform(file.begin(),file.end(),file.begin(),tolower);



/*if((pos=file.find(remword,0))!=string::npos)
{
file.erase(pos,remword.length());
}
cout << "After removing 'the'" <<endl;
*/

}
-----------------------------------------------------------------------------------
 
I

int2str

Hp said:
I am sorry not to have posted my code, i apologize for the that.

Here is the code:

_Compiling_ code would be nice, too...
using namespace std;
using std::string;

This is redundant. If you include the full namespace (std), you don't
need to list the individual ones. Pick one.
int var_len;
Unused?


FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

This code is pretty much unreadable. You should not mix variable
declaration with code to read in a file like that. Some error checking
would be useful as well.

fopen() feels very "C". You could use a more C++ approach here, like
"ifstream".
vector<string> files;
Unused?

vector<string> punct;//Vector of strings to remove the punctuations
from each files

Looks like you fill this vector but then decided to replace them all
manually anyway?

It may be simpler (if you dont want to use boost::regex) to put all the
unwanted characters into a simple string (not a vector) and iterate
over that.
std::replace(file.begin(),file.end(),',','');

You can't replace with a non-character...
std::transform(file.begin(),file.end(),file.begin(),tolower);

"tolower" is unfortunately amgigious. You'll have to cast it like this:


std::transform(file.begin(),file.end(),file.begin(),(int(*)(int))std::tolower);
/*if((pos=file.find(remword,0))!=string::npos)
{
file.erase(pos,remword.length());
}
*/

You'll need a loop here. A single if won't do.

Cheers,
Andre
 
H

Hp

_Compiling_ code would be nice, too...


This is redundant. If you include the full namespace (std), you don't
need to list the individual ones. Pick one.


Unused?
I had declared it for future use.
This code is pretty much unreadable. You should not mix variable
declaration with code to read in a file like that. Some error checking
would be useful as well.

fopen() feels very "C". You could use a more C++ approach here, like
"ifstream".


Unused?
I had used this vector to read a set of files and read each file into a
string, giving me a vector of string of files that i need to read and
modify.
Looks like you fill this vector but then decided to replace them all
manually anyway?

It may be simpler (if you dont want to use boost::regex) to put all the
unwanted characters into a simple string (not a vector) and iterate
over that.
Thank you, i think i will do this.
You can't replace with a non-character...
This is a typo error, i have it replaced with a space, which got lost
while cutting and pasting.

"tolower" is unfortunately amgigious. You'll have to cast it like this:


std::transform(file.begin(),file.end(),file.begin(),(int(*)(int))std::tolower);
Ironically, the code i have written works:).
You'll need a loop here. A single if won't do.
The above piece of code doesnt work. I had initialized remword = "the",
but it was removing 'the' from 'there' too, which i dont want. Also, i
want all the occurances of it to be removed, which i can acheive
through a loop.
 
I

int2str

Hp said:
[snipped posted code]
The above piece of code doesnt work.

Alright, even if I am running pretty high danger of doing your
homework, I'll post my version of the program which will read in a file
and remove the stopwords.

The program reads only one file in though and doesn't build the
document/term matrix for you - that's still up to you.

Please try to understand the code and discuss as necessary to help you
learn something from it.

Here ya go:

#include <iostream>
#include <ostream>
#include <fstream>
#include <sstream>
#include <algorithm>
#include <string>
#include <map>

using namespace std;

const string InvalidChars = ",.!?;:=()+-\'\"&";

char sanitizeChar( const char & c )
{
for( string::const_iterator inv=InvalidChars.begin();
inv!=InvalidChars.end(); ++inv)
{
if ( *inv == c )
return ' ';
}

return tolower( c );
}

int main()
{
ifstream ff_swords( "stopwords.txt" );
ifstream ff_text( "test.txt" );

// TODO: Check if files are open here....

map<string,char> stopwords;

string token;

while( ff_swords >> token )
stopwords[ token ] = 1;

while( ff_text >> token )
{
transform( token.begin(), token.end(), token.begin(),
sanitizeChar );

istringstream ss( token );
while( ss >> token )
{
if ( stopwords.find( token ) != stopwords.end() )
continue;

// TODO: Run token through stemmer here.

// TODO: Add stemmed token to your custom matrix now...

cout << token << endl; // <-- Debug
}
}
}
 
K

Karl Heinz Buchegger

Hp said:
I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <set>
#include <algorithm>
#include <cctype>

using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <<endl;

Your problem gets much easier, if you don't do this:
read the entire file into one single string variable.

Why don't you break the input stream into individual words
right at the input stage?

ifstream Input( "0lt.txt" );
if( !Input ) {
// bl, bla, bla, error opening file, etc
return EXIT_FAILURE
}

string Word;
vector< string > Words;

while( Input >> Word )
Words.push_back( Word );


// now you have a vector of words. It is easy to manipulate
// each one of them, eg. discard special characters, transform
// every one of the words to lowercase, and of course, discard
// words which are listed in a second vector or map
 
K

Karl Heinz Buchegger

Hp said:
I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <set>
#include <algorithm>
#include <cctype>

using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <<endl;

Your problem gets much easier, if you don't do this:
read the entire file into one single string variable.

Why don't you break the input stream into individual words
right at the input stage?

ifstream Input( "0lt.txt" );
if( !Input ) {
// bl, bla, bla, error opening file, etc
return EXIT_FAILURE
}

string Word;
vector< string > Words;

while( Input >> Word )
Words.push_back( Word );


// now you have a vector of words. It is easy to manipulate
// each one of them, eg. discard special characters, transform
// every one of the words to lowercase, and of course, discard
// words which are listed in a second vector or map
 
K

Karl Heinz Buchegger

Hp said:
I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <set>
#include <algorithm>
#include <cctype>

using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <<endl;

Your problem gets much easier, if you don't do this:
read the entire file into one single string variable.

Why don't you break the input stream into individual words
right at the input stage?

ifstream Input( "0lt.txt" );
if( !Input ) {
// bl, bla, bla, error opening file, etc
return EXIT_FAILURE
}

string Word;
vector< string > Words;

while( Input >> Word )
Words.push_back( Word );


// now you have a vector of words. It is easy to manipulate
// each one of them, eg. discard special characters, transform
// every one of the words to lowercase, and of course, discard
// words which are listed in a second vector or map
 
K

Karl Heinz Buchegger

Hp said:
I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <set>
#include <algorithm>
#include <cctype>

using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <<endl;

Your problem gets much easier, if you don't do this:
read the entire file into one single string variable.

Why don't you break the input stream into individual words
right at the input stage?

ifstream Input( "0lt.txt" );
if( !Input ) {
// bl, bla, bla, error opening file, etc
return EXIT_FAILURE
}

string Word;
vector< string > Words;

while( Input >> Word )
Words.push_back( Word );


// now you have a vector of words. It is easy to manipulate
// each one of them, eg. discard special characters, transform
// every one of the words to lowercase, and of course, discard
// words which are listed in a second vector or map
 
K

Karl Heinz Buchegger

Hp said:
I am sorry not to have posted my code, i apologize for the that.

Here is the code:
-----------------------------------------------------------------
#include <iostream>
#include <string>
#include <fstream>
#include <vector>
#include <set>
#include <algorithm>
#include <cctype>

using namespace std;
using std::string;

int main(int argc, char *argv[])
{
using std::cout;
using std::endl;

int var_len;

FILE *fp;
long len;
char *buf;
fp=fopen("01t.txt","rb");
fseek(fp,0,SEEK_END);
len=ftell(fp);
fseek(fp,0,SEEK_SET);
buf=(char *)malloc(len);
fread(buf,len,1,fp);
fclose(fp);
string file;
file=buf;

cout<< file <<endl;

Your problem gets much easier, if you don't do this:
read the entire file into one single string variable.

Why don't you break the input stream into individual words
right at the input stage?

ifstream Input( "0lt.txt" );
if( !Input ) {
// bl, bla, bla, error opening file, etc
return EXIT_FAILURE
}

string Word;
vector< string > Words;

while( Input >> Word )
Words.push_back( Word );


// now you have a vector of words. It is easy to manipulate
// each one of them, eg. discard special characters, transform
// every one of the words to lowercase, and of course, discard
// words which are listed in a second vector or map
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,015
Latest member
AmbrosePal

Latest Threads

Top