Please Help!!more string manipulation Qs...in C++

Discussion in 'C++' started by Hp, Oct 25, 2005.

  1. Hp

    Hp Guest

    Hi All,
    Thanks a lot for all your replies.

    My requirement is as follows:
    I need to read a text file, eliminate certain special characters(like !
    , - = + ), and then convert it to lower case and then remove certain
    stopwords(like and, a, an, by, the etc) which is there in another txt
    file.
    Then, i need to run it thru a stemmer(a program which converts words
    like running to run, ie, converts them to roots words).
    Then i need to create a term-by-document matrix, which would be a
    matrix, where in M(i,j) will give the number of times the term j occurs
    in the document i.

    My situation as of now is as below:
    I have read the file contents into a string variable, removed/replaced
    the special characters with a space using the replace function, and
    then converted the string completely to lower case, using the transform
    function.

    I would really appreciate .any help, thanks i advance.

    Thanks,
    Hp
    Hp, Oct 25, 2005
    #1
    1. Advertising

  2. [OT] Please Help!!more string manipulation Qs...in C++

    On 24 Oct 2005 19:45:33 -0700, "Hp" <>
    wrote:

    >Hi All,
    >Thanks a lot for all your replies.
    >
    >My requirement is as follows:
    >I need to read a text file, eliminate certain special characters(like !
    >, - = + ), and then convert it to lower case and then remove certain
    >stopwords(like and, a, an, by, the etc) which is there in another txt
    >file.
    >Then, i need to run it thru a stemmer(a program which converts words
    >like running to run, ie, converts them to roots words).
    >Then i need to create a term-by-document matrix, which would be a
    >matrix, where in M(i,j) will give the number of times the term j occurs
    >in the document i.
    >
    >My situation as of now is as below:
    >I have read the file contents into a string variable, removed/replaced
    >the special characters with a space using the replace function, and
    >then converted the string completely to lower case, using the transform
    >function.
    >
    >I would really appreciate .any help, thanks i advance.
    >
    >Thanks,
    >Hp


    Is this homework??? Sure sounds like it.

    If not, why do you have to use C++ at all? Perl or awk, using regular
    expressions, is probably much easier for something like this.

    At any rate, your question has to do with algorithms, not with the
    language itself. Therefore, it is off-topic in this NG.

    --
    Bob Hairgrove
    Bob Hairgrove, Oct 25, 2005
    #2
    1. Advertising

  3. Hp

    Hp Guest

    It is a project, where i m stuck at a particular point and i dont know
    how to proceed. I know the algorithm, its just the implementation that
    i cant get, and hence forth it deseves a post in the c++ newsgroups.
    Hey bob, I would appreciate a solution to my question and can do
    without unnecessary comments!
    Hp, Oct 25, 2005
    #3
  4. Hp

    Guest

    Hp wrote:
    > It is a project, where i m stuck at a particular point and i dont know
    > how to proceed. I know the algorithm, its just the implementation that
    > i cant get, and hence forth it deseves a post in the c++ newsgroups.
    > Hey bob, I would appreciate a solution to my question and can do
    > without unnecessary comments!


    Why don't you show some code?
    With none of your "project" problems have you shown any code.

    Do something! Get stuck, then ask questions!

    The comments you get are not unecessary. You are on a C++ _langugae_
    newsgroup. Figure something out. Post again when you have _specific_
    problems with a language construct and now a "write my program for me"
    request!

    Cheers,
    Andre
    , Oct 25, 2005
    #4
  5. Hp

    Hp Guest

    I am sorry not to have posted my code, i apologize for the that.

    Here is the code:
    -----------------------------------------------------------------
    #include <iostream>
    #include <string>
    #include <fstream>
    #include <vector>
    #include <set>
    #include <algorithm>
    #include <cctype>

    using namespace std;
    using std::string;

    int main(int argc, char *argv[])
    {
    using std::cout;
    using std::endl;


    int var_len;

    FILE *fp;
    long len;
    char *buf;
    fp=fopen("01t.txt","rb");
    fseek(fp,0,SEEK_END);
    len=ftell(fp);
    fseek(fp,0,SEEK_SET);
    buf=(char *)malloc(len);
    fread(buf,len,1,fp);
    fclose(fp);
    string file;
    file=buf;

    cout<< file <<endl;

    vector<string> files;
    vector<string> punct;//Vector of strings to remove the punctuations
    from each files
    cout<<"This is a sample program"<<endl;
    punct.push_back(",");punct.push_back(":");punct.push_back(";");
    punct.push_back("'");
    punct.push_back("'");punct.push_back("=");punct.push_back("-");
    punct.push_back(".");punct.push_back(",");punct.push_back(",");

    for (int i=0;i<punct.size();i++)
    {
    cout<<punct.at(i)<<endl;
    }

    std::replace(file.begin(),file.end(),',','');
    std::replace(file.begin(),file.end(),';',' ');
    std::replace(file.begin(),file.end(),':','');
    std::replace(file.begin(),file.end(),'-',' ');
    std::replace(file.begin(),file.end(),'=','');
    std::replace(file.begin(),file.end(),'+',' ');
    std::replace(file.begin(),file.end(),')','');
    std::replace(file.begin(),file.end(),'(',' ');
    std::replace(file.begin(),file.end(),'&','');
    std::replace(file.begin(),file.end(),'!',' ');
    std::replace(file.begin(),file.end(),'.','');
    std::replace(file.begin(),file.end(),'/',' ');
    //Removing single and double quotes
    std::replace(file.begin(),file.end(),'\'','');
    std::replace(file.begin(),file.end(),'\"',' ');

    std::transform(file.begin(),file.end(),file.begin(),tolower);



    /*if((pos=file.find(remword,0))!=string::npos)
    {
    file.erase(pos,remword.length());
    }
    cout << "After removing 'the'" <<endl;
    */

    }
    -----------------------------------------------------------------------------------
    Hp, Oct 25, 2005
    #5
  6. Hp

    Guest

    Hp wrote:
    > I am sorry not to have posted my code, i apologize for the that.
    >
    > Here is the code:


    _Compiling_ code would be nice, too...

    > using namespace std;
    > using std::string;


    This is redundant. If you include the full namespace (std), you don't
    need to list the individual ones. Pick one.

    > int var_len;


    Unused?

    >
    > FILE *fp;
    > long len;
    > char *buf;
    > fp=fopen("01t.txt","rb");
    > fseek(fp,0,SEEK_END);
    > len=ftell(fp);
    > fseek(fp,0,SEEK_SET);
    > buf=(char *)malloc(len);
    > fread(buf,len,1,fp);
    > fclose(fp);
    > string file;
    > file=buf;


    This code is pretty much unreadable. You should not mix variable
    declaration with code to read in a file like that. Some error checking
    would be useful as well.

    fopen() feels very "C". You could use a more C++ approach here, like
    "ifstream".

    > vector<string> files;


    Unused?

    > vector<string> punct;//Vector of strings to remove the punctuations
    > from each files


    Looks like you fill this vector but then decided to replace them all
    manually anyway?

    It may be simpler (if you dont want to use boost::regex) to put all the
    unwanted characters into a simple string (not a vector) and iterate
    over that.

    > std::replace(file.begin(),file.end(),',','');


    You can't replace with a non-character...

    > std::transform(file.begin(),file.end(),file.begin(),tolower);


    "tolower" is unfortunately amgigious. You'll have to cast it like this:


    std::transform(file.begin(),file.end(),file.begin(),(int(*)(int))std::tolower);

    > /*if((pos=file.find(remword,0))!=string::npos)
    > {
    > file.erase(pos,remword.length());
    > }
    > */


    You'll need a loop here. A single if won't do.

    Cheers,
    Andre
    , Oct 25, 2005
    #6
  7. Hp

    Hp Guest

    wrote:
    > Hp wrote:
    > > I am sorry not to have posted my code, i apologize for the that.
    > >
    > > Here is the code:

    >
    > _Compiling_ code would be nice, too...
    >
    > > using namespace std;
    > > using std::string;

    >
    > This is redundant. If you include the full namespace (std), you don't
    > need to list the individual ones. Pick one.
    >
    > > int var_len;

    >
    > Unused?

    I had declared it for future use.

    >
    > >
    > > FILE *fp;
    > > long len;
    > > char *buf;
    > > fp=fopen("01t.txt","rb");
    > > fseek(fp,0,SEEK_END);
    > > len=ftell(fp);
    > > fseek(fp,0,SEEK_SET);
    > > buf=(char *)malloc(len);
    > > fread(buf,len,1,fp);
    > > fclose(fp);
    > > string file;
    > > file=buf;

    >
    > This code is pretty much unreadable. You should not mix variable
    > declaration with code to read in a file like that. Some error checking
    > would be useful as well.
    >
    > fopen() feels very "C". You could use a more C++ approach here, like
    > "ifstream".
    >
    > > vector<string> files;

    >
    > Unused?

    I had used this vector to read a set of files and read each file into a
    string, giving me a vector of string of files that i need to read and
    modify.
    >
    > > vector<string> punct;//Vector of strings to remove the punctuations
    > > from each files

    >
    > Looks like you fill this vector but then decided to replace them all
    > manually anyway?
    >
    > It may be simpler (if you dont want to use boost::regex) to put all the
    > unwanted characters into a simple string (not a vector) and iterate
    > over that.

    Thank you, i think i will do this.
    >
    > > std::replace(file.begin(),file.end(),',','');

    >
    > You can't replace with a non-character...

    This is a typo error, i have it replaced with a space, which got lost
    while cutting and pasting.


    >
    > > std::transform(file.begin(),file.end(),file.begin(),tolower);

    >
    > "tolower" is unfortunately amgigious. You'll have to cast it like this:
    >
    >
    > std::transform(file.begin(),file.end(),file.begin(),(int(*)(int))std::tolower);
    >

    Ironically, the code i have written works:).

    > > /*if((pos=file.find(remword,0))!=string::npos)
    > > {
    > > file.erase(pos,remword.length());
    > > }
    > > */

    >
    > You'll need a loop here. A single if won't do.

    The above piece of code doesnt work. I had initialized remword = "the",
    but it was removing 'the' from 'there' too, which i dont want. Also, i
    want all the occurances of it to be removed, which i can acheive
    through a loop.
    >
    > Cheers,
    > Andre
    Hp, Oct 26, 2005
    #7
  8. Hp

    Guest

    Hp wrote:
    > [snipped posted code]
    > The above piece of code doesnt work.


    Alright, even if I am running pretty high danger of doing your
    homework, I'll post my version of the program which will read in a file
    and remove the stopwords.

    The program reads only one file in though and doesn't build the
    document/term matrix for you - that's still up to you.

    Please try to understand the code and discuss as necessary to help you
    learn something from it.

    Here ya go:

    #include <iostream>
    #include <ostream>
    #include <fstream>
    #include <sstream>
    #include <algorithm>
    #include <string>
    #include <map>

    using namespace std;

    const string InvalidChars = ",.!?;:=()+-\'\"&";

    char sanitizeChar( const char & c )
    {
    for( string::const_iterator inv=InvalidChars.begin();
    inv!=InvalidChars.end(); ++inv)
    {
    if ( *inv == c )
    return ' ';
    }

    return tolower( c );
    }

    int main()
    {
    ifstream ff_swords( "stopwords.txt" );
    ifstream ff_text( "test.txt" );

    // TODO: Check if files are open here....

    map<string,char> stopwords;

    string token;

    while( ff_swords >> token )
    stopwords[ token ] = 1;

    while( ff_text >> token )
    {
    transform( token.begin(), token.end(), token.begin(),
    sanitizeChar );

    istringstream ss( token );
    while( ss >> token )
    {
    if ( stopwords.find( token ) != stopwords.end() )
    continue;

    // TODO: Run token through stemmer here.

    // TODO: Add stemmed token to your custom matrix now...

    cout << token << endl; // <-- Debug
    }
    }
    }
    , Oct 26, 2005
    #8
  9. Hp wrote:
    >
    > I am sorry not to have posted my code, i apologize for the that.
    >
    > Here is the code:
    > -----------------------------------------------------------------
    > #include <iostream>
    > #include <string>
    > #include <fstream>
    > #include <vector>
    > #include <set>
    > #include <algorithm>
    > #include <cctype>
    >
    > using namespace std;
    > using std::string;
    >
    > int main(int argc, char *argv[])
    > {
    > using std::cout;
    > using std::endl;
    >
    > int var_len;
    >
    > FILE *fp;
    > long len;
    > char *buf;
    > fp=fopen("01t.txt","rb");
    > fseek(fp,0,SEEK_END);
    > len=ftell(fp);
    > fseek(fp,0,SEEK_SET);
    > buf=(char *)malloc(len);
    > fread(buf,len,1,fp);
    > fclose(fp);
    > string file;
    > file=buf;
    >
    > cout<< file <<endl;


    Your problem gets much easier, if you don't do this:
    read the entire file into one single string variable.

    Why don't you break the input stream into individual words
    right at the input stage?

    ifstream Input( "0lt.txt" );
    if( !Input ) {
    // bl, bla, bla, error opening file, etc
    return EXIT_FAILURE
    }

    string Word;
    vector< string > Words;

    while( Input >> Word )
    Words.push_back( Word );


    // now you have a vector of words. It is easy to manipulate
    // each one of them, eg. discard special characters, transform
    // every one of the words to lowercase, and of course, discard
    // words which are listed in a second vector or map

    --
    Karl Heinz Buchegger
    Karl Heinz Buchegger, Oct 27, 2005
    #9
  10. Hp wrote:
    >
    > I am sorry not to have posted my code, i apologize for the that.
    >
    > Here is the code:
    > -----------------------------------------------------------------
    > #include <iostream>
    > #include <string>
    > #include <fstream>
    > #include <vector>
    > #include <set>
    > #include <algorithm>
    > #include <cctype>
    >
    > using namespace std;
    > using std::string;
    >
    > int main(int argc, char *argv[])
    > {
    > using std::cout;
    > using std::endl;
    >
    > int var_len;
    >
    > FILE *fp;
    > long len;
    > char *buf;
    > fp=fopen("01t.txt","rb");
    > fseek(fp,0,SEEK_END);
    > len=ftell(fp);
    > fseek(fp,0,SEEK_SET);
    > buf=(char *)malloc(len);
    > fread(buf,len,1,fp);
    > fclose(fp);
    > string file;
    > file=buf;
    >
    > cout<< file <<endl;


    Your problem gets much easier, if you don't do this:
    read the entire file into one single string variable.

    Why don't you break the input stream into individual words
    right at the input stage?

    ifstream Input( "0lt.txt" );
    if( !Input ) {
    // bl, bla, bla, error opening file, etc
    return EXIT_FAILURE
    }

    string Word;
    vector< string > Words;

    while( Input >> Word )
    Words.push_back( Word );


    // now you have a vector of words. It is easy to manipulate
    // each one of them, eg. discard special characters, transform
    // every one of the words to lowercase, and of course, discard
    // words which are listed in a second vector or map

    --
    Karl Heinz Buchegger
    Karl Heinz Buchegger, Oct 27, 2005
    #10
  11. Hp wrote:
    >
    > I am sorry not to have posted my code, i apologize for the that.
    >
    > Here is the code:
    > -----------------------------------------------------------------
    > #include <iostream>
    > #include <string>
    > #include <fstream>
    > #include <vector>
    > #include <set>
    > #include <algorithm>
    > #include <cctype>
    >
    > using namespace std;
    > using std::string;
    >
    > int main(int argc, char *argv[])
    > {
    > using std::cout;
    > using std::endl;
    >
    > int var_len;
    >
    > FILE *fp;
    > long len;
    > char *buf;
    > fp=fopen("01t.txt","rb");
    > fseek(fp,0,SEEK_END);
    > len=ftell(fp);
    > fseek(fp,0,SEEK_SET);
    > buf=(char *)malloc(len);
    > fread(buf,len,1,fp);
    > fclose(fp);
    > string file;
    > file=buf;
    >
    > cout<< file <<endl;


    Your problem gets much easier, if you don't do this:
    read the entire file into one single string variable.

    Why don't you break the input stream into individual words
    right at the input stage?

    ifstream Input( "0lt.txt" );
    if( !Input ) {
    // bl, bla, bla, error opening file, etc
    return EXIT_FAILURE
    }

    string Word;
    vector< string > Words;

    while( Input >> Word )
    Words.push_back( Word );


    // now you have a vector of words. It is easy to manipulate
    // each one of them, eg. discard special characters, transform
    // every one of the words to lowercase, and of course, discard
    // words which are listed in a second vector or map

    --
    Karl Heinz Buchegger
    Karl Heinz Buchegger, Oct 27, 2005
    #11
  12. Hp wrote:
    >
    > I am sorry not to have posted my code, i apologize for the that.
    >
    > Here is the code:
    > -----------------------------------------------------------------
    > #include <iostream>
    > #include <string>
    > #include <fstream>
    > #include <vector>
    > #include <set>
    > #include <algorithm>
    > #include <cctype>
    >
    > using namespace std;
    > using std::string;
    >
    > int main(int argc, char *argv[])
    > {
    > using std::cout;
    > using std::endl;
    >
    > int var_len;
    >
    > FILE *fp;
    > long len;
    > char *buf;
    > fp=fopen("01t.txt","rb");
    > fseek(fp,0,SEEK_END);
    > len=ftell(fp);
    > fseek(fp,0,SEEK_SET);
    > buf=(char *)malloc(len);
    > fread(buf,len,1,fp);
    > fclose(fp);
    > string file;
    > file=buf;
    >
    > cout<< file <<endl;


    Your problem gets much easier, if you don't do this:
    read the entire file into one single string variable.

    Why don't you break the input stream into individual words
    right at the input stage?

    ifstream Input( "0lt.txt" );
    if( !Input ) {
    // bl, bla, bla, error opening file, etc
    return EXIT_FAILURE
    }

    string Word;
    vector< string > Words;

    while( Input >> Word )
    Words.push_back( Word );


    // now you have a vector of words. It is easy to manipulate
    // each one of them, eg. discard special characters, transform
    // every one of the words to lowercase, and of course, discard
    // words which are listed in a second vector or map

    --
    Karl Heinz Buchegger
    Karl Heinz Buchegger, Oct 27, 2005
    #12
  13. Hp wrote:
    >
    > I am sorry not to have posted my code, i apologize for the that.
    >
    > Here is the code:
    > -----------------------------------------------------------------
    > #include <iostream>
    > #include <string>
    > #include <fstream>
    > #include <vector>
    > #include <set>
    > #include <algorithm>
    > #include <cctype>
    >
    > using namespace std;
    > using std::string;
    >
    > int main(int argc, char *argv[])
    > {
    > using std::cout;
    > using std::endl;
    >
    > int var_len;
    >
    > FILE *fp;
    > long len;
    > char *buf;
    > fp=fopen("01t.txt","rb");
    > fseek(fp,0,SEEK_END);
    > len=ftell(fp);
    > fseek(fp,0,SEEK_SET);
    > buf=(char *)malloc(len);
    > fread(buf,len,1,fp);
    > fclose(fp);
    > string file;
    > file=buf;
    >
    > cout<< file <<endl;


    Your problem gets much easier, if you don't do this:
    read the entire file into one single string variable.

    Why don't you break the input stream into individual words
    right at the input stage?

    ifstream Input( "0lt.txt" );
    if( !Input ) {
    // bl, bla, bla, error opening file, etc
    return EXIT_FAILURE
    }

    string Word;
    vector< string > Words;

    while( Input >> Word )
    Words.push_back( Word );


    // now you have a vector of words. It is easy to manipulate
    // each one of them, eg. discard special characters, transform
    // every one of the words to lowercase, and of course, discard
    // words which are listed in a second vector or map

    --
    Karl Heinz Buchegger
    Karl Heinz Buchegger, Oct 27, 2005
    #13
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. KK
    Replies:
    2
    Views:
    528
    Big Brian
    Oct 14, 2003
  2. Michael
    Replies:
    4
    Views:
    400
    Matt Hammond
    Jun 26, 2006
  3. pbd22
    Replies:
    0
    Views:
    291
    pbd22
    Mar 10, 2008
  4. Man Alive
    Replies:
    6
    Views:
    10,419
    Joshua Cranmer
    May 20, 2008
  5. Robert Klemme

    With a Ruby Yell: more, more more!

    Robert Klemme, Sep 28, 2005, in forum: Ruby
    Replies:
    5
    Views:
    206
    Jeff Wood
    Sep 29, 2005
Loading...

Share This Page