Fast way to read a text file line by line

Discussion in 'C++' started by Thomas Kowalski, May 10, 2007.

  1. Hi,
    currently I am reading a huge (about 10-100 MB) text-file line by line
    using
    fstreams and getline. I wonder whether there is a faster way to read a
    file line by line (with std::string line). Is there some way to burst
    read the whole file and later "extract" each line?

    Thanks in advance,
    Thomas Kowalski
     
    Thomas Kowalski, May 10, 2007
    #1
    1. Advertising

  2. Thomas Kowalski

    kingfox Guest

    On 5ÔÂ10ÈÕ, ÏÂÎç7ʱ49·Ö, Thomas Kowalski <> wrote:
    > Hi,
    > currently I am reading a huge (about 10-100 MB) text-file line by line
    > using
    > fstreams and getline. I wonder whether there is a faster way to read a
    > file line by line (with std::string line). Is there some way to burst
    > read the whole file and later "extract" each line?
    >
    > Thanks in advance,
    > Thomas Kowalski


    You can open your file with binary mode, and use
    ifstream::read(char_type* s, streamsize n) read whole file data into a
    large memory buffer in a stringstream object. Then use
    stringstream::getline() to read the data line by line.
     
    kingfox, May 10, 2007
    #2
    1. Advertising

  3. Thomas Kowalski

    Guest

    On May 10, 7:49 am, Thomas Kowalski <> wrote:
    > Hi,
    > currently I am reading a huge (about 10-100 MB) text-file line by line
    > using
    > fstreams and getline. I wonder whether there is a faster way to read a
    > file line by line (with std::string line). Is there some way to burst
    > read the whole file and later "extract" each line?


    If performance really is a problem there are a number of things that
    can be done. However, most are outside the realm of C++.

    Check to see if your string class deallocates when it gets a smaller
    string. If not (likey) preallocate your string to a size larger than
    what you expect the longest record to be and be sure you use the same
    string object to read. You don't want repeated allocations and
    deallocations within your string.

    You can always use the read() function to get data and parse through
    it.

    On some systems (e.g. Windoze and VMS) you can map your file to memory.
     
    , May 10, 2007
    #3
  4. Thomas Kowalski

    osmium Guest

    <> wrote:

    > On May 10, 7:49 am, Thomas Kowalski <> wrote:
    >> Hi,
    >> currently I am reading a huge (about 10-100 MB) text-file line by line
    >> using
    >> fstreams and getline. I wonder whether there is a faster way to read a
    >> file line by line (with std::string line). Is there some way to burst
    >> read the whole file and later "extract" each line?

    >
    > If performance really is a problem there are a number of things that
    > can be done. However, most are outside the realm of C++.
    >
    > Check to see if your string class deallocates when it gets a smaller
    > string. If not (likey) preallocate your string to a size larger than
    > what you expect the longest record to be and be sure you use the same
    > string object to read. You don't want repeated allocations and
    > deallocations within your string.
    >
    > You can always use the read() function to get data and parse through
    > it.


    read() is going to operate on arbitray sized "chunks" Be careful to be sure
    that *someone* is handdling the seam problems. Somehow detect and handle
    the fact that the last fragment of bytes in a chunk is probaly not a a full
    and complete line.
     
    osmium, May 10, 2007
    #4
  5. > You can open your file with binary mode, and use
    > ifstream::read(char_type* s, streamsize n) read whole file data into a
    > large memory buffer in a stringstream object.


    Sounds good, but how exactly to I use read something to stringstreams
    buffer using ifstream? Somehow I have a blank there.

    Thanks in advance,
    Thomas Kowalski
     
    Thomas Kowalski, May 18, 2007
    #5
  6. Thomas Kowalski

    James Kanze Guest

    On May 10, 1:49 pm, Thomas Kowalski <> wrote:

    > currently I am reading a huge (about 10-100 MB) text-file line
    > by line using fstreams and getline. I wonder whether there is
    > a faster way to read a file line by line (with std::string
    > line). Is there some way to burst read the whole file and
    > later "extract" each line?


    The fastest solution is probably mmap, or it's equivalent under
    Windows, but that's very system dependent. Other than that, you
    can read the file in one go using something like:
    std::istringstream tmp ;
    tmp << file.rdbuf() ;
    std::string s = tmp.str() ;
    or:
    std::string s( (std::istreambuf_iterator< char >( file )),
    (std::istreambuf_iterator< char >()) ) ;

    If you can get a good estimation of the size of the file before
    hand (which again requires sytem dependent code), then using
    reserve on the string in the second example above could
    significantly improve performance; as might something like:

    std::string s ;
    s.resize( knownFileSize ) ;
    file.read( &s[ 0 ], s.size() ) ;

    The only system I know where it is even possible to get the
    exact size of a text file is Unix, however; under Windows, all
    of the techniques overstate the size somewhat (and under other
    systems, it might not even be possible to get a reasonable
    estimate). So you might want to do something like:
    s.resize( file.gcount() ) ;
    after the above. (It's actually a little bit more complicated.
    If there are not at least s.size() bytes in the file---and under
    Windows, this will usually be the case if the file is opened as
    text, and GetFileSizeEx was used to obtain the size---then
    file.read, above, will appear to fail. In fact, if gcount() is
    greater than 0, it will have successfully read gcount() bytes,
    and if eof() has been set, the read can be considered to have
    successfully read all of the bytes in the file.)

    (Note that this is not guaranteed under the current C++
    standard. It works in practice, however, on all existing
    implementations of the library, and will be guaranteed in the
    next version of the standard.)

    Two other things that you might try:

    -- using std::vector< char > instead of std::string---with some
    implementations, it can be faster (especially if you
    construct the string using istreambuf_iterator), and

    -- reading the file as binary, rather than text, and handling
    the different end of line representations manually in your
    own code.

    Concerning the latter, be aware that on some systems, you cannot
    open a file as binary if it was created as text, and vice versa.
    Just ignoring extra '\r' in the text is often sufficient,
    however, for this solution to work adequately under both Unix
    and Windows; ignoring only the '\r' which immediately precede a
    '\n' is even more correct, but often not worth the extra bother.
    And if the file is read as binary, both stat (under Unix) and
    GetFileSizeEx (under Windows) will return the exact number of
    bytes you can read from it.

    --
    James Kanze (GABI Software) email:
    Conseils en informatique orientée objet/
    Beratung in objektorientierter Datenverarbeitung
    9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34
     
    James Kanze, May 18, 2007
    #6
  7. On May 10, 1:49 pm, Thomas Kowalski <> wrote:
    > Hi,
    > currently I am reading a huge (about 10-100 MB) text-file line by line
    > using
    > fstreams and getline. I wonder whether there is a faster way to read a
    > file line by line (with std::string line). Is there some way to burst
    > read the whole file and later "extract" each line?


    Yes, just write streambuf class that will do that for you.
    I did same thing, but not because of performance reasons
    but support for large files on 32 bit implementation of streams.

    What I did is something like this :

    class ImportInBuf: public std::streambuf {
    public:
    ImportInBuf(const char* filename)
    :fd_(open(filename,O_RDONLY | O_LARGEFILE))
    {
    if(fd_<0)throw std::runtime_error(strerror(errno));
    setg(buffer_,buffer_,buffer_);
    struct stat st;
    fstat(fd_,&st);
    fsize_=st.st_size;
    }
    // you don;t have to do it like this if your streams are 64 bit
    void seekg(uint64_t pos)
    {
    lseek(fd_,pos,SEEK_SET);
    pos_=pos;
    setg(buffer_,buffer_,buffer_);
    }
    uint64_t tellg()const { return pos_; }
    uint64_t size()const { return fsize_; }
    ~ImportInBuf(){ close(fd_); }
    private:
    ImportInBuf(const ImportInBuf&);
    ImportInBuf& operator=(const ImportInBuf&);
    virtual int underflow()
    {
    if(gptr()<egptr())
    {
    return *gptr();
    }
    int size = read(fd_,buffer_,4096);
    if(size)
    {
    pos_+=size;
    setg(buffer_,buffer_,buffer_+size);
    return *gptr();
    }
    pos_=fsize_;
    return EOF;
    }
    char buffer_[4096];
    int fd_;
    uint64_t fsize_,pos_;
    };

    Use it like this:
    ImportInBuf buf("sample.txt");
    std::istream is(&buf);

    or derive class from istream and pass buf in initializer list.


    Greetings, Branimir.
     
    Branimir Maksimovic, May 18, 2007
    #7
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Hugo
    Replies:
    10
    Views:
    1,385
    Matt Humphrey
    Oct 18, 2004
  2. kaushikshome
    Replies:
    4
    Views:
    821
    kaushikshome
    Sep 10, 2006
  3. scad
    Replies:
    23
    Views:
    1,214
    Alf P. Steinbach
    May 17, 2009
  4. Barak, Ron
    Replies:
    8
    Views:
    1,259
  5. Devesh Agrawal
    Replies:
    18
    Views:
    279
Loading...

Share This Page