Best way to parse delimited data from a file.

D

DeveloperDave

Ok, so I'm sure this is a common task, but I have been unable to find
a good algorithm for it. Currently I just have a horrible looking
nestled 'if' blocks, so hopefully someone can help me out with a more
elegant solution.

I have a large binary file. Essentially the binary file contains 1 or
more paragraphs. Each paragraph starts with a four byte sequence:

0x4F 0x67 0x67 0x53

In my C++ application I open a fstream to the file in binary mode. I
want to iterate through the file, and pass each paragraph into a new
class which will know what to do with it.

I have been using a read() to copy a chunk of data into a buffer, then
iterate through the buffer checking if each character matches 0x4F.
If it does, I then check to see if the next char matches 0x67 and so
on. This gives me four nestled if blocks. I also have to deal with
buffering the data.

Can anyone offer a better solution written in native C/C++

Thanks
 
D

DeveloperDave

Something like the following?

switch(state)
{
    case Contents:
        if (ch == 0x4F)
            state = MaybeParagraph1;
        break;
    case MaybeParagraph1:
        if (ch == 0x67)
            state = MaybeParagraph2;
        else
            state = Contents;
        break;
    ... ... ...
    case MaybeParagraph3:
        if (ch == 0x53)
            /* handle start of new paragraph (process previous paragraph
which will be empty for start of file */
        state = Contents;
        break;

}

Ah, so like a state machine (doh I should have remembered that). I
suppose the other half of the question is, what is the best way to get
the data out of the stream. Should I copy it into a buffer and
iterate through it or should I be using something like seekg/peek to
try and inspect the data on the stream, and then read off the entire
paragraph when I know the start/end positions.

Cheers
 
S

Stefan Ram

DeveloperDave said:
0x4F 0x67 0x67 0x53
I have been using a read() to copy a chunk of data into a buffer, then
iterate through the buffer checking if each character matches 0x4F.

What if a chunk ends with 0x4F 0x67, and the next chunk
starts with 0x67 0x53?

Otherwise, you could use a slight variant of the strstr
algorithm or use repeated calls of strstr for each
0-terminated section of a chunk.
 
M

Maxim Yegorushkin

For avoiding 4 if-s, see memcmp() or std::equal().

For avoiding buffering, use memory mapping to map the whole file into
memory (sorry not native C/C++, but platform-specific!). If the file is too
large to fit in the process address space, just switch to a 64-bit system!

Brilliant answer! Quite true that a 64-bit operating system allows you
to do I/O in an awesome novel way.

For reading one can map the file into memory and return two iterators to
the range mapped.

When writing one can map the file into memory again and write directly
into memory without having to issue write() syscall. Resizing a memory
mapped file is (munmap(), ftruncate(), mmap() again) (i.e. resizing
invalidates iterators, but not offsets, just like for std::vector<>).

And the file size does not matter much (unless its larger than 2^64, but
it is less than 2^64 on practise because system shared libraries may get
mapped in the middle of the 64-bit process address space) because
Unix'es (and probably Windoze) do demand paging, that is mapped memory
only allocates physical memory when that mapped memory is touched. For
example, if you mmap() a 64 Gb file no physical memory gets consumed
until you start accessing that memory.

For more details pls see: http://en.wikipedia.org/wiki/Demand_paging
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,009
Latest member
GidgetGamb

Latest Threads

Top