Storing/processing binary file input help needed

Discussion in 'C Programming' started by Arnold, Jan 6, 2004.

  1. Arnold

    Arnold Guest

    I need to read a binary file and store it into a buffer in memory (system
    has large amount of RAM, 2GB+) then pass it to a function. The function
    accepts input as 32 bit unsigned longs (DWORD). I can pass a max of 512
    words to it at a time. So I would pass them in chunks of 512 words until the
    whole file has been processed. I haven't worked with binary files before so
    I'm confused with how to store the binary file into memory. What sort of
    array do I use? Does C allow char only? Can I declare a DWORD buffer since
    that's what the function is taking as input? Or do I need to know the format
    of the original data that binary file is encoding and store it in that?
    That's the part that is really confusing me.

    I believe I'll need to used fread to copy the file to that array. I plan on
    getting the size of file, then determining how many DWORD are present in it
    (for example 9000) and use that my number of object parameter in fread. So
    in this case:

    fread(buffer, 4,9000,fp); //each DWORD is 4 bytes, 900 DWORDs in my binary
    file

    Is that right?

    Once I get the file into the buffer, I can then do a loop where I pass 512
    elements of the array to a function until all 9000 elements are processed. I
    hope that's right. Any other tips on improving speed and efficiency would be
    appreciated. Thanks.
    Arnold, Jan 6, 2004
    #1
    1. Advertising

  2. [Cross-post to comp.lang.c++ removed. If you want a C answer, ask here.
    If you want a C++ answer, ask there. Don't ask in both places. C and C++
    are two very different languages. The best solution in one may not even
    be valid in the other.]

    Arnold wrote:

    > I need to read a binary file and store it into a buffer in memory (system
    > has large amount of RAM, 2GB+) then pass it to a function. The function
    > accepts input as 32 bit unsigned longs (DWORD).


    You should leave the Microsoftisms at the door when you ask a question
    here. We discuss standard, portable C in this group. We know what
    unsigned long is. We don't know or care about DWORD.

    > I can pass a max of 512
    > words to it at a time. So I would pass them in chunks of 512 words until the
    > whole file has been processed. I haven't worked with binary files before so
    > I'm confused with how to store the binary file into memory. What sort of
    > array do I use? Does C allow char only? Can I declare a DWORD buffer since
    > that's what the function is taking as input?


    You can declare your buffer basically any way you want, but the
    functions for reading will always read a sequence of chars. The problem
    with declaring the buffer as something other than char[] is that it
    results in basically reinterpreting the raw bits, and the result may be
    incorrect or even illegal (resulting in undefined behavior - possibly a
    program crash) if the format of the file doesn't match the exact layout
    that the C implementation uses for the type (unsigned long in this case).

    Basically, you are talking about allowing the C implementation to
    dictate the file format. Not only is this a bad idea, but it sounds like
    it's backward in your case - the file format is already defined.

    The correct, portable way to read a binary file is almost always to read
    it as raw bytes, then convert the raw bytes according to the format of
    the file. So if your file is made up of 4-byte unsigned values, stored
    most-significant-byte first, you could do something like this:

    #define FIELD_BYTES 4

    unsigned char buf[FIELD_BYTES];
    unsigned long value = 0
    size_t i;

    fread(buf, FIELD_BYTES, 1, fp);
    for (i=0; i<FIELD_BYTES; ++i)
    {
    value = (value << CHAR_BIT) | buf;
    }

    You could also handle more than one value at a time, with a little more
    work.

    > Or do I need to know the format
    > of the original data that binary file is encoding and store it in that?


    Not sure what you mean by that. Of course you need to know the format of
    the file, and write the code accordingly. You can't wave a magic wand
    and make your code handle files in an unknown format.

    > That's the part that is really confusing me.
    >
    > I believe I'll need to used fread to copy the file to that array. I plan on
    > getting the size of file, then determining how many DWORD are present in it
    > (for example 9000) and use that my number of object parameter in fread. So
    > in this case:
    >
    > fread(buffer, 4,9000,fp); //each DWORD is 4 bytes, 900 DWORDs in my binary
    > file
    >
    > Is that right?


    It's a possible starting point. It's certainly not a complete, portable
    solution.

    >
    > Once I get the file into the buffer, I can then do a loop where I pass 512
    > elements of the array to a function until all 9000 elements are processed. I
    > hope that's right. Any other tips on improving speed and efficiency would be
    > appreciated. Thanks.


    My main tip for improving speed and efficiency is don't even try to.
    Write simple, correct code first. Only worry about making it faster if
    it's determined to be too slow, and then profile to determine where the
    time is being lost so you can target optimizing effort appropriately.

    In particular, if you are only able to handle 512 elements at a time, I
    wouldn't bother reading more than that from the file each iteration.
    There's probably no need to read the entire file into memory, and it
    would probably be more complicated. On the other hand, reading larger
    blocks (and thus minimizing I/O function calls) /might/ improve
    execution speed, but don't worry about that until it's time (as
    described above).

    -Kevin
    --
    My email address is valid, but changes periodically.
    To contact me please use the address from a recent posting.
    Kevin Goodsell, Jan 6, 2004
    #2
    1. Advertising

  3. On Tue, 06 Jan 2004 08:10:52 +0000, Arnold wrote:

    > Once I get the file into the buffer, I can then do a loop where I pass 512
    > elements of the array to a function until all 9000 elements are processed. I
    > hope that's right. Any other tips on improving speed and efficiency would be
    > appreciated. Thanks.


    As an alternative to the mmap solution from Glanni, the easiest way to do
    this would be to read 512 words, process them, write back result, repaet
    until end-of-file. No need to read the whole file in memory.

    You can write back results in place, if they should occupy the same
    storage, ro to some other file. If the data has to be replaced, it is
    often best to write the output to a new file, then move the new file over
    the old file. That way you will not corrupt the original file if your
    program crashes half way through.

    HTH,
    M4
    Martijn Lievaart, Jan 6, 2004
    #3
  4. "Arnold" <> wrote in message news:<g4uKb.395$>...

    I am not a C wizard but I have some suggestions.

    > I need to read a binary file and store it into a buffer in memory (system
    > has large amount of RAM, 2GB+) then pass it to a function. The function
    > accepts input as 32 bit unsigned longs (DWORD). I can pass a max of 512
    > words to it at a time. So I would pass them in chunks of 512 words until the
    > whole file has been processed.


    By the term "words" means to say that it is a chunk of chars and a
    delimiters with an ASCII space? Or each "words" size is 512 bytes?

    > I'm confused with how to store the binary file into memory. What sort of
    > array do I use? Does C allow char only? Can I declare a DWORD buffer since
    > that's what the function is taking as input? Or do I need to know the format
    > of the original data that binary file is encoding and store it in that?
    > That's the part that is really confusing me.


    By the term binary file and file format are you talking about the
    first two letters in a file according to the DOS assembly language
    (example MZ in .exe file) or the format of data present in a file
    (fields and record with a kind of delimiter). If it is the second then
    it is more related with the file's record design concept.

    >
    > I believe I'll need to used fread to copy the file to that array. I plan on
    > getting the size of file, then determining how many DWORD are present in it
    > (for example 9000) and use that my number of object parameter in fread. So
    > in this case:
    >
    > fread(buffer, 4,9000,fp); //each DWORD is 4 bytes, 900 DWORDs in my binary
    > file
    >
    > Is that right?



    Just 512 elements or unknown during the run time? Is not the time to
    take up with linked list rather than using array data type?



    > Once I get the file into the buffer, I can then do a loop where I pass 512
    > elements of the array to a function until all 9000 elements are processed. I
    > hope that's right. Any other tips on improving speed and efficiency would be
    > appreciated. Thanks.


    Optimizing in C is not a kind of "instructions management" like in
    asm.
    sathyashrayan, Jan 6, 2004
    #4
  5. Arnold

    Arnold Guest

    "Martijn Lievaart" <> wrote in message
    news:p...
    > On Tue, 06 Jan 2004 08:10:52 +0000, Arnold wrote:
    >
    > > Once I get the file into the buffer, I can then do a loop where I pass

    512
    > > elements of the array to a function until all 9000 elements are

    processed. I
    > > hope that's right. Any other tips on improving speed and efficiency

    would be
    > > appreciated. Thanks.

    >
    > As an alternative to the mmap solution from Glanni, the easiest way to do
    > this would be to read 512 words, process them, write back result, repaet
    > until end-of-file. No need to read the whole file in memory.


    I thought of that but speed is a concern so I want to keep the number of
    disk accesses at a minimum.

    >
    > You can write back results in place, if they should occupy the same
    > storage, ro to some other file. If the data has to be replaced, it is
    > often best to write the output to a new file, then move the new file over
    > the old file. That way you will not corrupt the original file if your
    > program crashes half way through.
    >


    In my case, I don't have to write any data back to the original file. Thanks
    for the suggestions.
    > HTH,
    > M4
    >
    Arnold, Jan 6, 2004
    #5
  6. Arnold

    Arnold Guest

    "sathyashrayan" <> wrote in message
    news:...
    > "Arnold" <> wrote in message

    news:<g4uKb.395$>...
    >
    > I am not a C wizard but I have some suggestions.
    >
    > > I need to read a binary file and store it into a buffer in memory

    (system
    > > has large amount of RAM, 2GB+) then pass it to a function. The function
    > > accepts input as 32 bit unsigned longs (DWORD). I can pass a max of 512
    > > words to it at a time. So I would pass them in chunks of 512 words until

    the
    > > whole file has been processed.

    >
    > By the term "words" means to say that it is a chunk of chars and a
    > delimiters with an ASCII space? Or each "words" size is 512 bytes?


    Each word is a DWORD, so each one is 32 bits. I can pass a maximum of 512
    DWORDs at a time to the function.


    > > I'm confused with how to store the binary file into memory. What sort of
    > > array do I use? Does C allow char only? Can I declare a DWORD buffer

    since
    > > that's what the function is taking as input? Or do I need to know the

    format
    > > of the original data that binary file is encoding and store it in that?
    > > That's the part that is really confusing me.

    >
    > By the term binary file and file format are you talking about the
    > first two letters in a file according to the DOS assembly language
    > (example MZ in .exe file) or the format of data present in a file
    > (fields and record with a kind of delimiter). If it is the second then
    > it is more related with the file's record design concept.


    It is the second.

    >
    > >
    > > I believe I'll need to used fread to copy the file to that array. I plan

    on
    > > getting the size of file, then determining how many DWORD are present in

    it
    > > (for example 9000) and use that my number of object parameter in fread.

    So
    > > in this case:
    > >
    > > fread(buffer, 4,9000,fp); //each DWORD is 4 bytes, 900 DWORDs in my

    binary
    > > file
    > >
    > > Is that right?

    >
    >
    > Just 512 elements or unknown during the run time? Is not the time to
    > take up with linked list rather than using array data type?
    >


    512 is the maximum the function can handle at a time so that is fixed,
    except for the last iteration though as the file won't have a multiple of
    512 number of DWORDs.

    >
    >
    > > Once I get the file into the buffer, I can then do a loop where I pass

    512
    > > elements of the array to a function until all 9000 elements are

    processed. I
    > > hope that's right. Any other tips on improving speed and efficiency

    would be
    > > appreciated. Thanks.

    >
    > Optimizing in C is not a kind of "instructions management" like in
    > asm.
    Arnold, Jan 6, 2004
    #6
  7. "Arnold" <> wrote in message
    news:g4uKb.395$...
    > I need to read a binary file and store it into a buffer in memory (system
    > has large amount of RAM, 2GB+) then pass it to a function. The function
    > accepts input as 32 bit unsigned longs (DWORD). I can pass a max of 512
    > words to it at a time. So I would pass them in chunks of 512 words until

    the
    > whole file has been processed. I haven't worked with binary files before

    so
    > I'm confused with how to store the binary file into memory. What sort of
    > array do I use? Does C allow char only? Can I declare a DWORD buffer since
    > that's what the function is taking as input? Or do I need to know the

    format
    > of the original data that binary file is encoding and store it in that?
    > That's the part that is really confusing me.
    >
    > I believe I'll need to used fread to copy the file to that array. I plan

    on
    > getting the size of file, then determining how many DWORD are present in

    it
    > (for example 9000) and use that my number of object parameter in fread. So
    > in this case:
    >
    > fread(buffer, 4,9000,fp); file://each DWORD is 4 bytes, 900 DWORDs in my

    binary
    > file
    >
    > Is that right?
    >


    You don't need to read the whole file, you can read 512 bytes at a time into
    a buffer of appropriate size:

    char buffer[512];
    x=fread(buffer,512 1, fp); // don't forget to check the value of x (which
    is the number of bytes actually read)
    ...

    You can then pass a pointer to this buffer to you function which has been
    prototyped to accept an
    array of DWORD, and the number of elements to process (which will be x/4
    from the fread above)
    e.g.

    int process_buf(DWORD *my_array, int number_of_elements);

    Then you function can iterate across this array as follows:

    int process_buff(DWORD * my_array,int no_elements)
    {
    int i;
    DWORD next_val;
    for(i=0;i<no_elements;i++){
    next_val=my_array; // You might need to convert from
    big-endian to little-endian here (see below)
    }

    }


    Of course this makes an assumption that the data in the file is stored in
    the same byte order as the processor you are running your program on (most
    likely you are using an Intel Pentium so Little-Endian is the byte order you
    are assuming). If the file uses another byte order then you can write
    (or google for) a macro that will do the conversion for you..

    Hope this helps
    Sean
    Sean Kenwrick, Jan 6, 2004
    #7
  8. On Tue, 06 Jan 2004 08:10:52 GMT, "Arnold" <>
    wrote:

    >I need to read a binary file and store it into a buffer in memory (system
    >has large amount of RAM, 2GB+) then pass it to a function. The function
    >accepts input as 32 bit unsigned longs (DWORD). I can pass a max of 512
    >words to it at a time. So I would pass them in chunks of 512 words until the
    >whole file has been processed. I haven't worked with binary files before so
    >I'm confused with how to store the binary file into memory. What sort of
    >array do I use? Does C allow char only? Can I declare a DWORD buffer since
    >that's what the function is taking as input? Or do I need to know the format
    >of the original data that binary file is encoding and store it in that?
    >That's the part that is really confusing me.


    The I/O function (fread as you suggest below) does not care how you
    define the buffer. However, how you use the buffer may make a
    difference. If you define the buffer as unsigned char, then you are
    guaranteed that all possible 256 values are acceptable (unsigned char
    cannot have trap values) and the buffer will be portable (at least for
    systems which have CHAR_BIT defined as 8). If you define the buffer
    as DWORD, are you sure that all 4 billion plus possible values that
    could come from a binary file are acceptable and your program will
    never execute on a machine with a different sizeof(unsigned long)?

    >
    >I believe I'll need to used fread to copy the file to that array. I plan on
    >getting the size of file, then determining how many DWORD are present in it
    >(for example 9000) and use that my number of object parameter in fread. So
    >in this case:


    There is no portable way to get the file size (unless you read the
    entire file) so you probably need to use a system specific extension
    or function for this.

    >
    >fread(buffer, 4,9000,fp); //each DWORD is 4 bytes, 900 DWORDs in my binary
    >file


    You meant 9000.

    >
    >Is that right?
    >
    >Once I get the file into the buffer, I can then do a loop where I pass 512
    >elements of the array to a function until all 9000 elements are processed. I
    >hope that's right. Any other tips on improving speed and efficiency would be
    >appreciated. Thanks.


    How you pass a quantity of array elements will determine the
    suitability of your design. (Actually, the method of passing the
    argument(s) should drive the design.) What is the prototype for the
    receiving function?

    The odds on the file containing an exact multiple of 512 DWORDs is
    about 1 in 500 so you may want to be able to handle the last set as a
    smaller quantity.



    <<Remove the del for email>>
    Barry Schwarz, Jan 6, 2004
    #8
  9. Arnold

    Jack Klein Guest

    On Tue, 06 Jan 2004 03:41:14 -0500, Michael B Allen
    <> wrote in comp.lang.c:

    > On Tue, 06 Jan 2004 03:10:52 -0500, Arnold wrote:
    >
    > > I need to read a binary file and store it into a buffer in memory
    > > (system has large amount of RAM, 2GB+) then pass it to a function. The
    > > function accepts input as 32 bit unsigned longs (DWORD). I can pass a
    > > max of 512 words to it at a time. So I would pass them in chunks of 512
    > > words until the whole file has been processed. I haven't worked with
    > > binary files before so I'm confused with how to store the binary file
    > > into memory.

    >
    > The term "binary file" is a bit of a misnomer. It just means it's not
    > text. Otherwise *everything* is "binary".
    >
    > > What sort of array do I use? Does C allow char only? Can I
    > > declare a DWORD buffer since that's what the function is taking as
    > > input? Or do I need to know the format of the original data that binary
    > > file is encoding and store it in that? That's the part that is really
    > > confusing me.

    >
    > Pretend for a minute that you have a really big array in memory:
    >
    > struct mystruct {
    > int foo;
    > char bar[10];
    > float zap;
    > }
    > ...
    > struct mystruct *s = malloc(100000 * sizeof(struct mystruct));


    Are you new in comp.lang.c? Everybody here by now should know the clc
    preferred idiom:

    struct mystruct *s = malloc(100000 * sizeof *s);

    ....and the magic number is anathema, of course, so:

    #define NUM_STRUCTS 100000

    struct mystruct *s = malloc(STRUCTS * sizeof *s);

    > populate(s);
    >
    > If you write this array to a file you have a "binary file". Now you could
    > do the reverse and read in your array from the file. At least you can
    > on the same machine. If you write the file on an a litte-endian i386 and
    > read it in on a big-endian Sparc you're going to have endianness problems.
    >
    > Mike
    >
    > PS: This question didn't warrant cross-posting to two different news
    > groups. Please refrain from doing that. Some people will simply not
    > answer your question when they see that.


    Why not? The fread() function is part of the standard C++ library as
    well, so the post is topical there, and two is certainly not an
    excessive number of groups for a cross-post.

    --
    Jack Klein
    Home: http://JK-Technology.Com
    FAQs for
    comp.lang.c http://www.eskimo.com/~scs/C-faq/top.html
    comp.lang.c++ http://www.parashift.com/c -faq-lite/
    alt.comp.lang.learn.c-c++
    http://www.contrib.andrew.cmu.edu/~ajo/docs/FAQ-acllc.html
    Jack Klein, Jan 7, 2004
    #9
  10. On Tue, 06 Jan 2004 18:01:29 +0000, Arnold wrote:

    >
    > "Martijn Lievaart" <> wrote in message
    > news:p...
    >> On Tue, 06 Jan 2004 08:10:52 +0000, Arnold wrote:
    >>
    >> > Once I get the file into the buffer, I can then do a loop where I pass

    > 512
    >> > elements of the array to a function until all 9000 elements are

    > processed. I
    >> > hope that's right. Any other tips on improving speed and efficiency

    > would be
    >> > appreciated. Thanks.

    >>
    >> As an alternative to the mmap solution from Glanni, the easiest way to do
    >> this would be to read 512 words, process them, write back result, repaet
    >> until end-of-file. No need to read the whole file in memory.

    >
    > I thought of that but speed is a concern so I want to keep the number of
    > disk accesses at a minimum.


    Memory mapping the file is probably still the best way, but suffers of a
    size limit. To get around this, you can also read in large chunks of the
    file. Instead of 512 words, read a few 100KB at the time and operate on
    that. Experiment with buffer sizes to see what gives the best result.

    I'm not sure what will be faster. Large buffers reduce the number of
    system calls slightly (good), but decrease locality of reference (bad).
    The mmap solution does not suffer either of these disadvantages I think.

    Note that the number of disk accesses will be the same whatever solution
    you chose. You have to read the whole file, period. I guess the main speed
    factors are the number of system calls and how effectively you use your
    memory. Also, you should try to do some useful work while waiting for the
    disk, maybe asynchronous I/O or multithreading can be of help?

    (If you look into multithreading, be sure you know what synchronisation
    machisms are lightweight and which are heavyweight, huge difference).

    I would just try a simple solution. If it isn't fast enough, try others.
    Profile to see where your program spends its time. If most of the time is
    spend on calculations, all of the above will give only very marginal
    speedups. If run on a fast machine, maybe a naive implementation will be
    fast enough for your needs. Remember the old truism about optimizing:
    Don't (until you have proven you need it).

    HTH,
    M4
    Martijn Lievaart, Jan 7, 2004
    #10
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Gianni Mariani
    Replies:
    0
    Views:
    365
    Gianni Mariani
    Jan 6, 2004
  2. Arnold
    Replies:
    7
    Views:
    497
    Martijn Lievaart
    Jan 7, 2004
  3. Jacek Dziedzic
    Replies:
    2
    Views:
    333
  4. toton
    Replies:
    11
    Views:
    699
    toton
    Oct 13, 2006
  5. Jonathan Wood
    Replies:
    1
    Views:
    498
    Jonathan Wood
    Jun 2, 2008
Loading...

Share This Page