Reading data by words from a file in Linux system

K

Kuhl

Hi, I am doing programming in Linux system. I need to read data from a
file. See the code below. Such code handles data per byte. But I
usually need to handle data by word (two bytes). If only handle bytes,
then the code is very inefficient. For the sample code in this post,
it is only comparing data, so it's not so serious yet. But in further
parts, I will do a lot of mathematical calculation. Byte operation
would be extremely inefficient. I don't know how to read data from a
file in words. Is there any solution? Thanks.

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/stat.h>

int main(int argc, char *argv[ ])
{
int fd;
int i;
int gds_size;
char databuf[1024];
struct stat filebuf;

if(stat(argv[1], &filebuf) == -1){
printf("\nERROR: Fail to find file %s .\n", argv[1]);
return 0;
}

fd = open(argv[1], O_RDONLY);
if(fd == -1){
printf("\nERROR: Fail to open file %s .\n", argv[1]);
return 0;
}

read(fd, databuf, 10);
if(!(databuf[0]==0&&databuf[1]==6&&databuf[2]==0&&databuf[3]
==2&&databuf[6]==0&&databuf[7]==28&&databuf[8]==1&&databuf[9]==2)){
printf("\nERROR: This is not a valid GDS format.\n");
close(fd);
return 0;
}

printf("\nFurther program going on.\n");
close(fd);
return 0;
}
 
G

Guest

Hi, I am doing programming in Linux system. I need to read data from a
file. See the code below. Such code handles data per byte. But I
usually need to handle data by word (two bytes). If only handle bytes,
then the code is very inefficient. For the sample code in this post,
it is only comparing data, so it's not so serious yet. But in further
parts, I will do a lot of mathematical calculation. Byte operation
would be extremely inefficient. I don't know how to read data from a
file in words. Is there any solution? Thanks.

You're asking a Unix specific question so you'd be better off asking
in a unix specific group (eg. comp.unix.programmer).

On the other hand reading a byte at a time may not be your problem.
You can read larger chunks using fread(). It may be better to read
the whole file or a large chunk of it before you do your processing.
You can also map the file into memory but that is platform specific.
 
G

Guest

Hi, I am doing programming in Linux system. I need to read data from a
file. See the code below. Such code handles data per byte. But I
usually need to handle data by word (two bytes). If only handle bytes,
then the code is very inefficient. For the sample code in this post,
it is only comparing data, so it's not so serious yet. But in further
parts, I will do a lot of mathematical calculation. Byte operation
would be extremely inefficient. I don't know how to read data from a
file in words. Is there any solution? Thanks.

fread and/or fgetc will do what you want.  (ie, either call
fread with a 2nd argument of 2, or call fgetc twice).  Let
the underlying library buffer the reads to get the speed
you want. (ie, don't call read unless the 3rd argument
is BUFSIZ)

 if(stat(argv[1], &filebuf) == -1){
  printf("\nERROR: Fail to find file %s .\n", argv[1]);
  return 0;
 }
 fd = open(argv[1], O_RDONLY);
 if(fd == -1){
  printf("\nERROR: Fail to open file %s .\n", argv[1]);
  return 0;
 }

No, no, no.  A thousand times, no.  Replace both of those
printfs with:
perror( argv[ 1 ]);

why?
 
R

Richard Tobin

if(stat(argv[1], &filebuf) == -1){
printf("\nERROR: Fail to find file %s .\n", argv[1]);
return 0;
}
fd = open(argv[1], O_RDONLY);
if(fd == -1){
printf("\nERROR: Fail to open file %s .\n", argv[1]);
return 0;
}
No, no, no. A thousand times, no. Replace both of those
printfs with:
perror( argv[ 1 ]);
[/QUOTE]

So you get a better error message. For an even better one,
use strerror().

-- Richard
 
C

CBFalconer

Kuhl said:
Hi, I am doing programming in Linux system. I need to read data
from a file. See the code below. Such code handles data per byte.
But I usually need to handle data by word (two bytes). If only
handle bytes, then the code is very inefficient. ...

That is the reason for the getc macro, which is special in that it
may evaluate operands more than once. The macro, if supplied, is
able to access the file buffer byte by byte, without losing
efficiency. It uses the normal file buffer.

7.19.7.5 The getc function

Synopsis
[#1]
#include <stdio.h>
int getc(FILE *stream);

Description

[#2] The getc function is equivalent to fgetc, except that
if it is implemented as a macro, it may evaluate stream more
than once, so the argument should never be an expression
with side effects.

Returns

[#3] The getc function returns the next character from the
input stream pointed to by stream. If the stream is at end-
of-file, the end-of-file indicator for the stream is set and
getc returns EOF. If a read error occurs, the error
indicator for the stream is set and getc returns EOF.

Also see putc.
 
K

Keith Thompson

CBFalconer said:
That is the reason for the getc macro, which is special in that it
may evaluate operands more than once. The macro, if supplied, is
able to access the file buffer byte by byte, without losing
efficiency. It uses the normal file buffer.
[...]

The fgetc function can also access the file buffer byte by byte. The
advantage of getc over fgetc is that it can avoid the overhead of a
function call, but both can avoid performing physical I/O on each
call.
 
B

Bartc

Kuhl said:
Hi, I am doing programming in Linux system. I need to read data from a
file. See the code below. Such code handles data per byte. But I
usually need to handle data by word (two bytes). If only handle bytes,
then the code is very inefficient. For the sample code in this post,
it is only comparing data, so it's not so serious yet. But in further
parts, I will do a lot of mathematical calculation. Byte operation
would be extremely inefficient. I don't know how to read data from a
file in words. Is there any solution? Thanks.

Try one of these functions:

int readword(FILE* f) {
return (fgetc(f)<<8) | fgetc(f);
}

int readword(FILE* f) {
return fgetc(f) | (fgetc(f)<<8);
}

depending on which order you want the bytes. The functions assume a file
opened in binary mode with fopen(), or use the equivalent fgetc for your
open() function.

To determine if these are fast enough, just read an entire, typical file
using readword(), but do nothing else. That will tell you how much overhead
reading bytes this way will be.
 
B

Bartc

Gordon Burditt said:
Neither of these functions gives the bytes in a predictable order,
since there is no sequence point between the first call to fgetc()
(whichever one that is) and the second call to fgetc() (whichever
one that is).

You're right; I only tested this with two different compilers; the third one
returned the same order in both functions.
Also, neither of these deal with EOF reasonably.

Try something like:

int readword(FILE *f) {
int c1;
int c2;
c1 = fgetc(f);
c2 = fgetc(f);
if (c1 == EOF || c2 == EOF) {
return EOF;
}
return c1 | (c2 << 8);
}

This one still doesn't handle EOF reasonably on a machine with
16-bit ints, because there is no way to distinguish EOF from reading
two 0xff bytes in succession. There isn't any special value I could
use to signal EOF since all possible combinations of values
could be produced by the two characters read.

EOF checking on a per-byte basis is probably less important when reading a
file known to have an even number of bytes, and when the words are known to
be structured in a certain way and when the EOF point can be predicted.

It might suffice to do an feof() check at strategic points, together with
data-specific integrity checks, to detect corrupt files. Or at least to
signal EOF in a way which doesn't require the caller to test for
readword()==EOF every single time, which would be a nightmare.
 
K

Kuhl

Hi, many thanks that there are so many good answers. Byte order is an
issue in my case. I found that whatever 16-bits data or 32-bits data
in the file I am handling define higher byte as less significant byte,
while it's on the contrary in C. C defines higher bytes as more
significant bytes. So eventually, I wrote a function to reverse the
byte order for each piece of data. About the speed concern, I used big-
size data buffer, while using pointer variables to access data. DRAM
size is not a concern in my system. Thanks.
 
J

James Kuyper

Kuhl said:
Hi, many thanks that there are so many good answers. Byte order is an
issue in my case. I found that whatever 16-bits data or 32-bits data
in the file I am handling define higher byte as less significant byte,
while it's on the contrary in C. C defines higher bytes as more
significant bytes. ...

C does no such thing. The byte order is up to each implementation of C
to decide, and most implementations decide to use whatever order makes
the most sense for the target architecture. You should avoid make any
assumptions about the byte order., at least if you want your to be
portable.
 
L

lawrence.jones

William Pursell said:
It is true that on some systems, errno is not set for
things like fopen, and that perror gives a less than
helpful message in that case, but the OP specifically
states that some flavor of Linux is being used, so that's
not an issue.

He's also using open, not fopen, so it isn't even relevant.
 
K

Kuhl

Hi. If the byte order is up to each implementation of C, then is this
order already fixed in a compiled executable file? If it is fixed in
the compiled file, then this program is still portable. Thanks.
 
J

James Kuyper

Kuhl said:
Hi. If the byte order is up to each implementation of C, then is this
order already fixed in a compiled executable file? ...

So to speak. It's really fixed by the interaction between the CPU and
the generated code. The executable file may contain an instruction
telling the CPU to load a word of memory (keep in mind that a "word" can
refer to different numbers of bytes on different machines) from RAM into
a register, and then perform arithmetic operations on that value in the
register. It is the CPU itself that interprets the bytes stored in RAM
when they get loaded into the registers. In principle, you could build
two different machines on which exactly the same generated machine code
would result in those bytes being interpreted in the opposite order. You
wouldn't be able to tell, just by looking at the executable, whether it
was implementing big-endian or little-endian integers; you would also
have to know which of the two machines it was being run on.

A compiler could emulate a big-endian machine even though the executable
will be running on a little-endian machine (or vice versa), by swapping
bytes before loading them into registers, and after writing them from
registers.
... If it is fixed in
the compiled file, then this program is still portable. Thanks.

I can't figure out how you reached that conclusion; but I can tell you
it is false.

A given C program, compiled for one platform, may produce an executable
file that, when run on that platform, interprets ints as 4 8-bit bytes
in bigendian order and 2's complement representation. That same
executable, when run on a different platform, may produce an error
message indicating that it's in the wrong format to BE an executable
file for that platform.

When that same C program is compiled for the second platform, it may
produce a different message with the same basic meaning if you attempt
to run the generated executable on the first platform. When you run it
on the second platform, it may interpret ints at 2 16-bit bytes in
little-endian order and 1's complement representation.

Whether or not this difference will cause a problem depends very much
upon how you wrote the code. Knowing how to write code so it will
perform essentially the operations on either platform is relatively
easy, but non-trivial, if all of the data in the program is stored
internally. However, if it requires an input source that is in some
sense "the same" on both platforms, the way in which you must write the
code depends upon the sense in which it is "the same", and the issue
gets quite complicated.

For instance, the process of transferring the data from one system to
the other might put either one or two 8-bit bytes in each 16-bit byte.
It might or might not change the endianess of multi-byte objects. It
might or might not convert the 2's complement data to 1's complement.
You'll have to know which of these options apply, in order to write the
code so it handles the "same" input data to produce the "same" outputs.

This is why it is often recommended that data to be transferred between
platforms should be stored in text format. The data may still need to be
transformed when transported to a different platform, but the issues
created by that kind of transformation are much easier to deal with.
 
G

Guest

Hi. If the byte order is up to each implementation of C, then is this
order already fixed in a compiled executable file? If it is fixed in
the compiled file, then this program is still portable. Thanks.

this makes no sense.

Yes, the byte order is up to each implementaion of C.
Yes, it is fixed by the compiler. Or at least the compiler
should agree with the platform conventions.
No it is not portable.


int main(void)
{
int i = 1;
unsigned char *p = &i;
printf ("lo byte:%x ho byte:%x\n", *p&0xff, *(p + 1)&0xff);
return 0;
}

this gives different results of different platforms.
If int is 16 bits (uncommon these days!) then it could
print 00 01 or 01 00 depending on endianess.

The order of bytes in a file will stay the same even if
the file is moved to another platform. If the file
is written on different platforms you may get different results.

I say reading a word at a time is proably a bad idea. If
you are *certain* that byte at a time i/o is your problem;
and you have MEASURED IT. Then consider reading much bigger
chunks and then convert them to words internally. You may need
to run different coe on different platforms.

int make_word (unsigned char *buffer)
{
#ifdef LSB_FIRST
return *buffer & (*(buffer + 1) << 8;
#else
return (*buffer << 8) & *(buffer + 1);
#endif
}
 
G

Guest

On 6 Apr, 03:53, Kuhl <[email protected]> wrote:


I say reading a word at a time is proably a bad idea. If
you are *certain* that byte at a time i/o is your problem;
and you have MEASURED IT. Then consider reading much bigger
chunks and then convert them to words internally. You may need
to run different
Code:
 on different platforms.

int make_word (unsigned char *buffer)
{
#ifdef LSB_FIRST
    return *buffer & (*(buffer + 1) << 8;
#else
    return (*buffer << 8) & *(buffer + 1);
#endif[/QUOTE]

damn...

ok, see those ands (&) up there? They should be ors (|)
 
J

JosephKK

This is why it is often recommended that data to be transferred between
platforms should be stored in text format. The data may still need to be
transformed when transported to a different platform, but the issues
created by that kind of transformation are much easier to deal with.

Or you could use Xdr.
.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top