Character Array vs String

M

Malcolm McLean

Could anybody please mention difference between character array and
string in C?
There's not much difference. In C, a string is an array of characters
with a terminating 0, or null character.

There's special syntax for specifying a string, which is the
conventional quotes. "Fred" creates a constant character array, 5
characters long, with a terminal null. We call that a "strign
literal". char name[5] = {'F', 'r', 'e', 'd', '\0'}; does exactly the
same thing, but the array is writeable.

Occasionally you might need a character array without a terminating
null. This wouldn't be a string in C terms, and will probably cause a
crash if you feed it to a string function like strcpy(). Usually it's
better to put on the null even if you don't use it, because it's only
one byte, and it means you can print the string out with no trouble if
you need to do so for some reason. The other situation is where you
have several strings in the same array, with nulls between them. name
= e.g. char *name = "Fred\0Bloggs\0\0". A few Microsoft Windows
functions like to receive lists of strings like this. They use a
sequence of two nulls to indicate the end of the list.
 
J

James Kuyper

Could anybody please mention difference between character array and
string in C?

A character array is an array of objects of character type, either char,
signed char, unsigned char, or wchar_t. It remains an array, regardless
of what's stored in those objects.

A string is a data structure that could, among other things, be stored
in a character array. "A string is a contiguous sequence of characters
terminated by and including the first null character." (7.1.1p1).

Example:

char array[] = "One\0Two";

Every single character in that array can be treated as the first
character of a different string; most of those strings overlap each
other. For instance, array+3 points at the string "", which is empty
except for the terminating null character. array+5 points at the string
"wo",
 
E

Eric Sosman

Could anybody please mention difference between character array and
string in C?

A character array is an array whose individual elements are
characters: `char this[42]', for example. (The term "character"
is ambiguous, since C has four types that could claim the name:
`char', `signed char', `unsigned char', and `wchar_t'. In informal
use "character" usually means `char', but be aware that the term
is not quite so specific and misunderstanding may occur.) Anyhow:
A character array is an array of "character" elements, just as an
`int' array is an array of `int' elements.

A string is a particular data structure that can be stored in
a character array. It consists of some sequence of "payload"
characters plus a special "sentinel" character to mark the end of
the string. The sentinel character has the numeric value zero, and
no payload character has that value. It is possible to have no payload
characters at all (the sentinel will be in the array's first position),
and this represents the empty string. (Note: The terms "payload" and
"sentinel" are not formally defined by C; they're just my attempt to
name and explain the parts of a string.)

Example:

char message[10];
strcpy(message, "Hello");

`message' is a character array: A ten-element array where each element
is a character. After strcpy(), a string of six characters (five
payload and one sentinel) inhabits the first six positions of the
array. The remaining four characters are still part of the array,
but are not part of the string.


<---------------------- array ------------------------------>

[0] [1] [2] [3] [4] [5] [6] [7] [8] [9]
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
message: | 'H' | 'e' | 'l' | 'l' | 'o' | 0 | ? | ? | ? | ? |
+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
sen-
<---------- payload ----------> ti-
nel

<------------- string ------------->
 
8

88888 Dihedral

For unicode support in C is documented in C99. In C89 the 8-bit ANSI char string was assumed and the termination of a string is a 0 stored in the of a string.

In some encoding C-code compiled to deal with ANSI C string might fail.
This was what I tested long time ago.

In programming languages length of a string might be stored explicitly in 2 to 4 bytes. Be careful of the length of two bytes stored that will cause trouble.

I think that the wchar that could support unicode can enhance the PC platform over the unix-lynux platform that might not support unicode in non-English users.
 
J

James Kuyper

To my knowledge, C does not have strings.

See 7.1.1p1, which I cited in my own response to Quentin.

C doesn't have a string type, but that's a very different question.
 
M

Malcolm McLean

To my knowledge, C does not have strings.
It has syntactic support for strings, as string literals. It also has
functions in the standard library that operate on strings. What it
doesn't have is a string type variable. Strings are passed about as
character pointers.
 
B

BartC

Sheky said:
Could anybody please mention difference between character array and
string in C?

As I understand it, a character array can contain a string, amongst other
things.

So in a ten-character array, if the first three characters are 'A', 'B' and
'C', and the fourth character is zero, then it contains the 3-character
string "ABC" at the beginning (and the other 6 characters can contain
anything).

But the more you look into this, the more complicated it gets (so that same
array also contains the string "BC", and the last 5 characters form a
5-character array, which may or may not contain a string of it's own, and so
on). So don't worry about it too deeply.
 
N

Nick Keighley

CONTEXT!!
Leave some in so we know what you are responding to.

For unicode  support in C is documented in C99. In C89 the 8-bit ANSI char

the character code you are trying to refer to is "ASCII" not "ANSI"
string was assumed

no it wasn't. K&R (pre-standard C) may have done this (but I'm not
convinced even K&R was locked to ASCII). C89 went to a little trouble
to make it char set independent. No reason why a conforming C89
implementaions could not use EBCDIC (a rather nasty IBM character
code). And such things exist.
and the termination of a string  is a 0 stored [at] the [end] of a string.

yes, the standard insists on this
In some encoding C-code compiled to deal with ANSI C string might fail.

no. All conforming implementations must use zero to terminate a
string.
Badly written C programs might assume a particular character encoding.
But it isn't hard to write programs that are character encoding
neutral.
This was what I tested long time ago.

anything that failed was either not a C compiler or your test progarm
was broken.
In
--some--

programming languages length of a string might be stored explicitly in 2 to 4 bytes. Be careful of the length of two bytes stored that will cause trouble.

Pascal did this typically. C++ std::string probably does it. There are
a zillion string libraries out there that do it. They will only cause
problems if you pass a non-C string to code taht is expecting a C-
string.
I think that the wchar that could support unicode can enhance the PC platform over the unix-lynux platform that might  not support unicode in non-English users.

I'm pretty sure all main-stream OS's (Win, Linux, MacOS) support
unicode already. Hence no enhancement necessary.
 
Q

Quentin Carbonneaux

See 7.1.1p1, which I cited in my own response to Quentin.

I saw it.
C doesn't have a string type, but that's a very different question.

My answer was a bit misleading... But, as you guessed it I tried to state that
C does not have a string type (I did not think of a string as a data structure,
which it is).

Thanks for making it clear.
 
O

osmium

:

For unicode support in C is documented in C99. In C89 the 8-bit ANSI char
the character code you are trying to refer to is "ASCII" not "ANSI"

Furthermore, ASCII is a 7-bit code, not 8. It is usually extended in some
fashion to become eight bits in actual use as opposed to an abstraction.
ANSI is the organization that "blesses" some stuff for the USA.

ANSI - American National Standards Institute.

ASCII - American Standard Code for Information Interchange.
 
M

Malcolm McLean

no. All conforming implementations must use zero to terminate a
string.
Badly written C programs might assume a particular character encoding.
But it isn't hard to write programs that are character encoding
neutral.
Sometimes it's harder than it looks.

For instance IFF files have 4-letter ASCII tags which indicate what
sort of "chunk" you are reading. So the obvious thing to write is

fread(chunk, 1, 4, fp);
if(!strncmp(chunk, "DATA", 4))
/* we've got a data chunk */

That will break on a non-ascii system. The solution is to hardcode the
values. But then you can no longer read the word "DATA" and it becomes
a lot harder to see that the chunk identifier is correct.
 
J

James Kuyper

On 11/09/2011 10:48 AM, Malcolm McLean wrote:
....
For instance IFF files have 4-letter ASCII tags which indicate what
sort of "chunk" you are reading. So the obvious thing to write is

fread(chunk, 1, 4, fp);
if(!strncmp(chunk, "DATA", 4))
/* we've got a data chunk */

That will break on a non-ascii system. The solution is to hardcode the
values. But then you can no longer read the word "DATA" and it becomes
a lot harder to see that the chunk identifier is correct.

You can make it a macro, whose name is more informative than the
hardcoded values. However, the better solution (though not always
feasible) is to convert those files from ASCII to the native encoding on
that platform, as part of the process of porting them to that platform.
If a C implementation uses a non-ascii encoding when targeting that
platform, then it's likely to be the case that the local text oriented
utilities (such as file editors or browsers) will do so, as well.
 
W

Willem

James Kuyper wrote:
) On 11/09/2011 10:48 AM, Malcolm McLean wrote:
) ...
)> For instance IFF files have 4-letter ASCII tags which indicate what
)> sort of "chunk" you are reading. So the obvious thing to write is
)>
)> fread(chunk, 1, 4, fp);
)> if(!strncmp(chunk, "DATA", 4))
)> /* we've got a data chunk */
)>
)> That will break on a non-ascii system. The solution is to hardcode the
)> values. But then you can no longer read the word "DATA" and it becomes
)> a lot harder to see that the chunk identifier is correct.
)
) You can make it a macro, whose name is more informative than the
) hardcoded values. However, the better solution (though not always
) feasible) is to convert those files from ASCII to the native encoding on
) that platform,

They are not ASCII files. They are binary files with chunks that are
identified by a 4-byte header which has meaning when read as ASCII.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
J

James Kuyper

James Kuyper wrote:
) On 11/09/2011 10:48 AM, Malcolm McLean wrote:
) ...
)> For instance IFF files have 4-letter ASCII tags which indicate what
)> sort of "chunk" you are reading. So the obvious thing to write is
)>
)> fread(chunk, 1, 4, fp);
)> if(!strncmp(chunk, "DATA", 4))
)> /* we've got a data chunk */
)>
)> That will break on a non-ascii system. The solution is to hardcode the
)> values. But then you can no longer read the word "DATA" and it becomes
)> a lot harder to see that the chunk identifier is correct.
)
) You can make it a macro, whose name is more informative than the
) hardcoded values. However, the better solution (though not always
) feasible) is to convert those files from ASCII to the native encoding on
) that platform,

They are not ASCII files. They are binary files with chunks that are
identified by a 4-byte header which has meaning when read as ASCII.

That makes it harder; the conversion utility would have to know about
the file format. It's still not impossible, but obviously far less
convenient.
 
B

BartC

James Kuyper said:
That makes it harder; the conversion utility would have to know about
the file format. It's still not impossible, but obviously far less
convenient.

You just use an a macro or function such as:

if(!strncmp(chunk,ASCII("DATA"),4)

That's if you're worried that your program might not work on a non-ASCII C
system. On an ASCII one, then the function or macro will do nothing.
 
M

Malcolm McLean

That makes it harder; the conversion utility would have to know about
the file format. It's still not impossible, but obviously far less
convenient.
It would be easy enough to write such a utility for IFF files, because
they have a structure whereby you have a "chunk" length, and
identifier telling you what sort of chunk it is. So you can just skip
through all the chunks, changing the identifier tags from ASCII to
EBCDIC.

But then you'd have two file formats, identical except for the tags,
and the potential for extra costs and incompatibilities would be
large. A bit like the decision to encode newline/carriage return as
just a newline. It saved a byte, but to this day text files won't
display properly on Windows as a result.
 
W

Willem

James Kuyper wrote:
) That makes it harder; the conversion utility would have to know about
) the file format. It's still not impossible, but obviously far less
) convenient.

And it would likely go against the spec of the file format.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT
 
T

Tobias Blass

It would be easy enough to write such a utility for IFF files, because
they have a structure whereby you have a "chunk" length, and
identifier telling you what sort of chunk it is. So you can just skip
through all the chunks, changing the identifier tags from ASCII to
EBCDIC.

Wouldn't it be easier to use a text file, so the program checks for
"DATA" and you encode your file in EBDIC for EBDIC systems and in ASCII
for ASCII systems... (if you can't change the file format, well I liked
the function like macro idea elsewhere in this thread)
But then you'd have two file formats, identical except for the tags,
and the potential for extra costs and incompatibilities would be
large. A bit like the decision to encode newline/carriage return as
just a newline. It saved a byte, but to this day text files won't
display properly on Windows as a result.

Well Windows developed after UNIX, so they could have adopted the \n
encoding if they wanted to. You could as well reverse your argument and
say "but to this day text files won't display properly on *NIX as a
result" (most *NIX utilities can handle \r\n encodings, though). I also
don't think \n was used to save a byte(CMIIW). \n is more "natural" (you
want a newline, so you add a newline character), but \r\n is more natural
if you are used to typewriters. Since typewriters are quite rare these
days I think the *NIX way makes more sense, but YMMV.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,536
Members
45,013
Latest member
KatriceSwa

Latest Threads

Top