Is this String class properly implemented?

Juha Nieminen · Apr 19, 2009

There's quite a lot wrong with your program, but I'll note this:

String String:perator =(String& str) {
return str;
}

You are breaking a contract here. operator=() should make this object
equal to the given object, but you are, in fact, doing absolutely
nothing. Thus your 'String' class is horribly broken and cannot be used
eg. in STL containers.

Also operator=() should always return a reference to *this.

bool String:perator ==(const String* s) const{
return this == s;
}

You are also breaking a contract here (or at the very least
convention). You are not comparing the contents of the strings, but the
pointers themselves. Two strings will compare unequal even if they
contain the exact same contents.

James Kanze · Apr 20, 2009

* James Kanze:

[...]

It can improve efficiency to have it, when it's done repeatedly.

However, I think it can be much more clear to /explicitly/
obtain a mutable string or string buffer for that.

Then there's no surprise about the O(n) (for n length of a)
lurking in there; it's then clear that it may occur in
explicitly obtaining the mutable string / buffer.

I wasn't thinking in terms of efficiency. From the very start,
I had a String and a StringBuilder class---you used the latter
to "construct" strings from characters.

When I introduced the above syntax into my String class, the
users found it very clever. I'd just learned about proxies, and
I found it very clever, too. Which pleased me at the time;
today, I tend to be suspicious of code which is too clever. On
the other hand, strings are so omnipresent that you can accept a
bit more irregularity in them, if it provides notational
convenience. (My strings used half open intervals, rather than
start position and length, and interpreted a negative value as
"from the end", so that a.substr( -3 ) returned a proxy
representing the last three characters in the string. Another
slightly too clever idea, but an enormous notational
convenience.)

Jorgen Grahn · Apr 21, 2009

I am new to the C++ language, I implement this String class just for
practising.
I want to know what to improve to make it *qualified* to be used in
real programs,

Documentation stating what a String represents, what purpose it has.
The word "string" can have many subtly different meanings.

but with the said:
or what code should be changed to make it more
efficient...yea, I know you will recommend me to using std::string,
(-:just practising...

Curiously, you have conversions from a lot of different types, but not
from std::basic_string itself. That's what I like the least, I think
-- that your class isolates itself from the rest of the C++ world.
(From algorithms, locales and iostreams too, I think.)

Personally, I would never attempt to write a general, reusable
container class. It seems fun and easy, but it's better left to
experts IMHO.

Here's the code:
#ifndef STRING_H_
#define STRING_H_
#include "Core.h"

class String{
private:
wchar_t* value;
int size;
public:
String();
String(const wchar_t* src);
String(const wchar_t* src, int offset, int count);
String(String& s);
String(const int& num);
String(const long& l);
String(const float& f);
String(const double& d);
String(const wchar_t& c);

explicit! These constructors are a disaster waiting to happen.
Been there, done that ...

/Jorgen

Tony · Apr 29, 2009

James said:
No. About the only use it could have is as an example of how
not to do the job. It fails on implementations not using ASCII,
for example, and it fails for English words like "naïve".

7-bit ASCII is your friend. OK, not *your* friend maybe, but mine for sure!

James Kanze · Apr 29, 2009

James Kanze wrote:

7-bit ASCII is your friend. OK, not *your* friend maybe, but
mine for sure!

7-bit ASCII is dead, as far as I can tell. Certainly none of
the machines I use use it. My (very ancient) Sparcs use ISO
8859-1, my Linux boxes UTF-8, and Windows UTF-16LE.

The reason is simple, of course: 7-bit ASCII (nor ISO 8859-1,
for that matter) doesn't suffice for any known language.
English, for example, normally distinguishes between opening and
closing quotes---an encoding which doesn't make this distinction
isn't usable for general purpose English. And of course,
regardless of the language, as soon as your program has to deal
with things like people's names, you need to deal with an
incredible number of accents.

Of course, I'm talking here about real programs, designed to be
used in production environments. If your goal is just a Sudoku
solver, then 7-bit ASCII is fine.

Juha Nieminen · Apr 30, 2009

James said:
The reason is simple, of course: 7-bit ASCII (nor ISO 8859-1,
for that matter) doesn't suffice for any known language.
English, for example, normally distinguishes between opening and
closing quotes

English also has words with accented vowels, such as
http://www.merriam-webster.com/dictionary/naivete

James Kanze · Apr 30, 2009

English also has words with accented vowels, such
ashttp://www.merriam-webster.com/dictionary/naivete

I know. I cited naïve myself.

In the past, all (or at least most) languages have made
compromises for typewritten text; a typewriter only has so many
keys, and each key can only produce two characters. (I don't
know how CJK languages handled this.) And each key would only
advance the carriage a fixed difference (if it advanced it at
all). So you end up with English without accents, with no
distinction between opening and closing quotes; French without
the oe ligature or accents on the capital letters; both French
and German without correct quotes; etc. Such things aren't
really acceptable today, however, since computers don't have
these restrictions. Roughly speaking, if you aren't using fixed
width fonts, you should be doing the rest of the typesetting
correctly as well, and that means (in English) naïve and déjà vu
with accents, distinct opening and closing quotes, and so on.

Juha Nieminen · Apr 30, 2009

James said:
In the past, all (or at least most) languages have made
compromises for typewritten text; a typewriter only has so many
keys, and each key can only produce two characters. (I don't
know how CJK languages handled this.) And each key would only
advance the carriage a fixed difference (if it advanced it at
all). So you end up with English without accents, with no
distinction between opening and closing quotes; French without
the oe ligature or accents on the capital letters; both French
and German without correct quotes; etc. Such things aren't
really acceptable today, however, since computers don't have
these restrictions. Roughly speaking, if you aren't using fixed
width fonts, you should be doing the rest of the typesetting
correctly as well, and that means (in English) naïve and déjà vu
with accents, distinct opening and closing quotes, and so on.

The main problem is to choose a character set and encoding. This
introduces problems in programs.

Unicode seems to be the de-facto standard nowadays, but it's still far
from easy to write programs which would handle all possible unicode
characters without problems. Even if you used raw unicode values as your
internal encoding (using 4 bytes wide characters) or the UTF32 encoding,
you are still going to stumble across problems. That's because something
as simple-sounding as "advance 10 characters forward" is not simple with
unicode even if you use wide characters where each unicode value has
been allocated a fixed amount of bytes. That's because some unicode
characters don't actually represent independent characters, but you can
have compound characters, composed of two unicode values (which means
that to advance one character forward would mean skipping *two* values
rather than one).

Problems become more complicated if you want to use a less verbose
encoding to save memory, such as UTF-8 (optimal for most western
languages) or UTF-16 (optimal eg. for Japanese and other languages heavy
in non-ascii characters). Advancing forward in a piece of text becomes a
challenge.

Guest · Apr 30, 2009

In the past, all (or at least most) languages have made
compromises for typewritten text; a typewriter only has so many
keys, and each key can only produce two characters. (I don't
know how CJK languages handled this.)

one solution:
http://en.wikipedia.org/wiki/Japanese_typewriter

<snip>

Tony · May 2, 2009

James said:
7-bit ASCII is dead, as far as I can tell. Certainly none of
the machines I use use it.

It's an application-specific thing, not a machine-specific thing.

My (very ancient) Sparcs use ISO
8859-1, my Linux boxes UTF-8, and Windows UTF-16LE.

The reason is simple, of course: 7-bit ASCII (nor ISO 8859-1,
for that matter) doesn't suffice for any known language.

Um, how about the C++ programming language!

Of course, I'm talking here about real programs, designed to be
used in production environments. If your goal is just a Sudoku
solver, then 7-bit ASCII is fine.

Of course compilers and other software development tools are just toys. The
English alphabet has 26 characters. No more, no less.

Jerry Coffin · May 2, 2009

[ ... ]

Um, how about the C++ programming language!

Sorry, but no. If you look at $2.10, you'll see "universal-character-
name', which allows one to generate names using characters that don't
fall within the ASCII character set (or ISO 8859 for that matter). It's
_possible_ to encode the source code to a C++ program using only the
characters in (for one example) ISO 646, but it's painful at best.

It's a bit hard to say much about ASCII per se -- the standard has been
obsolete for a long time. Even the organization that formed it doesn't
exist any more.

Of course compilers and other software development tools are just toys.

You do have something of a point -- if you restrict your target audience
sufficiently, you can also restrict some of what is supports (such as
different character sets).

The English alphabet has 26 characters. No more, no less.

Unfortunately statements like this weaken your point. By any reasonable
measure, the English alphabet contains at least 26 characters (upper and
lower case). Of course, even other western Euroean languages like French
and German require characters that aren't present in the English
alphabet, and the last I heard there were also at least a _few_ people
in places like China, Korea, Japan, the Arabian Peninsula, etc. -- and
most of them use languages in which the characters aren't even similar
to those in English.

Jerry Coffin · May 2, 2009

[ ... ]

Unfortunately statements like this weaken your point. By any reasonable
measure, the English alphabet contains at least 26 characters (upper and
lower case).

Oops -- of course that should have been "52" rather than 26.

James Kanze · May 3, 2009

It's an application-specific thing, not a machine-specific
thing.

That's true to a point---an application can even use EBCDIC,
internally, on any of these machines. In practice, however,
anything that leaves the program (files, printer output, screen
output) will be interpreted by other programs, and an
application will only be usable if it conforms to what these
programs expect.

Which isn't necessarily a trivial requirement. When I spoke of
the encodings used on my machines, I was refering very precisely
to those machines, when I'm logged into them, with the
environment I set up. Neither pure ASCII nor EBCDIC are
options, but there are a lot of other possibilities. Screen
output depends on the font being used (which as far as I know
can't be determined directly by a command line program running
in an xterm), printer output depends on what is installed and
configured on the printer (or in some cases, the spooling
system), and file output depends on the program which later
reads the file---which may differ depending on the program, and
what they do with the data. (A lot of programs in the Unix
world will use $LC_CTYPE to determine the encoding---which means
that if you and I read the same file, using the same program, we
may end up with different results.)

Um, how about the C++ programming language!

C++ accepts ISO/IEC 10646 in comments, string and character
literals, and symbol names. It allows the implementation to do
more or less what it wants with the input encoding, as long as
it interprets universal character names correctly. (How a good
implementation should determine the input encoding is still an
open question, IMHO. All of the scanning tools I write use
UTF-8 internally, and I have transcoding filebuf's which convert
any of the ISO 8859-n, UTF-16 (BE or LE) or UTF-32 (BE or LE)
into UTF-8. On the other hand, all of my tools depend on the
client code telling them which encoding to use; I have some code
floating around somewhere which supports "intelligent guessing",
but it's not really integrated into the rest.)

Of course compilers and other software development tools are
just toys. The English alphabet has 26 characters. No more, no
less.

C, C++, Java and Ada all accept the Unicode character set, in
one form or another. (Ada, and maybe Java, limit it to the
first BMP.) I would think that this is pretty much the case for
any modern programming language.

Richard Herring · May 8, 2009

Tony said:
Jerry Coffin wrote:
[...]

It's a bit hard to say much about ASCII per se -- the standard has
been obsolete for a long time. Even the organization that formed it
doesn't exist any more.

Click to expand...

Oh? Is that why such care was taken with the Unicode spec to make sure that
it mapped nicely onto ASCII?

Or ISO-8859?

[...]

Fine, upper and lower case then. But no umlauts or accent marks!

How naïve. My _English_ dictionary includes déjà vu, gâteau and many
other words with diacritics.

That passage seems non-sequitur: the whole gist was "what if one has
established that English is an appropriate simplifying assumption?".

Then one still needs some diacritics. The ISO-8859 family has them;
ASCII doesn't.

James Kanze · May 8, 2009

But there is a huge volume of programs that can and do use
just ASCII text.

There is a huge volume of programs that can and do use no text.
However, I don't know of any program today that uses text in
ASCII; text is used to communicate with human beings, and ASCII
isn't sufficient for that.

I gave the example of development tools: parsers, etc.

Except that the examples are false. C/C++/Java and Ada require
Unicode. Practically everything on the network is UTF-8.
Basically, except for some historical tools, ASCII is dead.

Sure, the web isn't just ASCII, but that is just an
application domain. If that is the target, then I'll use
UnicodeString instead of ASCIIString. I certainly don't want
all the overhead and complexity of Unicode in ASCIIString
though. It has too many valid uses to have to be bothered with
a mountain of unnecessary stuff if being subsumed into the
"one size fits all" monstrosity.

As long as you're the only person using your code, you can do
what you want.

On that we agree 100%! That's the rationale for keeping
ASCIIString unaberrated.

I understand the rationale.

I don't get what you mean: an ASCII text file is still an
ASCII text file no matter what font the user chooses in
Notepad, e.g.

First, there is no such thing as an ASCII text file. For that
matter, under Unix, there is no such thing as a text file. A
file is a sequence of bytes. How those bytes are interpreted
depends on the application. Most Unix tools expect text, in an
encoding which depends on the environment ($LC_CTYPE, etc.).
Most Unix tools delegate display to X, passing the bytes on to
the window manager "as is". And all Unix tools delegate to the
spooling system or the printer for printing, again, passing the
bytes on "as is" (more or less---the spooling system often has
some code translation in it). None of these take into
consideration what you meant when you wrote the file.

Internally, the program is still working with ASCII strings,
assuming English is the language (PURE English that recognizes
only 26 letters, that is).

Pure English has accented characters in some words (at least
according to Merriam Webster, for American English). Pure
English distiguishes between open and closing quotes, both
single and double. Real English distinguishes between a hyphen,
an en dash and an em dash.

But that's all irrelevant, because in the end, you're writing
bytes, and you have to establish some sort of agreement between
what you mean by them, and what the programs reading the data
mean. (*If* we could get by with only the characters in
traditional ASCII, it would be nice, because for historical
reasons, most of the other encodings encountered encode those
characters identically. Realistically, however, any program
dealing with text has to support more, or nobody will use it.)

Nor does it matter that the platform is Wintel where "behind
the scenes" the OS is all UTF-16.

(Aside Trivia: The "failure" of Sun has been attributed in
part to the unwillingness to move to x86 while "the industry"
went there. Very ancient indeed!).

Where did you get that bullshit? Sun does sell x86 processors
(using the AMD chip). And IBM and HP are quite successful with
there lines of non x86 processors. (IMHO, where Sun went wrong
was in abandoning its traditional hardware market, and moving
into software adventures like Java.)

The application domain you reference is: Operating System.
Quite different from CSV text file parser.

I'm not referencing any application domain in particular.
Practically all of the Unix applications I know take the
encoding from the environment; those that don't use UTF-8 (the
more recent ones, anyway). All of the Windows applications I
know use UTF-16LE.

Do you think anyone would use MS Office or Open Office if they
only supported ASCII?

Your statement could be misleading even if you didn't intend
it to be. The "any known language.. blah, blah", is a
generalization that fits the real world,

Yes. That's where I live and work. In the real world. I
produce programs that other people use. (In practice, my
programs don't usually deal with text, except maybe to pass it
through, so I'm not confronted with the problem that often. But
often enough to be aware of it.)

but software programs eventually are just "zeros and ones".

Not really. Programs assign semantics to those ones and zeros.
Even at the hardware level---a float and an int may contain the
same number of bits, but the code uses different instructions
with them. Programs interpret the data.

Which brings us back to my point above---you don't generally
control how other programs are going to interpret the data you
write.

The above from you is an odd perspective noting that in
another thread you were trying to shoehorn something with,
logically, magnitude and direction into a signed integral
type.

Sorry, I don't know what you're talking about.

That's a good expansion point. Let's look the constituents...

Comments and Symbols: If you want to program in French or
7-bit kanji (The Matrix?), have at it.

I've already had to deal with C with the symbols in Kanji. That
would have been toward the end of the 1980s. And I haven't seen
a program in the last ten years which didn't use symbols and
have comments in either French or German.

I guarantee you that I'll never ever use/need 10646 comments
or symbols.

Fine. If you write a compiler, and you're the only person to
use it, you can do whatever you want. But there's no sense in
talking about it here, since it has no relevance in the real
world.

I'll be nice and call it a simplifying assumption but it's
really a "no brainer".

Literals: Not a problem for me, and can be worked around for
others (put in file or something: make it data because that's
what it is. Programming in French is hard).

No it's not. (Actually, the most difficult language to program
in is English, because so many useful words are reserved as key
words. When I moved to C++, from C, I got hit several times in
the code written in English, by things like variables named
class. Never had that problem the French classe, nor the German
Klasse.)

Major advantage for me in programming: English is my primary
language!

It's one of my primarly languages as well. Not the only one,
obviously, but one of them.

(Curb all the jokes please! ;P). Trying to extend programming
(as I know it) to other languages is not my goal. It may be
someone else's proverbial "noble" goal.

[snip... must one indicate snips?]

C, C++, Java and Ada all accept the Unicode character set,
in one form or another.

Click to expand...

There's that operating system example again that doesn't apply
to hardly all application development.

That has nothing to do with the operating system. Read the
language standards.

You are interfusing programming languages with the data that
they manipulate.

No. Do you know any of the languages in question? All of them
clearly require support for at least the first BMP of Unicode in
the compiler. You may not use that possibility---a lot of
people don't---but it's a fundamental part of the language.
(FWIW: I think that C++ was the first to do so.)

Jerry Coffin · May 8, 2009

[ ... ]

Fine, but for an environment or project that has determined that ASCII is
adequate, why in the world would they do that? (And moreso, why would anyone
ever do that?).

Who has ever determined that ASCII was adequate? ASCII was never
anythint more than a stopgap -- a compromise between what was wanted,
and what you could reasonably support at a time when a machine with 32K
of RAM and (if you were really lucky) a 40 megabyte hard-drive needed to
support a few hundred simultaneous users because it cost well over a
million dollars.

ASCII has been obsolete for decades -- let is rest in peace.

Explain.

Look up trigraphs and digraphs. They were invented specifically because
ISO 646 doesn't include all the characters normally used in C or C++
source code.

[ ... ]

Oh? Is that why such care was taken with the Unicode spec to make sure that
it mapped nicely onto ASCII? ASCII will never die. It is fundamental and
foundational and for lots of programs, complete (read: all that is
necessary).

You're right about one or two points, but not in the way you think. For
example, it's true that ASCII won't die -- but only because it's already
been dead and buried for decaded. Unicode and ISO 10646 weren't written
particularly to be compatible with ASCII -- they were written to be
compatible with the common base area of ISO 8859. Claiming that's
"ASCII" does nothing more than display ignorance of both standards.

[ ... ]

There is a large set of programs that fall in that category.

I suppose that depends on how you define "large". My immediate guess
would be that it's a single-digit percentage.

[ ... ]

That passage seems non-sequitur: the whole gist was "what if one has
established that English is an appropriate simplifying assumption?".

Quite the contrary -- the point was that IF you've determined that you
can use only a subset of the English alphabet, that's fine -- but can
almost never determine any such thing.

James Kanze · May 9, 2009

[ ... ]

Oh? Is that why such care was taken with the Unicode spec to
make sure that it mapped nicely onto ASCII? ASCII will never
die. It is fundamental and foundational and for lots of
programs, complete (read: all that is necessary).

Click to expand...

Click to expand...

You're right about one or two points, but not in the way you
think. For example, it's true that ASCII won't die -- but only
because it's already been dead and buried for decaded. Unicode
and ISO 10646 weren't written particularly to be compatible
with ASCII -- they were written to be compatible with the
common base area of ISO 8859. Claiming that's "ASCII" does
nothing more than display ignorance of both standards.

Click to expand...

And the common base area of ISO 8859 was compatible with ASCII.
Historically, this was an issue: when ISO 8859 was introduced,
we still wanted to be able to read and interpret existing files,
and even today, a file written using just the printable
characters from ASCII will encode the same in all of the ISO
8859 encodings and in UTF-8. A useful characteristic if you
want to determine the encoding from the contents of the file
(e.g. as in XML)---you limit the characters in the file to just
this small set until you've specified the encoding, and the
parsing code doesn't have to commit to the actual encoding until
after it has read the specification.

[ ... ]

There is a large set of programs that fall in that category.

Click to expand...

Click to expand...

I suppose that depends on how you define "large". My immediate
guess would be that it's a single-digit percentage.

Click to expand...

Of those programs dealing with text. If you include all
programs, I suspect that most programs (e.g. the one which
controls the ignition in your car) don't use any character data
at all, so strictly speaking, they don't need more than plain
ASCII (since they don't even need that).

Of course, that's totally irrelevant to the argument about which
encoding to use for text data. (For what its worth, I've seen
more EBCDIC in the last ten years than I've seen ASCII.)

Jerry Coffin · May 9, 2009

[ ... ]

And the common base area of ISO 8859 was compatible with ASCII.

That depends on exactly what you mean by "compatible with". It's not
identical to ASCII though. For one example, in ASCII character 96 is a
reverse quote, but in ISO 8859 it's a grave accent.

I suppose you can argue that those are the same thing if you want --
none of the encoding standards makes any requirement about the glyphs
used to display a particular character, so they could perfectly well be
displayed with identical glyphs. Nonetheless, the two do not share the
same intent.

[ ... ]

Of those programs dealing with text. If you include all
programs, I suspect that most programs (e.g. the one which
controls the ignition in your car) don't use any character data
at all, so strictly speaking, they don't need more than plain
ASCII (since they don't even need that).

Well, yes - given that the discussion was about text encoding, I treated
the universe as programs that work with encoded text in some way.

Alf P. Steinbach · May 9, 2009

* Jerry Coffin:

[ ... ]

And the common base area of ISO 8859 was compatible with ASCII.

Click to expand...

That depends on exactly what you mean by "compatible with". It's not
identical to ASCII though. For one example, in ASCII character 96 is a
reverse quote, but in ISO 8859 it's a grave accent.

I'm sorry but as far as I know that's BS.

Would be nice to know where you picked up that piece of disinformation, though.

Or whether we're all ("we" = me, Wikipedia, James, etc.) all wrong...

I suppose you can argue that those are the same thing if you want --
none of the encoding standards makes any requirement about the glyphs
used to display a particular character, so they could perfectly well be
displayed with identical glyphs. Nonetheless, the two do not share the
same intent.

On the contrary, AFAIK the intent of ISO 8859-1 was to contain ASCII sans the
control characters directly as a subset.

Cheers,

- Alf

Jerry Coffin · May 9, 2009

[email protected] says... said:
I'm sorry but as far as I know that's BS.

Have you looked at both specifactions to find out? Have you even looked
at one of them?

Would be nice to know where you picked up that piece of disinformation, though.

It would be nice to know exactly what convinces you that it's
disinformation, and particularly whether you have any authoritative
source for the claim. Wikipedia certainly doesn't qualify, and as much
respect as I have to James, I don't think he does either. It would
appear to me that the only authoritative sources on the subject are the
standards themselves -- and your statement leads me to doubt that you've
consulted them in this case.

The following code is not working properly need assistance	2	Nov 16, 2022
STRING - Remove small letters from string	1	Jan 20, 2023
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Temporary Class through Reference	1	Feb 5, 2013
Lexical Analysis on C++	1	Oct 31, 2023
If(strcmp(str, "") == 0) - What does this line of code mean?	0	Aug 8, 2022
Need help with this script	4	Mar 12, 2023
How to put a null check on this code	0	Jan 4, 2022

Is this String class properly implemented?

Juha Nieminen

James Kanze

Jorgen Grahn

Tony

James Kanze

Juha Nieminen

James Kanze

Juha Nieminen

Guest

Tony

Jerry Coffin

Jerry Coffin

James Kanze

Richard Herring

James Kanze

Jerry Coffin

James Kanze

Jerry Coffin

Alf P. Steinbach

Jerry Coffin

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads