baffled by 'new' operator with array type.

R

Ray Dillinger

I have the following type declarations, which I thought were quite simple;

const uint16_t SUBTREES = 64;
typedef struct stringnode *strpt; // a strpt is a pointer to a struct stringnode.
typedef strpt branchtype [SUBTREES]; // a branchtype is an array of SUBTREES strpt.


and the following function, which also seems simple. (I have added the line numbers
at the front so you can see which lines the error messages refer to).

132: // allocate and return a subtree, copied from argument within the (modular) range specified
133: branchtype *branchcopy(const branchtype * const subtrees, const int start, const int end){
134: branchtype *newbranch = new(branchtype);
135: int count;
136: for (count = start; count != end; count = (count + 1) % SUBTREES)
137: newbranch[count] = subtrees[count];
138: return(newbranch);
139: }

This function is supposed to allocate a new branchtype (via a pointer to it which is
named newbranch) and then copy a subrange of an existing branchtype (whose address
it gets via its argument subtrees) to it. The subrange copied is intended to be
'modular' in that if 'end' is less than 'start' it copies a subrange from 'start'
to the end of the array, then continues from the beginning of the array to 'end'.

When compiling this function, I get the following errors and I don't understand
what I've done wrong.

string3.cpp:133:41: error: cannot convert ‘stringnode**’ to ‘stringnode* (*)[64]’ in initialization
string3.cpp:136:38: error: invalid array assignment


I think that the first error message is baffling because it seems to imply that
new(branchtype) is returning something other than a pointer to branchtype. (ie,
in this case a pointer to an array of pointers at struct stringnodes). I can't
figure out how that could happen after reading the documentation of 'new'. I
could, but wouldn't like, to simply drop 'new' and use 'calloc' and static cast
instead.

The second error message is baffling because subtrees is a (const) pointer to (const)
branchtype and newbranch is very explicitly declared to be a pointer to branchtype,
so they are clearly the same type, and I'm trying to modify the contents of the non-
const one, not the const one. The error message is documented to mean I have
attempted to assign to an array (in violation of standard, I know), but AFAIK
there isn't anything wrong with assigning to a single element of an array. If
I understand my declarations correctly, the elements of branchtype are pointers,
not arrays. Why does it think I'm making an assignment to an array here instead
of an assignment to a pointer (which happens to be one element of an array)?

Was there something wrong with my declaration? Do these declarations not mean what
I think they mean?

Thank you for any help. I know that my style here is very C'ish and not idiomatic
C++. This code is intended to be a library that will eventually get rolled into
a 'ropes' object type (representing Unicode strings as broad, shallow trees) in C++,
but right now I'm still trying to get some fundamentals working correctly with
an absolute minimum overhead (and backporting to make a plain C library is also a
goal).

Bear
 
I

Ian Collins

Ray said:
I have the following type declarations, which I thought were quite simple;

const uint16_t SUBTREES = 64;
typedef struct stringnode *strpt; // a strpt is a pointer to a struct stringnode.
typedef strpt branchtype [SUBTREES]; // a branchtype is an array of SUBTREES strpt.


and the following function, which also seems simple. (I have added the line numbers
at the front so you can see which lines the error messages refer to).

132: // allocate and return a subtree, copied from argument within the (modular) range specified
133: branchtype *branchcopy(const branchtype * const subtrees, const int start, const int end){

General point, making the int parameters const is unnecessary.
134: branchtype *newbranch = new(branchtype);
135: int count;
136: for (count = start; count != end; count = (count + 1) % SUBTREES)
137: newbranch[count] = subtrees[count];
138: return(newbranch);
139: }

This function is supposed to allocate a new branchtype (via a pointer to it which is
named newbranch) and then copy a subrange of an existing branchtype (whose address
it gets via its argument subtrees) to it. The subrange copied is intended to be
'modular' in that if 'end' is less than 'start' it copies a subrange from 'start'
to the end of the array, then continues from the beginning of the array to 'end'.

When compiling this function, I get the following errors and I don't understand
what I've done wrong.

string3.cpp:133:41: error: cannot convert ‘stringnode**’ to ‘stringnode* (*)[64]’ in initialization

I think you are assuming using a typedef for an array makes it something
other than an array.
string3.cpp:136:38: error: invalid array assignment

Here you are trying to assign one array to another which is something
the language doesn't support.

Your should consider using std::vector rather than naked arrays and
pointers to them.
 
I

Ian Collins

Ian said:
Ray said:
I have the following type declarations, which I thought were quite simple;

const uint16_t SUBTREES = 64;
typedef struct stringnode *strpt; // a strpt is a pointer to a struct stringnode.
typedef strpt branchtype [SUBTREES]; // a branchtype is an array of SUBTREES strpt.


and the following function, which also seems simple. (I have added the line numbers
at the front so you can see which lines the error messages refer to).

132: // allocate and return a subtree, copied from argument within the (modular) range specified
133: branchtype *branchcopy(const branchtype * const subtrees, const int start, const int end){

General point, making the int parameters const is unnecessary.
134: branchtype *newbranch = new(branchtype);

I forgot to not the use of parenthesis here is unusual and somewhat
confusing. At first I thought you were attempting to use placement new!
135: int count;
136: for (count = start; count != end; count = (count + 1) % SUBTREES)
137: newbranch[count] = subtrees[count];
138: return(newbranch);

same here.
 
R

Ray Dillinger

I have the following type declarations, which I thought were quite simple;

const uint16_t SUBTREES = 64;
typedef struct stringnode *strpt; // a strpt is a pointer to a struct stringnode.
typedef strpt branchtype [SUBTREES]; // a branchtype is an array of SUBTREES strpt.
....clip ...
string3.cpp:133:41: error: cannot convert ‘stringnode**’ to ‘stringnode* (*)[64]’ in initialization
string3.cpp:136:38: error: invalid array assignment


I think that the first error message is baffling because it seems to imply that
new(branchtype) is returning something other than a pointer to branchtype.
While reading the documentation for new, you must have missed the part about
new of an array returning a pointer to the first element of the array (not
a pointer to the array).

Ah. I did in fact read right past that, assuming the entire array
would be allocated and a pointer to the first element would serve,
as in C, in place of a pointer to the array itself.

So, if I understand you correctly, the space for a branchtype is
being allocated, but the pointer returned from 'new' is typed
as strpt** rather than branchtype* and I need to static cast it to
branchtype* to get the semantics I want.

I need to use 'new' rather than 'calloc' because I need finalizers
and overloaded operators to run, and if I allocate using 'calloc'
they won't. In fact, if I didn't need finalizers, I would probably
be doing this in C. Hm. Also, promoting it from a bare array to
some minimal object class will be necessary to make finalizers
run, and it seems like that would be another way to fix the issue
with 'new'.
Due to the typedefs, the expression new(branchtype) is equivalent to
new(*stringnode[64]). A pointer to the first element of such an array is of
type **stringnode. The error message you get states that the value of type
**stringnode returned by the new expression, cannot be converted to what is on
the left hand side of the assignment operbator.

Okay, this makes some sense in terms of explaining the error I got.
It completely doesn't make sense as a language design decision, but
whatever. If I got hung up on things that don't make sense as design
decisions, there is NO programming language I'd be using.
Due to the typedefs, the type of the variable subtrees is stringnode*(*)[64],
the same as the type of newbranch given in the other message. The expression
stringnode[count]
is equivalent to
*(stringnode + count)
The type of (stringnode + count) is the same as that of stringnode (a pointer
to an array of 64 pointers to stringnode). Dereferencing that results in a
value whose type is an array of 64 pointers to stringnode. The assignment
statement is attempting to assign an array of 64 pointers to stringnode.

Aaand, that isn't what I wanted to do. Thank you. Now that I get that,
I can just rewrite it using one more * dereference operator and refer to
a pointer rather than an array of pointers.
In C++, as in C, arrays are treated differently than non-arrays. Perhaps, you
mistakenly think that arrays and non-arrays are treated the same.

I must admit, that's it exactly. I have hardly ever used a named type
that is simply a fixed-length array, and haven't been bitten by the
special semantics of the 'new' operator with respect to fixed-length
arrays prior to this, so I foolishly assumed that there was some
consistency with the way objects and unions are treated.

Thank you very much for your help. You've been very insightful.

Bear
 
R

Ray Dillinger

I forgot to not the use of parenthesis here is unusual and somewhat
confusing. At first I thought you were attempting to use placement new!

I'm aware that 'new' and 'return' work fine without parens. But I
still think of them as having arguments, so I write them the same way
as anything that has arguments. It doesn't mean anything, it's
just something I'm used to seeing and I do it in my own code because
it eliminates special cases from the syntax and therefore I have less
syntactic complication to parse when reading it.

It's very similar with 'const int' above; it's not necessary and
doesn't actually produce faster code, but I do it because I know
at the design stage that I won't need to change these things in
the function and that if I do it'll be because I'm making a mistake.
It's for my own convenience in reminding me what the design
parameters are when I go back six months from now to work on it.

I also frequently use parens in expressions where they're not really
necessary, simply because I want to make the operator precedence
visually apparent. It's for my own convenience in reading it later,
or to guard against possible momentary lapses about the precedence
levels of operators. It's not there for the compiler, it's there
because it makes it easier (for me) to read and analyze later and
be absolutely sure it's doing what I think it's doing.

I think every coder has things they do for their own clarity or as
a result of their design process, even though those things are not
necessary w/r/t the language itself. If you find them strange, that's
only because mine are different from yours.

Bear
 
S

SG

Am 06.07.2013 20:47, schrieb Ray Dillinger:
Ah. I did in fact read right past that, assuming the entire array
would be allocated and a pointer to the first element would serve,
as in C, in place of a pointer to the array itself.

So, if I understand you correctly, the space for a branchtype is
being allocated, but the pointer returned from 'new' is typed
as strpt** rather than branchtype*

Unless I overlooked something, the type of

new branchtype

should be strpt* because branchtype = strpt[64].
and I need to static cast it to
branchtype* to get the semantics I want.

A static_cast would not do (I think). And I would strongly discourage
you from using reinterpret_cast for this.

In my opinion, the better solution is to replace the raw array with
std::array and also to get rid of the name strpt. Hiding pointers just
makes the code less readable.

const unsigned max_subtrees = 64;
typedef std::array<stringnode*,max_subtrees> branchtype;
// ^
// we don't need "struct" here, this is C++, not C

Or just write a class for that:

struct branchtype {
stringnode* subtrees[max_subtrees];
};

Or

struct branchtype {
std::array<stringnode*,max_subtrees> subtrees;
};

However, I fail to see where this could be a tree structure since
branchtype != stringnode. Maybe a redesign is on order.

You could tell us what you're actually trying to solve with this. Maybe
then someone will come up with a better design.

Cheers!
SG
 
J

Jorgen Grahn

I'm aware that 'new' and 'return' work fine without parens. But I
still think of them as having arguments, so I write them the same way
as anything that has arguments. It doesn't mean anything, it's
just something I'm used to seeing and I do it in my own code because
it eliminates special cases from the syntax and therefore I have less
syntactic complication to parse when reading it.

Note though that what he's saying is that 'new(Foo)' is unusual and
confusing to him (and me, and probably many others). You want others.
not just yourself, to read the code well.

I think it's worth changing your style in this case.
It's very similar with 'const int' above; it's not necessary and
doesn't actually produce faster code, but I do it because I know
at the design stage that I won't need to change these things in
the function and that if I do it'll be because I'm making a mistake.
It's for my own convenience in reminding me what the design
parameters are when I go back six months from now to work on it.

This one (void foo(const int bar) ...) I agree with.
It's no different from using const further down in a function:

void foo(const int bar)
{
const Baz baz(bar, 42);
...
}

Just a statement to the reader: don't bother looking for a mutation of
this value further down into the function. Not necessary, but not
unusual and rather helpful, especially with (too) long functions.

/Jorgen
 
R

Ray Dillinger

Or just write a class for that:

struct branchtype {
stringnode* subtrees[max_subtrees];
};

Yep, that's what's going to happen.
However, I fail to see where this could be a tree structure since
branchtype != stringnode. Maybe a redesign is on order.

No, it's quite deliberate. So's not using std::array in this case.
You could tell us what you're actually trying to solve with this. Maybe
then someone will come up with a better design.

I'm implementing ropes.

Ropes are a string representation that's better for very large strings.

They are a tree structure that maximizes the use of shared data, so if
you insert something near the beginning of a several-hundred-megabyte
string, first you don't have to shift several hundred megabytes over
by a few spaces; instead you either insert a stringnode into your tree
or merge the insertion with the buffer controlled by an existing leaf
stringnode. Second, most linear-time string operations, when applied
to strings represented as ropes take logarithmic rather than linear
time, and string copy takes O(1) time. The only string operation that
doesn't benefit at scale is in-place mutation, which on ropes takes
logarithmic rather than constant time. And in the application we're
doing, in-place mutation is rare. Third, when you are using ropes the
original string still exists. You don't have to copy it, you just have
to retain a reference to its root node. The new string shares 99%+ of
its buffers at the leafnodes, (possibly 100%), also shares all the
stringnodes except the path from the tree root to the insertion point,
and even along that path is likely to share many of the branchtype
arrays that the stringnodes refer to.

stringnodes are copy-on-write; you don't mutate them, ever. You just
make a new one that's filled in slightly differently. Stringnodes
internal to the tree have a pointer to a branchtype plus a startpoint
and endpoint that tells them which branches of that branchtype array
are part of their own subtree. This allows branchtypes to be shared
between stringnodes; because no tree ever includes null branches, you
can overwrite a null pointer in a branchtype, make a stringnode with
a different startpoint or endpoint for the insertion, and you haven't
changed the value of any other string that is using that set of
branches because the startpoint and endpoint recorded in the other
string's stringnode which points at that set of branches are unchanged.
When you can't do what you need to do by overwriting a null pointer,
you allocate a new branchtype, copy the part of the array you're using
over to that, (which is the function I gave) and overwrite whatever
nulls you need to overwrite in the new branchtype.

The leaf nodes have buffers of wchar_t rather than subtrees of branchtype,
but otherwise they work the same way. Because strings do not contain
null characters, You can overwrite null characters in the leaf buffers
without changing the content of any other leaf node using that buffer,
because that other leaf node will retain its own start and end points.

You can remove characters from the string by splitting a leafnode into
two new leaf nodes that refer to disjoint parts of the same buffer, still
without changing the contents of any other string even if it's using
that buffer. But when you have to overwrite a non-null character, you
create a new buffer and copy just the part you want in it, then proceed
to overwrite the null characters that you need to in the new buffer.

Ropes rely on garbage collection of nodes, branchtypes, and buffers
no longer in use, which means all of these things need to be child
classes of a 'GC-object' class.

I'm not using std::array specifically because it doesn't support the
shared-use semantics (ie, it mutates in place without regard to what
it's overwriting) and isn't garbage collected. Also because I'll
have to maintain this in the future, and when maintaining code I want
error messages that refer to code that's actually in the files I'm
looking at rather than reams of template gobbledegook.

And anyway, the point of this code is to replace the use of std::string
in a project where it is no longer a good choice because the strings
that we are working with are orders of magnitude larger than the
implementers of std::string anticipated, the lack of shared-memory on
strings is now consuming gigabytes and won't scale, and the mix of
operations includes frequent copying and frequent length-changing
mutations, which are both O(N) with std::strings but O(1) and O(logN)
respectively on rope strings.

Bear
 
R

Ray Dillinger

Have you seen the SGI rope implementation?


Hmmm. That's quite good. It's not indexed as well as the project wants,
but it may be easier to instantiate and modify than develop new. It's worth
at least a few hours of kicking around to see.

Thanks for the pointer.

Bear
 
R

Ray Dillinger


Hm. It's good work for what it is, but not what I need.

Uses a binary tree, so it'll have to do about six times more
memory fetches to get to a particular position, and tree
balance becomes an issue much more easily; The broad flat
nodes I had spec'd (64 children) require more CPU and more
complex insert/delete code, but make better use of caches
and data prefetches, permit more data sharing, and make tree
balancing nearly-trivial. Still, compared to the current
runtime costs that's all acceptable.

Singly indexed, in codepoints. That's like the string
implementation I'm trying to replace, so wouldn't be a
particularly hard sell in terms of compatibility with current
code, but making codepoints visible at the interface level
would be a career-limiting move right now. There is a big
push on here for 'early normalization' and 'character-only
semantics' which is a mandate for all text functionality.
The codepoint interface I could use for binary-storage
classes like 'blob', but for text a character interface
is required.

And it can't be just a 'view' on the codepoint string,
because when someone wants to go to the 56830'th character,
the whole point of the rope implementation is that you
don't have to count them from the front of the string.
So the class itself has to support character indexes as
well as codepoint indexes.

So "character" is supposed to mean a character as
understood by copywriters, meaning what the Unicode
Standard calls a grapheme cluster (or one of a limited
class of control characters including tabs and newlines).
All operations must have "character" semantics rather
than "codepoint" semantics.

However, this would be not too hard to add. It would take
another index field for characters, and straightforward
modifications in all functions that work directly with
buffers.

Non-Normalized sequences can exist. This is another violation
of the 'character-only semantics' push. Still, if I'm going
to be elbows-deep in it for purposes of providing character
indexes and putting up character versions of all the interface
functionality and preventing splits on codepoint boundaries
that aren't also character boundaries, adding a few calls to
check and perform normalization on buffers is trivial extra
work.

This problem is an artifact of the attempt to be character-
set agnostic that the templatized library is doing; it
necessarily provides no support for semantics not common
to all supported character sets (meaning, approximately, no
support for any semantics at all) and wants to be agnostic
about what a codepoint unit means (meaning, approximately,
that it is a valid representation of a binary string
masquerading as a representation of text). We're committing
to Unicode and want to take full advantage of it, so that's
not good enough anymore.

But here's the killer, which really would be too hard to
fix, especially in combination with the above. It has
only two options for garbage collection and neither is
compatible as-is with the runtime in which this implementation
will work. We can't let it do its own refcounting because
shared-representation copying exists at a higher level than
just strings sharing representation, and can result in
copied environments that need different refcounts for the
"same" data. For the same reason, our allocation does not
work in the way their other option assumes and requires.
 
B

Bart van Ingen Schenau

On Sun, 07 Jul 2013 10:02:37 -0700, Ray Dillinger wrote:

[quotes rearranged]
So "character" is supposed to mean a character as understood by
copywriters, meaning what the Unicode Standard calls a grapheme cluster
(or one of a limited class of control characters including tabs and
newlines). All operations must have "character" semantics rather than
"codepoint" semantics.
Singly indexed, in codepoints. That's like the string implementation
I'm trying to replace, so wouldn't be a particularly hard sell in terms
of compatibility with current code, but making codepoints visible at the
interface level would be a career-limiting move right now. There is a
big push on here for 'early normalization' and 'character-only
semantics' which is a mandate for all text functionality. The codepoint
interface I could use for binary-storage classes like 'blob', but for
text a character interface is required.

And it can't be just a 'view' on the codepoint string, because when
someone wants to go to the 56830'th character, the whole point of the
rope implementation is that you don't have to count them from the front
of the string. So the class itself has to support character indexes as
well as codepoint indexes.

Before you commit yourself firmly to offering "character" indices, have
you considered that your characters (grapheme clusters) are inherently
variable length, even if you ensure the text is always normalised
according to NFC or NFKC?

For example, a w with an accent acute (wˋ U+0077 U+02CA) does not have a
pre-composed codepoint in Unicode, so it will always be represented by a
sequence of two codepoints. If you say that this particular combination
will never occur in a real text, then I would agree, but I would expect a
good tester to come up with something silly like this and I can't rule
out that real texts in some languages also contain grapheme clusters that
are composed of multiple codepoints.

Bart v Ingen Schenau
 
R

Ray Dillinger

Before you commit yourself firmly to offering "character" indices, have
you considered that your characters (grapheme clusters) are inherently
variable length, even if you ensure the text is always normalised
according to NFC or NFKC?

For example, a w with an accent acute (wˋ U+0077 U+02CA) does not have a
pre-composed codepoint in Unicode, so it will always be represented by a
sequence of two codepoints. If you say that this particular combination
will never occur in a real text, then I would agree, but I would expect a
good tester to come up with something silly like this and I can't rule
out that real texts in some languages also contain grapheme clusters that
are composed of multiple codepoints.

Yes, that's kind of the point. If we were working with a character
set that could always be represented as single codepoints, the issue
would not have come up. We would have done early normalization and
called it finished.

A copywriter, or a linguist, does not think of w with an acute
accent as being more than one character. It is only computer
programmers who think of codepoints as characters. If someone else
indexes into the string and gets just the w (or just the accent)
he will file a bug report. Likewise if he indexes into the string
beyond it and finds his character counts are wrong by one character.

Counting from the beginning of one buffer is acceptable; counting
from the beginning of the text is not. So the branch nodes all
need to know how many characters their buffers or sub-branches
store, not just how many codepoints.

Bear
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,046
Latest member
Gavizuho

Latest Threads

Top