Aliasing in C++11

M

molw5.iwg

Based on my reading of the standard the compiler is free to assume a pointer to
a strongly typed enumeration aliases only other pointers to the same type and
raw character pointers. For example, I would like to define a more restrictive
byte array for interacting with binary data as follows:

enum byte : uint8_t {};
std::vector <byte> buffer;

Specifically, I believe string-like classes could benefit greatly from this
sort of implementation by defining their internal state using definitions
similar to the above:

class string
{
...
private:
enum byte : char {};
std::unique_ptr <byte[]> buffer;
};

The tests I've performed with GCC support the above interpretation – doesa
strict reading of the standard support the above, and/or is there some other
well-known alternative? Thanks in advance,

-molw5
 
Ö

Öö Tiib

Based on my reading of the standard the compiler is free to assume a pointer
to a strongly typed enumeration aliases only other pointers to the same type
and raw character pointers. For example, I would like to define a more
restrictive byte array for interacting with binary data as follows:

enum byte : uint8_t {};

No you are wrong. Your's is traditional enum with underlying type. This is
strongly typed enum:

enum class byte : uint8_t {};
std::vector <byte> buffer;

Even with yours 'byte' it is definitely more restrictive.
Specifically, I believe string-like classes could benefit greatly from this
sort of implementation by defining their internal state using definitions
similar to the above:

class string
{
...
private:
enum byte : char {};
std::unique_ptr <byte[]> buffer;
};

Do not perhaps write yet another text-containing class. The market is full
there are way too lot of such.
The tests I've performed with GCC support the above interpretation – does a
strict reading of the standard support the above, and/or is there some other
well-known alternative? Thanks in advance,

What is the "this"? It should work. Currently most people use std::string
(that actually contains UTF-8 encoded text) for storing texts. I fully
agree with you that it is loose and unsafe thing. However it is unlikely
that some revolution is coming. Billions of lines of code and millions of
interfaces all over the world use that std::string and problems are
consistently elsewhere.
 
M

molw5.iwg

No you are wrong. Your's is traditional enum with underlying type. This is

strongly typed enum:



enum class byte : uint8_t {};

Apologies - never the less in this context the distinction is irrelevant (the enumeration has no members).
What is the "this"? It should work. Currently most people use std::string

(that actually contains UTF-8 encoded text) for storing texts. I fully

agree with you that it is loose and unsafe thing. However it is unlikely

that some revolution is coming. Billions of lines of code and millions of

interfaces all over the world use that std::string and problems are

consistently elsewhere.

It does work – in the context of serialization, however, writes to the buffer
almost always invalidate every other state as the underlying char* pointer may
alias everything (including the pointer itself). The compiler is almost
never able to inline to the point where it can resolve these sort of aliasing
problems. I was asking whether or not another solution was commonly used to
define a raw character array (string, buffer, vector, what have you) with
stronger aliasing properties similar to the above – clearly this solutionis
C++11 specific and I'd imagine others have attempted to address this problem in
the past.

Clearly the above could be used to define other primitive-equivalent types with
stronger aliasing properties, string is merely the most interesting as the
external interface need not change (as char* may still alias byte*). I believe
it would be possible to write a standard conforming string library that uses
such a byte definition, freeing the compiler to maintain state across writes to
the string; I'm not, at present, planning to write one myself.
 
Ö

Öö Tiib

Apologies - never the less in this context the distinction is irrelevant
(the enumeration has no members).

It is somewhat relevant. By language rules a value of enum class type does
not implicitly convert to values of integral types. Traditional enum does.
Lack of named enumerators actually does not matter since enum may have all
the values of underlying type regardless if enumerator for particular value
exists or not.
It does work – in the context of serialization, however, writes to the buffer
almost always invalidate every other state as the underlying char* pointer
may alias everything (including the pointer itself). The compiler is almost
never able to inline to the point where it can resolve these sort of aliasing
problems. I was asking whether or not another solution was commonly usedto
define a raw character array (string, buffer, vector, what have you) with
stronger aliasing properties similar to the above – clearly this solution is
C++11 specific and I'd imagine others have attempted to address this problem
in the past.

Lot of people certainly have. It is very likely that you can find something
already implemented. I in fact haven't. I use std::string for text and
std::vector<char> for byte buffer. I know it is unsafe so I am more
careful. The benefit why I do it is that majority of libraries and tools support types like that. I would have to waste performance into conversions
when using something else.
Clearly the above could be used to define other primitive-equivalent types
with stronger aliasing properties, string is merely the most interesting
as the external interface need not change (as char* may still alias byte*).
I believe it would be possible to write a standard conforming string
library that uses such a byte definition, freeing the compiler to maintain
state across writes to the string; I'm not, at present, planning to write
one myself.

It feels that you are correct that it is possible. However ... writing
standard conforming string library does not feel to have point whatsoever.
Standard currently requires the std::string to be externally as loose and
unsafe as it is. So only thing possible is to make it internally more
efficient for particular purpose, not safer. It is unlikely to make
some major difference in efficiency either since there are lot of different
implementations of std::string already floating around as there are lot
of other text-containing and managing libraries and classes for any
purpose imaginable.
 
M

molw5.iwg

It is somewhat relevant. By language rules a value of enum class type does

not implicitly convert to values of integral types. Traditional enum does..

Lack of named enumerators actually does not matter since enum may have all

the values of underlying type regardless if enumerator for particular value

exists or not.

Agreed – I'm still not seeing the relevance to this topic.
Lot of people certainly have. It is very likely that you can find something

already implemented. I in fact haven't. I use std::string for text and

std::vector<char> for byte buffer. I know it is unsafe so I am more

careful. The benefit why I do it is that majority of libraries and tools support types like that. I would have to waste performance into conversions

when using something else.

Like I said – still looking for additional information. Thank you for the
response.
It feels that you are correct that it is possible. However ... writing

standard conforming string library does not feel to have point whatsoever..

Standard currently requires the std::string to be externally as loose and

unsafe as it is. So only thing possible is to make it internally more

efficient for particular purpose, not safer. It is unlikely to make

some major difference in efficiency either since there are lot of different

implementations of std::string already floating around as there are lot

of other text-containing and managing libraries and classes for any

purpose imaginable.

The advantage is the compiler is able to maintain state across string writes,
as I mentioned above; that alters the performance of user code. Obviously the
impact is domain specific – what isn't?
 
Ö

Öö Tiib

The advantage is the compiler is able to maintain state across string writes,
as I mentioned above; that alters the performance of user code. Obviously the
impact is domain specific – what isn't?

I am still unsure why compiler can not optimize away any aliasing
checks already by simply assuming that you do not somehow use underlying buffer
of std::string or std::vector<char> under question as storage for some other
objects possibly involved in your domain-specific solution?
 
M

molw5.iwg

I am still unsure why compiler can not optimize away any aliasing

checks already by simply assuming that you do not somehow use underlying buffer

of std::string or std::vector<char> under question as storage for some other

objects possibly involved in your domain-specific solution?

I honestly don't know how to respond to that. Review the strict aliasing rules?
 
Ö

Öö Tiib

std::string does not contain UTF-8 encoded text. It contains chars. If
your implementation treats those chars as UTF-8 encoded characters, then
fine - but that is NOT part of the standard, it's just something that
*nix operating systems tend to do.

I did in fact describe most widespread practice. char is a byte by C++
standard keep there whatever encoding standard is silent. Other
possibility is to use std::wstring for texts if wchar_t can contain
UTF-16LE. It might help in Windows or with QT as GUI. That is anyway
minority maybe 20% of C++ code written.
You might like to consider what happens when you resize a string to
remove part of a multibyte character. There's nothing there to make it
UTF safe...

There are no alternatives. Such and all other difficulties are normal
work. That is why developers are for.
I suspect this is why fstream::eek:pen takes a char* - someone assumed that
a char* was utf-8, and for those operating systems where a filename is
unicode it's broken.

I repeat ... there are no serious support to Unicode in C++. fstream was
likely designed when no one thought that file names can be anything but
ASCII. UTF-8 is most popular encoding. Majority of HTML or other XML you
see in internet are in that. So it makes sense to use something what you
do not have to convert.
 
Ö

Öö Tiib

I honestly don't know how to respond to that. Review the strict aliasingrules?

It all seems to be about storage taken with malloc(). It feels that if you
use underlying buffer of std::string or std::vector<char> for odd purposes
then you are on your own anyway. I can't find that standard compliant
compiler is required to expect that std::string::iterator and double* may
point to same thing.

So ... what you do seems more and more domain-specific.
 
M

molw5.iwg

It all seems to be about storage taken with malloc(). It feels that if you

use underlying buffer of std::string or std::vector<char> for odd purposes

then you are on your own anyway. I can't find that standard compliant

compiler is required to expect that std::string::iterator and double* may

point to same thing.



So ... what you do seems more and more domain-specific.

I don't know why I'm still replying to this – std::string:iterator contains a
raw character pointer or offset into it's buffer. The compiler is forced to
assume the write itself may alias double*.
 
Ö

Öö Tiib

I don't know why I'm still replying to this – std::string:iterator contains a
raw character pointer or offset into it's buffer. The compiler is forcedto
assume the write itself may alias double*.

std::string::iterator is nowhere required to contain ordinary raw character
pointers. Its members are not specified by standard.
 
M

molw5.iwg

std::string::iterator is nowhere required to contain ordinary raw character

pointers. Its members are not specified by standard.

No kidding? So I suppose the above byte definition could be used instead?
I'm sorry Tiib, I'm done – perhaps someone with more patience will be willing
to pick this up with you.
 
Ö

Öö Tiib

No kidding? So I suppose the above byte definition could be used instead?
I'm sorry Tiib, I'm done – perhaps someone with more patience will be willing
to pick this up with you.

That 'byte' of yours used internally in std::string::iterator? If one
implementing C++ compiler feels it beneficial then easily. There are no
requirements that there are pointers inside whatsoever. Implementation
may use pointers, yes. However whatever implementation inner things with
whatever implementation-specific attributes may be in it. Standard does
only specify interface requirements for standard library.
 
N

Nobody

I suspect this is why fstream::eek:pen takes a char* - someone assumed that a
char* was utf-8, and for those operating systems where a filename is
unicode it's broken.

I assume that it's because fopen() takes a char*.

All widely-used OSes can reference (some) files using char*, even if it's
suboptimal (e.g. on Windows, only files whose names are valid in the
current codepage can be opened that way).

Making fstream::eek:pen() take e.g. a wchar_t* or std::wstring would be even
more broken on Unix than using char* is on Windows. Unix filenames are
just NUL-terminated sequences of bytes with no defined encoding.
 
Ö

Öö Tiib

C++ on Windows is only 20% of all C++? I'm astonished. Do you have a
source for that?

Yup. Trends change. Sad I can't share the source.

Most of the commercial C++ code is written to work on several platforms
(Mac, Linux, Tablets, Consoles, Windows) and so it is not Windows
specific and companies do not care. Microsoft has achieved that with
their C++ unfriendliness and bad compilers.

The hobbyist developers use g++ or CLang way more than MSVC and those
are better on other platforms like Linux.

Most of the Windows and only Windows stuff is currently written in
C# or other .NET things and so it is not C++.
 
B

Bo Persson

Andy Champ skrev 2013-02-22 16:13:
std::string does not contain UTF-8 encoded text. It contains chars. If
your implementation treats those chars as UTF-8 encoded characters, then
fine - but that is NOT part of the standard, it's just something that
*nix operating systems tend to do.

You might like to consider what happens when you resize a string to
remove part of a multibyte character. There's nothing there to make it
UTF safe...

I suspect this is why fstream::eek:pen takes a char* - someone assumed that
a char* was utf-8, and for those operating systems where a filename is
unicode it's broken.

Actually, it's not. The historical reason is that fstream::eek:pen was
designed at a time when std::string did not yet exist.

Note that in C++11 we do have an fstream::eek:pen(std::string). And without
a required UTF-8 support.


Bo Persson
 
Ö

Öö Tiib

So you don't count portable programs running on Windows as Windows
programs? With such definitions the 20% number makes more sense indeed...

Sure, they are Windows programs when built for Windows. However most
Windows-specific is likely removed, what remains is likely isolated
to small modules and the texts are likely kept as UTF-8 not UTF-16LE.
IMO, they have a quite decent compiler and an excellent debugger (with
some braindead quirks of course, but who doesn't have them). The compiler
is a bit lagging when adapting to standards compliance, but this doesn't
make it unusable.

Last 15 years the trend has been to lag behind of others. At the moment free IDEs and free compilers are from several sides better than MS commercial
tools. WinDbg is fine; the one integrated to IDE does not apparently
understand what is going on. Good engineer can work well even with bad
tools.
It is probably true that writing strictly Windows-specific stuff in a
Windows-specific language like C# is easier than in C++. That's the whole
point and not-so-secret agenda of creating the Windows-specific languages
in the first place. So actually the percent of Windows-specific programs
written in C++ should be zero; if it is 20% this probably means somebody
has chosen a wrong tool for the job.

Somebody does always something with wrong tool; no statistics are needed;
that is human nature. :D C++ can be best tool to solve a problem for
Windows as well. When efficiency is needed then C++ is unrivaled.
C++ has also unrivaled power of integrating different things together.
When those powers are not needed then C++ is perhaps too complicated tool
for many.
 
N

Nobody

Ah - I hadn't realised that. So what does ls display if you have
backspaces or newlines in the filename? Something stupid I take it? It
does rather explain the decision.

Originally, ls just copied the byte sequence to stdout. Modern versions
(at least the GNU version) will decode the string according to the current
locale then re-encode it, with question marks (or escape sequences with
-Q) for non-printable characters or sequences which cannot be decoded
according to the current locale.

Taking the encoding into account means that multi-column output is aligned
correctly when dealing with multi-byte characters (i.e. columns are based
upon characters rather than bytes).
 
J

James Kanze

I assume that it's because fopen() takes a char*.

And because fstream preceded std::string by a number of years.
All widely-used OSes can reference (some) files using char*, even if it's
suboptimal (e.g. on Windows, only files whose names are valid in the
current codepage can be opened that way).
Making fstream::eek:pen() take e.g. a wchar_t* or std::wstring would be even
more broken on Unix than using char* is on Windows. Unix filenames are
just NUL-terminated sequences of bytes with no defined encoding.

I suspect that more likely, any suggestion of having `fstream`
take `wchar_t` (or even `std::string`) simply came up too late.
What you can pass to fstream::eek:pen is implementation defined
anyway, so there's no problem with other OSs; the implementation
defined legal set of wchar_t filenames under Unix is empty; you
get the same sort of error that you get when you try to open
a file named ":::" under Windows.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,764
Messages
2,569,567
Members
45,041
Latest member
RomeoFarnh

Latest Threads

Top