Reading an array from file?

J

James Kanze

[ Figuring out valid names on a server ]
Not when you're creating new files. And most of my programs
don't run under a GUI; they're servers, which run 24 hours a
day. Of course, they don't run under Windows either, so the
question is moot:). But the question remains---picking up the
name from a GUI is fine for interactive programs, but a lot of
programs aren't interactive.
Even when the main program isn't interactive, configuration
for it can be.
Ultimately you're right though -- it would be nice if you
could depend on (for example) being able to query a server
about some basic characteristics of a shared/exported file
system, so you could portably figure out what it allows. Right
now, virtually all such "knowledge" is encoded implicitly in
client code (or simply doesn't exist -- the client just passes
a string through and hopes for the best).

The problem is: there is a more or less guaranteed minimum that
will be portable, say one to six alphanumeric characters, a dot,
and a single alphabetic character. And no directory paths.
That's awfully restrictive, however, and you almost never need
that extreme. Trying to define just how far you can deviate
from it, however, is not always obvious; in today's world, I
tend to suppose two strings which would be valid C/C++ symbols,
separated by a '.', and a maximum of 14 characters, including
the '.'. But it's more complex than that, because some systems
ignore case, others don't, and some treat the text after the dot
special.
[ ... ]
And that the file name matches, somehow. But typically, this
isn't the case---I regularly share files between systems, and
this seems to be the case for everyone where I work.
I wish I could offer something positive here, but I doubt I
can. Ultimately, this depends more on the FS than the OS
though -- just for example, regardless of the OS, an ISO 9660
FS (absent something like Joliet extensions) places draconian
restrictions on file names.

Yes. Posix defined a function pathconf, which allows obtaining
some file system specific information, but it's still very Posix
oriented---it only returns information about things which can
vary on a Posix filesystem. Where as in practice, the problems
I encounter are because I'm using both Windows and various Unix.
[ ... ]
I agree that standard (and simple) solutions exist. Putting
a BOM at the start of a text file allows immediate
identification of the encoding format. But how many editors
that you know actually do this?
A few -- Windows Notepad knows how to create and work with
UTF-16LE, UTF-16BE, all including BOMs (or whatever you call
the UTF-8 signature).

I'd call it a BOM:). The idea being that if the reader knows
(or can reasonably assume) it is dealing with Unicode, writing
0xFEFF as the first character allows the application to
determine the transformation format being used by simply reading
the first couple of bytes (maximum four). (I tend to "ignore"
byte order, as such, and simply think in terms of transformation
format---although the only difference between some
transformation formats is the byte order.) Of course, if the
reader can assume Unicode, you don't need a BOM for UTF-8: the
BOM in the other transformation formats ensures that one of the
first four bytes will be 0xFF, and another 0xFE, neither of
which can be present in UTF-8.

In practice, of course, there's still a lot of non-Unicode
floating around as well, and not all Unicode files contain a
BOM, so things get more complicated. Even when limiting myself
to Unicode, I'll read the first four bytes---if there's a BOM,
fine, but even if there's not, I'll look for 0x00 bytes, if the
position and number correspond to one of the UTF-16 or UTF-32
formats, assuming the first two characters have a Unicode
encoding of less than 0xFF, I'll assume that format. It's not
guaranteed, and will almost certainly fail if I get a file with
Chinese text and no BOM, but it works often enough to be
worthwhile. At least in my environment (where files with
Chinese text are very rare).
The current version of Visual Studio also seems to work fine
with UTF-8 and UTF-16 (BE & LE) text files as well. It
preserves the BOM and endianess when saving a modified version
-- but if you want to use it to create a new file with
UTF-16BE encoding (for example) that might be a bit more
difficult (I haven't tried to very hard, but I don't
immediately see a "Unicode big endian" option like Notepad
provides).

I'm afraid I can't help you there. (Now if it were vim...)
But it sounds like Microsoft is being inconsistent.
 
J

Jerry Coffin

(e-mail address removed)>, (e-mail address removed)
says...

[ ... ]
In practice, of course, there's still a lot of non-Unicode
floating around as well, and not all Unicode files contain a
BOM, so things get more complicated. Even when limiting myself
to Unicode, I'll read the first four bytes---if there's a BOM,
fine, but even if there's not, I'll look for 0x00 bytes, if the
position and number correspond to one of the UTF-16 or UTF-32
formats, assuming the first two characters have a Unicode
encoding of less than 0xFF, I'll assume that format. It's not
guaranteed, and will almost certainly fail if I get a file with
Chinese text and no BOM, but it works often enough to be
worthwhile. At least in my environment (where files with
Chinese text are very rare).

When you start doing work for Windows, you'll probably want to look
at IsTextUnicode(). It does roughly the kind of guessing you describe
above, but IIRC, it looks at something like 8K of text instead of
four bytes.
I'm afraid I can't help you there. (Now if it were vim...)
But it sounds like Microsoft is being inconsistent.

Well, sort of. Then again, the programs are enough different in
general that consistency between them would be a bit like consistency
between a skateboard and a delivery truck -- they both have four
wheels, but almost everything else is quite different.
 
J

James Kanze

(e-mail address removed)>, (e-mail address removed)
says...
[ ... ]
In practice, of course, there's still a lot of non-Unicode
floating around as well, and not all Unicode files contain a
BOM, so things get more complicated. Even when limiting
myself to Unicode, I'll read the first four bytes---if
there's a BOM, fine, but even if there's not, I'll look for
0x00 bytes, if the position and number correspond to one of
the UTF-16 or UTF-32 formats, assuming the first two
characters have a Unicode encoding of less than 0xFF, I'll
assume that format. It's not guaranteed, and will almost
certainly fail if I get a file with Chinese text and no BOM,
but it works often enough to be worthwhile. At least in my
environment (where files with Chinese text are very rare).
When you start doing work for Windows, you'll probably want to
look at IsTextUnicode(). It does roughly the kind of guessing
you describe above, but IIRC, it looks at something like 8K of
text instead of four bytes.

My code should still be portable, when possible. I'd also like
it to work from streamed input---even a four character buffer
introduces significant complications into my code. And the
documentation of the "results" suggests that it really only
looks for UTF-16.
Well, sort of. Then again, the programs are enough different
in general that consistency between them would be a bit like
consistency between a skateboard and a delivery truck -- they
both have four wheels, but almost everything else is quite
different.

What I meant was more general. On one hand, Windows seems to
tend toward UTF-16; on the other Visual Studios doesn't allow
you to create it.
 
J

Jerry Coffin

[ ... ]
What I meant was more general. On one hand, Windows seems to
tend toward UTF-16; on the other Visual Studios doesn't allow
you to create it.

Ah, yes, that's more or less reasonable. VS _does_ seem to allow you
to create it, but not necessarily easily or consistently. A bit more
looking shows, for one example, that it does have a setting to create
Unicode files if they contain characters that can't be represented in
the current code page. At least to me, that sounds like really lousy
idea -- I have a file that's in one format, but I enter one wrong
character, and suddenly (and apparently without warning) its format
changes completely...
 
J

James Kanze

[ ... ]
What I meant was more general. On one hand, Windows seems
to tend toward UTF-16; on the other Visual Studios doesn't
allow you to create it.
Ah, yes, that's more or less reasonable. VS _does_ seem to
allow you to create it, but not necessarily easily or
consistently. A bit more looking shows, for one example, that
it does have a setting to create Unicode files if they contain
characters that can't be represented in the current code page.
At least to me, that sounds like really lousy idea -- I have a
file that's in one format, but I enter one wrong character,
and suddenly (and apparently without warning) its format
changes completely...

Yes. Without any consideration for backwards compatibility, I'd
offer a simple choice of "native" (UTF-16, CRLF line
separators), "internet" (UTF-8, CRLF line terminators) and
"unix" (UTF-8, LF line teminators). But compatibility means
that all of the other existing combinations, which are already
there, have to be supported somehow as well. (Until recently,
most of my sources were ISO 8859-1, CR line terminators. Which
didn't seem to cause the VC++ compiler any problems, but the
only characters not in the basic character set were in comments,
so I'm not sure that means anything---I've converted to UTF-8,
and the sources compile just as well, with no change in the
compiler configuration.)
 
A

Alf P. Steinbach

* James Kanze:
[ ... ]
What I meant was more general. On one hand, Windows seems
to tend toward UTF-16; on the other Visual Studios doesn't
allow you to create it.
Ah, yes, that's more or less reasonable. VS _does_ seem to
allow you to create it, but not necessarily easily or
consistently. A bit more looking shows, for one example, that
it does have a setting to create Unicode files if they contain
characters that can't be represented in the current code page.
At least to me, that sounds like really lousy idea -- I have a
file that's in one format, but I enter one wrong character,
and suddenly (and apparently without warning) its format
changes completely...

Yes. Without any consideration for backwards compatibility, I'd
offer a simple choice of "native" (UTF-16, CRLF line
separators), "internet" (UTF-8, CRLF line terminators) and
"unix" (UTF-8, LF line teminators). But compatibility means
that all of the other existing combinations, which are already
there, have to be supported somehow as well. (Until recently,
most of my sources were ISO 8859-1, CR line terminators. Which
didn't seem to cause the VC++ compiler any problems, but the
only characters not in the basic character set were in comments,
so I'm not sure that means anything---I've converted to UTF-8,
and the sources compile just as well, with no change in the
compiler configuration.)

If you're using Comeau that may be a good strategy. Otherwise you may want to
use g++ to check the validity of some construct that MSVC accepts, or just to
get a more comprehensible error message for something. And g++ does not accept
UTF8 with BOM (at least, unless you rebuild the compiler) -- in fact, the
MinGW g++ compiler only accepts UTF-8 sans BOM, but happily it accepts invalid
UTF-8 so that except for wide character literals one can pretend it's Latin-1.


Cheers,

- Alf
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top