portable ascii-hex conversion

James Brown · Nov 8, 2006

All,

I have a series of characters which I need to convert to integer values.
Each character is read in turn from a function 'nextch', and hex-digits are
identified by the isxdigit function - so I'm looking at '0' - '9', 'A' - 'Z'
and 'a' - 'z'.

Here is what I've got:

int num = 0;
int ch = nextch(); /* nextc obtains the next character value */

while(isxdigit(ch))
{
if(isdigit(ch))
ch = ch - '0'; /* this is portable I believe */
else
ch = (ch & ~0x20) - 'A' + 10; /* not sure if this is ok */

num = num * 0x10 + ch;
ch = nextch();
}

If you look at the if-else statement inside the while() loop, you will see
how I attempt to convert 'ch' from a character-value to a numeric value in
the range 0-15 inclusive. But I have doubts about the ((ch & ~0x20) - 'A' +
10) expression:

It assumes that 'A' - 'F' are consecutive values
It assumes that 'a' - 'f' are consecutive, and are always 0x20 above their
'uppercase' counterparts.

Are these assumptions correct? I'm guessing the code is non-portable, so
does anyone have a neat(er) suggestion?

p.s. I derived this code from the lcc compiler sourcecode...

James

Walter Roberson · Nov 8, 2006

I have a series of characters which I need to convert to integer values.
Each character is read in turn from a function 'nextch', and hex-digits are
identified by the isxdigit function - so I'm looking at '0' - '9', 'A' - 'Z'
and 'a' - 'z'.

Here is what I've got:

int num = 0;
int ch = nextch(); /* nextc obtains the next character value */

while(isxdigit(ch))

What if it was EOF ?

{
if(isdigit(ch))
ch = ch - '0'; /* this is portable I believe */
Yes.

else
ch = (ch & ~0x20) - 'A' + 10; /* not sure if this is ok */

The & ~0x20 is a hidden toupper() and not portable to non-ASCII.

And as you had thought, 'A' through 'F' are not guaranteed to be
consequative or even increasing order.

num = num * 0x10 + ch;

What if you overflow your int ?

ch = nextch();
}

Are these assumptions correct? I'm guessing the code is non-portable, so
does anyone have a neat(er) suggestion?

fetch more ch as long as isxdigit(ch) and you haven't gotten
more chars than you can handle, and store them into a buffer.
Then strtoul() specifying base 16.

If you have particular reasons for handling the characters yourself,
then create a translation table of size UCHAR_MAX,
and initialize it, tr['0'+i] = i for i from 0 to 9, and
tr['A'] = 10, tr['B'] = 11, etc., tr['a'] = 10, tr['b'] = 11, etc.,
then to do the conversion, just determine isxdigit(ch) and if so
then the converted value is tr[ch]. Yes, this has the potential
to waste UCHAR_MAX - 26 slots, but it is also a portable single-step
conversion with no math (other than normal array indexing)

James Brown · Nov 8, 2006

Walter Roberson said:
What if it was EOF ?

I thought it would be ok? ch would be EOF, which would cause isxdigit to
return(0), and the loop would break out. Is this not what would happen?

The & ~0x20 is a hidden toupper() and not portable to non-ASCII.

And as you had thought, 'A' through 'F' are not guaranteed to be
consequative or even increasing order.

What if you overflow your int ?

yes, I hadn't gotten as far as checking for overflow, that's my next task.

fetch more ch as long as isxdigit(ch) and you haven't gotten
more chars than you can handle, and store them into a buffer.
Then strtoul() specifying base 16.

Definitely a nice solution, but I think it will be hard to detect overflows?
My compiler documentation for strtoul says that it returns ULONG_MAX on
overflow, but how do I distinguish this from the case when I encounter the
actual ULONG_MAX value? This is why I am hand-coding this thing, so that I
can emit appropriate warning messages when such things happen.

If you have particular reasons for handling the characters yourself,
then create a translation table of size UCHAR_MAX,
and initialize it, tr['0'+i] = i for i from 0 to 9, and
tr['A'] = 10, tr['B'] = 11, etc., tr['a'] = 10, tr['b'] = 11, etc.,
then to do the conversion, just determine isxdigit(ch) and if so
then the converted value is tr[ch]. Yes, this has the potential
to waste UCHAR_MAX - 26 slots, but it is also a portable single-step
conversion with no math (other than normal array indexing)

I'll definitely consider this as a solution - I was hoping for a 1/2 liner
(calling a c-runtime func would be ideal), but it looks like a lookup table
may be the most appropriate way forward. I'm not too concerned with
performance though - I would prefer a simple loop above all else.

thanks,
James

James Brown · Nov 8, 2006

yes, I hadn't gotten as far as checking for overflow, that's my next task.

Definitely a nice solution, but I think it will be hard to detect
overflows? My compiler documentation for strtoul says that it returns
ULONG_MAX on overflow, but how do I distinguish this from the case when I
encounter the actual ULONG_MAX value? This is why I am hand-coding this
thing, so that I can emit appropriate warning messages when such things
happen.

ok, so I read the rest of the strtoul docs and it says 'errno' is set for
overflow/underflow. Looks like this is my preferred solution, thanks for the
help.

James

Peter Nilsson · Nov 8, 2006

Walter said:
What if it was EOF ?

The loop exists. What of it?

The & ~0x20 is a hidden toupper() and not portable to non-ASCII.

And as you had thought, 'A' through 'F' are not guaranteed to be
consequative or even increasing order.

What if you overflow your int ?

fetch more ch as long as isxdigit(ch) and you haven't gotten
more chars than you can handle, and store them into a buffer.
Then strtoul() specifying base 16.

If you have particular reasons for handling the characters yourself,
then create a translation table of size UCHAR_MAX,

Assuming UCHAR_MAX is reasonably small.

and initialize it, tr['0'+i] = i for i from 0 to 9, and
tr['A'] = 10, tr['B'] = 11, etc., tr['a'] = 10, tr['b'] = 11, etc.,
then to do the conversion, just determine isxdigit(ch) and if so
then the converted value is tr[ch]. Yes, this has the potential
to waste UCHAR_MAX - 26 slots, but it is also a portable single-step
conversion with no math (other than normal array indexing)

A simple switch() will do the job too...

Keith Thompson · Nov 8, 2006

James Brown said:
Definitely a nice solution, but I think it will be hard to detect overflows?
My compiler documentation for strtoul says that it returns ULONG_MAX on
overflow, but how do I distinguish this from the case when I encounter the
actual ULONG_MAX value? This is why I am hand-coding this thing, so that I
can emit appropriate warning messages when such things happen.

This is explained in the documentation for strtoul(). On overflow, it
returns ULONG_MAX and sets errno to ERANGE. (You have to set errno to
0 before calling it.)

errno = 0;
result = strtoul(blah, blah, blah);
if (result == ULONG_MAX && errno == ERANGE) {
/* overflow */
}

CBFalconer · Nov 8, 2006

James said:
.... snip ...

p.s. I derived this code from the lcc compiler sourcecode...

If you mean lcc-win32, that explains the non-portability.

Keith Thompson · Nov 8, 2006

Peter Nilsson said:
The loop exists. What of it?

I think you mean the loop exits.

Andrew Poelstra · Nov 8, 2006

It assumes that 'A' - 'F' are consecutive values
It assumes that 'a' - 'f' are consecutive, and are always 0x20 above their
'uppercase' counterparts.

Are these assumptions correct? I'm guessing the code is non-portable, so
does anyone have a neat(er) suggestion?

Here's a trick from /C Unleashed/, in a chapter (I believe) Richard
Heathfield wrote:

char *hex = "0123456789ABCDEF";

Then you have a number-to-hex converter right there:
hex[n] = n_16, 0 <= n <= 15.

Richard Heathfield · Nov 8, 2006

Andrew Poelstra said:

Here's a trick from /C Unleashed/, in a chapter (I believe) Richard
Heathfield wrote:

char *hex = "0123456789ABCDEF";

How I wish I'd written const char *. Oh well.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: normal service will be restored as soon as possible. Please do not
adjust your email clients.

Andrew Poelstra · Nov 8, 2006

Andrew Poelstra said:

How I wish I'd written const char *. Oh well.

Actually, what you wrote was:

char Hex[] = "0123456789ABCDEF";

Sorry for misquoting you; I had written that, and then thought,
'Why would I want to modify that?', converted it from an array
to a pointer, and forgot to qualify it with const.

CBFalconer · Nov 8, 2006

Keith said:
.... snip ...

This is explained in the documentation for strtoul(). On overflow,
it returns ULONG_MAX and sets errno to ERANGE. (You have to set
errno to 0 before calling it.)

errno = 0;
result = strtoul(blah, blah, blah);
if (result == ULONG_MAX && errno == ERANGE) {
/* overflow */
}

IIRC this fails miserably when the user enters "-1", etc.

Keith Thompson · Nov 8, 2006

CBFalconer said:
If you mean lcc-win32, that explains the non-portability.

lcc is a different compiler, the one on which lcc-win32 was based.

Code that's part of an implementation doesn't have to be portable; if
there's some real advantage in using some non-portable solution, the
implementer is free to do so.

Richard Heathfield · Nov 8, 2006

Andrew Poelstra said:

Andrew Poelstra said:

How I wish I'd written const char *. Oh well.

Click to expand...

Actually, what you wrote was:

char Hex[] = "0123456789ABCDEF";

Sorry for misquoting you;

No sweat. BTW it could still use a const. Mea culpa.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: normal service will be restored as soon as possible. Please do not
adjust your email clients.

Keith Thompson · Nov 8, 2006

CBFalconer said:
IIRC this fails miserably when the user enters "-1", etc.

That depends on what you mean by "fails miserably". It treats "-1" as
a negative integer to be converted to unsigned long. The result is
ULONG_MAX; you can distinguish that from an overflow by checking
errno. If you want to reject negative literals, you need to check for
the '-' character before calling strtoul().

I'm not convinced that that's the ideal behavior, but it's how it's
defined.

CBFalconer · Nov 8, 2006

Keith said:
That depends on what you mean by "fails miserably". It treats "-1" as
a negative integer to be converted to unsigned long. The result is
ULONG_MAX; you can distinguish that from an overflow by checking
errno. If you want to reject negative literals, you need to check for
the '-' character before calling strtoul().

I'm not convinced that that's the ideal behavior, but it's how it's
defined.

Exactly, and I define that as 'fails miserably'. I want to know
whenever a user enters something outside its defined principle
range. I have input routines that reject such.

Tor Rustad · Nov 9, 2006

Andrew Poelstra skrev:

Here's a trick from /C Unleashed/, in a chapter (I believe) Richard
Heathfield wrote:

The trick predate this book...

char *hex = "0123456789ABCDEF";

Then you have a number-to-hex converter right there:
hex[n] = n_16, 0 <= n <= 15.

Which can be used for binary-to-hex converter, however OP asked for a
hex-to-binary converter.

What is needed, is something ala:

unsigned char hex[UCHAR_MAX] = {0};

hex['0'] = 0;
hex['1'] = 1;
hex['2'] = 2;
....
hex['F'] = 15;

/* todo: check input has even lenght */
/* todo: check for buffer overflow */

for(i=j=0; i < in_len; i+=2, j++)
{
if ( isxdigit(in) && isxdigit(in[i+1]) )
{
binary[ j ] = hex[ in ] << 4 | hex[ in[i+1] ];
}
else
{
assert( !"function xxx: non-hex input" );
/* todo: handle non-hex input */
}
}

Richard Heathfield · Nov 9, 2006

Tor Rustad said:

Andrew Poelstra skrev:

The trick predate this book...

Certainly. I didn't even think of it as a trick - just as... um... oh yeah,
programming.

--
Richard Heathfield
"Usenet is a strange place" - dmr 29/7/1999
http://www.cpax.org.uk
email: normal service will be restored as soon as possible. Please do not
adjust your email clients.

Simon Biber · Nov 10, 2006

James said:
All,

I have a series of characters which I need to convert to integer values.
Each character is read in turn from a function 'nextch', and hex-digits are
identified by the isxdigit function - so I'm looking at '0' - '9', 'A' - 'Z'
and 'a' - 'z'. [...]
Are these assumptions correct? I'm guessing the code is non-portable, so
does anyone have a neat(er) suggestion?

As others have said, the code is not portable.

How about this for a suggestion? It includes overflow detection.

#include <stdio.h>
#include <ctype.h>
#include <limits.h>

int nextch(void)
{
return getchar();
}

int main(void)
{
int ch;
unsigned long num = 0;
while(isxdigit(ch = nextch()) && num <= ULONG_MAX / 16)
{
num <<= 4;
switch(ch)
{
case 'f': case 'F': num++;
case 'e': case 'E': num++;
case 'd': case 'D': num++;
case 'c': case 'C': num++;
case 'b': case 'B': num++;
case 'a': case 'A': num++;
case '9': num++;
case '8': num++;
case '7': num++;
case '6': num++;
case '5': num++;
case '4': num++;
case '3': num++;
case '2': num++;
case '1': num++;
}
}
if(isxdigit(ch))
{
fprintf(stderr, "Overflow\n");
}
else
{
printf("hex = %lX, dec = %lu\n", num, num);
}
return 0;
}

GCC turns this switch into a jump table. It subtracts 49 from ch and
jumps to the given location in the table:

int eax = ch - '1';
int e
num <<= 4;

leal -49(%edx), %eax /* eax = ch - '1' */
sall $4, %ecx /* num <<= 4 */
cmpl $53, %eax /* compare eax with 'f' - '1' */
movl %ecx, %ebx /* ebx = num */
ja L33 /* if(eax > 'f' - 1') goto default */
jmp *L27(,%eax,4) /* goto L27[eax]
.section .rdata,"dr"
.align 4
L27:
.long L26
.long L25
.long L24
.long L23
.long L22
.long L21
.long L20
.long L19
.long L18
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L17
.long L15
.long L13
.long L11
.long L9
.long L7
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L33
.long L17
.long L15
.long L13
.long L11
.long L9
.long L7
.text
L7:
leal 1(%ecx), %ebx
L9:
incl %ebx
L11:
incl %ebx
L13:
incl %ebx
L15:
incl %ebx
L17:
incl %ebx
L18:
incl %ebx
L19:
incl %ebx
L20:
incl %ebx
L21:
incl %ebx
L22:
incl %ebx
L23:
incl %ebx
L24:
incl %ebx
L25:
incl %ebx

Old Wolf · Nov 12, 2006

Andrew said:
Actually, what you wrote was:

char Hex[] = "0123456789ABCDEF";

Sorry for misquoting you; I had written that, and then thought,
'Why would I want to modify that?', converted it from an array
to a pointer, and forgot to qualify it with const.

Can you explain how you got from "Why would I want to
modify that?", to converting it to a pointer ?

My preference would be:
const char Hex[] = "0123456789ABCDEF";

Secure Keyboard v2.0 Modern C++ Virtual Keyboard for Windows (Glassmorphism UI, Clipboard Auto-Clear)	0	Mar 26, 2026
Mini Web Server in C++ (Part One)	4	Oct 2, 2025
How to try a range of hex values in C# code ?	0	Nov 19, 2022
How to use Densenet121 in monai	0	Feb 16, 2024
convert int to char	14	Mar 7, 2006
Text to string program	7	Aug 27, 2006
character conversion	5	Apr 21, 2008
hex and octal constants in various languages	54	Jun 15, 2009

portable ascii-hex conversion

James Brown

Walter Roberson

James Brown

James Brown

Peter Nilsson

Keith Thompson

CBFalconer

Keith Thompson

Andrew Poelstra

Richard Heathfield

Andrew Poelstra

CBFalconer

Keith Thompson

Richard Heathfield

Keith Thompson

CBFalconer

Tor Rustad

Richard Heathfield

Simon Biber

Old Wolf

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads