Strip everything out of a string expect numbers.

J

jcf

I need to do the following: Our application builds information from a
string that is stored in a buffer that can be any of the following
formats: ABCD0102, ABCDEF0102, AB*CD*01*02, AB*CDEF*01*02. The only
information needed is the 4 digits at the end. The new format with
asterix seperation is an "upgrade" that I need to work around but also
be able to handle the older formats as well. Before, it would just go
by the position in the string to parse the "store" and "area" (the
information I need)
..
..
..
memcpy(store, buff+strlen(buff)-4,2);
memcpy(area, buff+strlen(buff)-2,2);
..
..
..
But, now I was thinking if I could just strip everything but the four
digits that are always there, before I do get to the memcopy's and
store that in a new buffer, it will still work fine. I was trying
sscanf, but can't get it to work.

/*strip(buff, new_buff)*/

memcpy(store, new_buff+strlen(new_buff)-4,2);
memcpy(area, new_buff+strlen(new_buff)-2,2);
 
E

Eric Sosman

jcf said:
I need to do the following: Our application builds information from a
string that is stored in a buffer that can be any of the following
formats: ABCD0102, ABCDEF0102, AB*CD*01*02, AB*CDEF*01*02. The only
information needed is the 4 digits at the end. The new format with
asterix seperation is an "upgrade" that I need to work around but also

That's "asterisk" (unless you happen to be on good
terms with a venerable Druid) and "separation."
be able to handle the older formats as well. Before, it would just go
by the position in the string to parse the "store" and "area" (the
information I need)
.
.
.
memcpy(store, buff+strlen(buff)-4,2);
memcpy(area, buff+strlen(buff)-2,2);
.
.
.
But, now I was thinking if I could just strip everything but the four
digits that are always there, before I do get to the memcopy's and
store that in a new buffer, it will still work fine. I was trying
sscanf, but can't get it to work.

/*strip(buff, new_buff)*/

memcpy(store, new_buff+strlen(new_buff)-4,2);
memcpy(area, new_buff+strlen(new_buff)-2,2);

The isdigit() function from <ctype.h> can help you
decide whether each character is a digit or a non-digit.
The strcspn() function from <string.h> can give you an
easy way to skip over all the leading non-digits. If
you need more thorough validation (e.g., "all the leading
non-digits must be upper-case letters" or "there must be
at least four but no more than six leading non-digits"
or anything along such lines), you'll need to describe it.
 
J

jcf

Eric said:
That's "asterisk" (unless you happen to be on good
terms with a venerable Druid) and "separation."


The isdigit() function from <ctype.h> can help you
decide whether each character is a digit or a non-digit.
The strcspn() function from <string.h> can give you an
easy way to skip over all the leading non-digits. If
you need more thorough validation (e.g., "all the leading
non-digits must be upper-case letters" or "there must be
at least four but no more than six leading non-digits"
or anything along such lines), you'll need to describe it.

Here is the sscanf function that almost works

for the new format (AB*CD*01*02):

sscanf(old, "%[A-Z,*]%2[0-9]*%2[0-9]",junk, store, area);

for the old format (ABCD0102):

sscanf(old, "%[A-Z,*]%2[0-9]%2[0-9]",junk, store, area);

So, I guess, really my question is, is their a way to consolidate those
two sscanf expressions that will work on both strings. Something like
'throw away leading uppercase letters and asterisks, get two numbers
for "store", throw away an asterisk, which may or may not be there, get
two numbers for "area"'
 
M

Martijn

jcf said:
I need to do the following: Our application builds information from a
string that is stored in a buffer that can be any of the following
formats: ABCD0102, ABCDEF0102, AB*CD*01*02, AB*CDEF*01*02. The only
information needed is the 4 digits at the end. The new format with
asterix seperation is an "upgrade" that I need to work around but also
be able to handle the older formats as well.

[snipped]
But, now I was thinking if I could just strip everything but the four
digits that are always there, before I do get to the memcopy's and
store that in a new buffer, it will still work fine. I was trying
sscanf, but can't get it to work.

/*strip(buff, new_buff)*/

memcpy(store, new_buff+strlen(new_buff)-4,2);
memcpy(area, new_buff+strlen(new_buff)-2,2);

How about this:

sscanf(szIn, "%[^0-9]%d%[^0-9]%d", szTemp, &i1, szTemp, &i2);
if ( i1 >= 100 ) /* this indicates i1 captured both numbers */
{
i2 = i1 % 100;
i1 /= 100;
}

printf("%d - %d\n", i1, i2);

You could also use the return value of the sscanf (which returns the amount
of items successfully retrieved). There could be a better solution for the
szTemp variable, but I am not familiar enough with (s)scanf to know if there
is an alternative for this.

But this only works for a limited set of inputs, of course.

Good luck,
 
S

SM Ryan

# I need to do the following: Our application builds information from a
# string that is stored in a buffer that can be any of the following
# formats: ABCD0102, ABCDEF0102, AB*CD*01*02, AB*CDEF*01*02. The only
# information needed is the 4 digits at the end. The new format with
# asterix seperation is an "upgrade" that I need to work around but also
# be able to handle the older formats as well. Before, it would just go
# by the position in the string to parse the "store" and "area" (the
# information I need)
# .
# .
# .
# memcpy(store, buff+strlen(buff)-4,2);
# memcpy(area, buff+strlen(buff)-2,2);
# .
# .
# .
# But, now I was thinking if I could just strip everything but the four
# digits that are always there, before I do get to the memcopy's and
# store that in a new buffer, it will still work fine. I was trying
# sscanf, but can't get it to work.
#
# /*strip(buff, new_buff)*/

What about
char *ss = buff,*dd = new_buff;
for (; *ss; ss++) if (isdigit(*ss)) *dd++ = *s;
*dd = 0;
int n = strlen(new_buff);
if (n>4) {memmove(new_buf,new_buff+n-4,5);}
else if (n<4) {string is too short}
 
C

Chris Torek

Here is the sscanf [code, with two separate directive sequences]
that almost works

for the new format (AB*CD*01*02):
sscanf(old, "%[A-Z,*]%2[0-9]*%2[0-9]",junk, store, area);

for the old format (ABCD0102):
sscanf(old, "%[A-Z,*]%2[0-9]%2[0-9]",junk, store, area);

So, I guess, really my question is, is their a way to consolidate those
two sscanf expressions that will work on both strings.

No. The scanf engine is capable of scanning "one or more characters
from a set" using the %[ directive, but %[ is required to match at
least one character. Since the "*" between the two 2-digit areas
is an optional single "*", you need a more powerful regular-expression
matcher than that offered by the scanf family.

You can, however, use the "*" modifier to avoid storing the skipped
initial sequence:

result = sscanf(buf, "%*[,*A-Z]%2[0-9]*%2[0-9]", store, area);
if (result != 2)
result = sscanf(buf, "%*[,*A-Z]%2[0-9]%2[0-9]", store, area);
if (result != 2)
... handle error case ...

Note that the A-Z part depends on the character set; it will do
the wrong thing (or something presumably incorrect at least) on a
machine that uses an EBCDIC encoding (mainly IBM mainframes). You
could avoid this by spelling out the entire alphabet, or using some
alternative parsing scheme (instead of relying on the scanf engine).
 
J

jcf

Thanks for the suggestions, this seems to work and elimates the need
for memcpy's
if(strchr(buff, '*')==NULL)
sscanf(buff, "%[^0-9]%2[0-9]%2[0-9]",junk, store, area);
else
sscanf(buff, "%[^0-9]%2[0-9]*%2[0-9]",junk, store, area);
if(strlen(store)!=2 || strlen(area)!=2){
/*error*/
}
I wish there was a way to combine the two sscanf expressions with
format tokens, but I can't seem to figure it out. I guess I'm trying to
be too clever.
 
A

akarl

jcf said:
I need to do the following: Our application builds information from a
string that is stored in a buffer that can be any of the following
formats: ABCD0102, ABCDEF0102, AB*CD*01*02, AB*CDEF*01*02. The only
information needed is the 4 digits at the end. The new format with
asterix seperation is an "upgrade" that I need to work around but also
be able to handle the older formats as well. Before, it would just go
by the position in the string to parse the "store" and "area" (the
information I need)
.
.
.
memcpy(store, buff+strlen(buff)-4,2);
memcpy(area, buff+strlen(buff)-2,2);
.
.
.
But, now I was thinking if I could just strip everything but the four
digits that are always there, before I do get to the memcopy's and
store that in a new buffer, it will still work fine. I was trying
sscanf, but can't get it to work.

/*strip(buff, new_buff)*/

memcpy(store, new_buff+strlen(new_buff)-4,2);
memcpy(area, new_buff+strlen(new_buff)-2,2);

As long as you don't need format validation the following will do:

#include <ctype.h>
#include <stdbool.h>

/* GetDigits(s, d) extracts the digits from s (in order)
and puts them in d, which must be large enough to hold
the digits and the NUL character.

Example: s = "AB*CDEF*01*02" gives d = "0102". */

void GetDigits(const char *s, char *digits)
{
int i = 0, j = 0;

while (true) {
while ((s != '\0') && !isdigit(s)) { i++; }
if (s == '\0') { break; }
digits[j] = s;
i++;
j++;
}
digits[j] = '\0';
}


August
 
L

Leonardo Palozzi

jcf said:
sscanf(buff, "%[^0-9]%2[0-9]%2[0-9]",junk, store, area);
else
sscanf(buff, "%[^0-9]%2[0-9]*%2[0-9]",junk, store, area);

You can avoid copying the need for "junk" by using the '*' flag
character.

ie.

sscanf(buff, "%*[^0-9]%2[0-9]%2[0-9]", store, area);

-Leonardo
 
E

Eric Sosman

jcf said:
Eric said:
jcf said:
I need to do the following: Our application builds information from a
string that is stored in a buffer that can be any of the following
formats: ABCD0102, ABCDEF0102, AB*CD*01*02, AB*CDEF*01*02. The only
information needed is the 4 digits at the end. The new format with
asterix seperation is an "upgrade" that I need to work around but also
[...]

Here is the sscanf function that almost works

for the new format (AB*CD*01*02):

sscanf(old, "%[A-Z,*]%2[0-9]*%2[0-9]",junk, store, area);

for the old format (ABCD0102):

sscanf(old, "%[A-Z,*]%2[0-9]%2[0-9]",junk, store, area);

So, I guess, really my question is, is their a way to consolidate those
two sscanf expressions that will work on both strings. Something like
'throw away leading uppercase letters and asterisks, get two numbers
for "store", throw away an asterisk, which may or may not be there, get
two numbers for "area"'

I don't think so. In any case, the two existing formats
have problems of their own. For example, the first is
perfectly content with input like "AB*****,1*9" and will
not detect any discrepancy. Similarly, the second would
happily accept ",123" and report no problem. You could mess
around with sscanf() and maybe come up with something that's
better (for example, you could get rid of `junk' by using
the assignment-suppression modifier), but I think sscanf()
is the wrong screwdriver for this nail.

A lot depends on how "clean" you believe the input to be.
If it's the job of somebody upstream to ensure that the input
is properly formatted and if you're confident the job has been
done correctly, you could use strchr() to test for the presence
of an asterisk and then choose between one sscanf() or the
other (after fixing the formats, of course).

Personally, I'd be uncomfortable with so much trust; it
goes against the grain. What happens when somebody has a
bright idea for yet another "upgrade" of the format, but
forgets to change the code you're now writing? It would most
likely be better to have your program ring the alarm bells
and draw attention to the problem than to get fooled by a
"store" code that's expanded to three digits.
 
K

Keith Thompson

akarl said:
As long as you don't need format validation the following will do:

#include <ctype.h>
#include <stdbool.h>

/* GetDigits(s, d) extracts the digits from s (in order)
and puts them in d, which must be large enough to hold
the digits and the NUL character.

Example: s = "AB*CDEF*01*02" gives d = "0102". */

void GetDigits(const char *s, char *digits)
{
int i = 0, j = 0;

while (true) {
while ((s != '\0') && !isdigit(s)) { i++; }
if (s == '\0') { break; }
digits[j] = s;
i++;
j++;
}
digits[j] = '\0';
}


Here's a simpler version:

void GetDigits(const char *s, char *digits)
{
int i, j;
for (i = 0, j=0; s != '\0'; i ++) {
if (isdigit((unsigned char)s)) {
digits[j++] = s;
}
}
digits[j] = '\0';
}
 
A

akarl

Keith said:
As long as you don't need format validation the following will do:

#include <ctype.h>
#include <stdbool.h>

/* GetDigits(s, d) extracts the digits from s (in order)
and puts them in d, which must be large enough to hold
the digits and the NUL character.

Example: s = "AB*CDEF*01*02" gives d = "0102". */

void GetDigits(const char *s, char *digits)
{
int i = 0, j = 0;

while (true) {
while ((s != '\0') && !isdigit(s)) { i++; }
if (s == '\0') { break; }
digits[j] = s;
i++;
j++;
}
digits[j] = '\0';
}



Here's a simpler version:

void GetDigits(const char *s, char *digits)
{
int i, j;
for (i = 0, j=0; s != '\0'; i ++) {
if (isdigit((unsigned char)s)) {
digits[j++] = s;
}
}
digits[j] = '\0';
}


Yes, of course. How silly of me. I was working on a similar function
that was more complex, hence the complication of a simple problem.
Personally I would write it as:

void GetDigits(const char *s, char *digits)
{
int i = 0, j = 0;

while (s != '\0') {
if (isdigit(s)) {
digits[j] = s;
j++;
}
i++;
}
digits[j] = '\0';
}

since I don't do multiple side effects and I use the `for' loop only
when I know the number of loops in advance (as the `for' loop works in
languages outside the C family). Most C/C++/Java only programmers will
of course have a different opinion.


August
 
A

akarl

Keith said:
Here's a simpler version:

void GetDigits(const char *s, char *digits)
{
int i, j;
for (i = 0, j=0; s != '\0'; i ++) {
if (isdigit((unsigned char)s)) {
digits[j++] = s;
}
}
digits[j] = '\0';
}


Why the cast to `unsigned char' when `isdigit' expects an `int'?


August
 
R

Robert Gamble

akarl said:
Keith said:
Here's a simpler version:

void GetDigits(const char *s, char *digits)
{
int i, j;
for (i = 0, j=0; s != '\0'; i ++) {
if (isdigit((unsigned char)s)) {
digits[j++] = s;
}
}
digits[j] = '\0';
}


Why the cast to `unsigned char' when `isdigit' expects an `int'?


The value of the argument to isdigit must be one that can be
represented as an unsigned char or the value of the macro EOF (a
negative integer that cannot be represented as a unsigned char, hence
the parameter type int), hence the cast.

Robert Gamble
 
K

Keith Thompson

akarl said:
Keith said:
Here's a simpler version:
void GetDigits(const char *s, char *digits)
{
int i, j;
for (i = 0, j=0; s != '\0'; i ++) {
if (isdigit((unsigned char)s)) {
digits[j++] = s;
}
}
digits[j] = '\0';
}


Why the cast to `unsigned char' when `isdigit' expects an `int'?


Because isdigit expects an int whose value is representable as an
unsigned char (or is equal to EOF). If plain char is signed, and one
of the characters in the input string has a negative value, it will be
promoted to a negative int, and passing a negative value other than
EOF to isdigit invokes undefined behavior. Explicitly converting to
unsigned char avoids this problem.

You're unlikely to run into this problem in practice for two reasons:
first, it's very likely that all the input characters will have
positive values (all the members of the basic execution character set
are guaranteed to be non-negative, even if plain char is signed), and
many <ctype.h> implementations work "properly" for negative values.
 
C

CBFalconer

Keith said:
akarl said:
As long as you don't need format validation the following will do:

#include <ctype.h>
#include <stdbool.h>

/* GetDigits(s, d) extracts the digits from s (in order)
and puts them in d, which must be large enough to hold
the digits and the NUL character.

Example: s = "AB*CDEF*01*02" gives d = "0102". */

void GetDigits(const char *s, char *digits)
{
int i = 0, j = 0;

while (true) {
while ((s != '\0') && !isdigit(s)) { i++; }
if (s == '\0') { break; }
digits[j] = s;
i++;
j++;
}
digits[j] = '\0';
}


Here's a simpler version:

void GetDigits(const char *s, char *digits)
{
int i, j;
for (i = 0, j=0; s != '\0'; i ++) {
if (isdigit((unsigned char)s)) {
digits[j++] = s;
}
}
digits[j] = '\0';
}


Not to mention that this avoids the undefined behaviour, and can
actually work. It would also be useful to return j so the caller
call easily tell the length of digits, and whether any digits were
found.
 
M

Martijn

jcf said:
Thanks for the suggestions, this seems to work and elimates the need
for memcpy's
if(strchr(buff, '*')==NULL)
sscanf(buff, "%[^0-9]%2[0-9]%2[0-9]",junk, store, area);
else
sscanf(buff, "%[^0-9]%2[0-9]*%2[0-9]",junk, store, area);
if(strlen(store)!=2 || strlen(area)!=2){
/*error*/
}
I wish there was a way to combine the two sscanf expressions with
format tokens, but I can't seem to figure it out. I guess I'm trying
to be too clever.

The solution I had opted actually works for both formats. Another
limitation is the fact that both store and area can not be 00. And use
Leonardo's tip for avoiding the use of extra buffers.

Good luck,
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,580
Members
45,054
Latest member
TrimKetoBoost

Latest Threads

Top