Very simple parser... not for me

pozz · Jan 27, 2014

I have a string that is generated from a numerical keypad with decimal
numbers 0-9 and the two special characters * and #, similar to a phone
keypad.

I want to parse the string and make something or something else
depending on the current state and the string itself. Generally the
string can be:
#
*#
*<g>#
<c>*<g>#
<c>#
<c>*#
where
<c> is a 4- to 6-digits code
<g> is the exact digit '1', '2' or '3'.

Depending on the current state, some combinations are valid, others not.
In some cases the code should be valid (already saved in the system),
in other cases it could be new (when a new code must be entered to
change the current one).

I think I have solved the problem trying to split the string in to three
fields: <c>, '*' and <g>. If the first char is a digit, it must be a 4-
to 6- code, otherwise it must be a '*' or a '#', and so on...

Even if my code works, I don't think this is an elegant solution. What
happens if in the feature I decide to change one combination? For
example, from "#" to "###"? Or from "<c>*#" to "*<c>#"?
I have to recode everything, because my assumption the string can be
split in three fields in that order is not true anymore.

What is the best approach to implement a simple parser like this?

Malcolm McLean · Jan 27, 2014

I have a string that is generated from a numerical keypad with decimal
numbers 0-9 and the two special characters * and #, similar to a phone
keypad.

I want to parse the string and make something or something else
depending on the current state and the string itself. Generally the
string can be:

#
*#
*<g>#
<c>*<g>#
<c>#
<c>*#

where

<c> is a 4- to 6-digits code
<g> is the exact digit '1', '2' or '3'

Depending on the current state, some combinations are valid, others not.

In some cases the code should be valid (already saved in the system),
in other cases it could be new (when a new code must be entered to
change the current one).

What is the best approach to implement a simple parser like this?

Go to my website
http://www.malcolmmclean.site11.com/www

and check the MiniBasic pages to understand parsing in general, or buy a copy
of the book. It's very cheap and money well spent.

If you can guarantee that you always have to deal with special characters,
4-6 digit cdes, or single digits, the main step is in the tokeniser.
A token is # * <c> or <g>. You have to do a bit of lookahead to distinguish <c>
from <g>, but that's easy enough to implement.
So the tokenizer returns 1 of 6 states (the four token, error, and end of
input). It also keeps track of the length of the token.
You implement two functions in the toeknize, gettoken(), which returns the
current token, and match(). Match() moves on the tokenizer to the next token,
but only if passed the correct token type. Otherwise it flags an error.

A tokenizer makes the parser very easy to write.
Get the first token. # * and <c> are legal, anything else is a parse error.
switch on the result, match the token you read, and read the next token.
Some are legal, some are not. *<g>* is legal, but *<g>$ (where $ = end of
input) is not. You can handle this easily by calling match('*'), without even
reading it.

Errors are sticky. So you pass an "error" structure to every function. If
an error is set it doesn't report more errors and, generally, tries to
return you as quickly as possible, but without funny jumps or other abnormal
flow control. A good plan is to say if(error_reported) return '$' in the
tokenizer.

Ben Bacarisse · Jan 27, 2014

pozz said:
I want to parse the string and make something or something else
depending on the current state and the string itself. Generally the
string can be:
#
*#
*<g>#
<c>*<g>#
<c>#
<c>*#
where
<c> is a 4- to 6-digits code
<g> is the exact digit '1', '2' or '3'.

Depending on the current state, some combinations are valid, others
not. In some cases the code should be valid (already saved in the
system), in other cases it could be new (when a new code must be
entered to change the current one).

It's not clear if these constraints need to be part of the parse. I'd
argue that you should exclude such considerations from the parsing. If
you don't, it can get very much more complex.

What is the best approach to implement a simple parser like this?

You might be in danger of over-engineering this. The more future
changes you try to incorporate, the less you'll ever do now! However,
the pattern you've described is a classic application of regular
expressions. There are many RE matching libraries available, and it's
not too hard to write one if you can't use any of these due to space or
other restrictions. But consider that you might have a suitable
solution already doing it ad hoc.

pozz · Jan 27, 2014

Il 27/01/2014 23:33, Ben Bacarisse ha scritto:

It's not clear if these constraints need to be part of the parse. I'd
argue that you should exclude such considerations from the parsing. If
you don't, it can get very much more complex.

Ok, exclude these constraints from the parsing process.

You might be in danger of over-engineering this. The more future
changes you try to incorporate, the less you'll ever do now! However,
the pattern you've described is a classic application of regular
expressions. There are many RE matching libraries available, and it's
not too hard to write one if you can't use any of these due to space or
other restrictions. But consider that you might have a suitable
solution already doing it ad hoc.

Simple regular expressions, yes, you're right. My valid combinations of
characters 0-9, *, # can be described as regular expressions.

I'm using a very small microcontroller, how to code a simple regular
expression matching algorithm?

BartC · Jan 27, 2014

pozz said:
Il 27/01/2014 23:33, Ben Bacarisse ha scritto:

I'm using a very small microcontroller, how to code a simple regular
expression matching algorithm?

I don't know about regular expressions, but I can parse the above using
random logic, and it's less than 100 lines. My test result in these numeric
values:

c: -1 if not present, or 0 to 999999 (or a bit more actually as I don't
check the length. Is c a numeric value or a string consisting of digits? Is
0000 allowed? If that is distinct from 00000, then you don't want c as a
number, but you need a code then that tells you when c is not present)

g: 0 if not present, or 1 to 3

star: 0 if not present, or 1

hash: always 1

(I can't post the C at the minute because it's code-generated and I haven't
time to tidy it up. It's not that great either.. But I don't think this task
will present problems with a microcontroller.)

Kaz Kylheku · Jan 27, 2014

I'm using a very small microcontroller, how to code a simple regular
expression matching algorithm?

What you can do is write the regexes out and possibly even debug them using
some scripting language (that doesn't run on the embedded system, but on your
development box).

Then whiteboard the automata corresponding to the regexes and convert into a
table-driven machine by hand.

For a simple input language, I would skip the regexes and non-deterministic
automata. Rather, I would start by diagramming a deterministic automaton right
from the beginning, from which the state transition table pops out directly.

In a deterministic finite automaton (DFA), there are no empty transitions
(transitions from one state to another without consuming an input token), and
no ambiguous transitions (transitions for the same input token, into different
target states).

If you can directly model your machine's actions as a DFA, things are easy.

Ben Bacarisse · Jan 28, 2014

BartC said:
I don't know about regular expressions, but I can parse the above using
random logic, and it's less than 100 lines.

Yes. In fact I said "You might be in danger of over-engineering this"
before offering a suggestion, but the OP specifically asked for an
alternative to the "obvious" ad-hoc parse.

<snip>

BartC · Jan 28, 2014

Ben Bacarisse said:
Yes. In fact I said "You might be in danger of over-engineering this"
before offering a suggestion, but the OP specifically asked for an
alternative to the "obvious" ad-hoc parse.

As I said, I don't know regular expressions, or whether they can do much
beyond recognising a particular pattern, such as extracting the actual data
to operate on (which I assumed can be represented by three values (C, G and
'STAR').)

A dedicated parser can take care of that easily. If the format changes, then
yes some code changes are needed, but in the recursive descent approach I
started to use (as I'm familiar with that; also not quite as random as I
said), that's straightforward, unless the format will change every five
minutes. I think also that other constraints could be applied more easily
this way.

(But we don't have all the info; maybe the details of each field in the
string
are less important than verifying or classifying the format.)

pozz · Jan 28, 2014

Il 28/01/2014 00:42, BartC ha scritto:

I don't know about regular expressions, but I can parse the above using
random logic,

I'm sorry for the trivial question, but... what is "random logic"?

and it's less than 100 lines. My test result in these numeric
values:

c: -1 if not present, or 0 to 999999 (or a bit more actually as I don't
check the length. Is c a numeric value or a string consisting of
digits? Is 0000 allowed? If that is distinct from 00000, then you don't
want c as a number, but you need a code then that tells you when c is
not present)

Yes, the code should be treated as a string because the code "0000" is
completely different than "000000".

g: 0 if not present, or 1 to 3

star: 0 if not present, or 1

hash: always 1

(I can't post the C at the minute because it's code-generated and I haven't
time to tidy it up. It's not that great either.. But I don't think this
task
will present problems with a microcontroller.)

Is it code generated? Could you explain how, please?

pozz · Jan 28, 2014

Il 28/01/2014 00:56, Kaz Kylheku ha scritto:

What you can do is write the regexes out and possibly even debug them using
some scripting language (that doesn't run on the embedded system, but on your
development box).

Then whiteboard the automata corresponding to the regexes and convert into a
table-driven machine by hand.

How do you make this?

For a simple input language, I would skip the regexes and non-deterministic
automata. Rather, I would start by diagramming a deterministic automaton right
from the beginning, from which the state transition table pops out directly.

In a deterministic finite automaton (DFA), there are no empty transitions
(transitions from one state to another without consuming an input token), and
no ambiguous transitions (transitions for the same input token, into different
target states).

If you can directly model your machine's actions as a DFA, things are easy.

I tried to represent my simple language in a deterministic finite state
machine, but it's a too much complex task for me. I arrived to have too
many states and automaton wasn't readable.

Kaz Kylheku · Jan 28, 2014

Il 28/01/2014 00:42, BartC ha scritto:

I'm sorry for the trivial question, but... what is "random logic"?

"Random" is sometimes used in place of "ad hoc".

Eric Sosman · Jan 28, 2014

Il 28/01/2014 00:56, Kaz Kylheku ha scritto:

How do you make this?

I tried to represent my simple language in a deterministic finite state
machine, but it's a too much complex task for me. I arrived to have too
many states and automaton wasn't readable.

Is there something tricky I've not noticed? It seems to me
that a recognizer for these inputs would have rather few states,
if we permit the recognition of <c> and <g> as "primitives." For
reference, here's your grammar again:

#
*#
*<g>#
<c>*<g>#
<c>#
<c>*#
where
<c> is a 4- to 6-digits code
<g> is the exact digit '1', '2' or '3'.

.... and a recognizer might go something like this:

State 0: // starting out
if input is #, recognize "#" and stop
if input is *, move ahead and enter State 1
if input is <c>, move ahead and enter State 3
else syntax error

State 1: // saw "*", what follows?
if input is #, recognize "*#" and stop
if input is <g>, move ahead and enter State 2
else syntax error

State 2: // saw "*<g>", what follows?
if input is #, recognize "*<g>#" and stop
else syntax error

State 3: // saw "<c>", what follows?
if input is *, move ahead and enter State 4
if input is #, recognize "<c>#" and stop
else syntax error

State 4: // saw "<c>*", what follows?
if input is <g>, move ahead and enter State 5
if input is #, recognize "<c>*#" and stop
else syntax error

State 5: // saw "<c>*<g>", what follows
if input is #, recognize "<c>*<g>#" and stop
else syntax error

The "stop" action might (or might not) want to check that the
input has been completely consumed; it depends on what else
you're trying to do.

Jorgen Grahn · Jan 28, 2014

You might be in danger of over-engineering this. The more future
changes you try to incorporate, the less you'll ever do now!

Yes. And it doesn't strike me as something which will change very
often! If this is something users will have to input, they need to
learn the syntax; then they will also need to relearn it if it's
changed. That in itself is a barrier to change.

/Jorgen

Jorgen Grahn · Jan 28, 2014

.

As I said, I don't know regular expressions, or whether they can do much
beyond recognising a particular pattern, such as extracting the actual data
to operate on (which I assumed can be represented by three values (C, G and
'STAR').)

Some of them can, but when you have verified that input indeed matches
a certain pattern, it's IMO often easier to pick out the pieces you
want using strchr() and strtoul() or whatever. Life is easy when
there's no error handling to be done.

/Jorgen

BartC · Jan 28, 2014

pozz said:
Il 28/01/2014 00:42, BartC ha scritto:

I'm sorry for the trivial question, but... what is "random logic"?

Not a good term, perhaps 'dedicated' logic, ie. program code, is better. Or
'ad hoc'...

Is it code generated? Could you explain how, please?

That's an unimportant detail; I was coding in a somewhat different language
to C, which is translated to C using a tool. Anyway I've shown that C, a
little tidied up but still odd in places, below. (This is rather more than
100 lines, the original was a bit less.)

The parsing is done in 'parseinput()'. Sample testing and display of results
in main(). The actual code is unimportant, it's just to show what it might
look like. Quite a bit depends on how the input is obtained, and what needs
to be done with the output.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

static void resetinput(char *s);
static void nextinput(void);
static int readg(void);
static int readc(char *c);
static void error(char *m);
static int parseinput(char *s,char *c,int *g,int *star);

static char *inputptr; /* the parser uses these file-scope state variables
*
static char nextchar;

static void resetinput(char *s) { /* initialise the parser */
inputptr = s;
}

static void nextinput(void) { /* obtain next input character */
if (*inputptr==0) {
nextchar = 0;
} else {
nextchar = *inputptr++;
}
}

static int readg(void) {
int g=nextchar-'0';
nextinput();
return g;
}

static int readc(char *c) {
int n=0;
do {
c[n++] = nextchar;
if ((n>6)) {
error("C too long");
return 0;
}
nextinput();
} while (!((nextchar<'0'||nextchar>'9')));
if (n<4) {
error("C too short");
return 0;
}
c[n] = 0;
return 1;
}

static void error(char *m) {
printf("Error: %s",m);
puts("");
}

static int parseinput(char *s,char *c,int *g,int *star) {
resetinput(s);
*c = 0;
*g = 0;
*star = 0;

nextinput();

switch (nextchar) {
case '#':
break;
case '*':
*star = 1;
nextinput();
if (nextchar>='1'&&nextchar<='3') {
*g = readg();
}
if (nextchar!=35) {
error("# expected");
return 0;
}
break;
case 48: case 49: case 50: case 51: case 52: case 53: case 54: case 55:
case 56: case 57: // '0' to '9'
if (!readc(c)) {
return 0;
}
if (nextchar=='*') {
*star = 1;
nextinput();
}
if (nextchar>='1'&&nextchar<='3') {
*g = readg();
}
if (nextchar!='#') {
error("# expected");
return 0;
}
break;
default:
error("Bad start symbol");
return 0;
}
return 1;
}

int main(void) {
char c[20];
int g;
int star;
char input[]="9999*3#";
// char input[]="*2#";

printf("Parsing: %s",input);
puts("");
if (parseinput(input,c,&g,&star)) {
printf("C =<%s> G=%d Star=%d",c,g,star);
puts("");
} else {
printf("Couldn't parse");
puts("");
}
}

glen herrmannsfeldt · Jan 28, 2014

(snip)

Not a good term, perhaps 'dedicated' logic, ie. program code,
is better. Or 'ad hoc'...

As far as I know, the term is used to describe the way computers
were built before microprogramming. If you look inside a computer
(maybe in the 1950's) there were wires going all different directions,
looking somewhat random.

Microprogramming replaces random logic with much simpler logic
and puts all the "randomness" into the microprogram.

-- glen

Ike Naar · Jan 29, 2014

case 48: case 49: case 50: case 51: case 52: case 53: case 54: case 55:
case 56: case 57: // '0' to '9'

Why not

case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7': case '8': case '9':

?

BartC · Jan 29, 2014

Ike Naar said:
Why not

case '0': case '1': case '2': case '3': case '4':
case '5': case '6': case '7': case '8': case '9':

?

(I mentioned this was code-generated, the original was '0' .. '9', which is
expanded for the C version. Unfortunately the char constants are just mapped
to integers so are difficult to restore. And it was a lot of fiddly typing
to fix them all for this simple demo.

Perhaps if C allowed such a range in switch cases, then I would have had
less of an excuse, but it's been discussed here a few times and there are
always arguments against it, mainly that it encourages people to type 'A' ..
'Z', even though 99.99% of the time (or 100% of the time for 99.99% of
programs), that would work as expected.)

Paul N · Jan 29, 2014

(I mentioned this was code-generated, the original was '0' .. '9', which is

expanded for the C version. Unfortunately the char constants are just mapped

to integers so are difficult to restore. And it was a lot of fiddly typing

to fix them all for this simple demo.

Perhaps if C allowed such a range in switch cases, then I would have had

less of an excuse, but it's been discussed here a few times and there are

always arguments against it, mainly that it encourages people to type 'A' ..

'Z', even though 99.99% of the time (or 100% of the time for 99.99% of

programs), that would work as expected.)

Not that it helps in this case (where we're using a switch), but the C standard does guarantee that '0', '1', etc, up to '9', are consecutive numbers. Whereas there is no guarantee that, say, 'B' is one more than 'A'.

BartC · Jan 29, 2014

Paul N said:
Not that it helps in this case (where we're using a switch), but the C
standard
does guarantee that '0', '1', etc, up >to '9', are consecutive numbers.
Whereas
there is no guarantee that, say, 'B' is one more than 'A'.

Perhaps then a compiler can issue a warning when a range involving
alphabetic characters is used (and which can be turned off). Because a lot
of the time the compiler and/or the programmer will know when these
sequences are going to be consecutive.

And even when there is uncertainty, it wouldn't be out of the question for a
compiler to convert a range such as 'A' to 'F', to the explicit sequence
'A', 'B', 'C', 'D', 'E', 'F'.

(I'm talking about an extension where switch cases can specify a range.)

I need some help on a format issue that should be simple for someone here (but not me!)	0	Jul 6, 2023
C++ Student, need help with very simple concept	2	Mar 13, 2023
How to implement a html parser in java?	1	Dec 28, 2023
A very simple parser with scanf & C	49	Sep 23, 2011
Can anyone help me code a simple python code?	1	Mar 13, 2022
Please, help me.	1	Jun 30, 2021
Why is Python telling me variable is local not global?	3	Sep 2, 2023
Help me	2	Aug 2, 2022

Very simple parser... not for me

pozz

Malcolm McLean

Ben Bacarisse

pozz

BartC

Kaz Kylheku

Ben Bacarisse

BartC

pozz

pozz

Kaz Kylheku

Eric Sosman

Jorgen Grahn

Jorgen Grahn

BartC

glen herrmannsfeldt

Ike Naar

BartC

Paul N

BartC

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads