newbie question on sequencing

M

Mar Thomas

Heres my problem. I have an xml which looks like this
<myfile>
<parent num=1.00>
<child num=1.01>
<child num=1.02>
<child num=1.03>
</parent>
<parent num=1-a.00>
<child num=1-a.01>
<child num=1-a.02>
<child num=1-a.03>
</parent>
<parent num=A>
<child num=a>
<child num=b>
<child num=c>
</parent>
</myfile>

You will notice that the numbering structure changes for every element. How
can I find out

1. What the sequence is for each element
2. If there are any numbers missing in the each of the sequences

Can my XML parser help me get this info. I dont know where to start

Thanks
 
R

Roedy Green

<parent num=1.00>
<child num=1.01>
<child num=1.02>
<child num=1.03>
</parent>
<parent num=1-a.00>
<child num=1-a.01>
<child num=1-a.02>
<child num=1-a.03>
</parent>
<parent num=A>
<child num=a>
<child num=b>
<child num=c>
</parent>

Let's break the problem in two. Problem one, extract a sequence you
want to analyse from the XML. .e.g. "1.01" 1.02", 1.03" or "a", "b",
"c".

Now for the analysis:

1. use a regex to see if a sequence follows a known pattern. Apply to
the regex to each value in turn for each of your patterns. See
http://mindprod.com/jgloss/regex.html

2. Now you have identified the pattern, you can create a generator of
the expected value given the previous value. If they don't match, you
have a break.
 
B

Brad BARCLAY

Mar said:
You will notice that the numbering structure changes for every element. How
can I find out

1. What the sequence is for each element
2. If there are any numbers missing in the each of the sequences

Can my XML parser help me get this info. I dont know where to start

About all that an XML parser is going to be able to give you is the
daata itself. The parser doesn't necessarily know nor care what the
data actually is, so long as it conforms to the relevent DTD.

I assume you're trying to determine the sequencing programatically?

There are two things you really need to accomplish here -- the first is
a regular expression that encompasses the "language" of the values, and
the second is to create a dictionary ordering for the value elements so
that you can properly increment them.

For the first, you'll want to start by defining the relevent alphabet
for the language. To do this, you'll want to inspect the elements and
identify:

- The letter elements
- The numerical elements
- The symbol elements

Before you go much further, try to determine wether or not there are
going to be _any_ rules for the numbering -- ie: are non-alphanumerics
considered static seperators that are unchanging, or can they too be
incremented? If the former, things are a bit easier -- if a
non-alphanumeric occurs in the numbering, it will be unchanging in its
"position" throughout all members, making the construction of the
regular expression defining that numbering easier. If they can be an
active element of the numbering, things are somewhat more difficult.

As well, you'll have to try to determine what is to happen when a
letter or number identifier reaches its maximum amount for the given
number of digits. For example, if you have the following numbering in
your XML file:

'1'
'2'
'3'

...we can probably safely assume that '4' is next. But what comes
after '9'? Will it be '10' (adding another digit where one didn't exist
before?), 'A' (retaining single-digitedness, but either switching to
letters, _or_ assuming a hexidecimal representation), or will this not
be allowed?


Similar goes for letters. What comes after 'z'? 'A'? 'aa'?
Undefined? Nothing?

If you're working with numerical values, are you going to assume
they're decimal? If you only have as input the numbers 1 - 3 as above,
you could be working with octal values, where there is no '8' or '9'
digits. If you know for certain that only decimal values will be
allowed, this makes such issues quite a bit easier.

All of these factors will determine your regular expression
construction which, if you don't have any rules, can be a difficult
thing to construct algorithmically (as to correctly achieve the ends you
desire, it's not enough to create an expression that accepts the values
present, and the values presumed. ".*" will accept your values (and
everything else while you're at it). What you need to do is create an
expression which excepts _exactly_ your language -- ie: it will accept
all the allowable elements of the language, but nothing that isn't part
of the language).

Once you have those in place, you can use them to ensure that the
elements are consistent with the language they appear to be part of.

The next step is to have some dictionary rules in place for
incrementing and comparison. Assuming the common right-hand-digit
incrementing system the common numerical systems use, doing this will be
easy -- you can use a straight ASCII increment for all non-seperator
(static) elements, incrementing just as you would if you were working
with decimal numbers. To verify that the elements present do indeed
form a series, simply read the first value, increment it by one, and
check to see if that equals to the next value. If it does, it's in
sequence. If not, it's not (or you've made an incorrect assumption as
to the values).

You've asked a very difficult set of questions -- ones which have no
specific answers (aand no real "optimal" answer). For any "word" in a
language, there are an infinite number of grammers that can contain that
"word", most of which will also contain invalid values, and many of
which will reject valid values in the same language. You're trying to
devine a whole language based on a few elements. The only way you can
be precise in this instance is if you assume that those values are the
_only_ acceptable values in the language, and you construct a regular
expression that accepts exactly and only those values -- which doesn't
appear to be what you want.

The long and the sort of it being, unless you have some really explicit
rules, or create an XML entity (or attribute) where the developer can
define the regular expression in use for their numbering language, any
solution you come up with is going to be imprecise, and may be
error-prone with certain types of numberings.

(It should also be noted here that there are a lot of languages which
regular expressions _cannot_ define. These include anything that
requires some form of "memory" between states -- something which would
need a grammer instead of a finite automata).

HTH!

Brad BARCLAY
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top