some kind of detector, need advices...

Joh · Jul 15, 2004

Hello,

(sorry long)

i think i have missed something in the code below, i would like to
design some kind of detector with python, but i feel totally in a no
way now and need some advices to advance

data = "it is an <atag> example of the kind of </atag> data it must
handle and another kind of data".split(" ")
(actually data are splitted line by line in a file, and contained
other than simple words so using ' '<space> is just to post here)

i would like to be able to write some kind of easy rule like :
detect1 = """th.* kind of data"""
or better :
detect2 = """th.* * data""" ### second '*' could be seen like a joker,
as in re, some sort of "skip zero or more line"
which would give me spans where it matched, here :
[(6, 11), (15, 19)]

i have written code below which may handle detect1 , but still unable
to adapt it to detect2. i think i may miss some step back in case of
failed match.
if s.startswith("<"):
return True
return False
def __init__(self, rule, separator = " "):
self.rule = tuple(rule.split(separator))
self.length = len(self.rule)
self.compiled = []
self.filled = 0
for i in range(self.length):
current = self.rule
if current == '*':
### special case, one may advance...
self.compiled.append('*')
else:
self.filled += 1
self.compiled.append(re.compile(current))
self.compiled = tuple(self.compiled)
###
def match(self, lines, ignore = None):
spans = []
i, current, memorized, matched = 0, 0, None, None
while 1:
if i == len(lines):
break
line = lines
i += 1
print "%3d: %s (%s)" % (i, line, current),
if ignore and ignore(line):
print ' - ignored'
continue
regexp = self.compiled[current]
if regexp == '*':
### HERE I NEED SOME ADVICES...
elif hasattr(regexp, 'search') and regexp.search(line):
### match current pattern
print ' + matched',
matched = True
else:
current, memorized, matched = 0, None, None
if matched:
if memorized is None:
memorized = i - 1
if current == self.filled - 1:
print " + detected!",
spans.append((memorized, i))
current, memorized = 0, None
current += 1
print
return spans

1: it (0)
2: is (0)
3: an (0)
4: <atag> (0) - ignored
5: example (0)
6: of (0)
7: the (0) + matched
8: kind (1) + matched
9: of (2) + matched
10: </atag> (3) - ignored
11: data (3) + matched + detected!
12: it (1)
13: must (0)
14: handle (0)
15: and (0)
16: another (0) + matched
17: kind (1) + matched
18: of (2) + matched
19: data (3) + matched + detected!
[(6, 11), (15, 19)] ### actually they are indexes in list and +1 to
have line numbers

John Lenton · Jul 15, 2004

Hello,

(sorry long)

[snip]

am I the only one who didn't understand this?

Lonnie Princehouse · Jul 15, 2004

I'm not really sure I followed that either, but here's a restatement
of the problem according to my feeble understanding-

Joh wants to break a string into tokens (in this case, just separated
by whitespace) and then perform pattern matching based on the tokens.
He then wants to find the span of matched patterns, in terms of token
numbers.

For example, given the input "this is a test string" and the pattern
"is .* test", the program will first tokenize the string:

tokens = ['this', 'is', 'a', 'test', 'string']

....and then will match ['is', 'a', test'] and return the indices of
the matched interval [(1, 3)]. (I don't know if that interval is
supposed to be inclusive)

Note that .* would match zero or more tokens, not characters.
Additionally, it seems that xml-ish tags ("<atag>" in the example)
would be ignored.

Implementing this from scratch would require hacking together some
kind of LR parser. Blah. Fortunately, because tokenization is
trivial, it is possible translate all of the "detectors" (aka
token-matching patterns) directly into regular expressions; then, all
you have to do is correlate the match object intervals (indexed by
character) into token intervals (indexed by token).

To translate a "detector" into a proper regular expression, just make
a few substitutions:

def detector_to_re(detector):
""" translate a token pattern into a regular expression """
# could be more efficient, but this is good for readability

# "." => "(\S+)" (match one token)
detector = re.sub('\.', r'(\S+)', detector)

# whitespace block => "(\s+)" (match a stretch of whitespace)
detector = re.sub('\s+', r'(\s+)', detector)

return detector

def apply_detector(detector, data):

# compile mapping of character -> token indices
i = 0
token_indices = {}
for match in re.finditer('\S+', data):
token_indices[match.start()] = i
token_indices[match.end()] = i
i += 1

# ignore tags
data = re.sub('<.*?>', '', data)

detector_re = re.compile(detector_to_re(detector))

intervals = []

for match in detector_re.finditer():
intervals.append(
(
token_indices[match.start()],
token_indices[match.end()]
)
)

return intervals

On second thought, probably best to throw out this whole scheme and
just use regular expressions, even if it's less convenient. This way
you don't mix syntaxes of regexp vs. token exp (like "th.* *" would
do...)

I need some help with some work I have tried everything I can think of and read the book over and over	3	Jul 10, 2025
Big problem I need to solve with some unix utils	1	Jun 19, 2022
Need advices for mysqldb connection best practice	0	Jan 20, 2011
Idk need help in editing this source code	0	Nov 5, 2022
Adobe Acrobat JavaScript PDF Script Issues: File Matching and Dynamic Retrieval	0	Nov 29, 2024
I need help fixing my website	2	Oct 15, 2023
I need help with a Gemini prompt	1	May 14, 2025
The layout of braces	2	Sep 2, 2024

some kind of detector, need advices...

Joh

John Lenton

Lonnie Princehouse

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads