regexp search on infinite string?

P

Paddy

Lets say i have a generator running that generates successive
characters of a 'string'
From what I know, if I want to do a regexp search for a pattern of
characters then I would have to 'freeze' the generator and pass the
characters so far to re.search.
It is expensive to create successive characters, but caching could be
used for past characters. is it possible to wrap the generator in a
class, possibly inheriting from string, that would allow the regexp
searching of the string but without terminating the generator? In
other words duck typing for the usual string object needed by
re.search?

- Paddy.
 
J

James Stroud

Paddy said:
Lets say i have a generator running that generates successive
characters of a 'string'
characters then I would have to 'freeze' the generator and pass the
characters so far to re.search.
It is expensive to create successive characters, but caching could be
used for past characters. is it possible to wrap the generator in a
class, possibly inheriting from string, that would allow the regexp
searching of the string but without terminating the generator? In
other words duck typing for the usual string object needed by
re.search?

- Paddy.

re.search & re.compile checks for str or unicode types explicitly, so
you need to turn your data into one of those before using the module.

buffer = []
while True:
buffer.append(mygerator.next())
m = re.search(pattern, "".join(buffer))
if m:
process(m)
buffer = []

James
 
P

Paddy

Paddy said:
Lets say i have a generator running that generates successive
characters of a 'string'
characters then I would have to 'freeze' the generator and pass the
characters so far to re.search.
It is expensive to create successive characters, but caching could be
used for past characters. is it possible to wrap the generator in a
class, possibly inheriting from string, that would allow the regexp
searching of the string but without terminating the generator? In
other words duck typing for the usual string object needed by
re.search?

re.search & re.compile checks for str or unicode types explicitly, so
you need to turn your data into one of those before using the module.

buffer = []
while True:
buffer.append(mygerator.next())
m = re.search(pattern, "".join(buffer))
if m:
process(m)
buffer = []

James

Thanks James.
 
P

Paddy

Lets say i have a generator running that generates successive
characters of a 'string'>From what I know, if I want to do a regexp search for a pattern of

characters then I would have to 'freeze' the generator and pass the
characters so far to re.search.
It is expensive to create successive characters, but caching could be
used for past characters. is it possible to wrap the generator in a
class, possibly inheriting from string, that would allow the regexp
searching of the string but without terminating the generator? In
other words duck typing for the usual string object needed by
re.search?

- Paddy.

There seems to be no way of breaking into the re library accessing
characters from the string:
.... def __getitem__(self, *a):
.... print "getitem:",a
.... return str.__getitem__(self, *a)
.... def __get__(self, *a):
.... print "get:",a
.... return str.__get__(self, *a)
........ def __getitem__(self, *a):
.... print "getitem:",a
.... return str.__getitem__(self, *a)
.... def __get__(self, *a):
.... print "get:",a
.... return str.__get__(self, *a)
....
- Paddy.
 
J

John Machin

There seems to be no way of breaking into the re library accessing
characters from the string:


... def __getitem__(self, *a):
... print "getitem:",a
... return str.__getitem__(self, *a)
... def __get__(self, *a):
... print "get:",a
... return str.__get__(self, *a)
...>>> s = S('sdasd')

... def __getitem__(self, *a):
... print "getitem:",a
... return str.__getitem__(self, *a)
... def __get__(self, *a):
... print "get:",a
... return str.__get__(self, *a)
...

(2, 4)

- Paddy.

That would no doubt be because it either copies the input [we hope
not] or more likely because it hands off the grunt work to a C module
(_sre).

Why do you want to "break into" it, anyway?
 
P

Paddy

There seems to be no way of breaking into the re library accessing
characters from the string:
... def __getitem__(self, *a):
... print "getitem:",a
... return str.__getitem__(self, *a)
... def __get__(self, *a):
... print "get:",a
... return str.__get__(self, *a)
...>>> s = S('sdasd')
... def __getitem__(self, *a):
... print "getitem:",a
... return str.__getitem__(self, *a)
... def __get__(self, *a):
... print "get:",a
... return str.__get__(self, *a)
...
(2, 4)

That would no doubt be because it either copies the input [we hope
not] or more likely because it hands off the grunt work to a C module
(_sre).

Yes, it seems to need a buffer/string so probably access a contiguous
area of memory from C.
o
Why do you want to "break into" it, anyway?

A simulation generates stream of data that could be gigabytes from
which I'd like to find interesting bits by doing a regexp search. I
could use megabyte length sliding buffers, and probably will have to.

- Paddy.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
474,265
Messages
2,571,069
Members
48,771
Latest member
ElysaD

Latest Threads

Top