matching a string to extract substrings for which some functionreturns true

A

Amit Khemka

Hello All,

say you have some string: "['a', 'b', 1], foobar ['d', 4, ('a', 'e')]"
Now i want to extract all substrings for which
"isinstance(eval(substr), list)" is "True" .
now one way is to walk through the whole sample string and check the
condition, I
was wondering if there is any smarter way of doing the same, may be
using regular-expressions.

Thanks,
amit.
 
S

Steven D'Aprano

Hello All,

say you have some string: "['a', 'b', 1], foobar ['d', 4, ('a', 'e')]"
Now i want to extract all substrings for which
"isinstance(eval(substr), list)" is "True" .

That's an awfully open-ended question. Is there some sort of structure to
the string? What defines a substring? What should you get if you extract
from this string?

"[[[[[]]]]]"

Is that one list or five?

now one way is to walk through the whole sample string and check the
condition,

Yes. Where does the string come from? Can a hostile user pass bad strings
to you and crash your code?
 
A

Amit Khemka

Well actually the problem is I have a list of tuples which i cast as
string and then
put in a html page as the value of a hidden variable. And when i get
the string again,
i want to cast it back as list of tuples:
ex:
input: "('foo', 1, 'foobar', (3, 0)), ('foo1', 2, 'foobar1', (3, 1)),
('foo2', 2, 'foobar2', (3, 2))"
output: [('foo', 1, 'foobar', (3, 0)), ('foo1', 2, 'foobar1', (3, 1)),
('foo2', 2, 'foobar2', (3, 2))]

I hope that explains it better...

cheers,


Hello All,

say you have some string: "['a', 'b', 1], foobar ['d', 4, ('a', 'e')]"
Now i want to extract all substrings for which
"isinstance(eval(substr), list)" is "True" .

That's an awfully open-ended question. Is there some sort of structure to
the string? What defines a substring? What should you get if you extract
from this string?

"[[[[[]]]]]"

Is that one list or five?

now one way is to walk through the whole sample string and check the
condition,

Yes. Where does the string come from? Can a hostile user pass bad strings
to you and crash your code?


--
 
F

Fredrik Lundh

Amit said:
Well actually the problem is I have a list of tuples which i cast as
string and then put in a html page as the value of a hidden variable.
And when i get the string again, i want to cast it back as list of tuples:
ex:
input: "('foo', 1, 'foobar', (3, 0)), ('foo1', 2, 'foobar1', (3, 1)),
('foo2', 2, 'foobar2', (3, 2))"
output: [('foo', 1, 'foobar', (3, 0)), ('foo1', 2, 'foobar1', (3, 1)),
('foo2', 2, 'foobar2', (3, 2))]

I hope that explains it better...

what do you think happens if the user manipulates the field values
so they contain, say

os.system('rm -rf /')

or

"'*'*1000000*2*2*2*2*2*2*2*2*2"

or something similar?

if you cannot cache session data on the server side, I'd
recommend inventing a custom record format, and doing your
own parsing. turning your data into e.g.

"foo:1:foobar:3:0+foo1:2:foobar1:3:1+foo2:2:foobar2:3:2"

is trivial, and the resulting string can be trivially parsed by a couple
of string splits and int() calls.

to make things a little less obvious, and make it less likely that some
character in your data causes problems for the HTML parser, you can
use base64.encodestring on the result (this won't stop a hacker, of
course, so you cannot put sensitive data in this field).

</F>
 
A

Amit Khemka

Fredrik, thanks for your suggestion. Though the html page that are
generated are for internal uses and input is verified before
processing.

And more than just a solution in current context, actually I was a
more curious about how can one do so in Python.

cheers,
amit.

Amit said:
Well actually the problem is I have a list of tuples which i cast as
string and then put in a html page as the value of a hidden variable.
And when i get the string again, i want to cast it back as list of tuples:
ex:
input: "('foo', 1, 'foobar', (3, 0)), ('foo1', 2, 'foobar1', (3, 1)),
('foo2', 2, 'foobar2', (3, 2))"
output: [('foo', 1, 'foobar', (3, 0)), ('foo1', 2, 'foobar1', (3, 1)),
('foo2', 2, 'foobar2', (3, 2))]

I hope that explains it better...

what do you think happens if the user manipulates the field values
so they contain, say

os.system('rm -rf /')

or

"'*'*1000000*2*2*2*2*2*2*2*2*2"

or something similar?

if you cannot cache session data on the server side, I'd
recommend inventing a custom record format, and doing your
own parsing. turning your data into e.g.

"foo:1:foobar:3:0+foo1:2:foobar1:3:1+foo2:2:foobar2:3:2"

is trivial, and the resulting string can be trivially parsed by a couple
of string splits and int() calls.

to make things a little less obvious, and make it less likely that some
character in your data causes problems for the HTML parser, you can
use base64.encodestring on the result (this won't stop a hacker, of
course, so you cannot put sensitive data in this field).

</F>


--
 
M

Mike Meyer

Amit Khemka said:
Well actually the problem is I have a list of tuples which i cast as
string and then
put in a html page as the value of a hidden variable. And when i get
the string again,
i want to cast it back as list of tuples:
ex:
input: "('foo', 1, 'foobar', (3, 0)), ('foo1', 2, 'foobar1', (3, 1)),
('foo2', 2, 'foobar2', (3, 2))"
output: [('foo', 1, 'foobar', (3, 0)), ('foo1', 2, 'foobar1', (3, 1)),
('foo2', 2, 'foobar2', (3, 2))]

I hope that explains it better...

This is a serious security risk, as you can't trust the data not to do
arbitrary things to your system when eval'ed.

I'd look into pickling the list of tuples to get the string. You'll
want to use mode 0, and may need to encode the string in any
case. You'll also want to investigate the seecurity implications of
using pickle.

<mike
 
P

Paul Rubin

Mike Meyer said:
This is a serious security risk, as you can't trust the data not to do
arbitrary things to your system when eval'ed.
I'd look into pickling the list of tuples to get the string.

The whole scheme of putting the stuff on the html page and then
getting it back from the client is ill-advised. Keep the info on the
server and just have the client send back some token (session ID
usually) saying where to find it on the server. If you absolutely
have to put this sort of data on the client, append a cryptographic
authentication code using the hmac module, and don't believe the data
unless the authentication verifies.
 
F

Fredrik Lundh

I said:
if you cannot cache session data on the server side, I'd
recommend inventing a custom record format, and doing your
own parsing. turning your data into e.g.

"foo:1:foobar:3:0+foo1:2:foobar1:3:1+foo2:2:foobar2:3:2"

is trivial, and the resulting string can be trivially parsed by a couple
of string splits and int() calls.

on the other hand, the "myeval" function I posted here

http://article.gmane.org/gmane.comp.python.general/433160

should be able to deal with your data, as well as handle data from
malevolent sources without bringing down your computer.

just add

if token[1] == "(":
out = []
token = src.next()
while token[1] != ")":
out.append(_parse(src, token))
token = src.next()
if token[1] == ",":
token = src.next()
return tuple(out)

after the corresponding "[" part, and call it like:

data = myeval("[" + input + "]")

</F>
 
A

Amit Khemka

thanks for you suggestions :) ..

cheers,

I said:
if you cannot cache session data on the server side, I'd
recommend inventing a custom record format, and doing your
own parsing. turning your data into e.g.

"foo:1:foobar:3:0+foo1:2:foobar1:3:1+foo2:2:foobar2:3:2"

is trivial, and the resulting string can be trivially parsed by a couple
of string splits and int() calls.

on the other hand, the "myeval" function I posted here

http://article.gmane.org/gmane.comp.python.general/433160

should be able to deal with your data, as well as handle data from
malevolent sources without bringing down your computer.

just add

if token[1] == "(":
out = []
token = src.next()
while token[1] != ")":
out.append(_parse(src, token))
token = src.next()
if token[1] == ",":
token = src.next()
return tuple(out)

after the corresponding "[" part, and call it like:

data = myeval("[" + input + "]")

</F>


--
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,903
Latest member
orderPeak8CBDGummies

Latest Threads

Top