Running queries on large data structure

C

Christoph Haas

Hi, list...

I have written an application in Perl some time ago (I was young and needed
the money) that parses multiple large text files containing nested data
structures and allows the user to run quick queries on the data.
(For the firewall admins among you: it's a parser and web-based query tool
for CheckPoint firewall rulebases. The user can search for source and
destination IPs and get the matching rules.)

The current application consists of two parts:

(1) An importing process that reads and parses the large text files and
writes the data in different PostgreSQL tables.
(2) A web (CGI) interface that allows the user to query the collected
data from the PostgreSQL database by different criteria.
(I don't like PostgreSQL much due to the lack of decent tools like
phpmyadmin. Pgadmin3 and Phppgadmin don't give me the feeling that
I control the database. More the other way round. But PostgreSQL
has a nice 'inet' data type that allows for quick matches in tables
of IP addresses and networks.)

However the information in the (relational) database was stored in a
horribily artificial way. The SQL query is a 20-line monster with UNIONs
and LEFT JOINs and negations. It's lightning fast (0.5 seconds to search
over a 500 set consisting of complex rules) but neither the source code
nor the database is easy to handle any more. And I'd like to have more
flexibility in the kind of queries I run. So I'd like to trade the good
speed by some readability and a simpler - more object-oriented - data
structure.

I'm currently thinking of different ways to handle that but would like to
get some opinions about that:

(a) See what sqlalchemy can do for me to handle the object-relational
transformation and basically stay with PostgreSQL.
(b) Parse the input files into one large nested Python data structure.
Then write this structure to a file using "marshal" or "repr".
Then I have a very clean source code like
for rule in rules:
for src in sources:
if searched_for_src == src...
(c) ...?

What makes PostgreSQL less suited is the fact that CheckPoint rule bases
can contain several complex objects:
- Hosts (easy, they are just one IP address and can easily be compared)
- Networks (nearly as easy - just see if the IP is part of the network)
- Groups (slightly harder; can even be nested and contain other groups
and hosts or networks)
- IP ranges (10.0.0.50-10.5.25.100; not easy to parse either)

I would even like to allow the users more complex queries like multiple
search conditions. The query would be something like "show me all matching
firewall rules where 10.0.0.5 matches the source column and 192.168.42.1
matches the destination column OR any rule where the group 'internal hosts'
is mentioned in the destination column". It sounds like a database is the
right job. But somehow a database is also not flexible enough. And the
data is small enough (1 MB probably) that it can be read into memory.

What do you think would be the right tool for the job? Thanks for sharing
your thoughts.

Christoph
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,022
Latest member
MaybelleMa

Latest Threads

Top