Running queries on large data structure

Christoph Haas · Aug 2, 2006

Hi, list...

I have written an application in Perl some time ago (I was young and needed
the money) that parses multiple large text files containing nested data
structures and allows the user to run quick queries on the data.
(For the firewall admins among you: it's a parser and web-based query tool
for CheckPoint firewall rulebases. The user can search for source and
destination IPs and get the matching rules.)

The current application consists of two parts:

(1) An importing process that reads and parses the large text files and
writes the data in different PostgreSQL tables.
(2) A web (CGI) interface that allows the user to query the collected
data from the PostgreSQL database by different criteria.
(I don't like PostgreSQL much due to the lack of decent tools like
phpmyadmin. Pgadmin3 and Phppgadmin don't give me the feeling that
I control the database. More the other way round. But PostgreSQL
has a nice 'inet' data type that allows for quick matches in tables
of IP addresses and networks.)

However the information in the (relational) database was stored in a
horribily artificial way. The SQL query is a 20-line monster with UNIONs
and LEFT JOINs and negations. It's lightning fast (0.5 seconds to search
over a 500 set consisting of complex rules) but neither the source code
nor the database is easy to handle any more. And I'd like to have more
flexibility in the kind of queries I run. So I'd like to trade the good
speed by some readability and a simpler - more object-oriented - data
structure.

I'm currently thinking of different ways to handle that but would like to
get some opinions about that:

(a) See what sqlalchemy can do for me to handle the object-relational
transformation and basically stay with PostgreSQL.
(b) Parse the input files into one large nested Python data structure.
Then write this structure to a file using "marshal" or "repr".
Then I have a very clean source code like
for rule in rules:
for src in sources:
if searched_for_src == src...
(c) ...?

What makes PostgreSQL less suited is the fact that CheckPoint rule bases
can contain several complex objects:
- Hosts (easy, they are just one IP address and can easily be compared)
- Networks (nearly as easy - just see if the IP is part of the network)
- Groups (slightly harder; can even be nested and contain other groups
and hosts or networks)
- IP ranges (10.0.0.50-10.5.25.100; not easy to parse either)

I would even like to allow the users more complex queries like multiple
search conditions. The query would be something like "show me all matching
firewall rules where 10.0.0.5 matches the source column and 192.168.42.1
matches the destination column OR any rule where the group 'internal hosts'
is mentioned in the destination column". It sounds like a database is the
right job. But somehow a database is also not flexible enough. And the
data is small enough (1 MB probably) that it can be read into memory.

What do you think would be the right tool for the job? Thanks for sharing
your thoughts.

Christoph

data structure suggestion (native python datatypes or sqlite;compound select)	0	Sep 16, 2010
Quepy, transform questions in natural language into queries in a DBlanguage using python	0	Jan 29, 2013
Database vs Data Structure?	7	Apr 18, 2008
Crawl nested data structure, apply code block to each	10	Apr 13, 2014
Large data arrays?	9	Apr 23, 2009
Reading a file into a data structure....	8	Oct 13, 2011
Running Idle on Windows 7	0	Nov 24, 2011
Defining a data structure - is there a standard/best practice way?	1	Nov 11, 2013

Running queries on large data structure

Christoph Haas

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads