My SAX Parser, regexp style. Cut & paste version .901

R

robic0

Since so much was learned on the substitution method, thought
this might be a better approach.
This is just the starting framework. The rest will be filled in.
Turn off the debug output for full speed.

Un-wrap the regexp if it is, before using.

print <<EOM;

# -----------------------
# XML (Regex) SAX Parser
# Version .901 - 1/7/06
# Copyright 2005,2006
# by robic0-At-yahoo.com
# -----------------------

EOM

use strict;
use warnings;

open DATA, "config.html" or die "can't open config.html...";
my $gabage1 = join ('', <DATA>);
close DATA;

my ($cnt, $content, $show_pos, $debug) = (1, '', 1, 1);

# master
#/(?:<\?(.*?)\?>)|(?:<META(.*?)>)|(?:<!DOCTYPE(.*?)>)|(?:<!\[CDATA\[(.*?)\]\]>)|(?:<!--(.*?)-->)|(?:<(\/*[\:0-9a-zA-Z]+?[\s]*\/*)>)|(?:<([\:0-9a-zA-Z]+?)[\s]+((?:[\:0-9a-zA-Z]+[\s]*=[\s]*["'][^<]*['"])+[\s]*\/*)>)|(.+?)/sg)
# 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8
8 9 9

while ($gabage1 =~
/(?:<\?(.*?)\?>)|(?:<META(.*?)>)|(?:<!DOCTYPE(.*?)>)|(?:<!\[CDATA\[(.*?)\]\]>)|(?:<!--(.*?)-->)|(?:<(\/*[\:0-9a-zA-Z]+?[\s]*\/*)>)|(?:<([\:0-9a-zA-Z]+?)[\s]+((?:[\:0-9a-zA-Z]+[\s]*=[\s]*["'][^<]*['"])+[\s]*\/*)>)|(.+?)/sg)
# 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8
8 9 9
{
if (defined $9) { $content .= $9; next; }
print "-"x20,"\n" if ($debug);
if (length ($content)) {
print "9 $content\n" if ($debug);
$content = '';
}
if ($show_pos) {
my $rr = pos $gabage1;
print "$rr ";
}
print "1 VERSION: $1\n" if ($debug && defined $1);
print "2 META: $2\n" if ($debug && defined $2);
print "3 DOCTYPE: $3\n" if ($debug && defined $3);
print "4 CDATA: $4\n" if ($debug && defined $4);
print "3 COMMENT: $5\n" if ($debug && defined $5);
## <tag> or </tag> or <tag/>
print "6 TAG: $6\n" if ($debug && defined $6);
## <tag attrib/> or <tag attrib>
print "7,8 TAG: $7 Attr: $8\n" if ($debug && defined $7);
$cnt++;
}

__END__
 
R

robic0

I'm about to finish this thing. Its mostly modeled after Expat.
Its all perl, mine is faster parsing about 1 meg a second.
Its also complient will current xml standards on w3c.org.
There's so much to it, I don't think I want to post it here.
I would like to make it into a "free" module on cpan or Active States
release version.

I think its commercial level. The fact is I can "interject" special
searches and handling if I want to. It is designed using the specs
from here:

http://www.w3.org/TR/xml11/#NT-AttValue

Its version 1.1 If I'm using the wron specs, please let me know.
Its awsome, tremendously fast.
I am going to also write a full featured "schema checker" using this
base parser. I've never seen something so easy as schema checking.
Thinking beyond I will move into modification tools. Even style sheet
mods (i think, its all too easy now). I will do it all in markup.
The code is about 600 lines now. I could plop it down here. I have
all constructs covered in the above 1.1 specs. I'm worried a little
about encoding and unicode. By an large, I've never seen anything
so easy in my life. I fear that my code is approacing a proffessional
level and I may "not" want to just plop it down here.

I may want to contact AS or Cpan to post the module so its not ripped
off. However, I know I could do a schema checker in a week. Since its
all so easy now, I'm wondering if I can make any money at this or is it
all just a give-away...

Oh well, from a homeless man to a middle class man, I know it won't be
that much. However, I have developed tools that could do conversions.
Yea sure I want to put my stuff in the public domain, but the internals
I do with them could do fast custom conversions.

What do you think? Say it now, if it ends up in AS or Cpan you won't have
the option to reccommend. It will arrive there, but whats the money behind
hard core conversions, style, schema, filters, anything?
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,754
Messages
2,569,521
Members
44,995
Latest member
PinupduzSap

Latest Threads

Top