regex that works on rubular.com but not in my program

A

Andreas Hansen

hi,

i have some trouble with a regex.
it works on rubular.com but not in my program
ive used the content in testfile.txt on rubular.com

the regex finds a ip-address, a flag and a username in a TCP-packet(an
example: http://rubular.com/regexes/8389)

regex =
/(?:[P]\s)((?:[0-9A-Fa-f]{1,4}:){7}[0-9A-Fa-f]{1,4}|(?:\d{1,3}\.){3}\d{1,3}).{0,}(?:[:][\s])([P])(?:.{0,})$\s(?:^[E].{5}[@].{9}[Q])(?:.{30,42}\.)\W{0,}([\w\d]{1}(?:[\w\d-]+.){1,13}[\wA-Z0-9])/i

filename = 'testfile.txt'
file = File.open(filename).collect
j = file.length
i = 0
while i< j

a = file.to_s

b = a.scan(regex)
print b.length

i = i + 1
end
 
R

Robert Klemme

2009/6/25 Andreas Hansen said:
i have some trouble with a regex.
it works on rubular.com but not in my program
ive used the content in testfile.txt on rubular.com

the regex finds a ip-address, a flag and a username in a TCP-packet(an
example: http://rubular.com/regexes/8389)

regex =3D
/(?:[P]\s)((?:[0-9A-Fa-f]{1,4}:){7}[0-9A-Fa-f]{1,4}|(?:\d{1,3}\.){3}\d=

{1,3}).{0,}(?:[:][\s])([P])(?:.{0,})$\s(?:^[E].{5}[@].{9}[Q])(?:.{30,42}\.)=
\W{0,}([\w\d]{1}(?:[\w\d-]+.){1,13}[\wA-Z0-9])/i

Ugh! This is completely unreadable. How about using switch /x and
embedding some comments? Constructing the large regexp from a few
smaller expressions might also help. For example, you could use
/[a-f0-9]/i for hex digits.

Few notes on glancing over this

[\wA-Z0-9] -> \w
\w includes characters and numbers

[\s] -> \s
[P] -> P
etc.

[\d\d]{1} -> \w

filename =3D 'testfile.txt'
file =3D File.open(filename).collect

Not closing the file descriptor properly...
j =3D file.length
i =3D 0
while =A0i< j

=A0a =3D file.to_s

=A0b =3D a.scan(regex)
=A0print b.length


You are not printing a newline here. Are you maybe missing the print outpu=
t?
=A0i =3D i + 1
end

You can greatly simplify your code to

File.foreach filename do |line|
b =3D line.scan rx
puts b.length
end

Kind regards

robert
 
B

Brian Candler

Andreas said:
i have some trouble with a regex.
it works on rubular.com but not in my program

What regexp language is rubular.com using? Perl, ruby 1.8, ruby 1.9,
other?
regex =
/(?:[P]\s)((?:[0-9A-Fa-f]{1,4}:){7}[0-9A-Fa-f]{1,4}|(?:\d{1,3}\.){3}\d{1,3}).{0,}(?:[:][\s])([P])(?:.{0,})$\s(?:^[E].{5}[@].{9}[Q])(?:.{30,42}\.)\W{0,}([\w\d]{1}(?:[\w\d-]+.){1,13}[\wA-Z0-9])/i


This regexp must be machine-written, since it's absolutely horrible.
You'd never write it that way by hand. For example:

(?:[P]\s) => is just the same as => IP\s

(?:.{0,})$ => is just the same as => .*$

[\wA-Z0-9] => is just the same as => \w

I suggest you write your regexp by hand, one bit at a time. This is easy
in IRB as you can develop your regexp interactively.

irb(main):008:0> src
=> "12:23:59.378678 IP 85.225.108.54.54707 > 81.227.132.223.6112: P
590518027:590518071(44) ack 2582330461 win
64240\[email protected].......#2....<]P.........,................wakko0..........@......"
irb(main):009:0> /IP / =~ src
=> 16
irb(main):010:0> /IP ([0-9a-fA-F.]+)/ =~ src
=> 16
irb(main):011:0> $1
=> "85.225.108.54.54707"
irb(main):012:0> /IP ([0-9a-fA-F.]+)\.\d+/ =~ src
=> 16
irb(main):013:0> $1
=> "85.225.108.54"

This one isn't quite as sophisticated for IP address matching as the one
rubular gave you, but it's not necessary here. If you really want
stronger matching of IPv4 and IPv6 literals, you can do so if you wish.
 
A

Andreas Hansen

its the first time me or my friend has worked with regex, my friend have
rewritten the regex a bit, maybe it makes more sense now:

(?#start: fetch the ip adress after IP)
IP\s((?:\d*\.) {3}\d{1,3})
(?#end: fetch ip)

(?#start: flag).*:\s([PSF])(?#end:flag) (?#nothing more interesting on
this line)
*$(?#end)

(?#start: look for index pattern)
\s^E.{5}@.{9}Q.{30,42}
(?#end: index)

(?#start: get the username which is surrounded by multiple dots, minimum
of 2 in the begining and 0+ after)
*\.{2,}([\w](?:\w+.){1,13}\w)\.*
(?#end: username)

Each of those expressions works individually and together(in rubular)
but when i combine them in my program it prints nothing, not even nil.
So i tried them individually in the program as well and all but the
index pattern(prints nothing) works. so if anyone could offer some
insight why its not working or knows a better way to do this we´ll be
very happy:)

another thing:
some usernames are really hard to extract from the packets. an example:
G-eX.Dowden(http://rubular.com/regexes/8401)
any suggestions?


File.foreach filename do |line|
b = line.scan rx
puts b.length
end

was a nice solution, thank you:)
 
B

Brian Candler

Andreas said:
(?#start: flag).*:\s([PSF])(?#end:flag) (?#nothing more interesting on
this line)
.*$(?#end)

(?#start: look for index pattern)
\s^E.{5}@.{9}Q.{30,42}
(?#end: index)

You are looking for an end-of-line ($), followed by whitespace (\s),
followed by a start of line (^). This doesn't look right to me. It might
work sometimes, depending on whether your end-of-line is \n or \r\n
(?#start: get the username which is surrounded by multiple dots, minimum
of 2 in the begining and 0+ after)
.*\.{2,}([\w](?:\w+.){1,13}\w)\.*

That one makes little sense.

[\w] is the same as \w

(?:\w+.) means one or more word characters followed by any character;
this is then releated between 1 and 13 times
\w must be followed by a word character

\.* this is superfluous, since it matches 0 or more dots,
it would therefore match regardless of what is next
Each of those expressions works individually and together(in rubular)

Don't test them in rubular. Test them in irb or in ruby.
another thing:
some usernames are really hard to extract from the packets. an example:
G-eX.Dowden(http://rubular.com/regexes/8401)
any suggestions?

You're using the wrong way to view the packets in the first place.

Using a ruby interface to libpcap would be the safest way - I think I
saw one, but I've never used it.

Otherwise, look at tcpdump -X for a proper hex packet dump.

Brian.
 
B

Brian Candler

Brian said:
Don't test them in rubular. Test them in irb or in ruby.

In a ruby file, you can comment out bits of them until you make it work,
e.g.

re = %r{
(?#start: fetch the ip adress after IP)
IP\s((?:\d+\.){3}\d{1,3})
(?#end: fetch ip)
}x

#(?#start: flag).*:\s([PSF])(?#end:flag)
#(?#nothing more interesting on this line)
#.*
#
#(?#start: look for index pattern)
#^E.{5}@.{9}Q.{30,}
#(?#end: index)
#
#(?#start: get the username which is surrounded by multiple dots,
minimum
#of 2 in the begining and 0+ after)
#\.{2,}(\w+)
#(?#end: username)
#}x

src = "12:23:59.378678 IP 85.225.108.54.54707 > 81.227.132.223.6112: P
590518027:590518071(44) ack 2582330461 win
64240\[email protected].......#2....<]P.........,................wakko0..........@......"

p re =~ src
p $~.to_a

Then you move the }x end of the regular expression and start
uncommenting further bits until it starts to fail again, then you know
where the problem is.
 
B

Brian Candler

Note also that ^ and $ don't consume characters, and . doesn't match
newlines without the /m flag.

irb(main):009:0> "abc\ndef" =~ /^a.*^d/
=> nil
irb(main):010:0> "abc\ndef" =~ /^a.*^d/m
=> 0
irb(main):011:0> "abc\ndef" =~ /^a.*\r?\n^d/
=> 0

re = %r{
(?#start: fetch the ip adress after IP)
IP\s((?:\d+\.){3}\d{1,3})
(?#end: fetch ip)

(?#start: flag).*:\s([PSF])(?#end:flag)

(?#nothing more interesting on this line)
*

(?#start: look for index pattern)
^E.{5}@.{8}Q.{30,}
(?#end: index)

(?#start: get the username which is surrounded by multiple dots, minimum
of 2 in the begining and 0+ after)
\.{2,}(\w+)
(?#end: username)
}xm

src = "12:23:59.378678 IP 85.225.108.54.54707 > 81.227.132.223.6112: P
590518027:590518071(44) ack 2582330461 win
64240\[email protected].......#2....<]P.........,................wakko0..........@......"

p re =~ src
p $~.to_a
 
B

brabuhr

its the first time me or my friend has worked with regex, my friend have
rewritten the regex a bit, maybe it makes more sense now ...

another thing:
some usernames are really hard to extract from the packets. an example:
G-eX.Dowden
cat /tmp/z
p(("12:23:59.378678 IP 85.225.108.54.54707 > 81.227.132.223.6112: P " +
"5 90518027:590518071(44) ack 2582330461 win 64240\[email protected]." +
"l6Q .......#2....<]P.........,................wakko0..........@." +
".....\n" +
"12:23:59.378678 IP 85.225.108.55.54707 > 81.227.132.223.6112: P " +
"5 90518027:590518071(44) ack 2582330461 win 64240\[email protected]." +
"l6Q .......#2....<]P.........,................wa-kk.o0.........." +
"@......"
).scan(
%r{
# capture the address after "IP"
IP\s((?:\d{1,3}\.){3}\d{1,3})\.

.+? # skip (non-greedy)

# capture the flag
:\s([PSF])\s\d

.+? # skip (non-greedy)
^E.{5}@.{8}Q.{30} # skip the index pattern
.+? # skip (non-greedy)

# capture the username surrounded by dots: 2+ before, 0+ after
\.{2,}(\w[-\w\.]+\w)\.?
}mx # m: "make dot match newlines"
)
)
ruby /tmp/z
[["85.225.108.54", "P", "wakko0"], ["85.225.108.55", "P", "wa-kk.o0"]]
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,073
Latest member
DarinCeden

Latest Threads

Top