Suggestion for string parsing

M

Me Me

Hi all,
I would like to know if there's a better way to parse a string and
assing values to variables;

Ex:

Client=MPEG-4,390000,700000,24000

I can do

line =~ /(\w*)=([0-9A-Za-z -.:]*),([0-9]*),([0-9]*),([0-9]*)/

and

var1 = $1
var2 = $2
var3 = $3
var4 = $4
var4 = $5

But I'm sure there's a better way, even considering that the number of
parameters can increase and I don't want to write a long regular
expression rule, that is hard to read.

Thanks a lot for any tips
 
P

Peña, Botp

RnJvbTogTWUgTWUgW21haWx0bzplbWFudWVsZWZAdGlzY2FsaS5pdF0gDQojIHZhcjEgPSAkMQ0K
IyB2YXIyID0gJDINCiMgdmFyMyA9ICQzDQojIHZhcjQgPSAkNA0KIyB2YXI1ID0gJDUNCg0KaGlu
dDogYXJyYXkNCg0KZWcsDQoNCj4gbGluZQ0KPT4gIkNsaWVudD1NUEVHLTQsMzkwMDAwLDcwMDAw
MCwyNDAwMCINCg0KPiByZQ0KPT4gLyhcdyo/KT0oWzAtOUEtWmEteiAtLjpdKj8pLChcZCo/KSwo
XGQqPyksKFxkKikvDQoNCj4gbGluZS5tYXRjaChyZSkuY2FwdHVyZXMNCj0+IFsiQ2xpZW50Iiwg
Ik1QRUctNCIsICIzOTAwMDAiLCAiNzAwMDAwIiwgIjI0MDAwIl0NCg0KYWxzbywNCg0KPiB4LHks
ej1bMSwyLDNdDQo9PiBbMSwgMiwgM10NCg0KPiB4DQo9PiAxDQoNCj4geg0KPT4gMw0KDQo=
 
C

Chris Lowis

But I'm sure there's a better way, even considering that the number of
parameters can increase and I don't want to write a long regular
expression rule, that is hard to read.

Are the parameters always delimited by commas ? In which case you could
modify the regular expression

line =~/(\w*)=(.*)/

Then

$2 #=> "MPEG-4,390000,700000,24000"
$2.split(",") #=> ["MPEG-4", "390000", "700000", "24000"]

Returns you the values after the '=' sign in line as an array. For more
power you could pass this sub-string to a CSV parsing library such as
FasterCSV.

Chris
 
M

Me Me

Thans for answering,
I was thinking if there some kind of c sscanf,
so that I could parse and assing to variable at the same time

so if I have

line="Client=MPEG-4,390000,700000,24000"

something like:
sscanf(line, %s=%s %s %d %d %d, val1, val2, val3, val4, val5, val6)

I don't know if there's a similar string function for this in Ruby

thanks
 
P

Peña, Botp

RnJvbTogTWUgTWUgW21haWx0bzplbWFudWVsZWZAdGlzY2FsaS5pdF0gDQojIEkgd2FzIHRoaW5r
aW5nIGlmIHRoZXJlIHNvbWUga2luZCBvZiBjIHNzY2FuZiwNCiMgc28gdGhhdCBJIGNvdWxkIHBh
cnNlIGFuZCBhc3NpbmcgdG8gdmFyaWFibGUgYXQgdGhlIHNhbWUgdGltZQ0KIyBzbyBpZiBJIGhh
dmUNCiMgbGluZT0iQ2xpZW50PU1QRUctNCwzOTAwMDAsNzAwMDAwLDI0MDAwIg0KIyBzb21ldGhp
bmcgbGlrZToNCiMgc3NjYW5mKGxpbmUsICVzPSVzICVzICVkICVkICVkLCB2YWwxLCB2YWwyLCB2
YWwzLCB2YWw0LCB2YWw1LCB2YWw2KQ0KIyBJIGRvbid0IGtub3cgaWYgdGhlcmUncyBhIHNpbWls
YXIgc3RyaW5nIGZ1bmN0aW9uIGZvciB0aGlzIGluIFJ1YnkNCg0KeW91IGFyZSByaWdodCBvbiBz
Y2FuZi4NCnRoZXJlIGlzIG9uZSBpbiBydWJ5LCBhbmQgaXQncyBhIGxvdCBzaW1wbGVyIHRoYW4g
eW91IHRoaW5rDQoNCnlvdSdsbCBoYXZlIHRvIHJlcXVpcmUgaXQgdGhvdWdoIGJlZm9yZSB1c2lu
ZywNCg0KZWcsDQoNCj4gcmVxdWlyZSAnc2NhbmYnDQo9PiBmYWxzZQ0KDQo+IGxpbmUuc2NhbmYo
IiU2cz0lNnMsJWQsJWQsJWQsJWQiKQ0KPT4gWyJDbGllbnQiLCAiTVBFRy00IiwgMzkwMDAwLCA3
MDAwMDAsIDI0MDAwXQ0KDQo=
 
M

Me Me

line.scanf("%6s=%6s,%d,%d,%d,%d")
=> ["Client", "MPEG-4", 390000, 700000, 24000]

Thanks
the problem I have now is that the size of the string is not fixed to 6
chars.
And if I try to parse like:
line.scanf("%s=%s,%d,%d,%d,%d")
It doesn't parse the string.

Is there a way to parse any string?
thanks again
 
B

Brian Candler

is there a way to use the scanf to parse a string not knowing how many

I'd still use Regexp.

line="Client=MPEG-4,390000,700000,24000"
val1,val2,val3,val4,val5 =
/^(\w*)=([^,]*),(\d*),(\d*),(\d*)/.match(line).captures

Another way:

def handle_line(v1,v2,v3,v4,v5)
puts "I got it! #{v1} etc"
end
...
if /^(\w*)=([^,]*),(\d*),(\d*),(\d*)/ =~ line
handle_line(*$~.captures)
end
 
M

Me Me

Brian said:
is there a way to use the scanf to parse a string not knowing how many
chars?

I'd still use Regexp.

line="Client=MPEG-4,390000,700000,24000"
val1,val2,val3,val4,val5 =
/^(\w*)=([^,]*),(\d*),(\d*),(\d*)/.match(line).captures

Another way:

def handle_line(v1,v2,v3,v4,v5)
puts "I got it! #{v1} etc"
end
...
if /^(\w*)=([^,]*),(\d*),(\d*),(\d*)/ =~ line
handle_line(*$~.captures)
end

thanks,
but what I would like to avoid regexp, it seems strange to me that
there's no way to parse a string providing the structure.
scanf would be great but if I put %s it doesn't get the string, unless I
put the number of chars.
 
B

Brian Candler

thanks,
but what I would like to avoid regexp, it seems strange to me that
there's no way to parse a string providing the structure.
scanf would be great but if I put %s it doesn't get the string, unless I
put the number of chars.

%s is terminated by whitespace. You have no way of telling scanf that
you want to treat "=" (after the first field) and "," (after the second
field) as separators, rather than characters to be consumed by %s.

Well, as long as your data doesn't contain spaces, you could do

line="Client=MPEG-4,390000,700000,24000"
line.gsub(/[=,]/,' ').scanf("%s %s %d %d %d")
 
L

Lloyd Linklater

Me said:
thanks,
but what I would like to avoid regexp, it seems strange to me that
there's no way to parse a string providing the structure.

Well, you can always write a BreakApart() algorithm but I must agree
with Brian that RegEx is the way to go. After all, that is what RegEx
does. I was tempted to add BreakApart() code here but I am neither sure
that it is what you really want nor that it is the best solution for the
problem at hand.

What is the *actual* problem? If it is what you said ("I would like to
know if there's a better way to parse a string and assing values to
variables;") then RegEx is a fine solution. If you reject a good
solution and seek something else, then it can only be that you are
actually seeking a solution to a different problem. So, what are you
*really* looking for?
 
M

Me Me

What is the *actual* problem? If it is what you said ("I would like to
know if there's a better way to parse a string and assing values to
variables;") then RegEx is a fine solution. If you reject a good
solution and seek something else, then it can only be that you are
actually seeking a solution to a different problem. So, what are you
*really* looking for?

I'm quite new to Ruby and I can understand that athere are better way to
do things, what I would like to avoid is to write something like this
(that works)

line =~ /(\w*)=([0-9A-Za-z -.:]*),([0-9A-Za-z -.:]*),([0-9A-Za-z
-.:]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*)/

I wanted to this in one line but in a more mainable way, otherwise I
could always pars the string char by char.
 
J

James Gray

What is the *actual* problem? If it is what you said ("I would
like to
know if there's a better way to parse a string and assing values to
variables;") then RegEx is a fine solution. If you reject a good
solution and seek something else, then it can only be that you are
actually seeking a solution to a different problem. So, what are you
*really* looking for?

I'm quite new to Ruby and I can understand that athere are better
way to
do things, what I would like to avoid is to write something like this
(that works)

line =~ /(\w*)=([0-9A-Za-z -.:]*),([0-9A-Za-z -.:]*),([0-9A-Za-z
-.:]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),
([0-9]*),([0-9]*),([0-9]*),([0-9]*)/

I believe there's a bug in your regex. I assume you don't really mean
all characters between space and period in the second character class,
especially since that includes a comman.
I wanted to this in one line but in a more mainable way, otherwise I
could always pars the string char by char.

I would probably do it in two steps. Match the bit before and after
the equal sign in one, then split() the after bit on commas:

#!/usr/bin/env ruby -wKU

if "Client=MPEG-4,390000,700000,24000" =~ /\A([^=]+)=([^=]+)\z/
p [$1, *$2.split(",")]
end

__END__

Here's another idea using StringScanner:

#!/usr/bin/env ruby -wKU

require "strscan"

class SimpleParser
def initialize(data)
s = StringScanner.new(data)
@values = [ ]

@values << s.matched if s.scan(/\w+/)
@values << s.matched[1..-1] if s.scan(/=[0-9A-Za-z \-.:]+/)
@values << s.matched[1..-1] while s.scan(/,[0-9]+/)
end

attr_reader :values
end

p SimpleParser.new("Client=MPEG-4,390000,700000,24000").values

__END__

Hope that gives you some fresh ideas.

James Edward Gray II
 
B

Brian Candler

what I would like to avoid is to write something like this
(that works)

line =~ /(\w*)=([0-9A-Za-z -.:]*),([0-9A-Za-z -.:]*),([0-9A-Za-z
-.:]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*)/

I wanted to this in one line but in a more mainable way, otherwise I
could always pars the string char by char.

If you don't actually need to match the data against a pattern, then
just use

line.split(',')

If you only want to proceed if the line is "valid", then write a
suitable regexp pattern to validate it. There are plenty of shortcuts.
For example, \d is the same as [0-9]. {n} means repeat the preceeding
element exactly n times. So:

case line
when /^(\w*)=([^,]*),(\d+(,\d+){9})$/
key1 = $1
key2 = $2
numbers = $3.split(/,/).collect { |n| n.to_i }
# or: numbers = $3.scanf("%d %d %d %d %d %d %d %d %d %d") if you
prefer
else
puts "Invalid line!"
end

That matches word=string,n,n,n,n,n,n,n,n,n,n

Furthermore you can substitute patterns you use repeatedly:

WORD = "[0-9A-Za-z -.:]*"
...
when /^(#{WORD})=(#{WORD}),(#{WORD}),(\d+(,\d+){9})$/o

(//o means that the regexp is built only once, the substitutions aren't
done every time round)

You can also use extended syntax to make the RE more maintainable:

VALID_LINE = %r{ ^
(\w*) = # key ($1)
(#{WORD}), # format ($2)
(\d+), # size ($3)
(\d+) # sample rate ($4)
$ }x

if VALID_LINE =~ line
..
end

You can also do groupings which *don't* capture data using (?: .. )

Compact enough?
 
L

Lloyd Linklater

Me said:
What is the *actual* problem? If it is what you said ("I would like to
know if there's a better way to parse a string and assing values to
variables;") then RegEx is a fine solution. If you reject a good
solution and seek something else, then it can only be that you are
actually seeking a solution to a different problem. So, what are you
*really* looking for?

I'm quite new to Ruby and I can understand that athere are better way to
do things, what I would like to avoid is to write something like this
(that works)

line =~ /(\w*)=([0-9A-Za-z -.:]*),([0-9A-Za-z -.:]*),([0-9A-Za-z
-.:]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*),([0-9]*)/

I wanted to this in one line but in a more mainable way, otherwise I
could always pars the string char by char.

AHA! I understand, or I at least flatter myself that I do. how about
this:

require 'scanf'

s = "Client=MPEG-4,390000,700000,24000,9452349,234583475,2452345"
val = s.scanf("%6s=%s")
vals = val[1].split(",")
p vals

=> ["MPEG-4", "390000", "700000", "24000", "9452349", "234583475",
"2452345"]
 
W

William James

Hi all,
I would like to know if there's a better way to parse a string and
assing values to variables;

Ex:

Client=MPEG-4,390000,700000,24000

I can do

line =~ /(\w*)=([0-9A-Za-z -.:]*),([0-9]*),([0-9]*),([0-9]*)/

and

var1 = $1
var2 = $2
var3 = $3
var4 = $4
var4 = $5

But I'm sure there's a better way, even considering that the number of
parameters can increase and I don't want to write a long regular
expression rule, that is hard to read.

s = "Client=MPEG-4,390000,700000,24000"
==>"Client=MPEG-4,390000,700000,24000"
if s =~ /^\w+=\S+(,\d+)+$/
vars = s.split( /[=,]/ )
end
==>["Client", "MPEG-4", "390000", "700000", "24000"]
 
P

Peña, Botp

RnJvbTogTWUgTWUgW21haWx0bzplbWFudWVsZWZAdGlzY2FsaS5pdF0gDQojID4+IGxpbmUuc2Nh
bmYoIiU2cz0lNnMsJWQsJWQsJWQsJWQiKQ0KIyA+ID0+IFsiQ2xpZW50IiwgIk1QRUctNCIsIDM5
MDAwMCwgNzAwMDAwLCAyNDAwMF0NCiMgdGhlIHByb2JsZW0gSSBoYXZlIG5vdyBpcyB0aGF0IHRo
ZSBzaXplIG9mIHRoZSBzdHJpbmcgaXMgbm90IA0KIyBmaXhlZCB0byA2ICBjaGFycy4NCiMgQW5k
IGlmIEkgdHJ5IHRvIHBhcnNlIGxpa2U6DQojIGxpbmUuc2NhbmYoIiVzPSVzLCVkLCVkLCVkLCVk
IikNCiMgSXQgZG9lc24ndCBwYXJzZSB0aGUgc3RyaW5nLg0KIyBJcyB0aGVyZSBhIHdheSB0byBw
YXJzZSBhbnkgc3RyaW5nPw0KIyB0aGFua3MgYWdhaW4NCg0Kb29wcywgc29ycnksIGkgdGhvdWdo
dCBpdCB3YXMgZ29vZCBlbm91Z2guDQoNCmluIHRoYXQgY2FzZSwgeW91J2xsIGhhdmUgdG8gdXNl
IGNoYXIgY2xhc3NlcywNCg0KPiBsaW5lLnNjYW5mKCIlW0EtWmEtel09JVtBLVoxLTktXSwlZCwl
ZCwlZCwlZCIpDQo9PiBbIkNsaWVudCIsICJNUEVHLTQiLCAzOTAwMDAsIDcwMDAwMCwgMjQwMDBd
DQoNCmlzIHRoYXQgb2s/DQpraW5kIHJlZ2FyZHMgLWJvdHANCg0K
 
L

Lloyd Linklater

William said:
s = "Client=MPEG-4,390000,700000,24000"
==>"Client=MPEG-4,390000,700000,24000"
if s =~ /^\w+=\S+(,\d+)+$/
vars = s.split( /[=,]/ )
end
==>["Client", "MPEG-4", "390000", "700000", "24000"]

You are right, William. That is cleaner. nice!
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Similar Threads


Members online

Forum statistics

Threads
473,816
Messages
2,569,710
Members
45,499
Latest member
BreannaWhi

Latest Threads

Top