Regexp, Strings and spaces

F

Florent Carli

Hello experts,

I'm looking for a regexp to get the information from smtg like this :

field1="value with or without spaces" field2=valuewithoutspaces

My only concern is that I don't want to match the quotes caracters.
For now I came up with :
my (@res) = $line =~ m/=(?:((?<=["])[^"]+(?=["])|(?<!["])\S+(?!["])))/g
But the lookbehinds do not work ...

Any way to do this without using lookbehinds ?

Thanks!
 
A

Anno Siegel

Florent Carli said:
Hello experts,

I'm looking for a regexp to get the information from smtg like this :

field1="value with or without spaces" field2=valuewithoutspaces

My only concern is that I don't want to match the quotes caracters.
For now I came up with :
my (@res) = $line =~ m/=(?:((?<=["])[^"]+(?=["])|(?<!["])\S+(?!["])))/g
But the lookbehinds do not work ...

Any way to do this without using lookbehinds ?

Sure: /"?([^"]*)/

Take a look at one or another of the csv modules too.

Anno
 
J

J. Romano

I'm looking for a regexp to get the information from smtg like this :

field1="value with or without spaces" field2=valuewithoutspaces

My only concern is that I don't want to match the quotes caracters.
For now I came up with :
my (@res) = $line =~ m/=(?:((?<=["])[^"]+(?=["])|(?<!["])\S+(?!["])))/g
But the lookbehinds do not work ...

This is easier done without lookbehinds:

$line =
'field1="value with or without spaces" field2=valuewithoutspaces'

while ( $line =~ m/="([^"]*)"|=(\w*)/g )
{
push @res, $1 if defined $1;
push @res, $2 if defined $2;
}

Essentially, the above lines of code loop through every instance of
either
="some text"
or
=some_text
The first instance has a pattern match of m/="[^"]*"/ and the second
instance has a pattern match of m/=(\w*)/ . Therefore, I put them
together (by joining them with the "|" symbol and put capturing
parentheses around the text I'm intersted in) with the regular
expression m/="([^"]*)"|=(\w*)/g .

The "/g" is used to loop through every match, populating either $1
or $2 every time through the loop. Inside the loop, I push either $1
or $2 into the @res array, depending on which one is defined (that is,
which one happened to match).

I hope this helps.

-- Jean-Luc
 
F

Florent Carli

$line =
'field1="value with or without spaces" field2=valuewithoutspaces'

while ( $line =~ m/="([^"]*)"|=(\w*)/g )
{
push @res, $1 if defined $1;
push @res, $2 if defined $2;
}

I think my specifications were bad.
The "line" can be as long as it wants with so many fields.
It can be field1="test" field2=test2 field3="test 3"
field4="testagain"
and the next line could be
field1="test 4" field2="test 5" field3=test_6 field4="test n°7"

What I need was to get value of field2 for any type of field2 I can
get : "value with space", "valuewithoutspace", valuewithoutspace, or
even empty or "".
Any all cases, the value alone (without quotes) must go into $1 and $1
only.

For now, the only regexp able to do this I have found is :
field2=["]?((?<=["])[^"]*(?=["])|(?<!["])\S*(?!["]))
But like I said, the software I use to parse is using a version of
perl that does not support lookbehinds ...

I'm trying to do basically the same thing windows does when you type :
copy "my file.doc" "d:\my documents"
or
copy myfile.doc d:\

But only with one regexp (and no second pass in perl to remove the
quotes for instance ;) )
any idea ?
 
A

Anno Siegel

Florent Carli said:
Sure: /"?([^"]*)/
This does not work since 'field=hello field2="world"' would get you
'hello field2=' into $1.

I didn't read your original specification that way.

The best solution is probably a module (Text::Balanced, or one of
the CSV modules). For background information, see the FAQ:

How can I split a [character] delimited string except when inside [character]

Anno
 
A

A. Sinan Unur

(e-mail address removed) (Florent Carli) wrote in
For now, the only regexp able to do this I have found is :
field2=["]?((?<=["])[^"]*(?=["])|(?<!["])\S*(?!["]))
But like I said, the software I use to parse is using a version of
perl that does not support lookbehinds ...

I'm trying to do basically the same thing windows does when you type :
copy "my file.doc" "d:\my documents"
or
copy myfile.doc d:\

But only with one regexp (and no second pass in perl to remove the
quotes for instance ;) )
any idea ?

Is this just out of curiosity?

If there is some other purpose to this, take a look at Text::Balanced.
The few times I needed this type of functionality, that module worked
very well for me.
 
F

Florent Carli

The problem is that I have to enter a regex into a config file of a
software which does not understand lookbehinds (probably a old version
of perl, since I get a "bad pattern <?...").
Anyway, I'm not using perl directly for this, I have to find a regex
to do that, without lookbehinds, that's it.
That's why I can not code a second pass to remove quotes after a
/field2=("[^"]*"|\S*)/ for instance, or something that would give me
the one backreference I need after a /field2=(?:"([^"]*)"|(\S*))/.
I can't use a perl module either, of course.
If fact, I cannot code at all, the only thing I can control is 1
regexp.

Thanks!
 
J

J. Romano

I think my specifications were bad.

I'm sorry, but did you even try out my code? It does exactly what
you want. I even tested it.
The "line" can be as long as it wants with so many fields.
It can be field1="test" field2=test2 field3="test 3"
field4="testagain"
and the next line could be
field1="test 4" field2="test 5" field3=test_6 field4="test n°7"

It does exactly that. I even created a short script for you to run
to show you that it works. Here, try this:

#!/usr/bin/perl -w
use strict;
my @res; # results will be stored here
# Process the input lines (from the DATA section):
while (<DATA>)
{
while ( m/="([^"]*)"|=(\w*)/g )
{
push @res, $1 if defined $1;
push @res, $2 if defined $2;
}
}
# Print out the @res array to show the results:
foreach (my $i = 0; $i < @res; $i++)
{
print "\$res[$i] = \"$res[$i]\"\n";
}
__DATA__
# These are sample input lines:
field1="test" field2=test2 field3="test 3" field4="testagain"
field1="test 4" field2="test 5" field3=test_6 field4="test n°7"
field1=""
__END__

What I need was to get value of field2 for any type of field2 I can
get : "value with space", "valuewithoutspace", valuewithoutspace, or
even empty or "".
Any all cases, the value alone (without quotes) must go into $1 and $1
only.

No, I think you are mistaken. The value alone (without quotes)
must go into the @res array, and not necessarily into $1. The match
will either temporarily be in $1 or $2, but regardless of which it
goes into, it WILL be placed into the @res array, which is what you
want.
For now, the only regexp able to do this I have found is :
field2=["]?((?<=["])[^"]*(?=["])|(?<!["])\S*(?!["]))
But like I said, the software I use to parse is using a version of
perl that does not support lookbehinds ...

Don't use look-behinds. They are not needed for your task. And
please test the code I gave you before saying that it doesn't do what
you want.

-- Jean-Luc
 
J

J. Romano

The problem is that I have to enter a regex into a config file of a
software which does not understand lookbehinds (probably a old version
of perl, since I get a "bad pattern <?...").

Oh, so that's why you had all those restrictions. Without the
knowledge of your restrictions, we couldn't really give you a complete
answer.
Anyway, I'm not using perl directly for this, I have to find a regex
to do that, without lookbehinds, that's it.

Are you sure you are using Perl for this? I've done similar things
myself (that is, putting a regular expression in a config file), but I
don't think it was Perl that was evaluating them. It could be that
Perl has nothing to do with this.
That's why I can not code a second pass to remove quotes after a
/field2=("[^"]*"|\S*)/ for instance, or something that would give me
the one backreference I need after a /field2=(?:"([^"]*)"|(\S*))/.
I can't use a perl module either, of course.
If fact, I cannot code at all, the only thing I can control is 1
regexp.

The main problem is that you are searching for different patterns,
depending on what your delimeter is. If you have 'value="some text"',
then you will be looking for the next '"' character to signal the end
of your pattern. But if you have 'value=some_text', then you will be
looking for whitespace to signal the end of your pattern. This flow
of logic (if-then-else) is something that regular expressions alone
weren't made to handle.

I don't think your problem has a working solution because regular
expressions lack the ability to carry out the above logic. So let me
propose two work-arounds:

1. You could modify the program that reads the config files to handle
the logic you need.

or

2. You can write a simple Perl script to convert your config file so
that all the fields have quotes around the values (whether they need
them or not). In other words, your script would change all instances
of:

field1=some_text

to:

field1="some_text"

Then you could just set your regular expression to be:

m/field[0-9]+="([^"]*)"/

and then all your fields would be extracted. Problem solved.

Of course, I would imagine that the second work-around will be much
easier for you to implement, unless there is some other restriction
that you haven't shared with us.

Hopefully you'll find a solution that works for you.

-- Jean-Luc
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,755
Messages
2,569,537
Members
45,024
Latest member
ARDU_PROgrammER

Latest Threads

Top