Help with split using multiple delimiters

G

geeknc

I have a file that contains 5 elements per line each seperated by white
space, however the 4th element is surrounded by quotes.

Each line in a file looks like this:

ItemA ItemB 1.1.1.1.1 "xxx xx xxxxxx" ItemD

I was hoping to do something like this....

($a,$b,$c,$d,$e) = split(/split on white space or "...."/, $string);

and end up with....

$a = "ItemA";
$b = "ItemB";
$c = "1.1.1.1.1";
$d = "xxx xx xxxxxx";
$e = "ItemD";

I have tried multiple delimiters, but nothing seems to return 5
elements. Thank you, in advance, for any help you can offer.
 
I

it_says_BALLS_on_your forehead

don't use split--use a regex.

($a, $b, $c, $d, $e) = $string =~
/(\S+)\s+(\S+)\s+(\S+)\s+"(.+)"\s+(\S+)/;

or if using $_

($a, $b, $c, $d, $e) = /(\S+)\s+(\S+)\s+(\S+)\s+"(.+)"\s+(\S+)/;

you can wrap each element in double quotes later.

you may be able to do

@array = /(\S+)\s+(\S+)\s+(\S+)\s+"(.+)"\s+(\S+)/;

for (@array) {
$_ = qq{"$_"};
}
 
P

Paul Lalli

I have a file that contains 5 elements per line each seperated by white
space, however the 4th element is surrounded by quotes.

Can you explain what was wrong with the solution you found in the FAQ?
You did, of course, search the FAQ before asking hundreds of other
people for help, right?

perldoc -q split
How can I split a [character] delimited string except when
inside [character]? (Comma-separated files)

In your case, the first [character] is a space, the second is a
double-quotes.

Paul Lalli
 
J

James Taylor

don't use split--use a regex.

($a, $b, $c, $d, $e) = $string =~
/(\S+)\s+(\S+)\s+(\S+)\s+"(.+)"\s+(\S+)/;

If you don't know in advance which fields will be quoted,
you can use this regex instead:

my ($a, $b, $c, $d, $e) = $string =~ /("[^"]*"|\S+)/g;
# but then you need to remove any quotes by saying:
s/^"([^"]*)"$/$1/ foreach $a, $b, $c, $d, $e;

If you don't mind the fields all going in one array, you
could do it all in one go like this:

my @fields;
push @fields, $+ while $string =~ /"([^"]*)"|(\S+)/g;

Of course, nothing stops you then assigning the @fields
array to individual scalar variables:

my ($a, $b, $c, $d, $e) = @fields;

If a single line while loop with a fairly simple regex seems too
easy or too efficient, you can always spend time reading up on
the various CPAN modules suggested by the FAQ (perldoc -q split)
work out how to setup the necessary OO object instances, how
to call the provided methods to get the result you require,
test that it does what you expect, pray that there are no
earlier versions of the module around that are buggy, pray
that no future versions will be buggy, load the whole module
at compile time and hope that this and the method call interface
don't hit performance too much, and then sit back and enjoy
the somewhat dubious pleasures of OPC (Other People's Code)
in the knowledge that at least you didn't have to do the
work yourself. (Irony intended.)

Even if you wanted to use a module, I note that the FAQ
entry "How can I split a [character] delimited string except
when inside [character]?" recommends the use of Text::CVS or
Text::CVS_XS but I don't believe CVS is what's needed here. :)
 
I

it_says_BALLS_on_your forehead

i don't know if that would work because of greedy matching. you may
need a ? after your asterisk, to make it stingy matching.
 
A

Anno Siegel

James Taylor said:
[...]

Even if you wanted to use a module, I note that the FAQ
entry "How can I split a [character] delimited string except
when inside [character]?" recommends the use of Text::CVS or
Text::CVS_XS but I don't believe CVS is what's needed here. :)

That must be a typo in the FAQ. s/CVS/CSV/g.

Anno
 
J

James Taylor

Simon, I'm not sure which bit of my post you were replying
to, or even if it was me you were replying to, as you did
not quote any context. I will therefore attempt to rebuild
the relevant context below with the correct attributions.
You probably need to get a better news reader if you can.

If you don't know in advance which fields will be quoted,
you can use this regex instead:

my ($a, $b, $c, $d, $e) = $string =~ /("[^"]*"|\S+)/g;
# but then you need to remove any quotes by saying:
s/^"([^"]*)"$/$1/ foreach $a, $b, $c, $d, $e;

If you don't mind the fields all going in one array, you
could do it all in one go like this:

my @fields;
push @fields, $+ while $string =~ /"([^"]*)"|(\S+)/g;

i don't know if that would work because of greedy matching. you may
need a ? after your asterisk, to make it stingy matching.

If we're sure that the OP's input lines contain simple
double quoted strings that do not themselves contain double
quotes (and this is what his example illustrated) then a
greedy [^"]* will swallow everything up to the next double
quote just as we require. Obviously, if the closing quote was
missing, it wouldn't capture the correct thing. (I think it
would backtrack and treat the opening quote as part of a
space delimited word instead). The OP could check there are
an even number of double quotes beforehand by saying:

die "Bad input line: $string\n" if $string =~ tr/"// % 2;

If the input lines were similar to CSV in allowing strings
that themselves contain double quotes, doubled up like this:

ItemA ItemB 1.1.1.1.1 "He said ""Hello"" to me" ItemD

then a more complex regex would be required. If this is what the
OP wants he can ask, but I don't believe it is. What he shouldn't
do, though, is use Text::parseWords because, contrary to popular
belief, it doesn't handle CSV style quotes.
 
J

James Taylor

James said:
Even if you wanted to use a module, I note that the FAQ
entry "How can I split a [character] delimited string except
when inside [character]?" recommends the use of Text::CVS or
Text::CVS_XS but I don't believe CVS is what's needed here. :)

That must be a typo in the FAQ. s/CVS/CSV/g.

Who's responsible for maintaining the FAQ?
What's the correct procedure for nudging them?
 
I

it_says_BALLS_on_your forehead

James said:
If you don't know in advance which fields will be quoted,
you can use this regex instead:


....so based on that (you said fieldS), the greedy matching would have
caused the regex to do something that was unintended.
James Taylor also wrote:
If this is what the
OP wants he can ask, but I don't believe it is.

....referring to nested quotes. you'r right, he didn't ask that. nor did
i assume he did. the example that he gave suggests that the 4th field
would always be the quoted field, so that's why i gave him the simple
regex that i did.

i was simply pointing out what i thought was an oversight in your
regex, because my interpretation was that you thought the OP may have
to deal with multiple quoted fields, and if that were the case, the
default greedy matching would eat up all but the last quote.
 
X

xhoster

it_says_BALLS_on_your forehead said:
...so based on that (you said fieldS), the greedy matching would have
caused the regex to do something that was unintended.

Can you illustrate this alleged problem?

Xho
 
I

it_says_BALLS_on_your forehead

woops, you're right. the [^"] deals with that, so it wouldn't be a
problem. and nested quotes would be a problem regardless of whether the
repetition specifier (?) was used. sorry about that...


i was thinking something like:
my $str = q{"one" "two" "three" "four" "five"};
my @fields = $str =~ /(".*")/g;

....
which would populate the whole string in the $fields[0];

again, sorry about that James Taylor.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,768
Messages
2,569,575
Members
45,053
Latest member
billing-software

Latest Threads

Top