Regular expressions: how to skip characters from a capture

I

Ikke

Hi everybody,

First of all, if this is not the right group to ask, I apologize. I could
not find a specific regex group, so I decided to turn to those who know
Perl.

I am trying to build a regex, which needs to capture data from a .csv
file in a very specific format.

The first part should be a filename - either enclosed in quotes, or not.
This is the part that is giving me problems.

I've created something like: ((filename)|quote(filename)quote) which
matches the text as I would like, but it doesn't capture the text as
expected.

If there are no quotes, the text is returned as group 1. If there are
quotes, group 2 returns the filename with the quotes. I'd like the quotes
to be removed from this group. I assumed that, if the first part of the
expression matches, the double braces would indicate that group 2 would
return the name as well, but this is not the case. Apparently ((regex))
equals (regex).

Does anybody know a solution for this problem?

Thank you very much,

Ikke
 
X

xhoster

Ikke said:
Hi everybody,

First of all, if this is not the right group to ask, I apologize. I could
not find a specific regex group, so I decided to turn to those who know
Perl.

I am trying to build a regex, which needs to capture data from a .csv
file in a very specific format.

Are you using Perl to do your regex? Or the regex feature in some other
language? The first is on topic here, but the second is not. We generally
won't know what semantics or quirks may be present in someone else's
implementation of regex.

The first part should be a filename - either enclosed in quotes, or not.
This is the part that is giving me problems.

I've created something like: ((filename)|quote(filename)quote) which
matches the text as I would like, but it doesn't capture the text as
expected.

If there are no quotes, the text is returned as group 1. If there are
quotes, group 2 returns the filename with the quotes.

I can't reproduce that.
perl -le '$_=qq{"filename"}; /((filename)|"(filename)")/ or die;
print foreach ($1,$2,$3)'
"filename"

filename

I get quoted filename in group 1, nothing in group 2, and unquoted filename
in group 3.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
S

smallpond

Hi everybody,

First of all, if this is not the right group to ask, I apologize. I could
not find a specific regex group, so I decided to turn to those who know
Perl.

I am trying to build a regex, which needs to capture data from a .csv
file in a very specific format.

The first part should be a filename - either enclosed in quotes, or not.
This is the part that is giving me problems.

I've created something like: ((filename)|quote(filename)quote) which
matches the text as I would like, but it doesn't capture the text as
expected.

If there are no quotes, the text is returned as group 1. If there are
quotes, group 2 returns the filename with the quotes. I'd like the quotes
to be removed from this group. I assumed that, if the first part of the
expression matches, the double braces would indicate that group 2 would
return the name as well, but this is not the case. Apparently ((regex))
equals (regex).

Does anybody know a solution for this problem?

Thank you very much,

Ikke


/"*([^",]*)/

foo,baz => foo
"foo",baz => foo
 
I

Ikke

Are you using Perl to do your regex? Or the regex feature in some
other language? The first is on topic here, but the second is not.
We generally won't know what semantics or quirks may be present in
someone else's implementation of regex.

That's mostly the reason I apologized for barging in here - I'm using
Delphi and I'm not that familiar with regular expressions.

As you pointed out, I've already noticed that regular expressions come in
various "dialects", for various languages...

I can't reproduce that.
perl -le '$_=qq{"filename"}; /((filename)|"(filename)")/ or die;
print foreach ($1,$2,$3)'
"filename"

filename

I get quoted filename in group 1, nothing in group 2, and unquoted
filename in group 3.

Something similar to what I'm getting, I'm afraid. I'd like to have a
regular expression which always returns filename, whether the original
line states filename or "filename", but not "filename or filename" .

Thanks anyway,

Ikke
 
I

Ikke

/"*([^",]*)/

foo,baz => foo
"foo",baz => foo

This is similar to the expression I started out with - but this will also
match:
"foo,baz => foo
or
foo",baz => foo
both of which are not valid inputs for my scenario.

Thanks anyway,

Ikke
 
J

Jürgen Exner

Ikke said:
I am trying to build a regex, which needs to capture data from a .csv
file in a very specific format.

The first part should be a filename - either enclosed in quotes, or not.
This is the part that is giving me problems.

Why aren't you using a parser for parsing CSV? There are several module
on CPAN.
I've created something like: ((filename)|quote(filename)quote) which
matches the text as I would like, but it doesn't capture the text as
expected.

So, let me paraphrase: sometime the data elements in your CSV are
enclosed in quotes. And you want them without quotes, no matter if they
are stored with or without quotes. Right?
Does anybody know a solution for this problem?

REs are not powerful enough to match balanced items, please see "perldoc
-q balanced" for more details.
Just use a CSV parser module, it will take care of the de-quoting
automatically.

jue
 
X

xhoster

Ikke said:
Something similar to what I'm getting, I'm afraid.

Perl numbers the captures in the order they open. Maybe Delphi does it
in the order they close.
I'd like to have a
regular expression which always returns filename, whether the original
line states filename or "filename", but not "filename or filename" .

In Perl you could use look ahead and look behind, like this:

/(filename|(?<=")filename(?="))/

There is only one capture there; the other parentheses are non-capturing
ones used for look-ahead and behind syntax. But I doubt Delphi has
look-ahead and look-behind, or at least not the same as Perl's.

I usually don't like cramming too much logic into one regex, so I'd be more
likely to break it up like this:

if ( /filename|"filename"/ ) { #just validate, no capture
/"?(filename)"?/; ## capture valid data
#....
} else {
die "not valid"
};

The capture could match things with unbalanced quotes, except that is
protected from doing so by the prior regex.

But I don't know how this fits into Delphi.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.
 
S

smallpond

/"*([^",]*)/
foo,baz => foo
"foo",baz => foo

This is similar to the expression I started out with - but this will also
match:
"foo,baz => foo
or
foo",baz => foo
both of which are not valid inputs for my scenario.

Thanks anyway,

Ikke


You are trying to do two different things with one
regex. You want to extract the filename and validate the input.

You need to explain what the correct output of the regex is in
the invalid cases.
 
D

Dr.Ruud

Ikke schreef:
I'd like to have a
regular expression which always returns filename, whether the original
line states filename or "filename", but not "filename or filename" .

/("?)([^" ]*)\1/
 
B

Bart Lateur

Ikke said:
I've created something like: ((filename)|quote(filename)quote) which
matches the text as I would like, but it doesn't capture the text as
expected.

If there are no quotes, the text is returned as group 1. If there are
quotes, group 2 returns the filename with the quotes. I'd like the quotes
to be removed from this group.

Much simpler is to just use the whole capture and strip the quotes from
this afterwards.

Either that, or you have to loop through the captures till you find one
that isn't undefined (or empty?), except for the outer one (maybe make
it noncapturing).

BTW Perl has the $+ special variable for this purpose. See perlvar:

$+ The text matched by the last bracket of the last successful
search pattern. This is useful if you don't know which one
of a set of alternative patterns matched. For example:

/Version: (.*)|Revision: (.*)/ && ($rev = $+);

(Mnemonic: be positive and forward looking.) This variable
is read-only and dynamically scoped to the current BLOCK.

It might not be so trivial in Delphi.
 
S

sln

Why aren't you using a parser for parsing CSV? There are several module
on CPAN.


So, let me paraphrase: sometime the data elements in your CSV are
enclosed in quotes. And you want them without quotes, no matter if they
are stored with or without quotes. Right?


REs are not powerful enough to match balanced items, please see "perldoc
-q balanced" for more details.
Just use a CSV parser module, it will take care of the de-quoting
automatically.

jue

I can only see this post in my reader (Agent), my usenet provider (easynews)
is acting flakey lately. Not too sure what the issue is with the OP.
I didn't see any Perl regexp there so it might be some other flavor.

This regexp will match alot, thus it needs help with a conditional.

Good Luck!

use strict;
use warnings;

while (<DATA>)
{
chomp;
next if (!length());
print "\nline = $_\n";

# Try this if you can't do the while/global search below
# ------------------------------------------------------
# if (/(,?)("?)(filename)("?)/ && $2 eq $4)

while (/(,?)("?)(filename)("?)/g)
{
next if ( $2 ne $4);
print "------------------\n";
print "\tFound: <$1>\n";
print "\tFound: <$2>\n";
print "\tFound: <$3>\n"; # <- where it is
print "\tFound: <$4>\n";
}
}

__DATA__
asdfasdfasd asdasdf
"filename"
filename
filename,filename
,"filename",
"filename,filename"
filename,filename"
"filename,filename
"filename
filename"
filename",filename,"filename"
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,770
Messages
2,569,583
Members
45,075
Latest member
MakersCBDBloodSupport

Latest Threads

Top