Regexp to extract filename from g++ output

M

Mike

hi all,

i am trying to extract the filename from a g++ compiler message
but cannot find a regexp that handles the optional column part
correctly. the message is in the format:

c:\tmp\test.cpp:1: type: messagetext
or
c:\tmp\test.cpp:1:20: type: messagetext

OR

/tmp/test.cpp:1: type: messagetext
or
/tmp/test.cpp:1:20: type: messagetext

thanks a lot!

mike
 
C

Collin VanDyck

c:\tmp\test.cpp:1: type: messagetext
or
c:\tmp\test.cpp:1:20: type: messagetext

OR

/tmp/test.cpp:1: type: messagetext
or
/tmp/test.cpp:1:20: type: messagetext

thanks a lot!

mike

Maybe try

Pattern p = Pattern.compile("^([^:]+):");

That will match all characters up to the first colon. I didn't infer
from your post that you needed to catch the column numbers, only the
file name.

When you create a matcher on this Pattern, you'd use the first group:

matcher.group(1)

to get the part that matched the filename.

Does this help at all?

Collin
 
V

Virgil Green

Collin said:
c:\tmp\test.cpp:1: type: messagetext
or
c:\tmp\test.cpp:1:20: type: messagetext

OR

/tmp/test.cpp:1: type: messagetext
or
/tmp/test.cpp:1:20: type: messagetext

thanks a lot!

mike

Maybe try

Pattern p = Pattern.compile("^([^:]+):");

That will match all characters up to the first colon. I didn't infer
from your post that you needed to catch the column numbers, only the
file name.

When you create a matcher on this Pattern, you'd use the first group:

matcher.group(1)

to get the part that matched the filename.

Does this help at all?

Collin

Try this... which accommodates the Windows-style path and Unix-style paths.
Only partially tested.
(([a-zA-Z]):|)(\\|/)[^:]+
 
V

Virgil Green

Virgil said:
Collin said:
c:\tmp\test.cpp:1: type: messagetext
or
c:\tmp\test.cpp:1:20: type: messagetext

OR

/tmp/test.cpp:1: type: messagetext
or
/tmp/test.cpp:1:20: type: messagetext

thanks a lot!

mike

Maybe try

Pattern p = Pattern.compile("^([^:]+):");

That will match all characters up to the first colon. I didn't infer
from your post that you needed to catch the column numbers, only the
file name.

When you create a matcher on this Pattern, you'd use the first group:

matcher.group(1)

to get the part that matched the filename.

Does this help at all?

Collin

Try this... which accommodates the Windows-style path and Unix-style
paths. Only partially tested.
(([a-zA-Z]):|)(\\|/)[^:]+

Forgot to add additional escaping for use in Java source:
(([a-zA-Z]):|)(\\\\|/)[^:]+
 
T

Tilman Bohn

c:\tmp\test.cpp:1: type: messagetext
or
c:\tmp\test.cpp:1:20: type: messagetext
[...]
Pattern p = Pattern.compile("^([^:]+):");

That will match all characters up to the first colon.

Which, in the first two examples, will be `c'.

So the next attempt would include finding strings of only digits
between colons.

However, this one is tricky because c:1 is actually a valid pathname in
DOS (without any leading backslash this refers to the file or directory
named `1' in the current directory of drive c:). To be honest though, I'm
not sure if this (i.e., a cwd per drive) is just a quirk of command.com
and cmd.exe or a part of DOS/Windows's internal book-keeping. (And in any
case I'm not sure whether the output would ever include such a pathname!)

Anyway, the point is that it might be impossible to tell algorithmically
whether the first colon is part of a drive/path combination or the
separator between the filename and line, because c:1:10 might be column 10
of line 1 of file c in the cwd (unix style) _or_ line 10 of file 1 in the
cwd of drive c (DOS style). And to top it all off, on MacOS classic, where
the colon is the regular dir separator, the same output could refer to
line 10 of file 1 in the folder named `c'...

So I suspect your best bet for robustness would be to check for the value
of File.separator and use a different pattern accordingly... Anything else
I think might break sooner or later.
 
T

Tilman Bohn

In message <[email protected]>,
Virgil Green wrote on Mon, 20 Dec 2004 23:30:28 GMT:

[...]
Try this... which accommodates the Windows-style path and Unix-style paths.
Only partially tested.
(([a-zA-Z]):|)(\\|/)[^:]+

1. (foo|) is normally written (foo)?. (Purely cosmetics.) 2. Is the
compiler output guaranteed to only ever contain absolute pathnames?
3. Colons are valid characters in Unix file and directory names.
 
V

Virgil Green

Virgil said:
Virgil said:
Collin said:
c:\tmp\test.cpp:1: type: messagetext
or
c:\tmp\test.cpp:1:20: type: messagetext

OR

/tmp/test.cpp:1: type: messagetext
or
/tmp/test.cpp:1:20: type: messagetext

thanks a lot!

mike

Maybe try

Pattern p = Pattern.compile("^([^:]+):");

That will match all characters up to the first colon. I didn't
infer from your post that you needed to catch the column numbers,
only the file name.

When you create a matcher on this Pattern, you'd use the first
group:

matcher.group(1)

to get the part that matched the filename.

Does this help at all?

Collin

Try this... which accommodates the Windows-style path and Unix-style
paths. Only partially tested.
(([a-zA-Z]):|)(\\|/)[^:]+

Forgot to add additional escaping for use in Java source:
(([a-zA-Z]):|)(\\\\|/)[^:]+

Oops again... no need to test for the differing slashes...
([a-zA-Z]:|)[^:]+

- Virgil
 
V

Virgil Green

Tilman said:
In message <[email protected]>,
Virgil Green wrote on Mon, 20 Dec 2004 23:30:28 GMT:

[...]
Try this... which accommodates the Windows-style path and Unix-style
paths. Only partially tested.
(([a-zA-Z]):|)(\\|/)[^:]+

1. (foo|) is normally written (foo)?. (Purely cosmetics.) 2. Is the
compiler output guaranteed to only ever contain absolute pathnames?
3. Colons are valid characters in Unix file and directory names.

1. Oh... thanks. I just play with regexes occasionally. Nice to know that.
2. Haven't a clue. Wasn't my original problem. But I did post a later
version when I realized I didn't need to check for the slashes.
3. That occurred to me when I was thinking about not needing to test for
slashes. Spaces are valid too, right? So, is it possible to extract the file
name accurately with a regex given that the text message could contain any
number of colons and spaces? I'm hard pressed to think of a pattern that
would match if you can't rely on a colon or a space as a terminator.

Something for me to mull over.

- Virgil
 
T

Tilman Bohn

In message <[email protected]>,
Virgil Green wrote on Tue, 21 Dec 2004 00:01:39 GMT:

[...]
1. Oh... thanks. I just play with regexes occasionally. Nice to know that.

You're welcome. ;-) (Not that I'm a regexp-wizard myself, mind you.)
2. Haven't a clue. Wasn't my original problem. But I did post a later
version when I realized I didn't need to check for the slashes.

Our posts must've crossed, I haven't seen that message yet, I only saw
the one with the correct Java Backslash-Escaping.
3. That occurred to me when I was thinking about not needing to test for
slashes. Spaces are valid too, right?

Yup. On Unix, Windows and MacOS.
So, is it possible to extract the file
name accurately with a regex given that the text message could contain any
number of colons and spaces? I'm hard pressed to think of a pattern that
would match if you can't rely on a colon or a space as a terminator.

I don't think it's possible. Actually at this point I don't even think
it's possible if one tests for the platform-specific separator and then
proceeds accordingly, because of the optional column output... However, if
one can assume that all file names will end in .cpp (in any case
combination, as Windows is case-insensitive) one might get close enough
FAPP. This can then of course easily break, but if it's well-documented it
could be acceptable.
Something for me to mull over.

Have fun! I have to go get some sleep now and will check the group for
your solution first thing in the morning! ;-)
 
G

Gordon Beaton

3. That occurred to me when I was thinking about not needing to test
for slashes. Spaces are valid too, right? So, is it possible to
extract the file name accurately with a regex given that the text
message could contain any number of colons and spaces? I'm hard
pressed to think of a pattern that would match if you can't rely
on a colon or a space as a terminator.

This works for me, regardless of the filename:

(.+):([0-9]+):([0-9]+): (error|warning):(.+)

It works because the first group (.+) is greedy, not possesive. So it
consumes as much as necessary but leaves enough of the input for the
remainder of the expression to match (actually I believe it will
initially match the entire expression, then backtrack until the
remaining groups match, but I am not a regexp expert).

Here's a simple test run:

using regexp "(.+):([0-9]+):([0-9]+): (error|warning):(.+)"

line 1: 'errors.c:12:13: error: baz'
groups: 5
1 (file): errors.c
2 (line): 12
3 (col): 13
4 (type): error
5 (msg): baz

line 2: 'error:s .c:12:13: error: gurka'
groups: 5
1 (file): error:s .c
2 (line): 12
3 (col): 13
4 (type): error
5 (msg): gurka

/gordon
 
M

Mike

thanks for your replies!

gordon, your version fails here if there is no column:
c:\tmp\test.cpp:1: error: message
does this work in your jre? i am using sun's 1.4.2_06 under win2k.

i think i do not understand the logic of regular expressions at all:
i have:
fn:1:2:
(.*):)[0-9]+)+
i would expect that this returns a group that contains 'fn' but it returns
'fn:1'
why does it always return the 'maximum matching area'?

when i use:
(.*):)[0-9]+:)|:)[0-9]+:[0-9]+:)
again it returns 'fn:1'
is this the expected behaviour? what extra logic is applied to the OR?

when i use
(.*):)[0-9]+:[0-9]+:)|:)[0-9]+:)
it works, but fails for the input string 'fn:1:'

how can i express "read into group1 until the first :[0-9]+ is found" with a
regexp?

thanks,
mike
 
G

Gordon Beaton

gordon, your version fails here if there is no column:
c:\tmp\test.cpp:1: error: message
does this work in your jre? i am using sun's 1.4.2_06 under win2k.

Sorry, I missed the part about the column being optional.

Try this one:

(.+?):([0-9]+):(?:([0-9]+):)? (error|warning): (.+)

The first group needs a reluctant quantifier, otherwise it consumes
too much of the input when the column is optional and both line and
column numbers are provided.

The optional column uses two nested groups. Of these, the inner group
is the interesting one, since it contains (just) the column number,
whereas the outer group also contains the termating colon. The extra
?: after the outer group's opening parenthesis isn't strictly
necessary, but it's there to prevent the outer group from being
captured.

Here's my test again:

using regexp "(.+?):([0-9]+):(?:([0-9]+):)? (error|warning): (.+)"

test 0: 'file.c:2:3: warning: foo':
1 (file): file.c
2 (line): 2
3 (col): 3
4 (type): warning
5 (msg): foo

test 1: 'file.c:2: warning: bar':
1 (file): file.c
2 (line): 2
3 (col): null
4 (type): warning
5 (msg): bar

test 2: 'file name.c:2: warning: gurka':
1 (file): file name.c
2 (line): 2
3 (col): null
4 (type): warning
5 (msg): gurka

test 3: 'file name.c:2:3: warning: baz':
1 (file): file name.c
2 (line): 2
3 (col): 3
4 (type): warning
5 (msg): baz

test 4: 'file n:ame.c:2: warning: gurka':
1 (file): file n:ame.c
2 (line): 2
3 (col): null
4 (type): warning
5 (msg): gurka

test 5: 'file n:ame.c:2:3: warning: baz':
1 (file): file n:ame.c
2 (line): 2
3 (col): 3
4 (type): warning
5 (msg): baz

test 6: 'file n:ame.c:4:5:6:2:3: warning: baz':
1 (file): file n:ame.c:4:5:6
2 (line): 2
3 (col): 3
4 (type): warning
5 (msg): baz

/gordon
 
M

Mike

groovy!! that works. thanks a lot!

Gordon Beaton said:
gordon, your version fails here if there is no column:
c:\tmp\test.cpp:1: error: message
does this work in your jre? i am using sun's 1.4.2_06 under win2k.

Sorry, I missed the part about the column being optional.

Try this one:

(.+?):([0-9]+):(?:([0-9]+):)? (error|warning): (.+)

The first group needs a reluctant quantifier, otherwise it consumes
too much of the input when the column is optional and both line and
column numbers are provided.

The optional column uses two nested groups. Of these, the inner group
is the interesting one, since it contains (just) the column number,
whereas the outer group also contains the termating colon. The extra
?: after the outer group's opening parenthesis isn't strictly
necessary, but it's there to prevent the outer group from being
captured.

Here's my test again:

using regexp "(.+?):([0-9]+):(?:([0-9]+):)? (error|warning): (.+)"

test 0: 'file.c:2:3: warning: foo':
1 (file): file.c
2 (line): 2
3 (col): 3
4 (type): warning
5 (msg): foo

test 1: 'file.c:2: warning: bar':
1 (file): file.c
2 (line): 2
3 (col): null
4 (type): warning
5 (msg): bar

test 2: 'file name.c:2: warning: gurka':
1 (file): file name.c
2 (line): 2
3 (col): null
4 (type): warning
5 (msg): gurka

test 3: 'file name.c:2:3: warning: baz':
1 (file): file name.c
2 (line): 2
3 (col): 3
4 (type): warning
5 (msg): baz

test 4: 'file n:ame.c:2: warning: gurka':
1 (file): file n:ame.c
2 (line): 2
3 (col): null
4 (type): warning
5 (msg): gurka

test 5: 'file n:ame.c:2:3: warning: baz':
1 (file): file n:ame.c
2 (line): 2
3 (col): 3
4 (type): warning
5 (msg): baz

test 6: 'file n:ame.c:4:5:6:2:3: warning: baz':
1 (file): file n:ame.c:4:5:6
2 (line): 2
3 (col): 3
4 (type): warning
5 (msg): baz

/gordon
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,769
Messages
2,569,576
Members
45,054
Latest member
LucyCarper

Latest Threads

Top