Extracting patterns after matching a regex

M

Martin

Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"

and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(r"FTPHOST", s)

But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

Thanks in advance for the help,

Martin
 
M

MRAB

Martin said:
Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"

and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(r"FTPHOST", s)

But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

Thanks in advance for the help,
m = re.search(r"FTPHOST: (.*)", s)
print m.group(1)
 
P

pdpi

Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"

and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(r"FTPHOST", s)

But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

Thanks in advance for the help,

Martin

What you're doing is telling python "look for all matches of
'FTPHOST'". That doesn't really help you much, because you pretty much
expect FTPHOST to be there anyway, so finding it means squat. What you
_really_ want to tell it is "Look for things shaped like 'FTPHOST:
<ftpaddress>', and tell me what <ftpaddress> actually is". Look here:
http://docs.python.org/howto/regex.html#grouping. That'll explain how
to accomplish what you're trying to do.
 
A

Andreas Tawn

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"

and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(r"FTPHOST", s)

But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

Thanks in advance for the help,

Martin

No need for regex.

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
return s[9:]

Cheers,

Drea
 
M

Mark Tolonen

Martin said:
Hi,

I need to extract a string after a matching a regular expression. For
example I have the string...

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"

and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:

m = re.findall(r"FTPHOST", s)

But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!

In regular expressions, you match the entire string you are interested in,
and parenthesize the parts that you want to parse out of that string. The
group() method is used to get the whole string with group(0), and each of
the parenthesized parts with group(n). An example:
'e4ftl01u.ecs.nasa.gov'

-Mark
 
M

Mart.

m = re.search(r"FTPHOST: (.*)", s)
print m.group(1)

so the .* means to match everything after the regex? That doesn't help
in this case as the string is placed amongst others for example.

MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n',
 
M

Mart.

In regular expressions, you match the entire string you are interested in,
and parenthesize the parts that you want to parse out of that string.  The
group() method is used to get the whole string with group(0), and each of
the parenthesized parts with group(n).  An example:


'FTPHOST: e4ftl01u.ecs.nasa.gov'>>> re.search(r'FTPHOST: (.*)',s).group(1)

'e4ftl01u.ecs.nasa.gov'

-Mark

I see what you mean regarding the groups. Because my string is nested
in amongst others e.g.

MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n',

I get the information that follows as well. So is the only way to then
parse the new string? I am trying to construct something that is
fairly robust, so not sure just printing before the \r is the best
solution.

Thanks
 
T

Terry Reedy

Whether or not you need re is an issue to be determined.

Just split the string on ': ' and take the second part.
Or find the position of the space and slice the remainder.
so the .* means to match everything after the regex? That doesn't help
in this case

It helps in the case you presented.
> as the string is placed amongst others for example.
MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n',

What you show above is a tuple of strings. Scan the members looking for
s.startswith('FTPHOST:') and apply previous answer.
Or if above is actually meant to be one string (with quotes omitted),
split in ',' and apply previous answer.

tjr
 
M

Mart.

I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,

No need for regex.

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
    return s[9:]

Cheers,

Drea

Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints

e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

etc. So I need to find a way to stop it before the \r

slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

Many thanks
 
A

Andreas Tawn

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,

No need for regex.

s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
    return s[9:]

Cheers,

Drea

Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints

e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

etc. So I need to find a way to stop it before the \r

slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(":")[1].strip() will work.
 
N

nn

No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
    return s[9:]

Drea

Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example  s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints

e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',

etc. So I need to find a way to stop it before the \r

slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

Many thanks

It is not clear from your post what the input is really like. But just
guessing this might work:
'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r
\n','Ftp Pull Download Links: \r\n'
'e4ftl01u.ecs.nasa.gov'
 
M

Mart.

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
    return s[9:]
Cheers,
Drea
Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example  s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.

If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(":")[1].strip() will work.

It is an email which contains information before and after the main
section I am interested in, namely...

FINISHED: 09/07/2009 08:42:31

MEDIATYPE: FtpPull
MEDIAFORMAT: FILEFORMAT
FTPHOST: e4ftl01u.ecs.nasa.gov
FTPDIR: /PullDir/0301872638CySfQB
Ftp Pull Download Links:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
Down load ZIP file of packaged order:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
FTPEXPR: 09/12/2009 08:42:31
MEDIA 1 of 1
MEDIAID:

I have been doing this to turn the email into a string

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())

so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...

'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',

So not sure splitting it like you suggested works in this case.

Thanks
 
M

Mart.

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
    return s[9:]
Cheers,
Drea
Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example  s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.
Many thanks

It is not clear from your post what the input is really like. But just
guessing this might work:

'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r
\n','Ftp Pull Download Links: \r\n'

'e4ftl01u.ecs.nasa.gov'

Hi,

That does work. So the \ escapes the \r, does this tell it to stop
when it reaches the "\r"?

Thanks
 
P

pdpi

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
    return s[9:]
Cheers,
Drea
Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example  s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.
Many thanks

It is not clear from your post what the input is really like. But just
guessing this might work:

'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r
\n','Ftp Pull Download Links: \r\n'

'e4ftl01u.ecs.nasa.gov'

Except, I'm assuming, the OP's getting the data from a (windows-
formatted) file, so \r\n shouldn't be escaped in the regex:
 
M

MRAB

Mart. said:
Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
return s[9:]
Cheers,
Drea
Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.
If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(":")[1].strip() will work.

It is an email which contains information before and after the main
section I am interested in, namely...

FINISHED: 09/07/2009 08:42:31

MEDIATYPE: FtpPull
MEDIAFORMAT: FILEFORMAT
FTPHOST: e4ftl01u.ecs.nasa.gov
FTPDIR: /PullDir/0301872638CySfQB
Ftp Pull Download Links:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
Down load ZIP file of packaged order:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
FTPEXPR: 09/12/2009 08:42:31
MEDIA 1 of 1
MEDIAID:

I have been doing this to turn the email into a string

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())
To me that seems a strange thing to do. You could just read the entire
file as a string:

f = open(email, 'r')
s = f.read()
 
M

Mart.

Mart. said:
Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
    return s[9:]
Cheers,
Drea
Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example  s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.
If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(":")[1].strip() will work.
It is an email which contains information before and after the main
section I am interested in, namely...
FINISHED: 09/07/2009 08:42:31
MEDIATYPE: FtpPull
MEDIAFORMAT: FILEFORMAT
FTPHOST: e4ftl01u.ecs.nasa.gov
FTPDIR: /PullDir/0301872638CySfQB
Ftp Pull Download Links:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
Down load ZIP file of packaged order:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
FTPEXPR: 09/12/2009 08:42:31
MEDIA 1 of 1
MEDIAID:
I have been doing this to turn the email into a string
email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())

To me that seems a strange thing to do. You could just read the entire
file as a string:

     f = open(email, 'r')
     s = f.read()
so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...
'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',
So not sure splitting it like you suggested works in this case.

Within the file are a list of files, e.g.

TOTAL FILES: 2
FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
FILESIZE: 11028908

FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
FILESIZE: 18975

and what i want to do is get the ftp address from the file and collect
these files to pull down from the web e.g.

MOD13A2.A2007033.h17v08.005.2007101023605.hdf
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml

Thus far I have

#!/usr/bin/env python

import sys
import re
import urllib

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())
m = re.findall(r"MOD....\.........\.h..v..\.005\..............\....
\....", s)

ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
url = 'ftp://' + ftphost + ftpdir

for i in xrange(len(m)):

print i, ':', len(m)
file1 = m[:-4] # remove xml bit.
file2 = m

urllib.urlretrieve(url, file1)
urllib.urlretrieve(url, file2)

which works, clearly my match for the MOD13A2* files isn't ideal I
guess, but they will always occupt those dimensions, so it should
work. Any suggestions on how to improve this are appreciated.

Thanks.
 
D

Dave Angel

Mart. said:
<snip>
I have been doing this to turn the email into a string

email =ys.argv[1]
f =open(email, 'r')
s =str(f.readlines())

so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...

'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',
<snip>

The mistake I see is trying to turn a list into a string, just so you
can try to parse it back again. Just write a loop that iterates through
the list that readlines() returns.

DaveA
 
M

MRAB

Mart. said:
Mart. said:
Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa.gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
return s[9:]
Cheers,
Drea
Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.
If, as Terry suggested, you do have a tuple of strings and the first element has FTPHOST, then s[0].split(":")[1].strip() will work.
It is an email which contains information before and after the main
section I am interested in, namely...
FINISHED: 09/07/2009 08:42:31
MEDIATYPE: FtpPull
MEDIAFORMAT: FILEFORMAT
FTPHOST: e4ftl01u.ecs.nasa.gov
FTPDIR: /PullDir/0301872638CySfQB
Ftp Pull Download Links:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB
Down load ZIP file of packaged order:
ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB.zip
FTPEXPR: 09/12/2009 08:42:31
MEDIA 1 of 1
MEDIAID:
I have been doing this to turn the email into a string
email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())
To me that seems a strange thing to do. You could just read the entire
file as a string:

f = open(email, 'r')
s = f.read()
so FTPHOST isn't the first element, it is just part of a larger
string. When I turn the email into a string it looks like...
'FINISHED: 09/07/2009 08:42:31\r\n', '\r\n', 'MEDIATYPE: FtpPull\r\n',
'MEDIAFORMAT: FILEFORMAT\r\n', 'FTPHOST: e4ftl01u.ecs.nasa.gov\r\n',
'FTPDIR: /PullDir/0301872638CySfQB\r\n', 'Ftp Pull Download Links: \r
\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/0301872638CySfQB\r\n', 'Down
load ZIP file of packaged order:\r\n',
So not sure splitting it like you suggested works in this case.

Within the file are a list of files, e.g.

TOTAL FILES: 2
FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf
FILESIZE: 11028908

FILENAME: MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml
FILESIZE: 18975

and what i want to do is get the ftp address from the file and collect
these files to pull down from the web e.g.

MOD13A2.A2007033.h17v08.005.2007101023605.hdf
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml

Thus far I have

#!/usr/bin/env python

import sys
import re
import urllib

email = sys.argv[1]
f = open(email, 'r')
s = str(f.readlines())
m = re.findall(r"MOD....\.........\.h..v..\.005\..............\....
\....", s)

ftphost = re.search(r'FTPHOST: (.*?)\\r',s).group(1)
ftpdir = re.search(r'FTPDIR: (.*?)\\r',s).group(1)
url = 'ftp://' + ftphost + ftpdir

for i in xrange(len(m)):

print i, ':', len(m)
file1 = m[:-4] # remove xml bit.
file2 = m

urllib.urlretrieve(url, file1)
urllib.urlretrieve(url, file2)

which works, clearly my match for the MOD13A2* files isn't ideal I
guess, but they will always occupt those dimensions, so it should
work. Any suggestions on how to improve this are appreciated.

Suppose the file contains your example text above. Using 'readlines'
returns a list of the lines:
['TOTAL FILES: 2\n', '\t\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n', '\t\tFILESIZE:
11028908\n', '\n', '\t\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n', '\t\tFILESIZE:
18975\n']

Using 'str' on that list then converts it to s string _representation_
of that list:
"['TOTAL FILES: 2\\n', '\\t\\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\\n', '\\t\\tFILESIZE:
11028908\\n', '\\n', '\\t\\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\\n', '\\t\\tFILESIZE:
18975\\n']"

That just parsing a lot more difficult.

It's much easier to just read the entire file as a single string and
then parse that:
'TOTAL FILES: 2\n\t\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf\n\t\tFILESIZE:
11028908\n\n\t\tFILENAME:
MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml\n\t\tFILESIZE: 18975\n'['MOD13A2.A2007033.h17v08.005.2007101023605.hdf',
'MOD13A2.A2007033.h17v08.005.2007101023605.hdf.xml']
 
N

nn

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa..gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
    return s[9:]
Cheers,
Drea
Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example  s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.
Many thanks
It is not clear from your post what the input is really like. But just
guessing this might work:
'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r
\n','Ftp Pull Download Links: \r\n'
'e4ftl01u.ecs.nasa.gov'

Except, I'm assuming, the OP's getting the data from a (windows-
formatted) file, so \r\n shouldn't be escaped in the regex:

I am just playing the guessing game like everybody else here. Since
the OP didn't use re.DOTALL and was getting more than one line for .*
I assumed that the \n was quite literally '\' and 'n'.
 
N

nn

Hi,
I need to extract a string after a matching a regular expression. For
example I have the string...
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
and once I match "FTPHOST" I would like to extract
"e4ftl01u.ecs.nasa.gov". I am not sure as to the best approach to the
problem, I had been trying to match the string using something like
this:
m = re.findall(r"FTPHOST", s)
But I couldn't then work out how to return the "e4ftl01u.ecs.nasa..gov"
part. Perhaps I need to find the string and then split it? I had some
help with a similar problem, but now I don't seem to be able to
transfer that to this problem!
Thanks in advance for the help,
Martin
No need for regex.
s = "FTPHOST: e4ftl01u.ecs.nasa.gov"
If "FTPHOST" in s:
    return s[9:]
Cheers,
Drea
Sorry perhaps I didn't make it clear enough, so apologies. I only
presented the example  s = "FTPHOST: e4ftl01u.ecs.nasa.gov" as I
thought this easily encompassed the problem. The solution presented
works fine for this i.e. re.search(r'FTPHOST: (.*)',s).group(1). But
when I used this on the actual file I am trying to parse I realised it
is slightly more complicated as this also pulls out other information,
for example it prints
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r\n',
'Ftp Pull Download Links: \r\n', 'ftp://e4ftl01u.ecs.nasa.gov/PullDir/
0301872638CySfQB\r\n', 'Down load ZIP file of packaged order:\r\n',
etc. So I need to find a way to stop it before the \r
slicing the string wouldn't work in this scenario as I can envisage a
situation where the string lenght increases and I would prefer not to
keep having to change the string.
Many thanks
It is not clear from your post what the input is really like. But just
guessing this might work:
'MEDIATYPE: FtpPull\r\n', 'MEDIAFORMAT: FILEFORMAT\r\n','FTPHOST:
e4ftl01u.ecs.nasa.gov\r\n', 'FTPDIR: /PullDir/0301872638CySfQB\r
\n','Ftp Pull Download Links: \r\n'
'e4ftl01u.ecs.nasa.gov'

Hi,

That does work. So the \ escapes the \r, does this tell it to stop
when it reaches the "\r"?

Thanks

Indeed.
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

No members online now.

Forum statistics

Threads
473,769
Messages
2,569,579
Members
45,053
Latest member
BrodieSola

Latest Threads

Top