Regular expressions

mauriceling · Dec 26, 2011

Hi

I am trying to change "@HWI-ST115:568:B08LLABXX:1:1105:6465:151103 1:N:
0:" to "@HWI-ST115:568:B08LLABXX:1:1105:6465:151103/1".

Can anyone help me with the regular expressions needed?

Thanks in advance.

Maurice

Chris Angelico · Dec 27, 2011

Hi

I am trying to change <one string> to <another string>.

Can anyone help me with the regular expressions needed?

A regular expression defines a string based on rules. Without seeing a
lot more strings, we can't know what possibilities there are for each
part of the string. You probably know your data better than we ever
will, even eyeballing the entire set of strings; just write down, in
order, what the pieces ought to be - for instance, the first token
might be a literal @ sign, followed by three upper-case letters, then
a hyphen, then any number of alphanumerics followed by a colon, etc.
Once you have that, it's fairly straightforward to translate that into
regex syntax.

ChrisA

Roy Smith · Dec 27, 2011

[email protected] said:
Hi

I am trying to change "@HWI-ST115:568:B08LLABXX:1:1105:6465:151103 1:N:
0:" to "@HWI-ST115:568:B08LLABXX:1:1105:6465:151103/1".

Can anyone help me with the regular expressions needed?

Easy-peasy:

import re
input = "@HWI-ST115:568:B08LLABXX:1:1105:6465:151103 1:N: 0:"
output = "@HWI-ST115:568:B08LLABXX:1:1105:6465:151103/1"
pattern = re.compile(
r'@HWI-ST115:568:B08LLABXX:1:1105:6465:151103 1:N: 0:')
out = pattern.sub(
r'@HWI-ST115:568:B08LLABXX:1:1105:6465:151103/1',
input)
assert out == output

To be honest, I wouldn't do this with a regex. I'm not quite sure what
you're trying to do, but I'm guessing it's something like "Get
everything after the first space in the string; keep just the integer
that's before the first ':' in that and turn the space into a slash".
In that case, I'd do something like:

head, tail = input.split(' ', 1)
number, _ = tail.split(':')
print "%s/%s" % (head, number)

mauriceling · Dec 27, 2011

A regular expression defines a string based on rules. Without seeing a
lot more strings, we can't know what possibilities there are for each
part of the string. You probably know your data better than we ever
will, even eyeballing the entire set of strings; just write down, in
order, what the pieces ought to be - for instance, the first token
might be a literal @ sign, followed by three upper-case letters, then
a hyphen, then any number of alphanumerics followed by a colon, etc.
Once you have that, it's fairly straightforward to translate that into
regex syntax.

ChrisA

I've tried

re.sub('@\S\s[1-9]:[A-N]:[0-9]', '@\S\s', '@HWI-ST115:568:B08LLABXX:
1:1105:6465:151103 1:N:0:')

but it does not seems to work.

Jason Friedman · Dec 27, 2011

Hi

I am trying to change <one string> to <another string>.

Can anyone help me with the regular expressions needed?

A regular expression defines a string based on rules. Without seeing a
lot more strings, we can't know what possibilities there are for each
part of the string. You probably know your data better than we ever
will, even eyeballing the entire set of strings; just write down, in
order, what the pieces ought to be - for instance, the first token
might be a literal @ sign, followed by three upper-case letters, then
a hyphen, then any number of alphanumerics followed by a colon, etc.
Once you have that, it's fairly straightforward to translate that into
regex syntax.

ChrisA[/QUOTE]

The OP told me, off list, that my guess was true:

Can we say that your string:
1) Contains 7 colon-delimited fields, followed by
2) whitespace, followed by
3) 3 colon-delimited fields (A, B, C), followed by
4) a colon?
The transformation needed is that the whitespace is replaced by a
slash, the "A" characters are taken as is, and the colons and fields
following the "A" characters are eliminated?

Doubtful that my guess was 100% accurate, but nevertheless:
'@HWI-ST115:568:B08LLABXX:1:1105:6465:151103/1'

mauriceling · Dec 27, 2011

The OP told me, off list, that my guess was true:

Doubtful that my guess was 100% accurate, but nevertheless:

'@HWI-ST115:568:B08LLABXX:1:1105:6465:151103/1'

Thanks a lot everyone.

Can anyone suggest a good place to learn REs?

ML

Jason Friedman · Dec 27, 2011

Thanks a lot everyone.

Can anyone suggest a good place to learn REs?

Start with the manual:
http://docs.python.org/py3k/library/re.html#module-re

Fredrik Tolf · Dec 27, 2011

I've tried

re.sub('@\S\s[1-9]:[A-N]:[0-9]', '@\S\s', '@HWI-ST115:568:B08LLABXX:
1:1105:6465:151103 1:N:0:')

but it does not seems to work.

Indeed, for several reasons. First of all, your backslash sequences are
interpreted by Python as string escapes. You'll need to write either "\\S"
or r"\S" (the r, for raw, turns off backslash escapes).

Second, when you use only "\S", that matches a single non-space character,
not several; you'll need to quantify them. "\S*" will match zero or more,
"\S+" will match one or more, "\S?" will match zero or one, and there are
a couple of other possibilities as well (see the manual for details). In
this case, you probably want to use "+" for most of those.

Third, you're not marking the groups that you want to use in the
replacement. Since you want to retain the entire string before the space,
and the numeric element, you'll want to enclose them in parentheses to
mark them as groups.

Fourth, your replacement string is entirely wacky. You don't use sequences
such as "\S" and "\s" to refer back to groups in the original text, but
numbered references, to refer back to parenthesized groups in the order
they appear in the regex. In accordance what you seemed to want, you
should probably use "@\1/\2" in your case ("\1" refers back to the first
parentesized group, which you be the first "\S+" part, and "\2" to the
second group, which should be the "[1-9]+" part; the at-mark and slash
are inserted as they are into the result string).

Fifth, you'll probably want to match the last colon as well, in order not
to retain it into the result string.

All in all, you will probably want to use something like this to correct
that regex:

re.sub(r'@(\S+)\s([1-9]+):[A-N]+:[0-9]+:', r'@\1/\2',
'@HWI-ST115:568:B08LLABXX:1:1105:6465:151103 1:N:0:')

Also, you may be interested to know that you can use "\d" instead of
"[0-9]".

rusi · Dec 28, 2011

I've tried

Click to expand...

re.sub('@\S\s[1-9]:[A-N]:[0-9]', '@\S\s', '@HWI-ST115:568:B08LLABXX:
1:1105:6465:151103 1:N:0:')

Click to expand...

but it does not seems to work.

Click to expand...

Indeed, for several reasons. First of all, your backslash sequences are
interpreted by Python as string escapes. You'll need to write either "\\S"
or r"\S" (the r, for raw, turns off backslash escapes).

Second, when you use only "\S", that matches a single non-space character,
not several; you'll need to quantify them. "\S*" will match zero or more,
"\S+" will match one or more, "\S?" will match zero or one, and there are
a couple of other possibilities as well (see the manual for details). In
this case, you probably want to use "+" for most of those.

Third, you're not marking the groups that you want to use in the
replacement. Since you want to retain the entire string before the space,
and the numeric element, you'll want to enclose them in parentheses to
mark them as groups.

Fourth, your replacement string is entirely wacky. You don't use sequences
such as "\S" and "\s" to refer back to groups in the original text, but
numbered references, to refer back to parenthesized groups in the order
they appear in the regex. In accordance what you seemed to want, you
should probably use "@\1/\2" in your case ("\1" refers back to the first
parentesized group, which you be the first "\S+" part, and "\2" to the
second group, which should be the "[1-9]+" part; the at-mark and slash
are inserted as they are into the result string).

Fifth, you'll probably want to match the last colon as well, in order not
to retain it into the result string.

All in all, you will probably want to use something like this to correct
that regex:

re.sub(r'@(\S+)\s([1-9]+):[A-N]+:[0-9]+:', r'@\1/\2',
'@HWI-ST115:568:B08LLABXX:1:1105:6465:151103 1:N:0:')

Also, you may be interested to know that you can use "\d" instead of
"[0-9]".

For practical 'get-the-hands-dirty' experience look at

python-specific: http://kodos.sourceforge.net/
Online: http://gskinner.com/RegExr/
emacs-specific: re-builder and regex-tool http://bc.tech.coop/blog/071103.html

different ways of allocating memory	3	Dec 23, 2009
Trouble with prediction code, for the life of me I can't figure out why it isnt running properly. Help would be appreciated.	0	Jul 8, 2023
Python Regular Expressions	4	Jun 22, 2011
Problem: Installing Curb in Mac OS Leopard	1	Sep 13, 2009
Large regular expressions	1	Mar 15, 2010
problem with subprocess	0	Jul 10, 2009
Creating a new data structure while filtering its data origin.	5	Mar 28, 2007
The power of regular expressions without regular expressions.	0	Jul 17, 2013

Regular expressions

mauriceling

Chris Angelico

Roy Smith

mauriceling

Jason Friedman

mauriceling

Jason Friedman

Fredrik Tolf

rusi

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads