Pattern matching problem

Bryan · Jun 12, 2004

Hi,

I have a large dna sequence (about 200000 bases) in a file, and in
another file with 23 smallish sub (about 100 bases) sequences that I
want to match in the large sequence.

The large sequence file is in fasta format, and I read it in without a
problem.

The subsequence file is a table, which I read in using the Data::Table
module:
my $subseqs= Data::Table::fromTSV("subseq.txt");

Then I loop through the $subseq table and do a pattern match for my
sequence like this:
if ($sequence =~ m/$subseq/g) {
# Match!
}

Here's where the problem starts... 18 out of 23 subsets match. But ALL
match if I do a search for the subsequence in any text editor (like vi)!
I have verified that everything is uppercased, and double checked that
all 23 subsequences are indeed correct in the main sequence. A lot of
testing and debugging shows that if I copy and paste any sequence or
subsequence then matches are okay. So Im thinking there is some hidden
characters messing things up that were missed.

But in vi, I use :set list to show all command characters. Nothing
unusual is there.

I read in the sequence file like this:
my $seq;
open (INFILE, "< $ARGV[0]") or die "Cannot open $ARGV[0] for read\n\n";
my @data = <INFILE>;
close INFILE;

foreach my $line (@data) {
# Strip off newlines
chomp $line;
# do some checks for other lines
$seq .= uc($line);
}
}

Does anyone see anything wrong with this, or my pattern match that may
explain the unexplainable?

Thanks,
Bryan

ctcgag · Jun 12, 2004

Bryan said:
Hi,

I have a large dna sequence (about 200000 bases) in a file, and in
another file with 23 smallish sub (about 100 bases) sequences that I
want to match in the large sequence.

The large sequence file is in fasta format, and I read it in without a
problem.
Maybe!

....
Here's where the problem starts... 18 out of 23 subsets match. But ALL
match if I do a search for the subsequence in any text editor (like vi)!

Do all of the 5 that fail span line boundaries on the original FASTA file?
Do any of the 18 that match span line boundaries? (Well, they would have
to if it were strictly FASTA, i.e. less than 80 char lines, but it may not
be.)

I have verified that everything is uppercased, and double checked that
all 23 subsequences are indeed correct in the main sequence. A lot of
testing and debugging shows that if I copy and paste any sequence or
subsequence then matches are okay. So Im thinking there is some hidden
characters messing things up that were missed.

But in vi, I use :set list to show all command characters. Nothing
unusual is there.

I read in the sequence file like this:
my $seq;
open (INFILE, "< $ARGV[0]") or die "Cannot open $ARGV[0] for
read\n\n"; my @data = <INFILE>;
close INFILE;

foreach my $line (@data) {
# Strip off newlines
chomp $line;
# do some checks for other lines
$seq .= uc($line);
}
}

It could be that your FASTA file has \r\n line endings, and your chomp is
only stripping the \n. vi might autodetect that you are editing a dos-type
file, and not consider the \r to be a control character, and hence not show
it. So you might want to do "$seq =~ s/\r//g" when you are done reading
it.

Or, just read in $seq exactly like you do now, then just print it out to a
file. Inspect that file, rather than the original, with vi or od to see
what you can see.

If that didn't do it, I'd take one of the things that doesn't match, and
keep chopping off a character until it does match. Once it does match,
report what was blocking the match:

while ($failing) {
$seq =~ /$failing(.)(.{0,20})/ or next;
print "Character preventing match is ", ord($1),
" several characters after that are $2\n";
exit;
} continue {
chop $failing;
};
die "Never did match!"

Of course, you have to make sure you didn't chop it down so much that it
started matching in other places--that's why I told it print 20 more char
of context.

Xho

Tad McClellan · Jun 12, 2004

Bryan said:
if ($sequence =~ m/$subseq/g) {

What do you think the m//g option is doing for you there?

paul nutteing · Jun 12, 2004

Bryan said:
Hi,

I have a large dna sequence (about 200000 bases) in a file, and in
another file with 23 smallish sub (about 100 bases) sequences that I
want to match in the large sequence.

The large sequence file is in fasta format, and I read it in without a
problem.

The subsequence file is a table, which I read in using the Data::Table
module:
my $subseqs= Data::Table::fromTSV("subseq.txt");

Then I loop through the $subseq table and do a pattern match for my
sequence like this:
if ($sequence =~ m/$subseq/g) {
# Match!
}

Here's where the problem starts... 18 out of 23 subsets match. But ALL
match if I do a search for the subsequence in any text editor (like vi)!
I have verified that everything is uppercased, and double checked that
all 23 subsequences are indeed correct in the main sequence. A lot of
testing and debugging shows that if I copy and paste any sequence or
subsequence then matches are okay. So Im thinking there is some hidden
characters messing things up that were missed.

But in vi, I use :set list to show all command characters. Nothing
unusual is there.

I read in the sequence file like this:
my $seq;
open (INFILE, "< $ARGV[0]") or die "Cannot open $ARGV[0] for read\n\n";
my @data = <INFILE>;
close INFILE;

foreach my $line (@data) {
# Strip off newlines
chomp $line;
# do some checks for other lines
$seq .= uc($line);
}
}

Does anyone see anything wrong with this, or my pattern match that may
explain the unexplainable?

Thanks,
Bryan

I have been investigating matches in DNA profile
databases for a year or so.
I have a method for checking a 4 million profile
file for matches written up on
http://www.nutteing2.freeservers.com/dnapr.htm
I am not a programmer so have used macros
etc that are readily available.

What they aren't telling you about DNA profiles
and what Special Branch don't want you to know.
http://www.nutteing2.freeservers.com/dnapr.htm
or nutteingd in a search engine

email (e-mail address removed).....uk (remove 4 of 5 dots)

paul nutteing · Jun 12, 2004

Sorry, i forgot to change the file name in the URL

http://www.nutteing2.freeservers.com/dnas5.htm

for macros etc

Paul Nutteing · Jun 12, 2004

paul nutteing said:
Sorry, i forgot to change the file name in the URL

http://www.nutteing2.freeservers.com/dnas5.htm

for macros etc

ignore my previous - I misread the OP article.
I tried using Google posting removal tool and
it came up in French version and I gave up.

BCC · Jun 13, 2004

Tad McClellan said:
What do you think the m//g option is doing for you there?

I thought it would match my $subseq pattern in the $sequence, and return
pos() if it is found. Maybe Im not using this correctly? Looks like since
$sequence is only one big line, I dont need the g, and maybe not even the m.
But will I still get pos() back?

Anyways, I tried it without the m, and without the g, it still fails 5 out
of 23 times.

B

Pattern matching	2	May 17, 2009
why the loop-break after pattern-matching	1	Jan 1, 2012
How can I fix my pattern coding error in c++	0	Mar 19, 2023
Problem with KMKfw libraries	1	May 11, 2023
Pattern Matching and Extraction	10	Jan 22, 2007
pattern matching and abstract functions	12	Mar 29, 2011
pattern matching	7	May 23, 2006
Timing problem	4	May 1, 2023

Pattern matching problem

Bryan

ctcgag

Tad McClellan

paul nutteing

paul nutteing

Paul Nutteing

BCC

Ask a Question

Similar Threads

Members online

Forum statistics

Latest Threads