Pattern matching problem

B

Bryan

Hi,

I have a large dna sequence (about 200000 bases) in a file, and in
another file with 23 smallish sub (about 100 bases) sequences that I
want to match in the large sequence.

The large sequence file is in fasta format, and I read it in without a
problem.

The subsequence file is a table, which I read in using the Data::Table
module:
my $subseqs= Data::Table::fromTSV("subseq.txt");

Then I loop through the $subseq table and do a pattern match for my
sequence like this:
if ($sequence =~ m/$subseq/g) {
# Match!
}

Here's where the problem starts... 18 out of 23 subsets match. But ALL
match if I do a search for the subsequence in any text editor (like vi)!
I have verified that everything is uppercased, and double checked that
all 23 subsequences are indeed correct in the main sequence. A lot of
testing and debugging shows that if I copy and paste any sequence or
subsequence then matches are okay. So Im thinking there is some hidden
characters messing things up that were missed.

But in vi, I use :set list to show all command characters. Nothing
unusual is there.

I read in the sequence file like this:
my $seq;
open (INFILE, "< $ARGV[0]") or die "Cannot open $ARGV[0] for read\n\n";
my @data = <INFILE>;
close INFILE;


foreach my $line (@data) {
# Strip off newlines
chomp $line;
# do some checks for other lines
$seq .= uc($line);
}
}


Does anyone see anything wrong with this, or my pattern match that may
explain the unexplainable?

Thanks,
Bryan
 
C

ctcgag

Bryan said:
Hi,

I have a large dna sequence (about 200000 bases) in a file, and in
another file with 23 smallish sub (about 100 bases) sequences that I
want to match in the large sequence.

The large sequence file is in fasta format, and I read it in without a
problem.
Maybe!

....
Here's where the problem starts... 18 out of 23 subsets match. But ALL
match if I do a search for the subsequence in any text editor (like vi)!

Do all of the 5 that fail span line boundaries on the original FASTA file?
Do any of the 18 that match span line boundaries? (Well, they would have
to if it were strictly FASTA, i.e. less than 80 char lines, but it may not
be.)

I have verified that everything is uppercased, and double checked that
all 23 subsequences are indeed correct in the main sequence. A lot of
testing and debugging shows that if I copy and paste any sequence or
subsequence then matches are okay. So Im thinking there is some hidden
characters messing things up that were missed.

But in vi, I use :set list to show all command characters. Nothing
unusual is there.

I read in the sequence file like this:
my $seq;
open (INFILE, "< $ARGV[0]") or die "Cannot open $ARGV[0] for
read\n\n"; my @data = <INFILE>;
close INFILE;

foreach my $line (@data) {
# Strip off newlines
chomp $line;
# do some checks for other lines
$seq .= uc($line);
}
}

It could be that your FASTA file has \r\n line endings, and your chomp is
only stripping the \n. vi might autodetect that you are editing a dos-type
file, and not consider the \r to be a control character, and hence not show
it. So you might want to do "$seq =~ s/\r//g" when you are done reading
it.

Or, just read in $seq exactly like you do now, then just print it out to a
file. Inspect that file, rather than the original, with vi or od to see
what you can see.

If that didn't do it, I'd take one of the things that doesn't match, and
keep chopping off a character until it does match. Once it does match,
report what was blocking the match:

while ($failing) {
$seq =~ /$failing(.)(.{0,20})/ or next;
print "Character preventing match is ", ord($1),
" several characters after that are $2\n";
exit;
} continue {
chop $failing;
};
die "Never did match!"

Of course, you have to make sure you didn't chop it down so much that it
started matching in other places--that's why I told it print 20 more char
of context.

Xho
 
P

paul nutteing

Bryan said:
Hi,

I have a large dna sequence (about 200000 bases) in a file, and in
another file with 23 smallish sub (about 100 bases) sequences that I
want to match in the large sequence.

The large sequence file is in fasta format, and I read it in without a
problem.

The subsequence file is a table, which I read in using the Data::Table
module:
my $subseqs= Data::Table::fromTSV("subseq.txt");

Then I loop through the $subseq table and do a pattern match for my
sequence like this:
if ($sequence =~ m/$subseq/g) {
# Match!
}

Here's where the problem starts... 18 out of 23 subsets match. But ALL
match if I do a search for the subsequence in any text editor (like vi)!
I have verified that everything is uppercased, and double checked that
all 23 subsequences are indeed correct in the main sequence. A lot of
testing and debugging shows that if I copy and paste any sequence or
subsequence then matches are okay. So Im thinking there is some hidden
characters messing things up that were missed.

But in vi, I use :set list to show all command characters. Nothing
unusual is there.

I read in the sequence file like this:
my $seq;
open (INFILE, "< $ARGV[0]") or die "Cannot open $ARGV[0] for read\n\n";
my @data = <INFILE>;
close INFILE;


foreach my $line (@data) {
# Strip off newlines
chomp $line;
# do some checks for other lines
$seq .= uc($line);
}
}


Does anyone see anything wrong with this, or my pattern match that may
explain the unexplainable?

Thanks,
Bryan


I have been investigating matches in DNA profile
databases for a year or so.
I have a method for checking a 4 million profile
file for matches written up on
http://www.nutteing2.freeservers.com/dnapr.htm
I am not a programmer so have used macros
etc that are readily available.

What they aren't telling you about DNA profiles
and what Special Branch don't want you to know.
http://www.nutteing2.freeservers.com/dnapr.htm
or nutteingd in a search engine

email (e-mail address removed).....uk (remove 4 of 5 dots)
 
B

BCC

Tad McClellan said:
What do you think the m//g option is doing for you there?

I thought it would match my $subseq pattern in the $sequence, and return
pos() if it is found. Maybe Im not using this correctly? Looks like since
$sequence is only one big line, I dont need the g, and maybe not even the m.
But will I still get pos() back?

Anyways, I tried it without the m, and without the g, it still fails 5 out
of 23 times.

B
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,767
Messages
2,569,572
Members
45,045
Latest member
DRCM

Latest Threads

Top