compare two voices

J

Jeremy Bowers

I want to make an app to help students study foreign language. I want the
following function in it:

The student reads a piece of text to the microphone. The software records
it and compares it to the wave-file pre-recorded by the teacher, and gives
out a score to indicate the similarity between them.

This function will help the students pronounce properly, I think.

Do you have any idea what it takes to compare two voices in a
*meaningful* fashion? This is a serious question. I can't guarantee
there is no app to help with this, but if it does exist, it either costs a
lot of money, or will be almost impossible to use for what you want
(boiling two voice samples down to a speaker-independent single similarity
number... the mind boggles at the possible number of ways of defining that).
Quite possibly both.

If you *do* know something about the math, which, by the way, is graduate
level+, then you'd do better to go look at the open source voice
recognition systems and ask on those mailing lists.

No matter how you slice it, this is not a Python problem, this is an
intense voice recognition algorithm problem that would make a good PhD
thesis. I have no idea if it has already been done and you will likely get
much better help from such a community where people might know that. I am
aware of the CMU Sphinx project, which should get you started Googling.
Good luck; it's a great idea, but if somebody somewhere hasn't already
done it, it's an extremely tough one.

(Theoretically, it's probably not a horrid problem, but my intuition leads
me to believe that turning it into a *useful product*, that corresponds to
what humans would say is "similar", will probably be a practical
nightmare. Plus it'll be highly language dependent; a similarity algorithm
for Chinese probably won't work very well for English and vice versa. All
this, and you *could* just play the two sounds back to the human and let
their brain try to understand it... ;-) )

Waiting for the message pointing to the Sourceforge project that
implemented this three years ago...
 
Q

Qiangning Hong

I want to make an app to help students study foreign language. I want
the following function in it:

The student reads a piece of text to the microphone. The software
records it and compares it to the wave-file pre-recorded by the
teacher, and gives out a score to indicate the similarity between them.

This function will help the students pronounce properly, I think.

Is there an existing library (C or Python) to do this? Or if someone
can guide me to a ready-to-implement algorithm?
 
Q

Qiangning Hong

Jeremy said:
No matter how you slice it, this is not a Python problem, this is an
intense voice recognition algorithm problem that would make a good
PhD thesis.

No, my goal is nothing relative to voice recognition. Sorry that I
haven't described my question clearly. We are not teaching English, so
the voice recognition isn't helpful here.

I just want to compare two sound WAVE file, not what the students or
the teacher really saying. For example, if the teacher recorded his
"standard" pronouncation of "god", then the student saying "good" will
get a higher score than the student saying "evil" ---- because "good"
sounds more like "god".

Yes, this not a Python problem, but I am a fan of Python and using
Python to develop the other parts of the application (UI, sound play
and record, grammer training, etc), so I ask here for available python
module, and of cause, for any kindly suggestions unrelative to Python
itself (like yours) too.

I myself have tried using Python's standard audioop module, using the
findfactor and rms functions. I try to use the value returned from
rms(add(a, mul(b, -findfactor(a, b)))) as the score. But the result is
not good. So I want to know if there is a human-voice optimized
algorithm/library out there.
 
A

Andrew Dalke

Qiangning said:
No, my goal is nothing relative to voice recognition. Sorry that I
haven't described my question clearly. We are not teaching English, so
the voice recognition isn't helpful here.

To repeat what Jeremy wrote - what you are asking *is* relative
to voice recognition. You want to recognize that two different voices,
with different pitches, pauses, etc., said the same thing.

There is a lot of data in speech. That's why sound files are bigger
than text files. Some of it gets interpreted as emotional nuances,
or as an accent, while others are simply ignored.
I just want to compare two sound WAVE file, not what the students or
the teacher really saying. For example, if the teacher recorded his
"standard" pronouncation of "god", then the student saying "good" will
get a higher score than the student saying "evil" ---- because "good"
sounds more like "god".

Try this: record the word twice and overlay them. They will be
different. And that's with the same speaker. Now try it with your
voice compared with another's. You can hear just how different they
are. One will be longer, another deeper, or with the "o" sound
originating in a different part of the mouth.

At the level you are working on the computer doesn't know which of
the data can be ignored. It doesn't know how to find the start
of the word (as when a student says "ummm, good"). It doesn't know
how to stretch the timings, nor adjust for pitch between, say,
a man and a woman's voice.

My ex-girlfriend gave me a computer program for learning Swedish.
It included a program to do a simpler version of what you are
asking. It only compared phonemes, so I could practice the vowels.
Even then it's comparison seemed more like a random value than
meaningful.

Again, as Jeremy said, you want something harder than what
speech recognition programs do. They at least are trained
to understand a given speaker, which helps improve the quality
of the recognition. You don't want that -- that's the
opposite of what you're trying to do. Speaker-independent
voice recognition is harder than speaker-dependent.

You can implement a solution on the lines you were thinking of
but as you found it doesn't work. A workable solution will
require good speech recognition capability and is still very
much in the research stage (as far as I know; it's not my
field).

If your target language is a major one then there may be some
commercial language recognition software you can use. You
could have your reference speaker train the software on the
vocabulary list and have your students try to have the software
recognize the correct word.

If your word list is too short or recognizer not set well
enough then saying something like "thud" will also be
recognized as being close enough to "good".

Why don't you just have the students hear both the
teachers voice and the student's just recorded voice, one
right after the other? That gives feedback. Why does
the computer need to judge the correctness?

Andrew
(e-mail address removed)
 
K

Kent Johnson

Qiangning said:
I want to make an app to help students study foreign language. I want
the following function in it:

The student reads a piece of text to the microphone. The software
records it and compares it to the wave-file pre-recorded by the
teacher, and gives out a score to indicate the similarity between them.

This function will help the students pronounce properly, I think.

Is there an existing library (C or Python) to do this? Or if someone
can guide me to a ready-to-implement algorithm?
I have worked on a commercial product that attempts to do this and I will confirm that it is very
difficult to create a meaningful score.

Kent
 
?

=?iso-8859-1?Q?Fran=E7ois?= Pinard

[Qiangning Hong]
I just want to compare two sound WAVE file, not what the students or
the teacher really saying. For example, if the teacher recorded his
"standard" pronouncation of "god", then the student saying "good" will
get a higher score than the student saying "evil" ---- because "good"
sounds more like "god".

If I had this problem and was alone, I would likely create one audiogram
(I mean a spectral frequency analysis over time) for each voice sample.
Normally, this is presented with frequency on the vertical axis, time on
the horizontal axis, and gray value for frequency amplitude. There are
a few tools available for doing this, yet integrating them in another
application may require some work.

Now, because of voice pitch differences and elocution speed, the
audiograms would somehow look alike, yet scaled differently in both
directions. The problem you now have is to recognise that an image is
"similar" to part of another, so here, I would likely do some research
on various transforms (like Hough's and any other of the same kind) that
might ease normalisation prior to comparison. Image classification
techniques (they do this a lot in satellite imagery) for recognizing
similar textures in audiograms, and so, clues for matching images. A
few image classification programs which have been previously announced
here, I did not look at them yet, but who knows, they may be helpful.

Then, if the above work is done correctly and meaningfully, you now want
to compute correlations between normalised audiograms. More correlated
they are, more likely the original pronunciation were.

Now, if I had this problem and could call friends, I would surely phone
one or two of them, who work at companies offering voice recognition
devices or services. They will be likely reluctant at sharing advanced
algorithms, as these give them industrial advantage over competitors.
I try to use the value returned from rms(add(a, mul(b, -findfactor(a,
b)))) as the score. But the result is not good.

Oh, absolutely no chance that such a simple thing would ever work. :)
 
R

Robert Oschler

Qiangning Hong said:
I want to make an app to help students study foreign language. I want
the following function in it:

The student reads a piece of text to the microphone. The software
records it and compares it to the wave-file pre-recorded by the
teacher, and gives out a score to indicate the similarity between them.

This function will help the students pronounce properly, I think.

Is there an existing library (C or Python) to do this? Or if someone
can guide me to a ready-to-implement algorithm?

How about another approach?

All modern speech recognition systems employ a phonetic alphabet. It's how
you describe to the speech recognition engine exactly how the word sounds.

For each sentence read, you create a small recognition context that includes
the sentence itself, AND subtle variations of the sentence phonetically.

For example (using English):

You want them to say correctly: "The weather is good today".

You create a context with the following phrases which include the original
sentence, and then alternative sentences that dithers (varies) the original
sentence phonetically. Sample context:

(*) The weather is good today
Da wedder is god tuday
The weether is good towday

Etc.

Then submit the context to the speech recognition engine and ask the user to
say the sentences. If the original sentence (*) comes back as the speech
recognition engine's best choice, then they said it right. If one of the
other choices comes back, then they made a mistake.

You could even "grade" their performance by tagging the variations by
closeness to the original, for example:

(*) The weather is good today (100)
Da wedder is god tuday (80)
Ta wegger es gid towday (50)

In the example above, the original sentence gets a 100, the second choice
which is close gets an 80, and the last option which is pretty bad gets 50.
With a little effort you could automatically create the "dithered" phonetic
variations and auto-calculate the score or closeness to original too.

Thanks,
Robert
http://www.robodance.com
Robosapien Dance Machine - SourceForge project
 
M

M.E.Farmer

Qiangning said:
I want to make an app to help students study foreign language. I want
the following function in it:

The student reads a piece of text to the microphone. The software
records it and compares it to the wave-file pre-recorded by the
teacher, and gives out a score to indicate the similarity between them.

This function will help the students pronounce properly, I think.

Is there an existing library (C or Python) to do this? Or if someone
can guide me to a ready-to-implement algorithm?

As others have noted this is a difficult problem.
This library was developed to study speech and should be worth a look:
http://www.speech.kth.se/snack/
"""
Using Snack you can create powerful multi-platform audio applications
with just a few lines of code. Snack has commands for basic sound
handling, such as playback, recording, file and socket I/O. Snack also
provides primitives for sound visualization, e.g. waveforms and
spectrograms. It was developed mainly to handle digital recordings of
speech, but is just as useful for general audio. Snack has also
successfully been applied to other one-dimensional signals.
"""
Be sure to check out the examples.

Might be worth a look:
http://www.speech.kth.se/projects/speech_projects.html
http://www.speech.kth.se/cost250/

You might also have luck with Windows using Microsoft Speech SDK ( it
is huge ).
Combined with Python scrtiptng you can go far.

hth,
M.E.Farmer
 

Ask a Question

Want to reply to this thread or ask your own question?

You'll need to choose a username for the site, which only take a couple of moments. After that, you can post your question and our members will help you out.

Ask a Question

Members online

Forum statistics

Threads
473,744
Messages
2,569,484
Members
44,905
Latest member
Kristy_Poole

Latest Threads

Top