compare two voices

Discussion in 'Python' started by Jeremy Bowers, May 1, 2005.

  1. On Sat, 30 Apr 2005 20:00:57 -0700, Qiangning Hong wrote:

    > I want to make an app to help students study foreign language. I want the
    > following function in it:
    >
    > The student reads a piece of text to the microphone. The software records
    > it and compares it to the wave-file pre-recorded by the teacher, and gives
    > out a score to indicate the similarity between them.
    >
    > This function will help the students pronounce properly, I think.


    Do you have any idea what it takes to compare two voices in a
    *meaningful* fashion? This is a serious question. I can't guarantee
    there is no app to help with this, but if it does exist, it either costs a
    lot of money, or will be almost impossible to use for what you want
    (boiling two voice samples down to a speaker-independent single similarity
    number... the mind boggles at the possible number of ways of defining that).
    Quite possibly both.

    If you *do* know something about the math, which, by the way, is graduate
    level+, then you'd do better to go look at the open source voice
    recognition systems and ask on those mailing lists.

    No matter how you slice it, this is not a Python problem, this is an
    intense voice recognition algorithm problem that would make a good PhD
    thesis. I have no idea if it has already been done and you will likely get
    much better help from such a community where people might know that. I am
    aware of the CMU Sphinx project, which should get you started Googling.
    Good luck; it's a great idea, but if somebody somewhere hasn't already
    done it, it's an extremely tough one.

    (Theoretically, it's probably not a horrid problem, but my intuition leads
    me to believe that turning it into a *useful product*, that corresponds to
    what humans would say is "similar", will probably be a practical
    nightmare. Plus it'll be highly language dependent; a similarity algorithm
    for Chinese probably won't work very well for English and vice versa. All
    this, and you *could* just play the two sounds back to the human and let
    their brain try to understand it... ;-) )

    Waiting for the message pointing to the Sourceforge project that
    implemented this three years ago...
    Jeremy Bowers, May 1, 2005
    #1
    1. Advertising

  2. I want to make an app to help students study foreign language. I want
    the following function in it:

    The student reads a piece of text to the microphone. The software
    records it and compares it to the wave-file pre-recorded by the
    teacher, and gives out a score to indicate the similarity between them.

    This function will help the students pronounce properly, I think.

    Is there an existing library (C or Python) to do this? Or if someone
    can guide me to a ready-to-implement algorithm?
    Qiangning Hong, May 1, 2005
    #2
    1. Advertising

  3. Jeremy Bowers wrote:
    > No matter how you slice it, this is not a Python problem, this is an
    > intense voice recognition algorithm problem that would make a good
    > PhD thesis.


    No, my goal is nothing relative to voice recognition. Sorry that I
    haven't described my question clearly. We are not teaching English, so
    the voice recognition isn't helpful here.

    I just want to compare two sound WAVE file, not what the students or
    the teacher really saying. For example, if the teacher recorded his
    "standard" pronouncation of "god", then the student saying "good" will
    get a higher score than the student saying "evil" ---- because "good"
    sounds more like "god".

    Yes, this not a Python problem, but I am a fan of Python and using
    Python to develop the other parts of the application (UI, sound play
    and record, grammer training, etc), so I ask here for available python
    module, and of cause, for any kindly suggestions unrelative to Python
    itself (like yours) too.

    I myself have tried using Python's standard audioop module, using the
    findfactor and rms functions. I try to use the value returned from
    rms(add(a, mul(b, -findfactor(a, b)))) as the score. But the result is
    not good. So I want to know if there is a human-voice optimized
    algorithm/library out there.
    Qiangning Hong, May 1, 2005
    #3
  4. Jeremy Bowers

    Andrew Dalke Guest

    > Jeremy Bowers wrote:
    >> No matter how you slice it, this is not a Python problem, this is an
    >> intense voice recognition algorithm problem that would make a good
    >> PhD thesis.


    Qiangning Hong wrote:
    > No, my goal is nothing relative to voice recognition. Sorry that I
    > haven't described my question clearly. We are not teaching English, so
    > the voice recognition isn't helpful here.


    To repeat what Jeremy wrote - what you are asking *is* relative
    to voice recognition. You want to recognize that two different voices,
    with different pitches, pauses, etc., said the same thing.

    There is a lot of data in speech. That's why sound files are bigger
    than text files. Some of it gets interpreted as emotional nuances,
    or as an accent, while others are simply ignored.

    > I just want to compare two sound WAVE file, not what the students or
    > the teacher really saying. For example, if the teacher recorded his
    > "standard" pronouncation of "god", then the student saying "good" will
    > get a higher score than the student saying "evil" ---- because "good"
    > sounds more like "god".


    Try this: record the word twice and overlay them. They will be
    different. And that's with the same speaker. Now try it with your
    voice compared with another's. You can hear just how different they
    are. One will be longer, another deeper, or with the "o" sound
    originating in a different part of the mouth.

    At the level you are working on the computer doesn't know which of
    the data can be ignored. It doesn't know how to find the start
    of the word (as when a student says "ummm, good"). It doesn't know
    how to stretch the timings, nor adjust for pitch between, say,
    a man and a woman's voice.

    My ex-girlfriend gave me a computer program for learning Swedish.
    It included a program to do a simpler version of what you are
    asking. It only compared phonemes, so I could practice the vowels.
    Even then it's comparison seemed more like a random value than
    meaningful.

    Again, as Jeremy said, you want something harder than what
    speech recognition programs do. They at least are trained
    to understand a given speaker, which helps improve the quality
    of the recognition. You don't want that -- that's the
    opposite of what you're trying to do. Speaker-independent
    voice recognition is harder than speaker-dependent.

    You can implement a solution on the lines you were thinking of
    but as you found it doesn't work. A workable solution will
    require good speech recognition capability and is still very
    much in the research stage (as far as I know; it's not my
    field).

    If your target language is a major one then there may be some
    commercial language recognition software you can use. You
    could have your reference speaker train the software on the
    vocabulary list and have your students try to have the software
    recognize the correct word.

    If your word list is too short or recognizer not set well
    enough then saying something like "thud" will also be
    recognized as being close enough to "good".

    Why don't you just have the students hear both the
    teachers voice and the student's just recorded voice, one
    right after the other? That gives feedback. Why does
    the computer need to judge the correctness?

    Andrew
    Andrew Dalke, May 1, 2005
    #4
  5. Jeremy Bowers

    Kent Johnson Guest

    Qiangning Hong wrote:
    > I want to make an app to help students study foreign language. I want
    > the following function in it:
    >
    > The student reads a piece of text to the microphone. The software
    > records it and compares it to the wave-file pre-recorded by the
    > teacher, and gives out a score to indicate the similarity between them.
    >
    > This function will help the students pronounce properly, I think.
    >
    > Is there an existing library (C or Python) to do this? Or if someone
    > can guide me to a ready-to-implement algorithm?
    >

    I have worked on a commercial product that attempts to do this and I will confirm that it is very
    difficult to create a meaningful score.

    Kent
    Kent Johnson, May 1, 2005
    #5
  6. [Qiangning Hong]

    > I just want to compare two sound WAVE file, not what the students or
    > the teacher really saying. For example, if the teacher recorded his
    > "standard" pronouncation of "god", then the student saying "good" will
    > get a higher score than the student saying "evil" ---- because "good"
    > sounds more like "god".


    If I had this problem and was alone, I would likely create one audiogram
    (I mean a spectral frequency analysis over time) for each voice sample.
    Normally, this is presented with frequency on the vertical axis, time on
    the horizontal axis, and gray value for frequency amplitude. There are
    a few tools available for doing this, yet integrating them in another
    application may require some work.

    Now, because of voice pitch differences and elocution speed, the
    audiograms would somehow look alike, yet scaled differently in both
    directions. The problem you now have is to recognise that an image is
    "similar" to part of another, so here, I would likely do some research
    on various transforms (like Hough's and any other of the same kind) that
    might ease normalisation prior to comparison. Image classification
    techniques (they do this a lot in satellite imagery) for recognizing
    similar textures in audiograms, and so, clues for matching images. A
    few image classification programs which have been previously announced
    here, I did not look at them yet, but who knows, they may be helpful.

    Then, if the above work is done correctly and meaningfully, you now want
    to compute correlations between normalised audiograms. More correlated
    they are, more likely the original pronunciation were.

    Now, if I had this problem and could call friends, I would surely phone
    one or two of them, who work at companies offering voice recognition
    devices or services. They will be likely reluctant at sharing advanced
    algorithms, as these give them industrial advantage over competitors.

    > I try to use the value returned from rms(add(a, mul(b, -findfactor(a,
    > b)))) as the score. But the result is not good.


    Oh, absolutely no chance that such a simple thing would ever work. :)

    --
    Fran├žois Pinard http://pinard.progiciels-bpi.ca
    =?iso-8859-1?Q?Fran=E7ois?= Pinard, May 1, 2005
    #6
  7. "Qiangning Hong" <> wrote in message
    news:...
    > I want to make an app to help students study foreign language. I want
    > the following function in it:
    >
    > The student reads a piece of text to the microphone. The software
    > records it and compares it to the wave-file pre-recorded by the
    > teacher, and gives out a score to indicate the similarity between them.
    >
    > This function will help the students pronounce properly, I think.
    >
    > Is there an existing library (C or Python) to do this? Or if someone
    > can guide me to a ready-to-implement algorithm?
    >


    How about another approach?

    All modern speech recognition systems employ a phonetic alphabet. It's how
    you describe to the speech recognition engine exactly how the word sounds.

    For each sentence read, you create a small recognition context that includes
    the sentence itself, AND subtle variations of the sentence phonetically.

    For example (using English):

    You want them to say correctly: "The weather is good today".

    You create a context with the following phrases which include the original
    sentence, and then alternative sentences that dithers (varies) the original
    sentence phonetically. Sample context:

    (*) The weather is good today
    Da wedder is god tuday
    The weether is good towday

    Etc.

    Then submit the context to the speech recognition engine and ask the user to
    say the sentences. If the original sentence (*) comes back as the speech
    recognition engine's best choice, then they said it right. If one of the
    other choices comes back, then they made a mistake.

    You could even "grade" their performance by tagging the variations by
    closeness to the original, for example:

    (*) The weather is good today (100)
    Da wedder is god tuday (80)
    Ta wegger es gid towday (50)

    In the example above, the original sentence gets a 100, the second choice
    which is close gets an 80, and the last option which is pretty bad gets 50.
    With a little effort you could automatically create the "dithered" phonetic
    variations and auto-calculate the score or closeness to original too.

    Thanks,
    Robert
    http://www.robodance.com
    Robosapien Dance Machine - SourceForge project
    Robert Oschler, May 1, 2005
    #7
  8. Jeremy Bowers

    M.E.Farmer Guest

    Qiangning Hong wrote:
    > I want to make an app to help students study foreign language. I

    want
    > the following function in it:
    >
    > The student reads a piece of text to the microphone. The software
    > records it and compares it to the wave-file pre-recorded by the
    > teacher, and gives out a score to indicate the similarity between

    them.
    >
    > This function will help the students pronounce properly, I think.
    >
    > Is there an existing library (C or Python) to do this? Or if someone
    > can guide me to a ready-to-implement algorithm?


    As others have noted this is a difficult problem.
    This library was developed to study speech and should be worth a look:
    http://www.speech.kth.se/snack/
    """
    Using Snack you can create powerful multi-platform audio applications
    with just a few lines of code. Snack has commands for basic sound
    handling, such as playback, recording, file and socket I/O. Snack also
    provides primitives for sound visualization, e.g. waveforms and
    spectrograms. It was developed mainly to handle digital recordings of
    speech, but is just as useful for general audio. Snack has also
    successfully been applied to other one-dimensional signals.
    """
    Be sure to check out the examples.

    Might be worth a look:
    http://www.speech.kth.se/projects/speech_projects.html
    http://www.speech.kth.se/cost250/

    You might also have luck with Windows using Microsoft Speech SDK ( it
    is huge ).
    Combined with Python scrtiptng you can go far.

    hth,
    M.E.Farmer
    M.E.Farmer, May 1, 2005
    #8
    1. Advertising

Want to reply to this thread or ask your own question?

It takes just 2 minutes to sign up (and it's free!). Just click the sign up button to choose a username and then you can ask your own questions on the forum.
Similar Threads
  1. Tony Benham
    Replies:
    3
    Views:
    3,056
    Valentin Tihomirov
    Nov 2, 2003
  2. YC
    Replies:
    1
    Views:
    4,818
    siva chelliah
    Aug 13, 2003
  3. phil

    compare two voices

    phil, May 2, 2005, in forum: Python
    Replies:
    1
    Views:
    730
    Jeremy Bowers
    May 2, 2005
  4. phil
    Replies:
    1
    Views:
    373
    Jeremy Bowers
    May 2, 2005
  5. GenxLogic
    Replies:
    3
    Views:
    1,243
    andrewmcdonagh
    Dec 6, 2006
Loading...

Share This Page