Conceptual Speech Commander Pre-Release 2 is now available to the public as a download. This version handles audio input and responds with the conceptual speech recognition analysis of its content. It also includes our Conceptual Language Understanding Engine (CLUE) for performing conceptual analysis of text. To try it, click here.

MULTI-PHONEME STREAMER AND KNOWLEDGE REPRESENTATION SPEECH RECOGNITION SYSTEM AND METHOD PATENT DOCUMENTATION

PDF file (134 pages - 643 KB)

Inventor: Philippe Roy

Assignee: Conceptual Speech LLC

Filed: June 30, 2003

ABSTRACT

A system and method related to a new approach to speech recognition that reacts to concepts conveyed through speech. In its fullest implementation, the system and method shifts the balance of power in speech recognition from straight sound recognition and statistical models to a more powerful and complete approach determining and addressing conveyed concepts. This is done by using a probabilistically unbiased multi-phoneme recognition process, followed by a phoneme stream analysis process that builds the list of candidate words derived from recognized phonemes, followed by a permutation analysis process that produces sequences of candidate words with high potential of being syntactically valid, and finally, by processing targeted syntactic sequences in a conceptual analysis process to generate the utterance's conceptual representation that can be used to produce an adequate response. The invention can be employed for a myriad of applications, such as improving accuracy or automatically generating punctuation for transcription and dictation, word or concept spotting in audio streams, concept spotting in electronic text, customer support, call routing and other command/response scenarios.

FIELD OF THE INVENTION

The present invention relates generally to speech processing. More specifically, the invention relates to speech processing used by humans and interpreted by machines where speech content is restricted only by concepts conveyed instead of syntactic related constraints.

BACKGROUND OF THE INVENTION

Speech recognition is defined as the process allowing humans to interact with machines by using speech. Scientists have worked for years to develop the capability for machines to understand human speech. The applications of this capability are obvious. People can interface with machines through speech, as opposed to the cryptic command inputs that are the norm with today’s personal computers, telephony devices, embedded devices and other programmable machinery. For example, a person who wants to access information from a telephone may need to listen to multiple prompts and navigate through a complex phone system by pressing keys on a keypad or matching predefined keywords to get adequate information retrieved. This time-consuming process frustrates, and even sometimes discourages the user, and increases the cost for the information provider.

The most common approach to speech recognition relates to sound analysis of a digitized audio sample, and the matching of that sound sample to stored acoustic profiles representative of pre-defined words or utterances. Techniques for such matching include the Hidden Markov Model (HMM) and the Backus-Naur (BNF) techniques, both well known in the art. Typically, current techniques analyze audio streams and identify one single most probable phoneme per time-slice, while introducing a probabilistic bias for the following time-slice to recognize a single most probable phoneme. A successful "match" of an audio sample to an acoustic profile results in a predefined operation to be executed. Such techniques typically force users to adapt their behavior by limiting their vocabulary, forcing them to learn commands that are recognized by the system or having them react to prompts taking significant time before the information of interest to them is communicated.

One of the greatest obstacles to overcome in continuous speech recognition is the ability to recognize words when uttered by persons having different accents and/or voice intonations. For example, many speech recognition applications cannot recognize spoken words that do not match the stored acoustic information due to particular pronunciation of that word by the speaker. Often users of speech recognition programs must "train" their own speech recognition system by reading sentences or other materials to permit the machine to recognize that user’s pronunciation of words. Such an approach cannot be used, however, for the casual user of a speech recognition system, since spending time to train the system would not be acceptable.

Several approaches involve the use of acoustical models of various words to identify words in digitized audio data. For example, U.S. Patent 5,033,087 issued to Bahl et. al. and titled "Method and Apparatus for the Automatic Determination of Phonological Rules as For a Continuous Speech Recognition System", the disclosure of which is hereby incorporated by reference in a manner consistent with this disclosure, discloses the use of acoustical models of separate words in isolation in a vocabulary. The system also employs phonological rules which model the effects of coarticulation to adequately modify the pronunciations of words based on previous words uttered.

Similarly, U.S. Patent No. 5,799,276 issued to Komissarchik et. al. and titled "Knowledge-Based Speech Recognition System and Methods Having Frame Length Computed Based Upon Estimated Pitch Period of Vocalic Intervals", the disclosure of which is hereby incorporated by reference in a manner consistent with this disclosure, discloses an apparatus and method for translating an input speech signal to text. The apparatus segments an input speech signal based on the detection of pitch period and generates a series of hypothetical acoustic feature vectors that characterize the signal in terms of primary acoustic events, detectable vowel sounds and other acoustic features. The apparatus and method employ a largely speaker-independent dictionary based upon the application of phonological and phonetic/acoustic rules to generate acoustic event transcriptions. Word choices are selected by comparing the generated acoustic event transcriptions to the series of hypothesized acoustic feature vectors.

Another approach is disclosed in U.S. Patent No. 5,329,608 issued to Bocchieri et. al. and titled "Automatic Speech Recognizer", the disclosure of which is hereby incorporated by reference in a manner consistent with this disclosure. Bocchieri discloses an apparatus and method for generating a string of phonetic transcription strings from data entered into the system and recording that in the system. A model is constructed of sub-words characteristic of spoken data and compared to the stored phonetic transcription strings to recognize the spoken data.

Yet another approach is to select candidate words by slicing a speech section by the unit of a word by spotting and simultaneously matching by the unit of a phoneme, as disclosed in U.S. Patent No. 6,236,964 issued to Tamura et. al. and titled "Speech Recognition Apparatus and Method for Matching Inputted Speech and a Word Generated From Stored Reference Phoneme Data", the disclosure of which is hereby incorporated by reference in a manner consistent with this disclosure.

As previously noted, several approaches use Hidden Markov Model techniques to identify a likely sequence of words that could have produced a given speech signal. For example, U.S. Patent No. 5,752,227 issued to Lyberg and titled "Method and Arrangement for Speech to Text Conversion", the disclosure of which is hereby incorporated by reference in a manner consistent with this disclosure, discloses identification of a string of phonemes from a given input speech by the use of Hidden Markov Model techniques. The phonemes are identified and joined together to form words and phrases/sentences, which are checked syntactically.

Typically, in prior art approaches, too much emphasis is put on straight sound recognition instead of recognizing speech as a whole, where syntax is used exclusively to build a concept and the concept itself is used in order to produce an adequate response.

SUMMARY OF THE INVENTION

The system and method of this invention provides a natural language speech recognition process allowing a machine to recognize human speech, conceptually analyze that speech so that the machine can "understand" it and provide an adequate response. The approach of this invention does not rely on word spotting, context-free grammars or other single-phoneme based techniques to "recognize" digitized audio signals representative of the speech input and consequently does not probabilistically bias the pattern recognition algorithm applied to compare stored phonemes profiles in each cluster with the audio data. Instead the approach of this invention is to recognize multiple, sometimes alternative, phonemes in the digitized audio signals; build words through streaming analysis, syntactically validate sequences of words through syntactic analysis, and finally, analyze selected syntactically valid sequences of words through conceptual analysis. The invention may utilize some methods related to artificial intelligence and, more specifically, recurrent neural networks and conceptual dependency to achieve these objectives. By conceptually analyzing the speech input, the machine can "understand" and respond adequately to that input. In addition, the invention is applicable to speakers of different accents and intonations by using clusters.

More specially, the invention relates to a multi-phoneme streamer and knowledge representation system and method. By combining novel methods of phoneme recognition based on multi-phoneme streaming, and applying conceptual dependency principles to most probable recognized syntactically valid sequences of candidate words obtained from the permutation of all recognized phonemes in their respective time-slice of an audio sample, the invention enables humans to communicate with machines through speech with little constraint in regards to syntax of commands that can be recognized successfully. Although most of the content of this disclosure relates to an English implementation of the invention, this approach can be used for any language.

The invention utilizes clusters as a grouping of all phoneme speech related data in a targeted group of individuals. (Every language is based on finite set of phonemes, such as about 45 phonemes for English. A cluster is a set of reference phonemes [e.g., 45 phonemes for English] for a particular speaker type, such as a man/woman, adult/child, region, or any combination thereof.) Preferably, the computerized system and method evaluates all probabilities without bias of all phonemes in all clusters for an audio data input through the use of a pattern recognition algorithm. A list of candidate words is then built, while keeping the starting time in the audio input for each of them, using all phonemes identified from a unique cluster in the audio data as exceeding a minimal probability set by the pattern recognition algorithm. Using a dictionary that associates pronunciations to spellings and syntactic parts of speech, a syntactic analysis process builds all syntactically valid sequences of words from all possible permutations of candidate words in the words list while respecting pronunciation of words boundaries. Only high potential of being correctly formed syntactic sequences, for example sentences or other syntactic organizations, are later analyzed conceptually. These sequences preferably encapsulate the entire audio data (i.e., all recognized phonemes) although the invention is operative on any given syntactic organization according to the programming engineer’s specifications. A subset of English syntactic organization rules that are applicable to the invention are discussed in Jurafsky, Daniel and Martin, James H., Speech and language processing, Prentice Hall, New Jersey, 2000, pages 332-353, the disclosures of which are herein incorporated by reference in a manner consistent with this disclosure.

Conceptual analysis is performed through predicate calculus operations that are driven by Predicate Builder scripts associated with each word and part of speech. Conceptual analysis initially involves searching for an object of knowledge in the syntactic hierarchy derived from the syntactic organization (i.e., what is being talked about, a person, a flight, etc), by parsing all noun phrases, as an example for the English language, and detecting a resulting valid Predicate structure. Once an object of knowledge is successfully detected, the entire syntactic organization is parsed, and the Predicate structure resulting from conceptual analysis is interpreted in order to produce an adequate answer. If an answer cannot be produced from conceptual analysis of the syntactic organization’s hierarchy, other syntactic organizations hierarchies that encapsulate the entire, or any desired portion, of the audio data are analyzed conceptually following the same process until at least one succeeds; although the successful conceptual representation may contain some kind of inquiry anomaly derived from the syntactic organization’s conceptual analysis, consequently signaling the desired continuation of conceptual analysis processing to eventually build a conceptual representation which contains preferable inquiry anomaly identified in it.

One advantage of the system and method of the invention is that it does not require a predefined syntax from the speaker to be observed in order for a command to be recognized successfully. Another advantage is that systems implementing this method do not require a sound input with high sampling rate in order to be analyzed successfully; for example, telephony systems can function more efficiently with this method than prior art approaches. This indeed significantly improves the balance of power in speech recognition by inserting a process where concepts conveyed have some weight in the recognition task; in contradiction to prior art approaches where emphasis is put on straight sound recognition.

The system includes an audio input device, an audio input digitizer, a unit for recognizing phonemes by applying pattern recognition algorithms, a phoneme stream analyzer for building a list of probable words based on the probable phonemes by reference to a dictionary structure, a syntactic analyzer for building syntactically valid sequences of words from the list of probable words, a conceptual analyzer for building conceptual representations of syntactically valid sequences, and a post analysis process that builds conceptual representations of adequate responses to the original inquiry.

Some of the techniques are based on the concept of Conceptual Dependency (CD), as first set forth by Schank. Many references are available that explain in depth the approach of Schank, which on a very broad level is to remove syntax from a statement leaving the concept intact. In that way, statements of differing syntax yet similar concept are equalized. Such references include Schank, Roger C. and Colby, Kenneth M., Computer models of thought and language, W.H. Freeman and Company, San Francisco, 1973, pages 187-247; Riesbeck, Christopher K. and Schank, Roger C., Inside case-based reasoning, Lawrence Erlbaum associates publishers, New Jersey, 1989; and Riesbeck, Christopher K. and Schank, Roger C., Inside computer understanding, Lawrence Erlbaum associates publishers, New Jersey, 1981. The disclosures of each of these references are incorporated by reference herein in a manner consistent with this disclosure.

It is an object of the invention to:

    1. provide a method for speech recognition that builds words and syntactically valid sequences of words from the phonemes contained in a digitized audio data sample.
    2. provide a method that combines artificial intelligence and recurrent neural networks with phoneme recognition and Conceptual Dependency that allows a machine to conceptually "understand" a digitized audio data sample.
    3. provide a method of conceptual speech recognition that allows a machine to formulate an adequate response to a digitized audio data sample based on the machine’s conceptual "understanding" of the input.
    4. provide a method of conceptual speech recognition that is speaker independent.
    5. provide a method of conceptual speech recognition that recognizes not only words but concepts in a digitized audio sample.
    6. provide a method of conceptual speech recognition that recognizes concepts in a digitized audio sample substantially regardless of the speaker’s vocal intonation and/or accent.
    7. provide a system utilizing a method of conceptual speech recognition that can be accessed and used by numerous users without prior training and/or enrollment by those users in the system.
    8. provide a system and method for word spotting in an audio stream.
    9. provide a system and method for concept spotting in an audio stream or electronic text.
    10. provide a system and method for validating punctuation and syntactic relationships in dictation speech recognition.
    11. provide a system and method that can generate punctuation in existing dictation systems so punctuation marks do not have to be read into dictation, allowing the user to speak more naturally.
    12. provide a system and method that can enhance recognition accuracy of existing dictation systems.

These and other aspects of the invention will become clear to those of ordinary skill in the art based on the disclosure contained herein.

REMAINDER OF DISCLOSURE

    Figures

      Brief Description of the Drawings

    Figure 1

    Figure 2

    Figure 3

    Figure 4

    Figure 5

    Figure 6

    Figure 7

    Figure 8

    Figure 9

    Figure 10

    Figure 11

    Figure 12

    Figure 13

    Figure 14

    Figure 15

    Figure 16

    Figure 17

    Figure 18

    Figure 19

    Figure 20

    Figure 21

    Figure 22

    Figure 23

    Detailed Description

    Optimization

    Examples

    Claims

 
webmasterPrivacy statementTerms of use
 
(©) 2003-2005 Conceptual Speech Technologies, LLC