Visit our Sponsor   Visit our Sponsor
delphi3000.com - the free delphi knowledge platform
delphi3000.com - the free delphi knowledge platform
Have a look at your member-status

connecting people's knowledge


  - Recent ArticlesRSS feed for Recent Articles on delphi3000.com
  - List of All Articles
  - Top Viewed Articles
  - Articles (+Attachem.)
  - Articles Of Interest
  - Categories
  - Top Uploader
  - Search
  - Index

  - My Home
  - Submit an Article
  - My Articles
  - My Personal Data
  - My Bookmarks
  - Activities
  - Login/Logout

  - Sign Up
  - Why Sign Up
  - Newsletter

  - Press
  - Advertise

  - Contact
  - Feedback





Community
Borland
ClubeDelphi
Dr. Bob
UK-BUG
Delphi Meetings
Planeta Delphi







Startblatt.de






Share this article with friendsShare this article with friends
Rate this articleRate this article - to keep the quality of delphi3000.com !
Comment this article or read through previous comments (31)


Speech Part 2 - How to Add Simple Dictation speech recognition to your Delphi AppsGo to Alec Bergamini's websiteFormat this article printer-friendly!Bookmark function is only available for registered users!
SAPI Speech Recognition - Dictation
Product:
Delphi 5.x (or higher)
Category:
Multimedia
Skill Level:
Scoring:
Last Update:
08/21/2001
Search Keys:
delphi delphi3000 article borland vcl code-snippet SAPI-5.1 Speech recognition dictation grammar SpInprocRecognizer SpSharedRecognizer SpInprocRecoContext SpSharedRecoContext CreateGrammar DictationSetState GetText
Times Scored:
34
Visits:
21123
Uploader: Alec Bergamini
Company: O&A Productions
Reference: O&A Productions
 
Question/Problem/Abstract:
I'd like to be able to speak into my computers microphone and have what I say translated into text that is entered into my application. How Can I do this?
Answer:



New Speech Part 2 - Speech Recognition (Simple Dictation)

This article shows how you can add simple dictation speech recognition capabilities to an application. First some technical considerations will be discussed followed by the creation of a small application that will allow the user to add words to a TMemo by speaking into a microphone attached to the computers sound card.

If you did not read Speech Part 1 (article #2581) you should go read it now. It tells you how to get the Microsoft Speech API 5.1 (SAPI) and install it on your system and in Delphi.

SAPI version 5.1 supports two distinct types of speech recognition; dictation and command and control. It’s important to understand the differences between the two in order to make correct decisions in the design of your speech enabled applications.

Dictation Speech Recognition

Dictation refers to a type of speech recognition where the machine listens to what you say and attempts to translate it into text. This all happens inside the speech engine and you don’t need to worry about it although a little theory may be helpful. Most modern dictation engines use a scheme where they listen to what’s said and break what they hear down into a series of word hypothesis. Each word hypothesis may actually contain a list of possible words with each word given some probability of correctness. So, for example, if I say “The quick red fox” the computer will likely break this down into 4 separate word hypothesis. The “fox” hypothesis may contain the possibilities of “fax”, “box”, “fix”, etc. These individual word hypothesis are then “put in context”. That is, each word is considered in relation to the words that came before and after. Based on the rules of context the speech engine comes to a final “best” decision about what was spoken and returns it to the application. In dictation, context is the name of the game. For this reason, dictation engines are considered to be contextual. (My apologies to any ASR scientists reading this for this minimalist explanation.)

As you may imagine, the accuracy of dictation ties directly to the CPU's speed and the system's available memory. The more resources, the more context that can be considered in a reasonable amount of time the more likely the resulting recognition will be accurate. The truth is that the basic principals on how to do speech recognition have not changed in over 20 years. What has changed is the power of the PC and it’s the processing power of modern PCs that makes speech recognition finally usable.

Also important to accurate dictation recognition is the engine having some understanding of the individual speaker’s voice. First speech engines are specific to language and possibly even region. This is why we see English engines and French engines and Chinese engines, etc. Beyond languages though, there are differences (sometimes extreme) within a language. A 5 year old girl sounds very different to the computer than a 47 year old man. This is why most current dictation engines require voice training.

If you have SAPI 5.1 installed, go to your system’s Control Panel and click the Speech icon. On the speech recognition tab you will find a button called >Train Profile..< that brings up the voice training wizard. If you haven’t already done so, you should take the time to complete at least one session. The more sessions you complete, the more accurate you can expect the dictation recognition. By the way, you have access to this wizard from the SDK and you can even provide the text for you own personal training sessions. In fact, taking a fairly long document that’s you’ve written in your own particular style and using that to train the engine can dramatically improve your own personal dictations.

Command and Control Speech Recognition

While dictation recognition is use primarily for recording what a user says and translating it into text, Command and Control (CnC) speech recognition is used for controlling applications. In the same way that you click your mouse on the browser icon on your desktop to access the internet you could speak “Computer, run browser”, or even better “Compute, go to www.o2a.com” to accomplish the same thing. Currently you are used to controlling your computer by you mouse and keyboard. CnC recognition adds a third input device, you voice.

CnC speech recognition is fundamentally different from dictation recognition in that it is recognition without regard to context. That is, there are no CPU cycles spent trying to determine if a word is correct by looking at the words that come before or after. For this reason CnC recognition is often also know as context free recognition.

Instead of using context, CnC recognition uses pre-defined grammars. These grammars contain rules, and each rule can then have a programmed response. So, in developing an application that uses CnC recognition the programmer defines both the grammar and the rules as well as the response to the recognition of each rule.

If grammars and rules are managed properly CnC recognitions can be much more accurate than dictation recognition. This is because the number of words that need to be recognized for CnC is only a subset of the universe of words needed for dictation. With CnC the engine only need to worry about the words in the active grammar not all the words in the dictionary. Fewer possibilities mean better accuracy.

It turns out that dictation recognition is much more difficult for speech engine developers than CnC recognition. But for us as the application developer it is much easier to implement simple dictation in an application than CnC because with dictation we don’t need to worry about writing a grammar. For this reason I’m starting with dictation instead of CnC. I’ll probably do CnC and grammar development in the next article.

A Simple Dictation Application

All right then, let’s build an application that takes dictation.

[ Delphi 6 users – In the process of testing this sample in Delphi 6 I ran into a known problem with event sinks generated from type library imports. See Article number 2590 for more information and some work arounds. ]

Start up Delphi (5 or 6, 4 might work to but I didn’t try it). On the SAPI5 palette (see Speech Part 1 if you don’t have one) find the TSpSharedRecoContext component and drop it on your form along with a TMemo component.

Add the ActiveX unit to your uses clause and add a private field to the form called    fMyGrammar of type IspeechRecoGrammar.

Create an onCreate event for the form, plus OnRecognition and OnHypothesis events for the SpSharedRecoContext component. You complete unit should look something like this

unit Unit1;

interface

uses
  Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms,
  Dialogs, OleServer, SpeechLib_TLB, StdCtrls, ActiveX, ComCtrls;

type
  TForm1 = class(TForm)
    SpSharedRecoContext1: TSpSharedRecoContext;
    Memo1: TMemo;
    procedure FormCreate(Sender: TObject);
    procedure SpSharedRecoContext1Recognition(Sender: TObject;
      StreamNumber: Integer; StreamPosition: OleVariant;
      RecognitionType: TOleEnum; var Result: OleVariant);
    procedure SpSharedRecoContext1Hypothesis(Sender: TObject;
      StreamNumber: Integer; StreamPosition: OleVariant;
      var Result: OleVariant);
  private
    { Private declarations }
    fMyGrammar : ISpeechRecoGrammar;
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}

procedure TForm1.FormCreate(Sender: TObject);
begin
  fMyGrammar := SpSharedRecoContext1.CreateGrammar(0);
  fMyGrammar.DictationSetState(SGDSActive);
end;

procedure TForm1.SpSharedRecoContext1Recognition(Sender: TObject;
  StreamNumber: Integer; StreamPosition: OleVariant;
  RecognitionType: TOleEnum; var Result: OleVariant);
begin
  Memo1.Text := Result.PhraseInfo.GetText;
end;

procedure TForm1.SpSharedRecoContext1Hypothesis(Sender: TObject;
  StreamNumber: Integer; StreamPosition: OleVariant;
  var Result: OleVariant);
begin
  Memo1.Text := Result.PhraseInfo.GetText;
end;

end.

Compile and run this. Speak something. Your words should appear in the memo field. If they do not shut the application down and;
1. Make sure you microphone is not muted.
2. Use the Speech control panel applet to make sure that your microphone and the recognition engine is working properly.
Now try it again.

Explanation

The SAPI 5.1 automation objects support both dictation and CnC speech recognition. Of the 19 components installed the following 4 are central to speech recognition.

The SpInprocRecognizer represents a speech recognition engine that is instantiated in the same process as the application.

The SpSharedRecognizer represents an instance of a speech recognition engine that is shared by many applications.

The SpInprocRecoContext is a recognition context that uses a SpInprocRecognizer.

The SpSharedRecoContext is recognition context that uses a SpSharedRecognizer.

Shared vs. Inprocess

An application can use either an inprocess instance of a speech engine (SpInprocRecognizer) or an instance that is shared with other applications (SpSharedRecognizer).  The inprocess recognizer claims resources for the application, so, for example, once an inprocess recognizer claims the system’s microphone, no other application can use it.

A shared recognizer runs in a separate process from the application and, as a result, it can be shared with other applications. This allows multiple applications to share system resources (like the microphone).

In our sample we are using a shared engine. In most desktop applications shared is the way to go. Using a shared recognizer allows your application to play nicely with other speech enabled applications on your system. If your application is targeted for some dedicated machine like one running a telephone voice response application then the inprocess approach would be appropriate. Inprocess recognition is somewhat more efficient then shared recognition.

Recognition Contexts

A recognition context is an object that manages the relationship between the recognition engine object (the recognizer) and the application. Do not confuse the use of the word “context“ as used here with its usage in “context free grammar”.  

A single recognizer can be used by many contexts. For example, a speech enabled application with 3 forms will likely have a single engine instance with a separate context of each form. When one form gets the focus its context becomes active and the other two forms contexts are disabled.  In this way, only the commands relevant to the one form are recognized by the engine. Another example as seen in Microsoft Word XP where there is one context for dictation and another context for issuing menu commands.

The recognition context is the primary means by which an application interacts with SAPI. It is the object you use to start and stop recognition and it is the object that receives the event notifications when something is recognized. Further, the recognition context controls which words (grammars and/or dictation) are recognized. By setting recognition contexts, applications limit or expand the scope of the words needed for a particular aspect of the application. This granularity for speech recognition improves the quality of recognition by removing words not needed at that moment. Conversely, the granularity also allows words to be added to the application if needed.

In our example above we do the simple thing (at least programmatically) and just load dictation. This means that all words will attempt to be recognized. The other possibility is to load one or more specific grammars. Grammars are a big subject and will be covered in a later article.

There’s a lot more on the subjects of recognition contexts and inprocess vs. shared recognizers in the SAPI 5.1 documentation but for now that’s enough to talk about the sample code.

What the sample code does

First, here is the form’s OnCreate event.

procedure TForm1.FormCreate(Sender: TObject);
begin
  fMyGrammar := SpSharedRecoContext1.CreateGrammar(0);
  fMyGrammar.DictationSetState(SGDSActive);
end;

Just two lines of code to set the whole recognition process in motion. First we need to create a grammar (CreateGrammar) object for the engine and then we instruct this grammar that it is to attempt to recognize all words by DictationSetState(SGDSActive).

Notice that neither on the form or in the code do we ever instantiate a SpSharedRecognizer. This is because SAPI is smart enough to create the shared recognizer object for us automatically when the SpSharedRecoContext is created.

Next we need some way for the application to be informed by the engine when it recognizes something. This is done through the OnRecognition event.

Procedure TForm1.SpSharedRecoContext1Recognition(Sender: Object;
  StreamNumber: Integer; StreamPosition: OleVariant;
  RecognitionType: TOleEnum; var Result: OleVariant);
begin
  Memo1.Text := Result.PhraseInfo.GetText;
end;

Of the various parameters passed in the OnRecogntion event, the Result parameter is the key. Although declared as an OleVariant for interprocess communications it’s really an object with an ISpeechRecoResult interface. This interface lets you get all sorts of information about what was said and what the recognizer understood. Some of the information available through this interface includes; the words recognized, a rating of the engine’s confidence in the recognition, when the recognition happened and how long it took. You can even play back the audio for what was said. Much of the information returned is only useful for context free grammars and doesn’t apply to dictation.
In the sample we just call the GetText method to return the text of what the engine understood.

The OnRecogintion event only fire when the engine is satisfied that the user has uttered a complete phase and that it has made its best guess about what the user said. You could run the sample application with only this event defined and it would work.

I added the OnHypothesis event so you could get a feel for how the engine, working in dictation mode, uses all the words together (in context) to create hypothesis’, make corrections, and, finally, come to a decision about what was said.

That’s enough for now

Speech recognition is a very big subject. I’ve scratched the surface of dictation speech recognition but there is much more. To write a really usable dictation application the user will need ways to correct mistakes and give the speech recognition engines commands like “Bold the last 3 words”. While possible with SAPI this level of discussion is beyond the scope of this introduction. I urge to study the documentation that comes with the SAPI SDK.
I haven’t give much more than passing mention of CnC context free grammars. CnC recognition and grammars will be the next article.

OK not quite enough

I couldn’t leave the sample application alone. Here’s a slightly modified version that is a bit more satisfying in that it lets you keep multiple utterances.

type
  TForm1 = class(TForm)
    SpSharedRecoContext1: TSpSharedRecoContext;
    Memo1: TMemo;
    procedure ButtonSpeakClick(Sender: TObject);
    procedure FormCreate(Sender: TObject);
    procedure SpSharedRecoContext1Recognition(Sender: TObject;
      StreamNumber: Integer; StreamPosition: OleVariant;
      RecognitionType: TOleEnum; var Result: OleVariant);
    procedure SpSharedRecoContext1Hypothesis(Sender: TObject;
      StreamNumber: Integer; StreamPosition: OleVariant;
      var Result: OleVariant);
  private
    { Private declarations }
    fMyGrammar : ISpeechRecoGrammar;
    CurrentText : String;
  public
    { Public declarations }
  end;

var
  Form1: TForm1;

implementation

{$R *.dfm}

procedure TForm1.FormCreate(Sender: TObject);
begin
  fMyGrammar := SpSharedRecoContext1.CreateGrammar(0);
  fMyGrammar.DictationSetState(SGDSActive);
end;

procedure TForm1.SpSharedRecoContext1Recognition(Sender: TObject;
  StreamNumber: Integer; StreamPosition: OleVariant;
  RecognitionType: TOleEnum; var Result: OleVariant);
begin
  Memo1.Text := CurrentText + Result.PhraseInfo.GetText;
  CurrentText := Memo1.Text;
end;

procedure TForm1.SpSharedRecoContext1Hypothesis(Sender: TObject;
  StreamNumber: Integer; StreamPosition: OleVariant;
  var Result: OleVariant);
begin
  Memo1.Text := CurrentText + Result.PhraseInfo.GetText;
end;

end. //really






Please rate this article!
Skill level:
BeginnerExpert

Useful:
No!Very!

Overall rating:
PoorExcellent



Comments to this article
Write a new comment
Run time Error
    goo kachee (Sep 28 2006 10:09AM)

In D6,when the SpSharedContext's OnReconition event is response ,"Invalid variant type conversion" error occurs.
there is the code:

procedure TfrmContinuousDictation.SpSharedRecoContextRecognition(
  Sender: TObject; StreamNumber: Integer; StreamPosition: OleVariant;
  RecognitionType: TOleEnum; var Result: OleVariant);
var
  SRResult: ISpeechRecoResult;
begin
  SRResult := IDispatch(Result) as ISpeechRecoResult;
  memText.SelText := SRResult.PhraseInfo.GetText(0, -1, True) + #32
end;


Damn it, i can't slove it.

Somebody help me£¿
Respond

speech recognition by pattern
    Zsolt Balanyi (Feb 10 2005 11:15PM)

Hello!

I am interested in a method where I can train the computer to recognize commands, and to return the commman ID for the spoken command. I am not interested in English, but in other languages.
Respond

Error
    niiom (Feb 2 2005 11:45PM)

memo1.Text := Result.PhraseInfo.GetText;

error:
[Error] Unit1.pas(44): Not enough actual parameters
What's the problem I don't get it
Respond

RE: Error
Koen (Feb 7 2005 2:29PM)

got the same problem, using delphi 7.
Respond

RE: RE: Error
Iskander Yarmuahemetov (Feb 7 2005 8:30PM)

actually I soved this problem ...GetText(0,-1,true) will solve this
0 means the place when it starts recognising
-1 that it will recognize all words (1 means one word, 2 means two words, etc)
true - I didn't solved it out yet :)
Respond

Grammar from Resources
    Greg Bullock (Oct 10 2003 7:54PM)

Thanks so much for the very helpful article Alec.

Following your examples, I've added a context-free grammar to my app for C&C.  So far, I can load the grammar from a file, but I can't get it to load from resources.  I use the lines

MyGrammar := SpSharedRecoContext.CreateGrammar(0);
MyGrammar.CmdLoadFromFile(ExtractFilePath(Application.ExeName)+'MyGrammar.cfg',SLOStatic);
MyGrammar.CmdSetRuleState('',SGDSActive);

and everything works fine.  But when I change the CmdLoadFromFile line to

MyGrammar.CmdLoadFromResource(MainInstance, OleVariant('MYCONTROLGRAMMAR'), OleVariant(10), $409, SLOStatic);

I get an run-time error message
"The specified resource name cannot be found in the image file"

Opening the .exe file in a resource editor, I can confirm that named resource is indeed in the file, but for some reason CmdLoadFromResource can't find it.

Any ideas?

Regards.
Greg
Respond

error message
    quaker (May 24 2003 10:35AM)

class  Evarianttypeclasterror with message"could not convert variant of type(dispatch) into type (integer)"

Why???

Respond

not running
Alain (Jul 27 2004 8:16AM)

i copy the code and i make the component with the active x and i run it but the project he run it but in the memo appear a word without speaking
i didn't why
please send me what should i do
thanks
Respond

appear in the memo words without speaking
Alain (Jul 27 2004 9:02AM)

i copy the code and make the component and in the form <>
and i run it he run it correctly but in the memo appear a word without speaking so this where he comes like " if / the / this ......." so what is the mistake the computer not type the word when i speak.
please send me why ?? what is the mistake ??
thanks you very much.
Respond

speech interface - rulezzzz
    Dima Galkin (Apr 28 2003 5:00AM)

it is very useful article.. i wanr to say a very big thanx to the auther!!!
sorry 4 my english
Respond

not running correctly
Alain (Jul 27 2004 9:27AM)

help me the project while i copy in delphi and run it in the memo appear a words without speaking
thanks
Respond

error message
    Jefri Setiawan (Sep 8 2002 12:01PM)

I'm still got error message about type library : TOlenum, please help me, i really interested in speech recognition, it's for my thesis :)
Respond

RE: error message
Alec Bergamini (Sep 9 2002 4:13PM)

If you are talking about the type lib problem when importing with Delphi 6 then you need to either do the import with Delphi 5 or use the most recent update of Delphi 6 which I am told fixes the import problem. If it doesn't then find a copy of Delphi 5. I know that works.
Respond

The code for botton Click event is missing?
    Abdulaziz Jasser (Mar 1 2002 12:01PM)

I try your example code at the bottom of the article and found that the some of the code is missing.
Respond

RE: The code for botton Click event is missing?
Alec Bergamini (Mar 1 2002 5:08PM)

Could you be a bit more specific. I don't see anything that is missing.
Respond

RE: RE: The code for botton Click event is missing?
Abdulaziz Jasser (Mar 1 2002 6:05PM)

I your example at the bottom of the article there should be a code for a botton click.  In your example you can find:
...................................................................
procedure ButtonSpeakClick(Sender: TObject);
...................................................................

But the code for this event is not there!  Beside I get an error when I run the example.
Respond

RE: RE: RE: The code for botton Click event is missing?
Alec Bergamini (Mar 1 2002 6:20PM)

I never noticed that. Actually that button was deleted from the form and is not used for anything. You should just ignore  the
procedure ButtonSpeakClick(Sender: TObject);
and not include it.
Respond

Great articles!
    Jason Pierce (Sep 10 2001 12:00AM)

Alec,

Great articles. Keep up the good work!

-Jason
Respond

Speech in Windows XP
    Bill Niles (Aug 21 2001 9:07PM)

I'm really enjoying these articles. I had no idea how easy implementing speech using Delphi could be. Hopefully the one on command and control grammars will not be to far off.

I've heard that XP will come with SAPI as part of the operating system. Does anyone know if this is true? If it is true, are these articles relevant to using the speech functionality in XP?
Respond

RE: Speech in Windows XP
a coder (Aug 21 2001 10:00PM)

hi
yes the speech engine is a part of winxp, and all this articles works fine on it :)
Respond

RE: RE: Speech in Windows XP
Alec Bergamini (Aug 22 2001 12:35AM)

Hello a coder,
When you installed set it up on XP did you download the SDK or were the automation objects automatically in the list of type libraries available to import?
Respond

RE: RE: RE: Speech in Windows XP
Eber Irigoyen (Aug 22 2001 3:11AM)

if it comes with XP then it should be available, shouldn't it?
is just like all you have installed on win98 is available to import
so... it should be available
Respond

It doesnt work!
aiikkkkkkkkkk asdas (Oct 18 2003 9:39PM)

Please help me someone!
I cant use the source.
When I try to talk a error message popups and says: "Invalid variant type conversation"!

Does anyone know what the problem is?

You can E-MAIL ME AT Aid_k_2000@hotmail.com AND NOT nice_assbaby1@hotmail.com.
Respond

RE: It doesnt work!
Greg Bullock (Oct 19 2003 12:40AM)

Try setting breakpoints in the message response methods, then step through the code to see which line is complaining.
Respond

RE: RE: It doesnt work!
aiikkkkkkkkkk asdas (Oct 19 2003 8:55PM)

The error appears when I use

fMyGrammar := SpSharedRecoContext1.CreateGrammar(0);
fMyGrammar.DictationSetState(SGDsActive);

in the OnCreate procedure.

You know what the problem is?

I am using delphi 6 personal and I downloaded the sapi 3 days ago
Respond

RE: It doesnt work!
Greg Bullock (Oct 19 2003 9:11PM)

Which line is it, the CreateGrammar line or the DictationSetState line?

If it's the second line, what happens when you replace
DictationSetState(SGDSActive)
with
DictationSetState(1)
?  According to the SAPI5 documentation for DictationSetState, this should be equivalent.

Greg


Respond

RE: RE: It doesnt work!
aiikkkkkkkkkk asdas (Oct 20 2003 11:06PM)

It doesnt matter anymore, I found a way to fix it on the net!
Thanks anyways!

But now I wonder if anybody knows how to create a own Add/remove word dialog? anybody knows?

thnx in advance!
Respond

RE: RE: RE: It doesnt work!
aaaaik (Oct 21 2003 3:02PM)

or can you answer this:

how can I make my own dictionary so that it for example can understand the swedish language???

Anybody knows?
Respond

RE: RE: RE: RE: It doesnt work!
mo Ja (Aug 2 2005 2:43AM)

2 aaaaik:
Use transcription(only works for C&C)
don't add actual words, add transcriptions of 'em...for example: 'real' you can recognize it by adding to xml file 'reel'.....hope you got the idea...sorry for my lacky english
Respond

RE: RE: RE: It doesnt work!
David (Dec 27 2004 7:22PM)

How have you corrected the "invalid variant type conversion" error?
Respond

RE: RE: RE: RE: It doesnt work!
David (Dec 27 2004 11:53PM)

OK I solved this problem... you must not import the speechlib_tlb.pas from the Microsoft SAPI, it sucks a lot.

Try this package instead:

http://www.blong.com/Conferences/DCon2002/Speech/SAPI51/SAPI51.zip

more info on
http://www.blong.com/Conferences/DCon2002/Speech/SAPI51/SAPI51.htm
Respond














 
Sign up to consume product discounts for Bronze memberships !

read more


  Visit our Sponsor

 

  Community Ad of
Hans Gulö
 
   














 







     
  Copyright © 2000 - 2007 delphi3000.com - All rights reserved. Terms of use. || Privacy
delphi3000.com is a service by bluestep.com IT-Services GmbH (Vienna)