Re: Voice Recognition Profiles

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Voice Recognition Profiles

Baggia Paolo
 
Dear ..,

I'd like to give you some more information on the background
of your proposals.

There are at least two broad classes of ASR:
- telephony ASR
- dictation ASR

The former does not require any kind of training, because it is
designed to be used by all possible speakers of a given language,
so the ASR is using a general acoustic model trained on a large
population of speakers.

Conversely the latter is for a personal use, so the training
is used for improving the performances on given speaker. Even in
this field from a very long training session (reading predefined
sentences) the current version of dictation ASR are using general
acoustic models as a baseline, so the training needed is reduced.

For telephony ASR there are approaches to adapt online the acoustic
models to improve the performance of the actual speaker. This is done
during the course of the speech interaction, without the need of
an explicit training phase.

A second aspect is that it is very premature to speak of a
Voice Recognition Profile today. All the technologies are different
so it is almost impossible to hava a standard profile, but your
idea is in principle good.

THis is my personal opinion,
Paolo Baggia, Loquendo.

====================================================================
Voice Recognition Profiles

This message: [ Message body ] [ Respond ] [ More options ]
Related messages: [ Next message ] [ Previous message ]
From: B.K. DeLong
Date: Fri, 28 Oct 2005 08:26:32 -0400
Message-Id: <[hidden email]>
To: [hidden email]


I'm not sure if this is the right place to discuss this - I looked
through the archives of this list and several TRs from the Voice
activity and didn't really find anything to answer my question.

Have any efforts been made to make a standard for voice recognition
training profiles? Is "training" even necessary any more for voice
recognition systems?

So when I load up a voice recognition program, I am told to read
several lines or paragraphs of text so it can match the text content
with my voice. For every program I try, I have to retrain it all over
again. In theory, if I move from my computer to my car and try to
activate my GPS system by voice, it needs to be trained. If I go to
an ATM or drive-thru where one can automatically order by voice, I
need to spend several minutes correcting the system until I'm
connected with a human operator.

Why not create a standard profile for voice recognition that all
voice-recognition applications can use? That way, when I come to a
new system I need to "train", I just type in my SSN or some other UID
which tells the system to pull my VRP (Voice Recognition Profile),
out of a centralized directory service, allowing me to immediately
use the system.

In theory, each time I access a new service, whatever actions I take
and corrections I make in the process, would be noted in the file for
the next time I access a service - a live, constantly-growing,
learning profile.

Does such a standard or technology effort exist?

--
B.K. DeLong
[hidden email]
+1.617.797.8471 (Note new number)

http://www.brain-stream.com Play.
http://www.bostonredcross.org Volunteer.
http://www.the-leaky-cauldron.org Potter.
http://www.hackerfoundation.org Future.
http://www.wkdelong.org Son.


PGP Fingerprint:
38D4 D4D4 5819 8667 DFD5 A62D AF61 15FF 297D 67FE

FOAF:
http://foaf.brain-stream.org

Gruppo Telecom Italia - Direzione e coordinamento di Telecom Italia S.p.A.

================================================
CONFIDENTIALITY NOTICE
This message and its attachments are addressed solely to the persons
above and may contain confidential information. If you have received
the message in error, be informed that any use of the content hereof
is prohibited. Please return it immediately to the sender and delete
the message. Should you have any questions, please send an e_mail to
<[hidden email]>[hidden email]. Thank you
<http://www.loquendo.com>www.loquendo.com
================================================

Reply | Threaded
Open this post in threaded view
|

RE: Voice Recognition Profiles

Shires, Glen

A simple way to address the different technologies used for profiles is to store the voice samples as plain audio. For example, by standardizing on a common training text (e.g. a few paragraphs in the public domain), and asking users to make a high-quality recording of it in a standardized way, then this audio could be used as input to virtually any speech recognition system for training.  As an example, the recording could be standardized to be 24kHz sampling rate, 16-bits/sample and stored in a specific non-lossy format and recorded through a specified near-field microphone.  Speech recognition systems could then process this audio to match the input characteristics of their own system; for example, mimicking properties of their microphone and environment and re-sampling to a different sampling rate.

Thus, the same audio samples could be used for training virtually any speech recognition system. For example, they could be recorded on a PC using a standardized application, then uploaded to a central web-site and downloaded by other devices that you use.


This is my personal opinion,
Glen Shires

________________________________________
From: [hidden email] [mailto:[hidden email]] On Behalf Of Baggia Paolo
Sent: Friday, November 11, 2005 2:53 AM
To: B.K. DeLong
Cc: Baggia Paolo; [hidden email]
Subject: Re: Voice Recognition Profiles

 
Dear ..,

I'd like to give you some more information on the background
of your proposals.

There are at least two broad classes of ASR:
- telephony ASR
- dictation ASR

The former does not require any kind of training, because it is
designed to be used by all possible speakers of a given language,
so the ASR is using a general acoustic model trained on a large
population of speakers.

Conversely the latter is for a personal use, so the training
is used for improving the performances on given speaker. Even in
this field from a very long training session (reading predefined
sentences) the current version of dictation ASR are using general
acoustic models as a baseline, so the training needed is reduced.

For telephony ASR there are approaches to adapt online the acoustic
models to improve the performance of the actual speaker. This is done
during the course of the speech interaction, without the need of
an explicit training phase.

A second aspect is that it is very premature to speak of a
Voice Recognition Profile today. All the technologies are different
so it is almost impossible to hava a standard profile, but your
idea is in principle good.

THis is my personal opinion,
Paolo Baggia, Loquendo.

====================================================================
Voice Recognition Profiles

This message: [ Message body ] [ Respond ] [ More options ]
Related messages: [ Next message ] [ Previous message ]
From: B.K. DeLong
Date: Fri, 28 Oct 2005 08:26:32 -0400
Message-Id:
To: [hidden email]


I'm not sure if this is the right place to discuss this - I looked
through the archives of this list and several TRs from the Voice
activity and didn't really find anything to answer my question.

Have any efforts been made to make a standard for voice recognition
training profiles? Is "training" even necessary any more for voice
recognition systems?

So when I load up a voice recognition program, I am told to read
several lines or paragraphs of text so it can match the text content
with my voice. For every program I try, I have to retrain it all over
again. In theory, if I move from my computer to my car and try to
activate my GPS system by voice, it needs to be trained. If I go to
an ATM or drive-thru where one can automatically order by voice, I
need to spend several minutes correcting the system until I'm
connected with a human operator.

Why not create a standard profile for voice recognition that all
voice-recognition applications can use? That way, when I come to a
new system I need to "train", I just type in my SSN or some other UID
which tells the system to pull my VRP (Voice Recognition Profile),
out of a centralized directory service, allowing me to immediately
use the system.

In theory, each time I access a new service, whatever actions I take
and corrections I make in the process, would be noted in the file for
the next time I access a service - a live, constantly-growing,
learning profile.

Does such a standard or technology effort exist?

--
B.K. DeLong
[hidden email]
+1.617.797.8471 (Note new number)

http://www.brain-stream.com Play.
http://www.bostonredcross.org Volunteer.
http://www.the-leaky-cauldron.org Potter.
http://www.hackerfoundation.org Future.
http://www.wkdelong.org Son.


PGP Fingerprint:
38D4 D4D4 5819 8667 DFD5 A62D AF61 15FF 297D 67FE

FOAF:
http://foaf.brain-stream.org 

Gruppo Telecom Italia - Direzione e coordinamento di Telecom Italia S.p.A.

================================================
CONFIDENTIALITY NOTICE
This message and its attachments are addressed solely to the persons
above and may contain confidential information. If you have received
the message in error, be informed that any use of the content hereof
is prohibited. Please return it immediately to the sender and delete
the message. Should you have any questions, please send an e_mail to
<mailto:[hidden email]>[hidden email]. Thank you
<http://www.loquendo.com>www.loquendo.com
================================================

Reply | Threaded
Open this post in threaded view
|

Re: Voice Recognition Profiles

Al Gilman
In reply to this post by Baggia Paolo

What Paolo says is what I hear from others:

Speech recognition performance is
- hotly competitive, because it is
- marginally acceptable

This also means that training against a 'vanilla' corpus of training
texts would probably be, to a competitive-sensitive degree, more
tedious and less effective than training on a corpus attuned to the
technology that you are training.

For basic controls, your GPS can use 'telephony ASR' without
training. That is to say to turn it on, change views, and do simple
things like "pan North." So long as the domain of discourse is
compact enough, it is selecting among a small set of valid catches
and it doesn't need to be trained.

General route planning from voice catches could get hairy.  I don't
know how near or far off that is.  We'll have to watch what comes
on the market.

How can I say: it's not that hard to make untrained ASR competitive
with the level of fussiness in the 'destination input' function of
contemporary free Web map services. These often take two or three
tries to reduce my input to a form they can recognize. Not natural
conversation, but voice competitive with other input modes. Other
than perhaps an inked 'X' or lasso on the graphic map.

To enter by free speech a destination that you want to go to, your
in-the-car GPS might access a network-hosted GIS reference
service behind it. But it would need it's own resident map because
when you're lost, of course that's when your network connection
fades out.

Another network-collaboration scenario we have discussed,
inspired by the needs of speakers with atypical speech is that
the [internet-connected] Voice Browser could, when required,
outsource the speech recognition to a Web Service hosting
a speech recognition technology that you have trained.  The
MRCP technology is a candidate to handle the outsourcing
connection.

http://www.aculab.com/support/v6_api/mrcp/specs.html

Al

At 11:52 AM +0100 11/11/05, Baggia Paolo wrote:

>
>Dear ..,
>
>I'd like to give you some more information on the background
>of your proposals.
>
>There are at least two broad classes of ASR:
>- telephony ASR
>- dictation ASR
>
>The former does not require any kind of training, because it is
>designed to be used by all possible speakers of a given language,
>so the ASR is using a general acoustic model trained on a large
>population of speakers.
>
>Conversely the latter is for a personal use, so the training
>is used for improving the performances on given speaker. Even in
>this field from a very long training session (reading predefined
>sentences) the current version of dictation ASR are using general
>acoustic models as a baseline, so the training needed is reduced.
>
>For telephony ASR there are approaches to adapt online the acoustic
>models to improve the performance of the actual speaker. This is done
>during the course of the speech interaction, without the need of
>an explicit training phase.
>
>A second aspect is that it is very premature to speak of a
>Voice Recognition Profile today. All the technologies are different
>so it is almost impossible to hava a standard profile, but your
>idea is in principle good.
>
>THis is my personal opinion,
>Paolo Baggia, Loquendo.
>
>====================================================================
>Voice Recognition Profiles
>
>This message: [ Message body ] [ Respond ] [ More options ]
>Related messages: [ Next message ] [ Previous message ]
>From: B.K. DeLong
>Date: Fri, 28 Oct 2005 08:26:32 -0400
>Message-Id:
>To: [hidden email]
>
>
>I'm not sure if this is the right place to discuss this - I looked
>through the archives of this list and several TRs from the Voice
>activity and didn't really find anything to answer my question.
>
>Have any efforts been made to make a standard for voice recognition
>training profiles? Is "training" even necessary any more for voice
>recognition systems?
>
>So when I load up a voice recognition program, I am told to read
>several lines or paragraphs of text so it can match the text content
>with my voice. For every program I try, I have to retrain it all over
>again. In theory, if I move from my computer to my car and try to
>activate my GPS system by voice, it needs to be trained. If I go to
>an ATM or drive-thru where one can automatically order by voice, I
>need to spend several minutes correcting the system until I'm
>connected with a human operator.
>
>Why not create a standard profile for voice recognition that all
>voice-recognition applications can use? That way, when I come to a
>new system I need to "train", I just type in my SSN or some other UID
>which tells the system to pull my VRP (Voice Recognition Profile),
>out of a centralized directory service, allowing me to immediately
>use the system.
>
>In theory, each time I access a new service, whatever actions I take
>and corrections I make in the process, would be noted in the file for
>the next time I access a service - a live, constantly-growing,
>learning profile.
>
>Does such a standard or technology effort exist?
>
>--
>B.K. DeLong
>[hidden email]
>+1.617.797.8471 (Note new number)
>
>http://www.brain-stream.com Play.
>http://www.bostonredcross.org Volunteer.
>http://www.the-leaky-cauldron.org Potter.
>http://www.hackerfoundation.org Future.
>http://www.wkdelong.org Son.
>
>
>PGP Fingerprint:
>38D4 D4D4 5819 8667 DFD5 A62D AF61 15FF 297D 67FE
>
>FOAF:
>http://foaf.brain-stream.org
>
>Gruppo Telecom Italia - Direzione e coordinamento di Telecom Italia S.p.A.
>
>================================================
>CONFIDENTIALITY NOTICE
>This message and its attachments are addressed solely to the persons
>above and may contain confidential information. If you have received
>the message in error, be informed that any use of the content hereof
>is prohibited. Please return it immediately to the sender and delete
>the message. Should you have any questions, please send an e_mail to
><<mailto:[hidden email]>mailto:[hidden email]>[hidden email].
>Thank you
><<http://www.loquendo.com>http://www.loquendo.com>www.loquendo.com
>================================================



Reply | Threaded
Open this post in threaded view
|

Re: Voice Recognition Profiles

Sheth Raxit
In reply to this post by Baggia Paolo

B.K.Delong, Baggia Paolo and Other Members,


B.K.DeLong Wrote
================

>Have any efforts been made to make a standard for voice >recognition training profiles? Is "training" even necessary >any more for voice  recognition systems?

>So when I load up a voice recognition program, I am told to read several lines or paragraphs of text so it can match the
>text content with my voice. For every program I try, I have > to retrain it all over again. In theory, if I move from my >computer to my car and try to activate my GPS system by >voice, it needs to be trained. If I go to an ATM or >drive-thru where one can automatically order by voice, I
>need to spend several minutes correcting the system until >I'm connected with a human operator.


Raxit
=====

I think, Explicit "training" is Boring for users, but may improve performance.But if i am not wrong your idea is also about CONTINUOUS FEEDBACK,IMPROVEMENT AND LEARNING of ASR that how Specific user is Speaking...

B.K.DeLong Wrote
================

>Why not create a standard profile for voice recognition >that all voice-recognition applications can use? That way, >when I come to a new system I need to "train", I just type >in my SSN or some other UID which tells the system to pull >my VRP (Voice Recognition Profile), out of a centralized >directory service, allowing me to immediately use the >system.


Raxit
=====

Many Service required user-identifications and Many services not.(and Services that not required identification obviously no
t have user profiles...But Say for example for some specific word, if recognition fails to recognize beyond some threshold v
alue...then Some Feedback Should go to ASR so that ASR Can LEARN...(And I think,Not sure, some ASR are capable to do similar
 thing...But it is not user-specific)
and next time ABLE TO RECOGNIZE the word correctly.


And The System which required user identification, It is good to have (what you suggest) Voice User Profiles, that is Vendor-Independent (so that if the application changes the ASR from Vendor1 to Vendor2, Voice Profiles of Vendor1 can also be used by Vendor2 )

(Here I think, even ASR may using Voice User Profiles but not in Some STANDARD/VENDOR INDEPENDENT  Formats is of NO (or Little or Limited ) use.)

And in that context creating Vendor Independent Voice User Profile might be the KEY ISSUE.


B.K.DeLong :
==========
>Does such a standard or technology effort exist?

Raxit:
======
I think Technology is  exists for  continuous feedback ( I am not sure, but i am searching on it...) (by some ASR Vendor ) but Not any Standard for how to use Voice User Profile Vendor independently...


Here the key-issue is VENDOR INDEPENDENT Standard so that if one user is using 10 different system using 5 different vendors of ASR, There would be only Single Profile of the User shared by all, using some standard Format/Protocol if the system is User-Dependets...

or  if Application wants to change the ASR of Vendor1 to of Vendor2 , the User profiles created by old vendor can be used by new.(No Vendor Lock-in in short...)


Waiting for reply...


Thanking You,
Regards
Raxit Sheth


--
Raxit Sheth
Systems Software Engineer
[hidden email]

***********************
Please note our new Address.
***********************
Phonologies (India) Private Limited
17/18 Metro House, Colaba Causeway,
Mumbai 400001. INDIA.
Ph:+91-22-22029732 / 36   Fax:+91-22-22029728

[hidden email]
http://www.phonologies.com

****The information in this email is confidential and may be legally
privileged. It is intended solely for the addressee. Access to this email by
anyone else is unauthorized. If you are not the intended recipient, any
disclosure, copying, distribution or any action taken or omitted to be taken
in reliance on it, is prohibited and may be unlawful***



Reply | Threaded
Open this post in threaded view
|

Re: Voice Recognition Profiles

Baggia Paolo
In reply to this post by Baggia Paolo
 
Raxit, B.K.Delong, and Other Members,

> Raxit:
> ======
> I think Technology is exists for continuous feedback ( I am not sure, but i am searching on it...) (by some > ASR Vendor ) but Not any Standard for how to use Voice User Profile Vendor independently...
>
>> Here the key-issue is VENDOR INDEPENDENT Standard so that if one user is using 10 different system using 5
>> different vendors of ASR, There would be only Single Profile of the User shared by all, using some standard
>> Format/Protocol if the system is User-Dependets...

> or if Application wants to change the ASR of Vendor1 to of Vendor2 , the User profiles created by
> old vendor can be used by new.(No Vendor Lock-in in short...)

Yes, that is the interesting point, but I do not think the technology is mature
today for allowing that standard or almost-standard.

Even for Speaker Verification/Identification there is not a standard for voice prints,
but I'm not 100% sure there is not a standardization effort on this simpler task.

Paolo.

Gruppo Telecom Italia - Direzione e coordinamento di Telecom Italia S.p.A.

================================================
CONFIDENTIALITY NOTICE
This message and its attachments are addressed solely to the persons
above and may contain confidential information. If you have received
the message in error, be informed that any use of the content hereof
is prohibited. Please return it immediately to the sender and delete
the message. Should you have any questions, please send an e_mail to
<[hidden email]>[hidden email]. Thank you
<http://www.loquendo.com>www.loquendo.com
================================================

Reply | Threaded
Open this post in threaded view
|

Re: Voice Recognition Profiles

Sheth Raxit
In reply to this post by Baggia Paolo

Paolo,B.K.Delong and Other Members,


Baggia Paolo Wrote:
====================
>
> Yes, that is the interesting point, but I do not think the
 technology is mature  today for allowing that standard or
 almost-standard.
>
> Even for Speaker Verification/Identification there is not
a standard for  voice prints,

> but I'm not 100% sure there is not a standardization effort on this simpler
task.


Raxit :
========
Yes, the same stuff applies to Speaker Verification stuff also. ( and speaker verification may be the STRONG example)



The Common-Format/Protocols may be one of the solutions.


Other solution may be  Some standard API for Migrations

So that Vendor can have their own formats but with Set of Migration API to  convert to standard format

and Migration of ASR are not to frequently so API may be the good option.


No two vendors of Speaker Verification have similar Voice Prints ( stored in  file /database etc); but migration can be done if some standard set of  Interface (API) is available without compromising RUN-TIME Performance and
FORCE to follow standard format...

(here i am  assuming that Migration would not happen frequently)


Technology  may be one of the constraints.
(can we remove the  effect of constraint ?)

( Differnt Vendors of Speaker Verification may using different combinations of
'voice print' (or some voice properties ) ).


But I think Technology should not be the constraint.

Is there any other constraint also ?
(Like Runtime performance, Securtity or any...)


Waiting for reply,

Thanking you,

Regards
Raxit Sheth




--
Raxit Sheth
Systems Software Engineer
[hidden email]

***********************
Please note our new Address.
***********************
Phonologies (India) Private Limited
17/18 Metro House, Colaba Causeway,
Mumbai 400001. INDIA.
Ph:+91-22-22029732 / 36   Fax:+91-22-22029728

[hidden email]
http://www.phonologies.com

****The information in this email is confidential and may be legally
privileged. It is intended solely for the addressee. Access to this email by
anyone else is unauthorized. If you are not the intended recipient, any
disclosure, copying, distribution or any action taken or omitted to be taken
in reliance on it, is prohibited and may be unlawful***