ConversationalUI李开复的演讲PPT

资源描述

《ConversationalUI李开复的演讲PPT》由会员分享，可在线阅读，更多相关《ConversationalUI李开复的演讲PPT（26页珍藏版）》请在金锄头文库上搜索。

1、Conversational Computers:Always 10 Years Away?Kai-Fu LeeCorporate Vice PresidentMicrosoft CorporationWhy Conversational Interface?Speech : “invented” for interaction“Speech & language are a biological adaptation to communicate information One of natures engineering marvels” Steven Pinker “Vision evo

2、lved from the need to survive; speech evolved from the need to communicate” Michael Dertouzos.Benefits of “Conversational Interface”“To me, speech recognition will be a transforming capability when you can speak to your computer and it will understand what youre saying in context.” Gordon Moore“Spee

3、ch and natural language understanding are the key technologies that will have the most impact in the next 15 years.” Bill GatesFuture UI vision assume conversational UIApples “Knowledge Navigator”.Microsofts “information at your fingertips”.Science fiction movies assume conversational UIBut “Always”

4、 10 Years Away1950Jerome Weisner predicted by 1960 machine Jerome Weisner predicted by 1960 machine translation may be possibletranslation may be possible1957Herbert Simon predicted by 1967 machine will Herbert Simon predicted by 1967 machine will match human performance in many areasmatch human per

5、formance in many areas1969US Expert Panel predicted “voice I/O will be in US Expert Panel predicted “voice I/O will be in common use by 1978”common use by 1978”1993I predicted by 2003 every PC will ship with speech I predicted by 2003 every PC will ship with speech recognitionrecognition1998Gartner

6、Group predicted PC UI will assume voice Gartner Group predicted PC UI will assume voice input by 2003input by 2003Decomposing the PredictionSpeech recognitionText to speechNatural language understandingWhy have we been a constant 10 years away?My 3-year & 10-year predictionsNaturalNaturalLanguageLan

7、guageUnderstandingUnderstandingSpeechSpeechRecognitionRecognitionText to SpeechText to SpeechTalk OutlineTalk OutlineNaturalNaturalLanguageLanguageUnderstandingUnderstandingSpeechSpeechRecognitionRecognitionText to SpeechText to SpeechSpeech recognitionText to speechNatural language understandingWhy

8、 have we been a constant 10 years away?My 3-year & 10-year predictionsFundamental Equation of Speech RecognitionX is the acoustic waveformW is the word stringA speech recognizer finds W such thatW = argmax p(W | X ) = argmax p(X | W ) p(W )p(X | W ) is the is the acoustic modelacoustic modelp(W ) is

9、 the is the language modellanguage modelStatistical ModelingImproving the acoustic model p(X | W ) Statistical ApproachStatistical Approach1.1.Build a detailed statistical model for each word.Build a detailed statistical model for each word.Detail could be based on phonetics, speaker, Detail could b

10、e based on phonetics, speaker, dialect, gender, or data-driven details etc.dialect, gender, or data-driven details etc.2.2.Collect a lot more samples for each word.Collect a lot more samples for each word.There is no data like more data.There is no data like more data.3.3.Go to step one.Go to step o

11、ne.Improving the language model p(W )Statistical Approach Trigrams.There is no data like more data.There is no data like more data.This helps recognition, not understanding.Does Moores Law Help Speech?Moores law is necessary but not sufficientJust faster chips means recognition errors Just faster ch

12、ips means recognition errors appear faster.appear faster.Super-Moores law for speech:Faster processors/memory/disk +Faster processors/memory/disk +Getting more real data & feedback loop +Getting more real data & feedback loop +Improved statistical modelsImproved statistical modelsResult:Moores law d

13、oubles performance in 18 monthsMoores law doubles performance in 18 monthsSuper-Moores law halves errors in 60 monthsSuper-Moores law halves errors in 60 monthsSpeech Recognition: Approaching Human Error RateApproaching Human Error RateMicrosoft licensed CMU Sphinx-IIWhisper in MSRSpeech in Office X

14、PSpeech in Tablet/Office 11Speech in LonghornHumanHumanError RateError RateTalk OutlineNaturalNaturalLanguageLanguageUnderstandingUnderstandingSpeechSpeechRecognitionRecognitionText to SpeechText to SpeechSpeech recognitionText to speechNatural language understandingWhy have we been a constant 10 ye

15、ars away?My 3-year & 10-year predictionsFundamental Approach for TTSConcatenative SynthesisConcatenation of pre-recorded speech unitsConcatenation of pre-recorded speech unitsFront-endFront-endNatural language processing (word breaking, POS)Natural language processing (word breaking, POS)Determine e

16、mphasis to drive speed, pitch, loudness.Determine emphasis to drive speed, pitch, loudness.Back-endBack-endCollect a lot of dataCollect a lot of dataCarefully segment & store in a databaseCarefully segment & store in a databaseSelect the best units from the databaseSelect the best units from the dat

17、abaseFind statistical metrics that match “naturalness”, Find statistical metrics that match “naturalness”, e.g., smoothness rather than specific duration targetse.g., smoothness rather than specific duration targetsUse these metrics to select unitsText to Speech Approaching Human NaturalnessApproach

18、ing Human NaturalnessNaturalnessHumanHumanNaturalnessNaturalnessASR & TTS: Optimization & EngineeringBy leveraging Moores lawExponential improvements fromFaster CPU + bigger database + better algorithmApproaching human abilities, but not AI, butOptimization, or “speech engineering”Still falls short

19、of humans on:Learning, adaptation.Robustness to environment.But many applications just from ASR & TTS:ASR: Dictation, speech search, speaker verification, language learningTTS: Telephony info access, voice fonts, voice conversionTalk OutlineNaturalNaturalLanguageLanguageUnderstandingUnderstandingSpe

20、echSpeechRecognitionRecognitionText to SpeechText to SpeechSpeech recognitionText to speechNatural language understandingWhy have we been a constant 10 years away?My 3-year & 10-year predictionsSyntax (rules of the humans language)Nouns, verbs, etc. and how they combine“Book about a trip to Chicago”

21、 vs. “Book a trip to Chicago”Normalize linguistic variations .SemanticsMeaning of the wordsBook means reserve a ticket; requires from-city, to-city, etc.Context (additional hints)Domain knowledge : No train from Hawaii to Chicago Statistics : Book as a noun Book as a verb“Book Chicago”Personal Prefe

22、rences : Where you live, your calendar, how you payModel of time, urgency,presenceDialog (resolving ambiguity & determine intent)“Buy a book or book travel?”“What date would you like to travel?”Natural Language Understanding Combines:Applying Statistics to UnderstandingEngineering approach:Focus on

23、one domain, engineer all the knowledge.Focus on one domain, engineer all the knowledge.Collect data & create feedback loop to improve.Collect data & create feedback loop to improve.Applying Bayes Rule to understandingW is the word string is the word string M is the meaning is the meaningA speech rec

24、ognizer finds A speech recognizer finds M such that such thatM = argmax p(M | W ) = argmax p(W | M) p(W )p p( (W W | | M M ) ) models all the ways to express a “meaning” models all the ways to express a “meaning”p p( (MM) ) is the is the semantic modelsemantic modelWhat is “unsolved” by Statistics?F

25、usion of many sources of knowledgeDomain-free understandingInstant context switchingInstant context switchingGeneral knowledgeHistory, sports, etc.History, sports, etc.Common sense reasoning“Least common of all senses”“Least common of all senses”Ambiguity“Mr. “Mr. WrightWright should should writewri

26、te to Mrs. to Mrs. WrightWright rightright away” away”Emotion, humor, etc.Many of the challenges are “AI-complete”Milestones in Speech Technology Research 196219671972197719821987199219972002Isolated WordsFilter-bank analysis; Time-normalization;Dynamic programmingIsolated Words; Connected Digits; C

27、ontinuous SpeechPattern recognition; LPC analysis; Clustering algorithms; Continuous Speech; Speech UnderstandingStochastic language understanding; Finite-state machines; Statistical learning;Small Vocabulary, Acoustic Phonetics-basedMedium Vocabular,Template-basedLarge Vocabulary; Syntax, Semantics

28、, Connected Words; Continuous SpeechLarge Vocabulary, Statistical-basedHidden Markov models; Stochastic Language modeling;Spoken dialog; Multiple modalitiesVery Large Vocabulary; Semantics, Multimodal Dialog, TTSConcatenative synthesis; Machine learning; Mixed-initiative dialog;Fueled by Moores Law

29、+ Data + ResearchTalk OutlineSpeech recognitionText to speechNatural language understandingWhy have we been a constant 10 years away?My 3-year & 10-year predictionsNaturalNaturalLanguageLanguageUnderstandingUnderstandingSpeechSpeechRecognitionRecognitionText to SpeechText to SpeechWhy Constant 10 Ye

30、ars Away?Immature technologyImproving but only recently becoming usefulImproving but only recently becoming usefulOver-sold expectationsScience fiction moviesScience fiction moviesEffective (but not real product) demosEffective (but not real product) demosUnder-estimated risksUser habits are hard to

31、 changeUser habits are hard to changeCost of developing speech application is high Cost of developing speech application is high Things are different now!Technology is readyTechnology is readyAnd we have learned our lessons.And we have learned our lessons.What Have We Learned?Dont make predictions.

32、based on extrapolating from one data point! based on extrapolating from one data point!There is no data like more data.Real data & feedback Moores Law.Real data & feedback Moores Law.Change the world, one domain at a time.Breakthrough from data + rigor is just fine.Breakthrough from data + rigor is

33、just fine.Start with users comfort zone.Start with the greatest customer need & business opportunity.Talk OutlineSpeech recognitionText to speechNatural language understandingWhy have we been a constant 10 years away?My 3-year & 10-year predictionsNaturalNaturalLanguageLanguageUnderstandingUnderstan

34、dingSpeechSpeechRecognitionRecognitionText to SpeechText to Speech3-Year Speech Prediction:Most Realistic Near-Term Speech ApplicationMost Realistic Near-Term Speech ApplicationMeeting / Voicemail Meeting / Voicemail TranscriptionTranscriptionMarket Market OpportunityOpportunityMobile Devices / Cars

35、Mobile Devices / CarsTelephony / Call CenterTelephony / Call CenterAccessibilityAccessibilityDesktop DictationDesktop DictationWindows Commands & Windows Commands & Applications / APIApplications / APITechnology Technology ReadinessReadinessCustomer Customer NeedNeedPoorPoorAlternativeAlternative10-

36、Year Speech PredictionsTelephonyDevicesDesktopDictation &New applicationsAll phoneshave speech;Mainstream app20052005Accessibility &AsianDictationMobility & Automotiveapplications2008200820102010Structured SearchDelegationCall CenterMainstream app(unified msg)VOIP convergesdata & voiceCentral Part o

37、f Mobile UI; Mobile dictation20132013Key part ofDesktop UI;PlanningFederationQuestionAnsweringTask-specific translationHome appliancesVoice dataVoicemail &MeetingSearchPersonalAnnotations &Recording searchMining fromaudio data(e.g., call center)Voicemail &MeetingtranscriptionConclusionSpeech technol

38、ogies will follow Moores LawFaster CPU + more data + better algorithms.Faster CPU + more data + better algorithms.Near-human quality possible in 7-10 yearsNear-human quality possible in 7-10 yearsNatural language understanding is hardDomain-free reasoning & common sense hardestDomain-free reasoning

39、& common sense hardestTruly human-level understanding likely elusiveTruly human-level understanding likely elusiveSmart, conversational systems will emerge2-3 years: telephony, multimodal, accessibility.2-3 years: telephony, multimodal, accessibility.7-10 years: intelligent assistance, meeting 7-10 years: intelligent assistance, meeting search/transcription, speech everywhere. search/transcription, speech everywhere. 2001 Microsoft Corporation. All rights reserved. 2001 Microsoft Corporation. All rights reserved.

展开阅读全文

ConversationalUI李开复的演讲PPT

最新文档