createphotorealistictalkingface

资源描述

《createphotorealistictalkingface》由会员分享，可在线阅读，更多相关《createphotorealistictalkingface（35页珍藏版）》请在金锄头文库上搜索。

1、Create Photo-Realistic Talking FaceChangbo Hu*This work was done during visiting Microsoft This work was done during visiting Microsoft Research China with Baining Guo and Bo ZhangResearch China with Baining Guo and Bo ZhangOutlineIntroduction of talking faceMotivationsSystem overviewTechniquesConcl

2、usionsIntroductionWhat is a talking faceWhat is a talking facen nFace (lip) animation, driven by voiceFace (lip) animation, driven by voicen nApplicationsApplicationsThe process of talking faceThe process of talking facen nFace modelFace modeln nMotion captureMotion capturen nMapping betweenMapping

3、between audio and video audio and video n nRendering, Rendering, Photo-realistic?Photo-realistic?Literaturesn nWalter,93, DecFace, 2Dwire frame modelWalter,93, DecFace, 2Dwire frame modeln nTerzopoulos,95, Skin and muscle modelTerzopoulos,95, Skin and muscle modeln nBreglar,97, Video Rewrite, Sample

4、 image basedBreglar,97, Video Rewrite, Sample image based n nTS Huang,98,Mesh model from range dataTS Huang,98,Mesh model from range datan nPoggio,98, MikeTalk, Viseme morphingPoggio,98, MikeTalk, Viseme morphingn nGuenter,99, Making face, 3D from multicamera Guenter,99, Making face, 3D from multica

5、mera n nZhengyou Zhang, 00, 3D face modeling from video Zhengyou Zhang, 00, 3D face modeling from video through epipolar constraintthrough epipolar constraintn nCosatto,00, Planar quads modelCosatto,00, Planar quads model Some Face modelsMotivationsAim: a graphics interface for conversation agentn n

6、Photo-realisticPhoto-realisticn nDriven by ChineseDriven by Chinesen nSmooth connection between sentencesSmooth connection between sentencesExtended from “Video rewrite”System overview:Pipeline of the system(1)Video with SoundImagesSoundPose trackingPhoneme segmentationAnnotationLip motion TrackingT

7、rain databaseSystem overview: Pipeline of the system(2)New textWav soundTTS systemTriphone sequenceSegmentationSynthesized triphone sequenceTrain databaseLip motion sequenceRewrite to facesBackground sequenceTechniquesAnalysis:n nAudio processAudio processn nImage processImage processSynthesisn nLip

8、 image Lip image n nBackground imageBackground imagen nStitch togetherStitch togetherAudio part:Sound SegmentationGiven the wav file and the scriptUsing HMM to train the segment systemSegment wav file to phoneme sequenceExample of the segmentation result:SILOPEN023SILOPEN2442s4361if46274j7580ia18197

9、sh98109ang1110121y122130e4131133y134145in2146154h155164ang2165194Annotation with PhonemeUsing phoneme to annotate video framesEach phoneme in a sentence corresponds to a short time of video sequenceTraining SentenceAudio FramesVideo FramesPhoneme SequenceFrames for Phoneme1Frames for Phoneme1Phoneme

10、1Frames for Phoneme2Frames for Phoneme2Phoneme2Phoneme Distance Analysis Phoneme&triphone basicsChinese Phoneme vs. English PhonemeDistance Metrics definitionsResultsPhoneme BasicsPhonemes represents the basic elements in speech. All possible speech can be represented by combination of phonemes.CH,

11、JH, S, EH, EY, OY, AE, SILCH, JH, S, EH, EY, OY, AE, SILTriphone are three consecutive phonemes. It not only represents pronounce characteristics but also contains context information.T-IY-P, IY-P-AA, P-AA-TT-IY-P, IY-P-AA, P-AA-TChinese Phoneme vs. EnglishChinese phoneme has two basic groups: Initi

12、als Chinese phoneme has two basic groups: Initials and Finals.and Finals.Initials: B, P, M, F, Initials: B, P, M, F, Finals: a3, o1, e2, eng3, iang4, ue5, Finals: a3, o1, e2, eng3, iang4, ue5, Chinese finals each has 5 tones: 1,2,3,4,5.Chinese finals each has 5 tones: 1,2,3,4,5.Different tones: a1,

13、a2, a3, a4, a5.Different tones: a1, a2, a3, a4, a5.Chinese finals actually is not a basic elements of Chinese finals actually is not a basic elements of speech.speech.For example: iang1, iao1, uang1, iong1For example: iang1, iao1, uang1, iong1Chinese phoneme set is much larger than Chinese phoneme s

14、et is much larger than English.English.Phoneme Distance AnalysisDefine the distance between any two phonemes.Define the distance between any two phonemes.Since we only synthesis video but not sound, so Since we only synthesis video but not sound, so tone is ignoredtone is ignoredLip shape motion is

15、the core element for Lip shape motion is the core element for distance metrics.distance metrics.Phoneme Distance AnalysisVideo 1Video 2Video 4Video 1Video 2Video 3Phoneme 1:Phoneme 2:Time Align to an uniform lengthVideo 2Video 3Video 4Video 2Video 1Video 1Average the videos to get an average videoVi

16、deo AverageVideo AverageBy comparing the two aligned average videos, we generate the distance matrix of the whole phoneme set.Image part: Pose TrackinglAssume a plane Assume a plane model for facemodel for facelStandard Standard minimization method minimization method to find transform to find trans

17、form matrix (affine matrix (affine transform)Black,95transform)Black,95lMask is used to Mask is used to constrain interests constrain interests part of the facepart of the faceTemplate PictureMask ImagePose trackingMotion prediction using parameters with physical meaningPose TrackingSome tracking re

18、sults:Some tracking results:Lip Motion TrackingUsing Eigen Points (Covell, 91)Feature Points include Jaw, lip and teethTraining database specified manuallyAuto tracking through all pose-tracked imagesLip motion trackingLip Motion TrackingTrain Database (hand-labeled)Auto Tracking ResultsSynthesis ne

19、w sentencesNew text converted by TTS system to wavWav is segmented to phoneme sequenceUsing DP to find an optimal video sequence from the training databaseTime-align triphone videos and stitch them together.Transform the lip sequence and paste them to background faces.Lip sequence synthesisOptimal p

20、honeme sequencesTriphone 1Triphone 2Triphone 5Triphone 3Triphone 4Triphone 6Triphone 7Triphone 8 Triphone BTriphone 9Triphone ATriphone CNew phoneme sequencesNew phoneme sequencesDynamic ProgrammingBeginTriphone1Triphone3Triphone2Triphone4EndTriphone5Edge Cost DefinitionTwo parts: 1.1.phoneme distan

21、ce: 3 phonemes distances added phoneme distance: 3 phonemes distances added togethertogether2.2.Lip shape distance for the overlap portion of triphone Lip shape distance for the overlap portion of triphone videovideoWeighted add together two partBackground video generationBackground is a video seque

22、nce when the virtual character spoke something elseSimilarity measurement of backgroundSelect “standard frame”The frame with maximal number of frames similar The frame with maximal number of frames similar to itto itFilter out the frames with jerkinessFilter out the frames with jerkinessStitch the t

23、ime-aligned result to background facesWrite back with a maskTransform the synthesized lip to the background faceMask image for write-back operationOriginal background frameWrite-back result of the same frameMore video resultsMore video resultsConclusion and Future WorkPose tracking and lip motion trackingSize of the train databaseTalking face with expressionReal-time generation?Fast modeling for different personAnimation Thank you

展开阅读全文

createphotorealistictalkingface

最新文档