多尺度視覺增強(qiáng)語音驅(qū)動(dòng)人臉生成

打開文本圖片集
關(guān)鍵詞:語音驅(qū)動(dòng);人臉生成;視覺增強(qiáng);視覺質(zhì)量
中圖分類號(hào):TP391文獻(xiàn)標(biāo)志碼:A
DOI:10.7652/xjtuxb202506017 文章編號(hào):0253-987X(2025)06-0167-10
Audio-Driven Talking Face Generation with Multi-Scale Visual Enhancement
YANG Xiangyan1,LIANGHuihui2,CHEN Xi,LIFan2
(1. School of Computer Science and Technology,Xinjiang University,Urumqi 83o046,China; .Faculty of Electronic and Information Engineering,Xi'an Jiaotong University,Xi'an 71oo49,China)
Abstract: To address the limitations of existing audio-driven talking face generation methods in terms of video clarity and realism,an end-to-end talking face generation method called VisClearTalk which incorporates multi-scale visual enhancement is proposed in this paper,and a face decoder with a visual enhancement module is proposed. First, the face encoder processed a random reference frame and a prior frame with the lower half of the face occluded to extract facial features. Simultaneously, the audio encoder extracted features from the audio to guide facial content generation. Subsequently, the face decoder integrated these features and performed an initial reconstruction of facial images through convolutional modules.Finall,the visual enhancement module employed multi-scale convolution and residual fusion to further enhance the details and edge information of the lower face region,improving the visual quality of the generated talking face videos. The VisClearTalk model was experimentally validated using public lip-reading datasets,with both quantitative and qualitative results demonstrating that the introduction of the visual enhancement module effectively improves the fineness and realism of facial visual content, enabling the generation of clear and natural talking face videos. In terms of performance metrics, the peak signal-to-noise ratio reached 34.349 dB, structural similarity reached O.933,and learnable perceptual image patch similarity was reduced to O. 040. The VisClearTalk model offers a viable solution for current talking face videos generation needs.
Keywords: audio-driven; talking face generation; visual enhancement; visual quality
語音驅(qū)動(dòng)人臉生成是視聽領(lǐng)域的重要研究課題之一[],其能夠?qū)⒁曈X和聽覺信息有機(jī)整合,增強(qiáng)人類對(duì)信息的理解和感知。(剩余14292字)