基于多模態(tài)表征學(xué)習(xí)的自動(dòng)音頻字幕方法

打印
收藏

收藏成功

微博 QQ空間微信

打開文本圖片集

Tan Liwen’，Zhou Yi′ ，Liu Yin1，Cao Yin2+ （1.Scholofomucation&InfoationEnginering，Choging UniersityfPosts&elecomicains，hoging4oina; 2.Dept.of Intelligent Science，Xi'anJiaotong-Liverpool University，Suzhou Jiangsu 215ooo，China）

Abstract：Modalitydiscrepancies haveperpetuallyposedsignificant chalenges fortheapplicationofAACand acrossall multi-modalresearchdomains.Faciliatingmodelsincomprehendingtextinformationplaysapivotalroleinestablishinga seamless connection between thetwo modalities of textandaudio.Recent studies haveconcentratedonnarrowingthedisparity between thesetwo modalities viacontrastive learning.However，bridgingthegapbetweenthem merelybyemployingasimple contrastivelossfunctionishallenging.Inordertoreduceteinfluenceofmodal diffrencesand enhancetheutilizationf the modelforthetwomodalfeatures，thispaperproposed SimTLNet，anaudiocaptioning methodbasedonmulti-modalrepresentationlearning byintroducing anovelrepresentationmodule，TRANSLATOR，constructingatwin representation structure，and jointly optimizingthemodel weights throughcontrastive learning and momentum updates，which enabledthe model toconcurrentlylearnthecommonhigh-dimensional semantic informationbetwen theaudioandtextmodalities.Theproposed method achieves 0.251，0.782，0.480forMETEOR，CIDEr，and SPIDEr-FLon AudioCaps dataset and0.187，0.475，0.303 for Clotho V2dataset，respectively，whicharecomparablewith state-of-the-art methodsandefectivelybridgethediferencebetween the two modalities.

Key words：audio captioning；representation learning；contrastive learning；modality discrepancies；twin network

0 引言

自動(dòng)音頻字幕（AAC）是一項(xiàng)多模態(tài)生成任務(wù)，它聯(lián)合音頻和文本兩種模態(tài)，生成音頻的描述性字幕[1]。（剩余16830字）

試讀結(jié)束

購買全文6.00元下一篇基于改進(jìn)行為克隆算法的機(jī)器人運(yùn)動(dòng)控制策略

計(jì)算機(jī)應(yīng)用研究

2025年06期

￥12.00/本

特黄三级爱爱视频|国产1区2区强奸|舌L子伦熟妇aV|日韩美腿激情一区|6月丁香综合久久|一级毛片免费试看|在线黄色电影免费|国产主播自拍一区|99精品热爱视频|亚洲黄色先锋一区

基于多模態(tài)表征學(xué)習(xí)的自動(dòng)音頻字幕方法