基于多模態(tài)表征學(xué)習(xí)的自動(dòng)音頻字幕方法

打開文本圖片集
Tan Liwen’,Zhou Yi′ ,Liu Yin1,Cao Yin2+ (1.Scholofomucation&InfoationEnginering,Choging UniersityfPosts&elecomicains,hoging4oina; 2.Dept.of Intelligent Science,Xi'anJiaotong-Liverpool University,Suzhou Jiangsu 215ooo,China)
Abstract:Modalitydiscrepancies haveperpetuallyposedsignificant chalenges fortheapplicationofAACand acrossall multi-modalresearchdomains.Faciliatingmodelsincomprehendingtextinformationplaysapivotalroleinestablishinga seamless connection between thetwo modalities of textandaudio.Recent studies haveconcentratedonnarrowingthedisparity between thesetwo modalities viacontrastive learning.However,bridgingthegapbetweenthem merelybyemployingasimple contrastivelossfunctionishallenging.Inordertoreduceteinfluenceofmodal diffrencesand enhancetheutilizationf the modelforthetwomodalfeatures,thispaperproposed SimTLNet,anaudiocaptioning methodbasedonmulti-modalrepresentationlearning byintroducing anovelrepresentationmodule,TRANSLATOR,constructingatwin representation structure,and jointly optimizingthemodel weights throughcontrastive learning and momentum updates,which enabledthe model toconcurrentlylearnthecommonhigh-dimensional semantic informationbetwen theaudioandtextmodalities.Theproposed method achieves 0.251,0.782,0.480forMETEOR,CIDEr,and SPIDEr-FLon AudioCaps dataset and0.187,0.475,0.303 for Clotho V2dataset,respectively,whicharecomparablewith state-of-the-art methodsandefectivelybridgethediferencebetween the two modalities.
Key words:audio captioning;representation learning;contrastive learning;modality discrepancies;twin network
0 引言
自動(dòng)音頻字幕(AAC)是一項(xiàng)多模態(tài)生成任務(wù),它聯(lián)合音頻和文本兩種模態(tài),生成音頻的描述性字幕[1]。(剩余16830字)