多維度交叉注意力融合的視聽(tīng)分割網(wǎng)絡(luò)

打印
收藏

收藏成功

微博 QQ空間微信

打開(kāi)文本圖片集

doi：10.19734/j.issn.1001-3695.2024.08.0369

Audio-visual segmentation network with multi-dimensional cross-attention fusion

LiFanfan，Zhang Yuanyuan，Zhang Yonglong，Zhu Junwu? （School of Information Engineering，Yangzhou University，Yangzhou Jiangsu 2251Oo，China）

Abstract：Audio-visual segmentation （AVS）aimsto locateandaccuratelysegmentthesoundingobjects inimagesbasedon both visualandauditoryinformation.Whilemostexistingresearch focusesprimarilyonexploring methods foraudio-visualinformationfusio，thereisinsuicientin-depthexplorationoffine-grinedaudio-visualanalysis，particularlyinaligingcontinuousaudiofeatures withspatialpixel-level information.Therefore，thispaperproposedanaudio-visualsegmentationatention fusion（AVSAF）method basedoncontrastive learning.Firstly，themethodusedmulti-ead crossattentionmechanismand memorytokentoconstructaaudio-visualtokenfusionmodule toreducethelossofmulti-modalinformation.Secondlyitintro ducedcontrastivelearning tominimizethediscrepancybetweenaudioandvisualfeatures，enhancing theiralignment.Aduallayerdecoderwasthenemployedtoaccuratelypredictandsegment thetarget’sposition.Finalyitcarredoutalargeumber of experiments on the S4 and MS3 sub-datasets of the AVSBenge-Object dataset.The J -valueisincreasedby3.O4and4.71 percentage pointsrespectively，and the F valueis increased by 2.4 and3.5percentage points respectively，which fully proves the effectiveness of the proposed method in audio-visual segmentation tasks.

Key words：audio-visual segmentation；multi-modal；contrastive learning；attention mechanism

0引言

人類(lèi)的感知是多維的，包括視覺(jué)、聽(tīng)覺(jué)、觸覺(jué)、味覺(jué)和嗅覺(jué)。（剩余13740字）

試讀結(jié)束

購(gòu)買(mǎi)全文6.00元下一篇基于多模態(tài)表征學(xué)習(xí)的自動(dòng)音頻字幕方法

計(jì)算機(jī)應(yīng)用研究

2025年06期

￥12.00/本

特黄三级爱爱视频|国产1区2区强奸|舌L子伦熟妇aV|日韩美腿激情一区|6月丁香综合久久|一级毛片免费试看|在线黄色电影免费|国产主播自拍一区|99精品热爱视频|亚洲黄色先锋一区

多維度交叉注意力融合的視聽(tīng)分割網(wǎng)絡(luò)