特黄三级爱爱视频|国产1区2区强奸|舌L子伦熟妇aV|日韩美腿激情一区|6月丁香综合久久|一级毛片免费试看|在线黄色电影免费|国产主播自拍一区|99精品热爱视频|亚洲黄色先锋一区

反向聚焦細(xì)粒度多模態(tài)語義對(duì)齊的視頻字幕模型

  • 打印
  • 收藏
收藏成功


打開文本圖片集

中圖分類號(hào):TP391 文獻(xiàn)標(biāo)志碼:A 文章編號(hào):1001-3695(2025)07-009-1986-08

doi:10.19734/j. issn.1001-3695.2024.11.0492

Abstract:Existingvideocaptioningoftenintroducemultimodal informationtoassistmodelsinextractingcriticalandfinegrained details fromcomplex anddynamic visual content.However,these methods tendtooverlook thesemantic gapscaused by representationaldiferencesamong modalities.Tobridgethesegaps,facilitateefectivecross-modalalignmentandeficientfusion,andenancetheextractionoffine-grainedsmanticinformatio,thispperproposedareverse-focusfingranedultio dal semanticalignmentforvideocaptioning(RM4Cap).Thismodelcombinedanimage-textpaircorpusand facilitatedsemanticalignmentbetweenvideoandimage,indirectlyaligningvideorepresentationswithtextintheimage-textpairs.Anditdesignedareverse attention focusing algorithm to suppress redundant scene informationwhile highlighting inconspicuous objects and their interactions.Experimentsconductedonthe MSVDand MSRVTTdatasetsshow thatthe model significantlyoutperforms existing methods in metricssuch as CIDErand BLEU-4.It efectivelyresolves thealignmentchallenges andredundancy issues in multimodal fusion,further demonstrating its ability to narrow the cross-modal semantic gap.

Key words:video captioning;multimodal; reverse attention;semantic alignment; semantic gap

0 引言

視頻字幕是一個(gè)連接視覺和語言并將視覺內(nèi)容以自然語言描述的跨模態(tài)任務(wù)。(剩余21688字)

目錄
monitor