基于CLIP文本特征增強的剪紙圖像分類

打開文本圖片集
關(guān)鍵詞:視覺語言大模型;剪紙分類;小樣本分類;模態(tài)融合;提示學習 中圖分類號:TP391 文獻標志碼:A 文章編號:1001-3695(2025)07-010-1994-09 doi:10.19734/j.issn.1001-3695.2024.11.0485
Abstract:Toaddressthechallengesoflarge modalitygaps between textand image featuresand insuficient classprototype representationin paper-cut image clasification,this paper proposed a CLIP-based textfeature enhancement method(CLIP visualtextenhancer,C-VTE).Themethdextractedtext featuresthrough manualprompttemplates,designedavisual-textenhancement module,andemployedCrosssAtentionand proportionalresidualconnections tofuseimageandtextfeatures,therebyreducing modalitydiscrepancyandenhancing the expressiveabilityofcategoryfeatures.Experimentsonapaper-cutdataset andfourpublicdatasets includingCaltech01validatedits efectivenessForbase-classclasificationonthepaper-cutdataset, C-VTE achieved 72.51% average accuracy,outperforming existing methods by 3.14 percentage points. In few-shot classification tasks on public datasets,it attained 84.78% average accuracy with a 2.45 percentage-point improvement.Ablation experimentsdemonstratethatboth themodalityfusion moduleand proportional residual components contribute significantlytoperformanceimprovement.Themethodofersnovelinsightsforeficientadaptationof vision-languagemodelsindownstreamclassification tasks,particularly suited for few-shot learning and base-class dominated scenarios.
Key words:visual language large model;paper-cut classification;few-shotclasification;multimodal fusion;prompt learning
0 引言
在非遺領(lǐng)域中,剪紙主要是以圖片的形式存在,且種類復(fù)雜,數(shù)量繁多。(剩余22719字)