爆肝80+篇多模態(tài)機(jī)器學(xué)習(xí)高分論文及源碼分享!含2023最新
多模態(tài)機(jī)器學(xué)習(xí)(MultiModal Machine Learning, MMML)是一種機(jī)器學(xué)習(xí)方法,它旨在解決復(fù)雜任務(wù),如多模態(tài)情感分析、跨語言圖像搜索等,這些任務(wù)需要同時(shí)考慮多種模態(tài)的數(shù)據(jù)并從中提取有用的信息。
得益于各種語言、視覺、視頻、音頻等大模型的性能不斷提升,多模態(tài)機(jī)器學(xué)習(xí)也逐漸興起,它可以幫助人工智能更全面、深入地理解周圍環(huán)境,提高模型的泛化能力和魯棒性,同時(shí)還可以促進(jìn)各學(xué)科之間的交流和融合。
在發(fā)展過程中,多模態(tài)機(jī)器學(xué)習(xí)的研究也面臨著許多方面的挑戰(zhàn),對于想要發(fā)論文的同學(xué)來說,了解這些挑戰(zhàn)并掌握已有的解決方案十分重要,可以幫助我們在此基礎(chǔ)上做出創(chuàng)新,快速找到自己的idea。
為了幫助同學(xué)們發(fā)出自己的paper,學(xué)姐這次又爆肝整理了多模態(tài)機(jī)器學(xué)習(xí)相關(guān)的81篇論文,包含表征、對齊、推理、生成、遷移、量化6個(gè)核心技術(shù)挑戰(zhàn)分類,篇幅原因每個(gè)分類只做簡單介紹。
掃碼添加小享,回復(fù)“多模態(tài)ML”
免費(fèi)領(lǐng)取全部81篇論文及源碼

表征(12篇)
1.Multiplicative Interactions and Where to Find Them
乘法交互作用及其來源
簡述:論文探討了乘法交互在神經(jīng)網(wǎng)絡(luò)設(shè)計(jì)中的作用,它是一種可以描述多種神經(jīng)網(wǎng)絡(luò)架構(gòu)模式(如門控、注意力層、超網(wǎng)絡(luò)和動態(tài)卷積等)的統(tǒng)一框架。作者認(rèn)為,乘法交互層可以豐富神經(jīng)網(wǎng)絡(luò)的函數(shù)類,并且在融合多信息流或條件計(jì)算時(shí)提供強(qiáng)大的歸納偏差。通過在大型復(fù)雜強(qiáng)化學(xué)習(xí)和序列建模任務(wù)中的應(yīng)用,作者證明了乘法交互的潛力和有效性,它可以提高神經(jīng)網(wǎng)絡(luò)的表現(xiàn),并提供設(shè)計(jì)新神經(jīng)網(wǎng)絡(luò)體系結(jié)構(gòu)的新思路。

2.Tensor fusion network for multimodal sentiment analysis
3.On the Benefits of Early Fusion in Multimodal Representation Learning
4.Extending long short-term memory for multi-view structured learning
5.Devise: A deep visual-semantic embedding model
6.Learning transferable visual models from natural language supervision
7.Order-embeddings of images and language
8.Learning Concept Taxonomies from Multi-modal Data
9.Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!
10.Learning factorized multimodal representations
11.Multimodal clustering networks for self-supervised learning from unlabeled videos
12.Deep multimodal subspace clustering networks
對齊(10篇)
1.Visual Referring Expression Recognition: What Do Systems Actually Learn?
視覺參照表達(dá)識別:系統(tǒng)實(shí)際學(xué)到了什么?
簡述:論文對最先進(jìn)的指稱表達(dá)式識別系統(tǒng)進(jìn)行了實(shí)證分析,發(fā)現(xiàn)這些系統(tǒng)可能會忽略語言結(jié)構(gòu),而依賴數(shù)據(jù)選擇和注釋過程中的淺層相關(guān)性。作者以一個(gè)在沒有輸入指稱表達(dá)式的情況下在輸入圖像上訓(xùn)練和測試的系統(tǒng)為例,發(fā)現(xiàn)該系統(tǒng)可以在前兩名預(yù)測中達(dá)到71.2%的精度。此外,只給定輸入即可預(yù)測對象類別的系統(tǒng)在前兩名預(yù)測中可以達(dá)到84.2%的精度。這些結(jié)果說明,在追求基于語言的實(shí)際任務(wù)上取得實(shí)質(zhì)性進(jìn)展時(shí),仔細(xì)分析模型正在學(xué)習(xí)什么以及數(shù)據(jù)是如何構(gòu)建的是至關(guān)重要的。

2.Unsupervised multimodal representation learning across medical images and reports
3.Clip-event: Connecting text and images with event structures
4.Learning by aligning videos in time
5.Multimodal adversarial network for cross-modal retrieval
6.Videobert: A joint model for video and language representation learning
7.Visualbert: A simple and performant baseline for vision and language
8.Decoupling the role of data, attention, and losses in multimodal transformers
9.Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks
10.MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences
推理(18篇)
1.Neural module networks
神經(jīng)模塊網(wǎng)絡(luò)
簡述:論文描述了一種構(gòu)建和學(xué)習(xí)神經(jīng)模塊網(wǎng)絡(luò)的程序,該程序?qū)⒙?lián)合訓(xùn)練的神經(jīng)“模塊”組合成用于問題回答的深度網(wǎng)絡(luò)。作者的方法將問題分解為其語言子結(jié)構(gòu),并使用這些結(jié)構(gòu)動態(tài)實(shí)例化模塊網(wǎng)絡(luò)(帶有可重用的組件以識別狗、對顏色進(jìn)行分類等)。所得的復(fù)合網(wǎng)絡(luò)是聯(lián)合訓(xùn)練的。作者在兩個(gè)具有挑戰(zhàn)性的視覺問題回答數(shù)據(jù)集上評估了該方法,在VQA自然圖像數(shù)據(jù)集和關(guān)于抽象形狀的復(fù)雜問題的新數(shù)據(jù)集上都取得了最佳結(jié)果。

2.Dynamic memory networks for visual and textual question answering
3.A Survey of Reinforcement Learning Informed by Natural Language
4.Mfas: Multimodal fusion architecture search
5.Multi-view intact space learning
6.Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning
7.Probabilistic neural symbolic models for interpretable visual question answering
8.Learning by abstraction: The neural state machine
9.Socratic models: Composing zero-shot multimodal reasoning with language
10.Vqa-lol: Visual question answering under the lens of logic
11.Multimodal logical inference system for visual-textual entailment
12.Towards causal vqa: Revealing and reducing spurious correlations by invariant and covariant semantic editing
13.Counterfactual vqa: A cause-effect look at language bias
14.Exploring visual relationship for image captioning
15.KAT: A Knowledge Augmented Transformer for Vision-and-Language
16.Building a large-scale multimodal knowledge base system for answering visual queries
17.Visualcomet: Reasoning about the dynamic context of a still image
18.From Recognition to Cognition: Visual Commonsense Reasoning
掃碼添加小享,回復(fù)“多模態(tài)ML”
免費(fèi)領(lǐng)取全部81篇論文及源碼

生成(12篇)
1.Multimodal summarization of complex sentences
復(fù)雜句的多模態(tài)總結(jié)
簡述:論文提出了將復(fù)雜句子自動說明為多模態(tài)總結(jié)的想法,這些總結(jié)結(jié)合了圖片、結(jié)構(gòu)和簡化壓縮文本。除了圖片之外,多模態(tài)總結(jié)還提供了關(guān)于發(fā)生了什么、誰做的、對誰做和如何做的額外線索,這可能有助于閱讀困難的人或希望快速瀏覽的人。作者提出了ROC-MMS,一個(gè)用于自動創(chuàng)建復(fù)雜句子的多模態(tài)總結(jié)(MMS)的系統(tǒng),通過生成圖片、文本摘要和結(jié)構(gòu),作者發(fā)現(xiàn),僅憑圖片不足以幫助人們理解大多數(shù)句子,尤其是對不熟悉該領(lǐng)域的讀者而言。

2.Extractive Text-Image Summarization Using Multi-Modal RNN
3.Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video
4.Multimodal abstractive summarization ` for how2 videos
5.Deep fragment embeddings for bidirectional image sentence mapping
6.Phrase-based image captioning
7.Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach
8.You said that?: Synthesising talking faces from audio
9.Zero-shot text-to-image generation
10.Stochastic video generation with a learned prior
11.Parallel wavenet: Fast high-fidelity speech synthesis
12.Arbitrary talking face generation via attentional audio-visual coherence learning
遷移(13篇)
1.Integrating Multimodal Information in Large Pretrained Transformers
在大型預(yù)訓(xùn)練Transformer中集成多模態(tài)信息
簡述:這篇論文提出了一個(gè)叫做Multimodal Adaptation Gate(MAG)的裝置,可以附加到BERT和XLNet上,讓它們在微調(diào)期間接受多模態(tài)非語言數(shù)據(jù)。這個(gè)裝置通過生成對BERT和XLNet內(nèi)部表示的轉(zhuǎn)變來實(shí)現(xiàn),而這個(gè)轉(zhuǎn)變是有條件于視覺和聲學(xué)模態(tài)的。實(shí)驗(yàn)表明,微調(diào)MAG-BERT和MAG-XLNet可以顯著提高情感分析性能,超過了以前的基線和僅語言微調(diào)的BERT和XLNet。在CMU-MOSI數(shù)據(jù)集上,MAG-XLNet首次實(shí)現(xiàn)了人類級別的多模態(tài)情感分析性能。

2.Multimodal few-shot learning with frozen language models
3.HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning
4.FLAVA: A Foundational Language And Vision Alignment Model
5.Pretrained transformers as universal computation engines
6.Scaling up visual and visual language representation learning with noisy text supervision
7.Foundations of multimodal co-learning
8.Found in translation: Learning robust joint representations by cyclic translations between modalities
9.Vokenization: Improving Language Understanding with Contextualized, VisualGrounded Supervision
10.Combining labeled and unlabeled data with co-training
11.Cross-modal data programming enables rapid medical machine learning
12.An information theoretic framework for multi-view learning
13.Comprehensive Semi-Supervised Multi-Modal Learning
量化(16篇)
1.Perceptual Score: What Data Modalities Does Your Model Perceive?
你的模型感知到什么樣的數(shù)據(jù)模式?
簡述:這篇論文介紹了一種新的度量方法,稱為感知分?jǐn)?shù),用于評估模型對輸入特征的不同子集(即模態(tài))的依賴程度。通過使用感知分?jǐn)?shù),作者發(fā)現(xiàn)四個(gè)流行數(shù)據(jù)集上的一種驚人一致趨勢:最近更準(zhǔn)確、最先進(jìn)的視覺問題回答或多模態(tài)對話視覺模型往往不如其前輩對視覺數(shù)據(jù)的感知。這種趨勢令人擔(dān)憂,因?yàn)榇鸢冈絹碓蕉嗟貜奈谋揪€索中推斷出來。使用感知分?jǐn)?shù)還可以通過將分?jǐn)?shù)分解為數(shù)據(jù)子集的貢獻(xiàn)來幫助分析模型偏差。作者希望就多模態(tài)模型的感知能力展開討論,并鼓勵(lì)從事多模態(tài)分類器工作的社區(qū)開始通過提出的感知分?jǐn)?shù)來量化感知能力。

2.Multimodal explanations: Justifying decisions and pointing to the evidence
3.Women also snowboard: Overcoming bias in captioning models
4.FairCVtest Demo: Understanding Bias in Multimodal Learning with a Testbed in Fair Automatic Recruitment
5.Smil: Multimodal learning with severely missing modality
6.VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers
7.Behind the scene: Revealing the secrets of pre-trained vision-and-language models
8.Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
9.Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!
10.MultiViz: Towards Visualizing and Understanding Multimodal Models
11.M2Lens: Visualizing and explaining multimodal models for sentiment analysis
12. HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning
13.One model to learn them all
14.What Makes Training Multi-Modal Classification Networks Hard?
15.Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks
16.MultiBench: Multiscale Benchmarks for Multimodal Representation Learning
掃碼添加小享,回復(fù)“多模態(tài)ML”
免費(fèi)領(lǐng)取全部81篇論文及源碼
