最美情侣中文字幕电影,在线麻豆精品传媒,在线网站高清黄,久久黄色视频

歡迎光臨散文網(wǎng) 會(huì)員登陸 & 注冊(cè)

超全CV基礎(chǔ)模型匯總!13大類算法,85個(gè)變種

2023-08-08 17:55 作者:深度之眼官方賬號(hào)  | 我要投稿

視覺(jué)領(lǐng)域的同學(xué)應(yīng)該有所體會(huì),獲取大量標(biāo)注數(shù)據(jù)是一件成本非常高的事。為了應(yīng)對(duì)這個(gè)問(wèn)題,研究者們通過(guò)借助無(wú)標(biāo)注數(shù)據(jù)、圖文數(shù)據(jù)或者多模態(tài)數(shù)據(jù)等,采用對(duì)比學(xué)習(xí)、掩碼重建等學(xué)習(xí)方式預(yù)訓(xùn)練得到視覺(jué)基礎(chǔ)模型,用于適應(yīng)各種下游任務(wù),比如物體檢測(cè)、語(yǔ)義分割等。在過(guò)去一年中,由于LLM、多模態(tài)等領(lǐng)域的快速發(fā)展,更多新興的計(jì)算機(jī)視覺(jué)基礎(chǔ)模型被提出。

到目前為止,已發(fā)布的計(jì)算機(jī)視覺(jué)基礎(chǔ)模型數(shù)目已經(jīng)相當(dāng)可觀,對(duì)于視覺(jué)領(lǐng)域的同學(xué)來(lái)說(shuō),這些基礎(chǔ)模型具有非常高的研究?jī)r(jià)值。為了方便同學(xué)們了解并掌握該領(lǐng)域的最新進(jìn)展,發(fā)出屬于自己的頂會(huì),學(xué)姐今天就和大家分享一篇綜述論文。

該文作者對(duì)計(jì)算機(jī)視覺(jué)領(lǐng)域的基礎(chǔ)模型進(jìn)行了詳細(xì)的梳理,涵蓋了13大類算法模型,以及每一類模型的變種共85個(gè),從最早的LeNet、ResNet到最新的SAM、GPT4等都有。

綜述鏈接:https://arxiv.org/pdf/2307.13721.pdf

除此之外,學(xué)姐也幫大家整理了120篇21年-23年必讀的CV領(lǐng)域算法模型的代表性論文,部分代碼已開源。

盡管已有的方法表現(xiàn)不俗,但我們清楚,計(jì)算機(jī)視覺(jué)基礎(chǔ)模型的發(fā)展仍然有巨大的進(jìn)步空間,希望同學(xué)們能通過(guò)這份資料全面掌握計(jì)算機(jī)視覺(jué)領(lǐng)域的發(fā)展脈絡(luò),厘清每個(gè)模型的變化歷史,并從中找到更優(yōu)解。

掃碼添加小享,回復(fù)“CV模型

免費(fèi)獲取全部論文+代碼合集!

論文list:

Surveys(12)

  • Foundational Models Defining a New Era in Vision: A Survey and Outlook 2023

  • A of Large Language Models 2023

  • Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond 2023

  • Multimodal Learning with Transformers: A Survey 2023

  • Self-Supervised Multimodal Learning: A Survey

  • Vision-and-Language Pretrained Models: A Survey 2022

  • A Survey of Vision-Language Pre-Trained Models 2022

  • Vision-Language Models for Vision Tasks: A Survey 2022

  • A Comprehensive Survey on Segment Anything Model for Vision and Beyond 2023

  • Vision-language pre-training: Basics, recent advances, and future trends 2022

  • Towards Open Vocabulary Learning: A Survey 2023

  • Transformer-Based Visual Segmentation: A Survey 2023

Papers

2021(11)

  • Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision 2021-02-11

  • Learning Transferable Visual Models From Natural Language Supervision 2021-02-26

  • WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training 2021-03-11

  • Open-vocabulary Object Detection via Vision and Language Knowledge Distillation 2021-04-28

  • CLIP2Video: Mastering Video-Text Retrieval via Image CLIP 2021-06-21

  • AudioCLIP: Extending CLIP to Image, Text and Audio 2021-06-24

  • Multimodal Few-Shot Learning with Frozen Language Models 2021-06-25

  • SimVLM: Simple Visual Language Model Pretraining with Weak Supervision 2021-08-24

  • LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs 2021-11-03

  • FILIP: Fine-grained Interactive Language-Image Pre-Training 2021-11-09

  • Florence: A New Foundation Model for Computer Vision 2021-11-22

2022(14)

  • Extract Free Dense Labels from CLIP 2021-12-02

  • FLAVA: A Foundational Language And Vision Alignment Model 2021-12-08

  • Image Segmentation Using Text and Image Prompts 2021-12-18

  • Scaling Open-Vocabulary Image Segmentation with Image-Level Labels 2021-12-22

  • GroupViT: Semantic Segmentation Emerges from Text Supervision 2022-02-22

  • CoCa: Contrastive Captioners are Image-Text Foundation Models 2022-05-04

  • Simple Open-Vocabulary Object Detection with Vision Transformers 2022-05-12

  • GIT: A Generative Image-to-text Transformer for Vision and Language 2022-05-27

  • Language Models are General-Purpose Interfaces 2022-06-13

  • Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone 2022-06-15

  • A Unified Sequence Interface for Vision Tasks 2022-06-15

  • BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning 2022-06-17

  • MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge 2022-06-17

  • LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action 2022-07-10

2023(83)

  • Masked Vision and Language Modeling for Multi-modal Representation Learning 2022-08-03

  • PaLI: A Jointly-Scaled Multilingual Language-Image Model 2022-09-14

  • VIMA: General Robot Manipulation with Multimodal Prompts 2022-10-06

  • Images Speak in Images: A Generalist Painter for In-Context Visual Learning 2022-12-05

  • InternVideo: General Video Foundation Models via Generative and Discriminative Learning 2022-12-07

  • Reproducible scaling laws for contrastive language-image learning 2022-12-14

  • Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks 2023-01-12

  • BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models 2023-01-30

  • Grounding Language Models to Images for Multimodal Inputs and Outputs 2023-01-31

  • Language Is Not All You Need: Aligning Perception with Language Models 2023-02-27

  • Prismer: A Vision-Language Model with An Ensemble of Experts 2023-03-04

  • PaLM-E: An Embodied Multimodal Language Model 2023-03-06

  • Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models 2023-03-08

  • Task and Motion Planning with Large Language Models for Object Rearrangement 2023-03-10

  • GPT-4 Technical Report 2023-03-15

  • EVA-02: A Visual Representation for Neon Genesis 2023-03-20

  • MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action 2023-03-20

  • Detecting Everything in the Open World: Towards Universal Object Detection 2023-03-21

  • Errors are Useful Prompts: Instruction Guided Task Programming with Verifier-Assisted Iterative Prompting 2023-03-24

  • EVA-CLIP: Improved Training Techniques for CLIP at Scale 2023-03-27

  • Unmasked Teacher: Towards Training-Efficient Video Foundation Models 2023-03-28

  • ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance 2023-03-29

  • HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face 2023-03-30

  • ERRA: An Embodied Representation and Reasoning Architecture for Long-horizon Language-conditioned Manipulation Tasks 2023-04-05

  • Segment Anything 2023-04-05

  • SegGPT: Segmenting Everything In Context 2023-04-06

  • ChatGPT Empowered Long-Step Robot Control in Various Environments: A Case Application 2023-04-08

  • Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions 2023-04-09

  • OpenAGI: When LLM Meets Domain Experts 2023-04-10

  • Graph-ToolFormer: To Empower LLMs with Graph Reasoning Ability via Prompt Augmented by ChatGPT 2023-04-10

  • Advancing Medical Imaging with Language Models: A Journey from N-grams to ChatGPT 2023-04-11

  • SAMM (Segment Any Medical Model): A 3D Slicer Integration to SAM 2023-04-12

  • Segment Everything Everywhere All at Once 2023-04-13

  • Visual Instruction Tuning 2023-04-17

  • Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models 2023-04-19

  • MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models 2023-04-20

  • Can GPT-4 Perform Neural Architecture Search? 2023-04-21

  • Evaluating ChatGPT's Information Extraction Capabilities: An Assessment of Performance, Explainability, Calibration, and Faithfulness 2023-04-23

  • Track Anything: Segment Anything Meets Videos 2023-04-24

  • Segment Anything in Medical Images 2023-04-24

  • Medical SAM Adapter: Adapting Segment Anything Model for Medical Image Segmentation 2023-04-25

  • Learnable Ophthalmology SAM 2023-04-26

  • LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model 2023-04-28

  • Transfer Visual Prompt Generator across LLMs 2023-05-02

  • Caption Anything: Interactive Image Description with Diverse Multimodal Controls 2023-05-04

  • ImageBind: One Embedding Space To Bind Them All 2023-05-09

  • InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning 2023-05-11

  • Segment and Track Anything 2023-05-11

  • An Inverse Scaling Law for CLIP Training 2023-05-11

  • VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks 2023-05-18

  • Cream: Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models 2023-05-24

  • Voyager: An Open-Ended Embodied Agent with Large Language Models 2023-05-25

  • DeSAM: Decoupling Segment Anything Model for Generalizable Medical Image Segmentation 2023-06-01

  • Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models 2023-06-08

  • Valley: Video Assistant with Large Language model Enhanced abilitY 2023-06-12

  • mPLUG-owl: Modularization empowers large language models with multimodality 2023-04-27

  • Image Captioners Are Scalable Vision Learners Too 2023-06-13

  • XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models 2023-06-13

  • ViP: A Differentially Private Foundation Model for Computer Vision 2023-06-15

  • COSA: Concatenated Sample Pretrained Vision-Language Foundation Model 2023-06-15

  • LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models 2023-06-15

  • Segment Any Point Cloud Sequences by Distilling Vision Foundation Models 2023-06-15

  • RemoteCLIP: A Vision Language Foundation Model for Remote Sensing 2023-06-19

  • LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching 2023-06-20

  • Fast Segment Anything 2023-06-21

  • TaCA: Upgrading Your Visual Foundation Model with Task-agnostic Compatible Adapter 2023-06-22

  • 3DSAM-adapter: Holistic Adaptation of SAM from 2D to 3D for Promptable Medical Image Segmentation 2023-06-23

  • How to Efficiently Adapt Large Segmentation Model(SAM) to Medical Images 2023-06-23

  • Faster Segment Anything: Towards Lightweight SAM for Mobile Applications 2023-06-25

  • MedLSAM: Localize and Segment Anything Model for 3D Medical Images 2023-06-26

  • LVM-Med:LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching

  • Kosmos-2: Grounding Multimodal Large Language Models to the World 2023-06-26

  • ViNT: A Foundation Model for Visual Navigation 2023-06-26

  • CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a $10,000 Budget; An Extra $4,000 Unlocks 81.8% Accuracy 2023-06-27

  • Stone Needle: A General Multimodal Large-scale Model Framework towards Healthcare 2023-06-28

  • RSPrompter: Learning to Prompt for Remote Sensing Instance Segmentation based on Visual Foundation Model 2023-06-28

  • Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language 2023-06-28

  • Foundation Model for Endoscopy Video Analysis via Large-scale Self-supervised Pre-train 2023-06-29

  • MIS-FM: 3D Medical Image Segmentation using Foundation Models Pretrained on a Large-Scale Unannotated Dataset 2023-06-29

  • RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation 2023-07-03

  • SAM-DA: UAV Tracks Anything at Night with SAM-Powered Domain Adaptation 2023-07-03

  • Segment Anything Meets Point Tracking 2023-07-03

  • BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs 2023-07-17

掃碼添加小享,回復(fù)“CV模型

免費(fèi)獲取全部論文+代碼合集!


超全CV基礎(chǔ)模型匯總!13大類算法,85個(gè)變種的評(píng)論 (共 條)

分享到微博請(qǐng)遵守國(guó)家法律
汪清县| 池州市| 邓州市| 比如县| 灵宝市| 滨州市| 余江县| 虹口区| 平昌县| 邓州市| 万安县| 镇赉县| 鹿邑县| 馆陶县| 万源市| 克山县| 中超| 龙口市| 平武县| 武川县| 云梦县| 黎平县| 永济市| 庆城县| 凌云县| 徐汇区| 渑池县| 乐山市| 鞍山市| 万源市| 湖口县| 桓台县| 承德市| 河间市| 赤壁市| 綦江县| 老河口市| 醴陵市| 襄城县| 林州市| 永胜县|