Skip to content

🤩 An AWESOME Curated List of Papers, Workshops, Datasets, and Challenges from CVPR 2024

License

Notifications You must be signed in to change notification settings

harpreetsahota204/awesome-cvpr-2024

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Awesome CVPR 2024 Papers, Workshops, Challenges, and Tutorials!

visitors

The 2024 Conference on Computer Vision and Pattern Recognition (CVPR) received 11,532 valid paper submissions, and only 2,719 were accepted, for an overall acceptance rate of about 23.6%.

Below is a list of the papers, posters, challenges, workshops, and datasets I'm most excited about.

I'll be there with my crew from Voxel 51 at Booth 1519, which will be located right next to the Meta and Amazon Science booths!

If you found the repo useful, come by and say "Hi" and I'll hook you up with some swag!

🏆 Challenges

Title Authors Code / arXiv Page Summary
Agriculture-Vision Prize Challenge The Agriculture-Vision Prize Challenge 2024 encourages the development of algorithms for recognizing agricultural patterns from aerial images and to promote sustainable agriculture practices. Semi-supervised learning techniques will be used to merge two datasets and assess model performance. Prizes are $2,500 for 1st place, $1,500 for 2nd place, and $1,000 for 3rd place.
Building3D Challenge arXiv This challenge utilizes the Building3D dataset, an urban-scale publicly available dataset with over 160,000 buildings from 16 cities in Estonia. Participants must develop algorithms that take point clouds as input and generate wireframe models.
Structured Semantic 3D Reconstruction (S23DR) Challenge Transform posed images or SfM outputs into wireframes for extracting semantically meaningful measurements. HoHo dataset provides images, point clouds, and wireframes with semantically tagged edges. $25,000 prize pool.
Pixel-level Video Understanding in the Wild The PVUW challenge includes four tracks: Video Semantic Segmentation (VSS), Video Panoptic Segmentation (VPS), Complex Video Object Segmentation, and Motion Expression guided Video Segmentation[1]. The two new tracks, based on the MOSE and MeViS datasets, aim to foster the development of more comprehensive and robust pixel-level understanding of video scenes in complex environments and realistic scenarios.
SyntaGen Competition The SyntaGen Competition challenges participants to create high-quality synthetic datasets using Stable Diffusion and the 20 class names from PASCAL VOC 2012 for semantic segmentation. The datasets will be evaluated by training a DeepLabv3 model and assessing its performance on a private test set, with submissions ranked based on the mIoU metric[1]. The top 2 teams will receive cash prizes and the opportunity to present their work at the workshop.
SMART-101 CVPR 2024 Challenge The EvalAI challenge called "Anthropic Conversational AI Evaluation" has the objective of evaluating open-domain conversational AI systems based on their ability to engage in helpful, harmless, and honest conversations with humans[1]. The challenge comprises a multi-turn dialogue between a human and an AI assistant, where the human can ask the AI to perform open-ended tasks or engage in open-ended conversation[1]. The AI systems are evaluated on various metrics, including helpfulness, harmlessness, honesty, groundedness, and role consistency.
Snapshot Spectral Imaging Face Anti-spoofing Challenge New spectroscopy sensors can improve facial recognition systems' ability to identify realistic flexible masks made of silicone or latex. Snapshot Spectral Imaging (SSI) technology obtains compressed sensing spectral images in a single exposure, making it useful for incorporating spectroscopic information. Using a snapshot spectral camera, we created HySpeFAS - the first snapshot spectral face anti-spoofing dataset with 6760 hyperspectral images, each containing 30 spectral channels. This competition aims to encourage research on new spectroscopic sensor face anti-spoofing algorithms suitable for SSI images.
Chalearn Face Anti-spoofing Workshop Spoofing clues resulting from physical presentation attacks are caused by color distortion, screen moire patterns, and production traces. Forgery clues resulting from digital editing attacks are changes in pixel values. The fifth competition aims to explore common characteristics of these attack clues and promote unified detection algorithms. We have a Unified physical-digital Attack dataset, called UniAttackData, with 1,800 participations, 2 physical and 12 digital attacks, and 29,706 videos.
DataCV Challenge GitHub The DataCV Challenge searches training sets for various targets in object detection. The datasets for the challenge consist of a data source pool, combining multiple existing detection datasets, and a newly introduced target dataset with diverse detection environments recorded across 100 countries. Test set A is publicly available on Github, while test set B is reserved for determining challenge awards. An evaluation server is provided for calculating test accuracy. Ethical considerations have been followed by blurring human faces and vehicle license plates to ensure individual privacy and validating copyright before distributing the datasets.
Grocery Vision The GroceryVision Dataset is part of the RetailVision Workshop Challenge at CVPR 2024. It has two tracks that use real-world retail data collected in typical grocery store environments. Track 1 focuses on Video and Spatial Temporal Action Localization (TAL and STAL). Participants are provided with 73,683 image-annotation pairs for training, and their performance is evaluated based on frame-mAP for TAL and tube-mAP for STAL. Track 2 is the Multi-modal Product Retrieval (MPR) challenge. Participants must design methods to accurately retrieve product identity by measuring similarity between images and descriptions.
SoccerNet-GSR'24 Challenge GitHub SoccerNet Game State Reconstruction (GSR) is a novel computer vision task involving the tracking and identification of sports players from a single moving camera to construct a video game-like minimap, without any specific hardware worn by the players. A new benchmark for Game State Reconstruction is introduced for this challenge, including a new dataset with 200 annotated soccer clips, a new evaluation metric, and a public baseline to serve as a starting point for the participants. Methods will be ranked according to their performance on the introduced metric on a held-out challenge set.

👁️ Vision Transformers

Title Authors Code / arXiv Page Summary
Point Transformer V3: Simpler, Faster, Stronger Xiaoyang Wu, Li Jiang, Peng-Shuai Wang GitHub arXiv The Point Transformer V3 (PTv3) is a 3D point cloud transformer architecture that prioritizes simplicity and efficiency to enable scalability, overcoming the traditional trade-off between accuracy and speed in point cloud processing. It uses point cloud serialization, serialized attention, enhanced conditional positional encoding (xCPE), and simplified designs to improve efficiency. PTv3 achieves state-of-the-art performance across over 20 downstream tasks spanning indoor and outdoor scenarios, while offering superior speed and memory efficiency compared to previous point transformers.
RepViT: Revisiting Mobile CNN From ViT Perspective Ao Wang, Hui Chen, Zijia Lin GitHub arXiv RepViT is a new lightweight convolutional neural network series for mobile devices. It combines efficient architectural choices from Vision Transformers with a standard CNN, MobileNetV3-L. Key steps include separating token mixer and channel mixer, reducing expansion ratio, using early convolutions as stem, employing a deeper downsampling layer, replacing the classifier with a simpler one, and using only 3x3 convolutions. RepViT outperforms existing CNNs and ViTs on vision tasks while maintaining favorable latency on mobile devices.

👁️💬 Vision-Language

Title Authors Code / arXiv Page Summary
Vlogger: Make Your Dream A Vlog Shaobin Zhuang, Kunchang Li3, Xinyuan Chen GitHub arXiv Vlogger is an AI system that generates minute-level video blogs from user descriptions. It uses a Large Language Model (LLM) to break down the task into four stages: Script, Actor, ShowMaker, and Voicer. The ShowMaker uses a Spatial-Temporal Enhanced Block (STEB) to enhance spatial-temporal coherence. Vlogger can generate 5+ minute vlogs surpassing previous long video generation methods.
A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models Julio Silva-Rodríguez, Sina Hajimiri, Ismail Ben Ayed GitHub arXiv CLIP is a powerful vision-language model for visual recognition. However, fine-tuning it for small downstream tasks with limited labeled samples is challenging. Efficient transfer learning (ETL) methods adapt VLMs with few parameters, but require careful per-task hyperparameter tuning using large validation sets. To overcome this, the authors propose CLAP, a principled approach that adapts linear probing for few-shot learning. CLAP consistently outperforms ETL methods, providing an efficient and robust approach for few-shot adaptation of large vision-language models in realistic settings where hyperparameter tuning with large validation sets is not feasible.
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want Zeyi Sun, Ye Fang, Tong Wu GitHub arXiv Alpha-CLIP is an improved version of the CLIP model that focuses on specific regions of interest in images through an auxiliary alpha channel. It can enhance CLIP in different image-related tasks, including 2D and 3D image generation, captioning, and detection. Alpha-CLIP preserves CLIP's visual recognition ability and boosts zero-shot classification accuracy by 4.1% when using foreground masks.
CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update Zhi Gao, Yuntao Du, Xintong Zhang GitHub arXiv CLOVA is a system that leverages large language models (LLMs) to generate programs that can accomplish various visual tasks using off-the-shelf visual tools. To overcome the limitation of fixed tools, CLOVA has a closed-loop framework that includes an inference phase, reflection phase, and learning phase. It also uses a multimodal global-local reflection scheme and three flexible methods to collect real-time training data. CLOVA's learning capability enables it to adapt to new environments, resulting in a 5-20% better performance on VQA, multiple-image reasoning, knowledge tagging, and image editing tasks.
Convolutional Prompting meets Language Models for Continual Learning Anurag Roy, Riddhiman Moulick, Vinay K. Verma arXiv The paper introduces ConvPrompt, a novel approach for continual learning in vision transformers. ConvPrompt leverages convolutional prompts and large language models to maintain layer-wise shared embeddings and improve knowledge sharing across tasks. The method improves state-of-the-art by around 3% with significantly fewer parameters. In summary, ConvPrompt is an efficient and effective prompt-based continual learning approach that adapts the model capacity based on task similarity.
Improved Visual Grounding through Self-Consistent Explanations Ruozhen He, Paola Cascante-Bonilla, Ziyan Yang GitHub arXiv This paper presents a strategy called SelfEQ. The aim of SelfEQ is to improve the ability of vision-and-language models to locate specific objects in an image. The proposed strategy involves adding paraphrases generated by a large language model to existing text-image datasets. The model is then fine-tuned to ensure that a phrase and its paraphrase map to the same region in the image. This promotes self-consistency in visual explanations, expands the model's vocabulary, and enhances the quality of object locations highlighted by gradient-based visual explanation methods like GradCAM.
Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation Ba Hung Ngo, Nhat-Tuong Do-Tran, Tuan-Ngoc Nguyen GitHub arXiv The paper introduces a new approach called Explicitly Class-specific Boundaries (ECB) for domain adaptation, which combines the strengths of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) by training CNN on ViT. ECB uses ViT to determine class-specific decision boundaries and CNN to group target features based on those boundaries. This improves the quality of pseudo labels and reduces knowledge disparities. The paper also provides visualizations to demonstrate the effectiveness of the proposed ECB method.
Link-Context Learning for Multimodal LLMs Yan Tai, Weichen Fan, Zhao Zhang GitHub arXiv The paper presents Link-Context Learning (LCL), a new approach that enables Multimodal Large Language Models (MLLMs) to learn new concepts from limited examples in a single conversation. The proposed training strategy fine-tunes MLLMs using contrast learning and balanced sampling from LCL and original tasks. The ISEKAI dataset is introduced to evaluate MLLMs' performance on LCL tasks. Experiments show that LCL-MLLM outperforms vanilla MLLMs on the ISEKAI dataset. The paper presents LCL as a promising paradigm for expanding MLLMs' abilities and paving the way for more human-like learning in multimodal models.
Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations Sangmin Lee, Bolin Lai, Fiona Ryan arXiv The paper "Modeling Multimodal Social Interactions" introduces three new tasks to model multi-party social interactions. The authors propose a novel multimodal baseline that leverages densely aligned language-visual representations to address these challenges. Experiments demonstrate the effectiveness of the proposed approach in modeling social interactions.
mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration Qinghao Ye, Haiyang Xu, Jiabo Ye1 GitHub arXiv mPLUG-Owl2 is a multi-modal language model that improves text and multi-modal task performance. It uses a modularized network design with a language decoder as a universal interface for managing different modalities. It incorporates shared functional modules and a modality-adaptive module. It uses a two-stage training paradigm consisting of vision-language pre-training and joint vision-language instruction tuning. Experiments show it achieves SOTA results on multiple vision-language and pure-text benchmarks. It introduces novel architecture designs and training methods to enable modality collaboration, leading to strong performance in text-only and multi-modal tasks.
OneLLM: One Framework to Align All Modalities with Language Jiaming Han, Kaixiong Gong, Yiyuan Zhang GitHub arXiv OneLLM aligns 8 modalities to language using a unified framework. It uses a frozen CLIP-ViT and a universal projection module (UPM) that mixes image projection experts. OneLLM progressively aligns modalities to the LLM, starting with image-text alignment and expanding to video, audio, point cloud, depth/normal map, IMU, and fMRI data. The authors curated a large multimodal instruction dataset to fine-tune OneLLM's multimodal understanding and reasoning capabilities. OneLLM performs excellently on 25 diverse multimodal benchmarks, including captioning, question answering, and reasoning tasks. OneLLM pioneers a unified and scalable MLLM framework that can align a wide range of modalities with language and achieve strong multimodal understanding through instruction finetuning.
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation Qidong Huang, Xiaoyi Dong, Pan Zhang GitHub arXiv OPERA is a novel solution to alleviate hallucination in MLLMs. It introduces a penalty term and rollback strategy during beam-search decoding, targeting the root cause of self-attention patterns. It doesn't require additional data or training and has been proven effective in experiments.
Describing Differences in Image Sets with Natural Language Lisa Dunlap, Yuhui Zhang, Xiaohan Wang GitHub arXiv
Let’s Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation Shanshan Zhong, Zhongzhan Huang, Shanghua Gao GitHub arXiv
Osprey: Pixel Understanding with Visual Instruction Tuning GitHub arXiv Osprey is an approach that improves multimodal large language models' accuracy in understanding visual information. It uses fine-grained mask regions and a convolutional CLIP backbone to extract precise visual mask features from high-resolution inputs efficiently. The authors curated the Osprey-724K dataset with 724K samples to facilitate mask-based instruction tuning. Osprey outperforms previous state-of-the-art methods in various region understanding tasks, showcasing its new capability for pixel-level instruction tuning.

👩🏾‍🏫 Tutorial

Title Authors Code / arXiv Page Summary
Diffusion-based Video Generative Models The tutorial will provide an in-depth exploration of diffusion-based video generative models, a cutting-edge field that is transforming video creation. It aims to help students, researchers, practitioners, video creators and enthusiasts gain the necessary knowledge to enter and contribute to this domain. The tutorial will cover three main topics: (1) Fundamentals: Diffusion models, video foundation models, pre-training (2) Applications: Fine-tuning, editing, controls, personalization, motion customization(3) Evaluation & Safety: Benchmarks, metrics, attacks, watermarks, copyright protection
From Multimodal LLM to Human-level AI: Modality, Instruction, Reasoning and Beyond The MLLM Tutorial at CVPR 2024 reviews cutting-edge research in multimodal large language models (MLLMs). These models integrate various modalities to enable AI systems to understand, reason, and plan. The tutorial focuses on MLLM architecture design, instructional learning, and multimodal reasoning. The organizers have compiled an extensive reading list for LLMs, MLLMs, instruction tuning, and reasoning. The tutorial aims to summarize technical advancements, challenges, and future research directions in the evolving field of MLLMs.
Generalist Agent AI The tutorial on Generalist Agent AI (GAA) provides a comprehensive overview of GAA systems that generate effective actions through multimodal sensory input. Led by experts from academia and industry, the tutorial covers areas such as embodied-multimodality, robotics, gaming, and healthcare. The schedule includes lectures, Q&A sessions, and panel discussions on knowledge agents, agent robotics, and agent foundation models.
Robustness at Inference: Towards Explainability, Uncertainty, and Intervenability The tutorial focuses on creating neural networks that are robust, explainable, and have uncertainty quantification. It also covers intervenability by humans. Three key concepts covered are explainability, uncertainty, and intervenability. The tutorial covers various applications, such as image recognition, detecting anomalies, and image quality assessment.
Unlearning in Computer Vision: Foundations and Applications Machine Unlearning (MU) is a new field in computer vision that focuses on removing specific data points, classes, or concepts from pre-trained models. The tutorial aims to provide a comprehensive understanding of MU techniques, algorithmic foundations and applications in computer vision. It also emphasizes the importance of MU from an industry perspective and discusses metrics to verify the unlearning process.
Recent Advances in Vision Foundation Models The tutorial covers general-purpose vision systems, or vision foundation models, for various downstream tasks at different levels of granularity. It explores the synergy of tasks and the versatility of transformers for building models for multimodal understanding and generation.

📊 Datasets/Benchmarks

Title Authors Code / arXiv Page Summary
Benchmarking and Evaluating Large Video Generation Models Yaofang Liu, Xiaodong Cun, Xuebo Liu GitHub arXiv The paper proposes a comprehensive evaluation framework for large video generation models that have grown rapidly. Existing academic metrics are inadequate for evaluating these models trained on massive datasets. The proposed evaluation pipeline comprises prompt curation, objective evaluation, subjective studies, and opinion alignment. The models are evaluated based on 17 objective metrics covering visual quality, content quality, motion quality, and text-caption alignment. Additionally, it provides a comparison table of various video generation models across different metrics and capabilities.
ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object Chenshuang Zhang, Fei Pan, Junmo Kim GitHub arXiv ImageNet-D is a new benchmark for evaluating neural network robustness in visual perception tasks. It generates synthetic images with diverse backgrounds, textures, and materials, making it more challenging than other synthetic datasets. Key features include diversified image generation, high visual fidelity, and significant accuracy reduction of various vision models. The benchmark is created by combining object categories and refining through human verification. ImageNet-D is effective in evaluating neural network robustness, as accuracy on it improves with accuracy on ImageNet.
LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs Yunsheng Ma, Can Cui, Xu Cao arXiv The LaMPilot dataset consists of 4,900 human-annotated traffic scenes, each with an instruction (I), an initial state (b), and a set of goal state criteria (G). The dataset is classified by maneuver and scenario types and is divided into training, validation, and testing sets.
MAPLM: A Real-World Large-Scale Vision-Language Dataset for Map and Traffic Scene Understanding Xu Cao, Tong Zhou, Yunsheng Ma GitHub The dataset contains 3D point cloud Bird's Eye View and high-resolution panoramic images of various traffic scenarios. It also includes detailed annotations at the feature, lane, and road levels. The dataset is designed for a Q&A task, where models will be evaluated based on their ability to answer questions about the traffic scenes such as the number of lanes, presence of intersections, and data quality.
Polos: Multimodal Metric Learning from Human Feedback for Image Captioning Yuiga Wada, Kanta Kaneda, Daichi Saito, Komei Sugiura GitHub arXiv The Polaris dataset, used to train the model, contains 131,020 human judgments from 550 evaluators on the appropriateness of image captions. The dataset is much larger than existing ones and is capable of training image captioning metrics. The captions in Polaris are more diverse, collected from humans and generated by 10 modern image captioning models. This demonstrates the effectiveness and robustness of Polos compared to previous metrics.
VBench: Comprehensive Benchmark Suite for Video Generative Models Ziqi Huang, Yinan He, Jiashuo Yu GitHub arXiv VBench is a tool that evaluates video generation models across 16 quality dimensions. These dimensions fall under Video Quality and Video-Condition Consistency. VBench provides valuable insights by evaluating models across multiple dimensions, content categories, and comparing video vs image generation. The tool's authors plan to expand VBench to more models and video generation tasks. Checkout the leaderboard on HF here: https://huggingface.co/spaces/Vchitect/VBench_Leaderboard
SoccerNet Game State Reconstruction Vladimir Somers, Victor Joos, Anthony Cioppa, Silvio Giancola, Seyed Abolfazl Ghasemzadeh, Floriane Magera, Baptiste Standaert, Amir Mohammad Mansourian, Xin Zhou, Shohreh Kasaei, Bernard Ghanem, Alexandre Alahi, Marc Van Droogenbroeck, Christophe De Vleeschouwer GitHub arXiv SoccerNet Game State Reconstruction (GSR) is a novel computer vision task involving the tracking and identification of sports players from a single moving camera to construct a video game-like minimap, without any specific hardware worn by the players. SoccerNet-GSR, the released dataset, includes 200 clips with 9.37M pitch localization annotations and 2.36M athlete positions on the pitch with their role, team & jersey number. Furthermore, a new performance metric 'GS-HOTA' is introduced to evaluate GSR methods.

📦 3D Vision

Title Authors Code / arXiv Page Summary
ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering Haokai Pang, Heming Zhu GitHub arXiv ASH generates real-time photorealistic renderings of animatable human avatars using Gaussian splats attached to a deformable mesh template. The skeletal motion is encoded using pose-dependent normal maps, and the dynamic Gaussian parameters are learned using 2D convolutional architectures. This approach surpasses existing real-time human avatar rendering methods and represents a significant step towards producing real-time, high-fidelity, controllable human avatars.
Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes Hmrishav Bandyopadhyay, Subhadeep Koley, Ayan Das arXiv This paper introduces a new method for generating precise 3D shapes from abstract freehand sketches, without the need for paired sketch-3D data. The approach uses a part-level modeling and alignment framework, which enables sketch modeling and in-position editing. By operating in a low-dimensional implicit latent space and using diffusion models, the approach significantly reduces computational demands and processing time. Overall, the method offers a novel solution for enabling accurate 3D generation from abstract sketches.
GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians Shenhan Qian, Tobias Kirschstein, Liam Schoneveld GitHub arXiv GaussianAvatars is a new technique for creating customizable photorealistic head avatars using a dynamic 3D representation based on 3D Gaussian splats. This approach allows for precise animation control while maintaining photorealistic rendering. The technique has shown impressive animation capabilities in challenging scenarios, such as reenactments from a driving video, where it outperforms existing techniques by a significant margin.
Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships Sebastian Koch, Narunas Vaskevicius Open3DSG predicts open-vocabulary 3D scene graphs from point clouds, combining vision-language and large language models. Key ideas include constructing a 3D graph with a GNN, aligning features with CLIP, and using an LLM. It allows querying arbitrary objects and relationships at inference time, and enables open-vocabulary prediction not limited to fixed labels.

🧨 Diffusion

Title Authors Code / arXiv Page Summary
Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features Niladri Shekhar Dutt, Sanjeev Muralikrishnan, Niloy J. Mitra GitHub arXiv Diff3F is a feature descriptor for untextured 3D shapes. It computes 3D semantic features using pre-trained 2D diffusion models, rendering depth and normal maps from multiple views, and lifting the 2D diffusion features back to the 3D surface. This produces semantic descriptors on the 3D shape without requiring additional training data or part segmentation.
One-step Diffusion with Distribution Matching Distillation Tianwei Yin arXiv Distribution Matching Distillation (DMD accelerates multi-step diffusion models into a one-step generator without compromising image quality. DMD matches the distribution of the original diffusion model by minimizing KL divergence and using two score functions - one for the actual data distribution and one for the generated distribution. A regression loss matches the large-scale structure of the multi-step diffusion outputs.

🧨Diffusion

Title Authors Code / arXiv Page Summary
DemoFusion: Democratising High-Resolution Image Generation With No $$$ Ruoyi Du,Dongliang Chang, Timothy M. Hospedales, Yi-Zhe Song, Zhanyu Ma GitHub arXiv DemoFusion is an extension that enables the generation of high-res images through an accessible and efficient inference procedure. It uses global-local denoising paths and introduces three techniques for coherent high-res generation: progressive upscaling, skip residual, and dilated sampling. DemoFusion unlocks the potential in existing open-source text-to-image models without additional training or prohibitive costs, democratizing high-res image synthesis.
DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing Yujun Shi, Chuhui Xue, Jiachun Pan GitHub arXiv DragDiffusion is a novel method for interactive point-based image editing that enhances the applicability and versatility of the DragGAN framework by extending it to diffusion models. It optimizes the latent of a single diffusion step and introduces techniques to preserve the identity of the original image. The authors present DragBench, the first benchmark dataset for evaluating interactive point-based image editing methods. Experiments demonstrate the effectiveness of DragDiffusion compared to DragGAN, and an ablation study explores key factors.
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models Shivangi Aneja, Justus Thies, Angela Dai GitHub arXiv FaceTalk is a novel method to generate 3D motion sequences of talking human heads from audio signals. It employs neural parametric head models with speech signals and a new latent diffusion model. The approach denoises Gaussian noise sequences iteratively and extracts mesh sequences using marching cubes from the frozen NPHM model. FaceTalk outperforms existing methods by 75% in perceptual user study evaluations and produces visually natural motion with diverse facial expressions and styles.
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models Ozgur Kara, Bariscan Kurtkaya, Hidir Yesiltepe GitHub arXiv RAVE is a fast and innovative method for zero-shot video editing that uses pre-trained text-to-image diffusion models. It preserves the original motion and structure of the input video while producing high-quality, temporally consistent edited videos. RAVE edits videos 25% faster than existing methods by efficiently leveraging spatio-temporal interactions between frames. It outperforms existing methods across diverse editing scenarios and requires no extra training or manual inputs. However, there are some limitations such as flickering issues for extreme shape edits in very long videos and fine detail flickering. Try the demo here: https://huggingface.co/spaces/ozgurkara/RAVE
Relightful Harmonization: Lighting-aware Portrait Background Replacement Mengwei Ren, Wei Xiong, Jae Shin Yoon arXiv The paper presents Relightful Harmonization, a technique for harmonizing portrait lighting with a new background image. The method encodes lighting information from the target background image and aligns it with features from panoramic environment maps. Relightful Harmonization outperforms existing benchmarks in visual fidelity and lighting coherence. The technique only requires an arbitrary background image during inference and expands the training data using a novel data simulation pipeline. This approach enables realistic, lighting-aware portrait background replacement using just a single target background image, without requiring HDR environment maps.
SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors Dave Zhenyu Chen, Haoxuan Li, Hsin-Ying Lee GitHub arXiv SceneTex generates high-quality indoor scene textures using depth-to-image diffusion priors. Key features include optimization in RGB space, multiresolution texture field, and cross-attention decoder for global style consistency. Experiments show it outperforms prior methods, but limitations include occasional artifacts and inability to handle complex geometry.
ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models Lukas Höllein,Aljaž Božič, Norman Müller GitHub arXiv ViewDiff generates 3D-consistent images from different viewpoints of the same object or scene. The approach involves enhancing the U-Net architecture of pretrained text-to-image models with new layers, training on real-world multi-view datasets using a denoising process, and producing multi-view consistent images of the same object in a single forward denoising pass. The results show high visual quality with improved 3D consistency compared to existing methods.

🧩 Segmentation

Title Authors Code / arXiv Page Summary
Amodal Ground Truth and Completion in the Wild Guanqi Zhan, Chuanxia Zheng, Weidi Xie GitHub arXiv The paper introduces amodal image segmentation which predicts masks for entire objects, including occluded parts. Previous methods used manual annotation, but the authors use 3D data to construct the MP3D-Amodal dataset with authentic amodal ground truth masks. Two architecture variants are explored: a two-stage OccAmodal model and a one-stage SDAmodal model. Their method achieves state-of-the-art performance on amodal segmentation datasets, including COCOA and the new MP3D-Amodal dataset.

🛠️ Workshop

Title Authors Code / arXiv Page Summary
Dataset Distillation Full-day workshop on June 17. The workshop will explore the potential of Dataset Distillation (DD) in computer vision applications like face recognition, object detection, image segmentation, and video understanding. DD has the potential to reduce training costs, make AI eco-friendly, and enable research groups with limited resources to engage in state-of-the-art research. The workshop will also cover related topics such as active learning, few-shot learning, generative models, and learning from synthetic data.
Urban Scene Modeling: Where Vision Meets Photogrammetry and Graphics Full day workshop on June 17th. The USM3D workshop unites researchers in computer vision, graphics, and photogrammetry to collaborate on urban scene modeling challenges. The event includes talks, presentations, a challenge, and a poster session.
What is Next in Multimodal Foundation Models? Takes place the morning of June 18th. The workshop focuses on the emerging field of multimodal foundation models, which are trained on multiple modalities simultaneously and applied in text-to-image/video/3D generation, zero-shot classification, and cross-modal retrieval. It brings together leaders to discuss different aspects of these models, including their design, efficiency, ethics, and open availability.
Gaze Estimation and Prediction in the Wild Morning of June 18th. The workshop will cover gaze-based interaction techniques, eye tracking technologies, applications of gaze interaction in various domains, and methodological considerations in gaze-based research. The main objective is to enhance the field of gaze interaction by providing a platform for researchers and practitioners to present their work, exchange ideas, and explore future directions.
Large Scale Holistic Video Understanding GitHub The main objective of the workshop is to establish a video benchmark integrating joint recognition of all the semantic concepts, as a single class label per task is often not sufficient to describe the holistic content of a video. The planned panel discussion with world’s leading experts on this problem will be a fruitful input and source of ideas for all participants. The community is invited to help to extend the HVU dataset that will spur research in video understanding as a comprehensive, multi-faceted problem.
Representation Learning with Very Limited Images Afternoon of June 18th. This workshop focuses on developing visual and multi-modal models with limited data resources. It aims to bring together diverse communities that work on approaches such as self-supervised learning with a single image or synthetic pre-training with generated images. The workshop's organizers include researchers from various institutions.
Responsible Data Full day workshop on June 18. The Workshop on Responsible Data will discuss building responsible and inclusive datasets for computer vision. Topics include context-driven dataset development, best practices for data collectors and annotators, responsible datasets for AI models, measuring dataset responsibility, transparency, data privacy and accountability, and engaging the open-source community.
Synthetic data fro Computer Vision The workshop focuses on the use of synthetic data for training and evaluating computer vision models. The workshop covers topics such as the effectiveness, efficiency, scalability, benchmarking, evaluation, risks, and ethical considerations of synthetic data in computer vision and related fields.
Computer Vision for Mixed Reality VR has the potential to revolutionize our interactions. Passthrough techniques like Apple Vision Pro and Quest-3 allow for deeply immersive mixed reality experiences. We focus on capturing real environments with cameras and using AI to augment them with virtual objects. Our call for papers invites research on novel methods for Mixed Reality. Topics include real-time view synthesis, scene understanding, 3D capture, and more.
Multimodal Algorithmic Reasoning Morning of June 17. The Multimodal Algorithmic Reasoning (MAR) 2024 workshop at CVPR 2024 aims to bring together researchers working on neural algorithmic learning, multimodal reasoning, and cognitive models of intelligence1. The workshop will focus on the emerging topic of multimodal algorithmic reasoning, where agents automatically deduce new algorithms for solving real-world tasks, and will also encourage the vision community to build neural networks with human-like intelligence abilities
Vision Datasets Understanding The 3rd CVPR Workshop on Vision Datasets Understanding (VDU) is a gathering of experts in the field of analyzing vision datasets. The workshop will cover a variety of topics, including attributes and properties of vision datasets, dataset-level analysis, representations of and similarities between vision datasets, improving vision dataset quality, and evaluating model accuracy under various test environments.