Top 6 Computer Vision Models Released in 2024

Ever since the likes Stable Diffusion, DALL.E and Midjourney made their way to the internet, text-to-image and text-to-vision models witnessed immense expansion. As we step into 2024, this growth shows no signs of slowing down.

In the ever-evolving landscape, let’s explore some of the most compelling computer vision models that have been introduced thus far.

Parrot: Pareto-optimal Multi-Reward Reinforcement Learning Framework for Text-to-Image Generation

Developed by an interesting team of researchers from Google Research, Google DeepMind, OpenAI, Rutgers University, and Korea University, this research introduces Parrot, a new reinforcement learning (RL) framework to improve text-to-image (T2I) generation by optimising multiple quality rewards.

Parrot addresses challenges associated with over-optimisation and manual weight selection by employing batch-wise Pareto optimal selection. It co-trains the T2I model and prompts an expansion network, enhancing the generation of quality-aware text prompts. To prevent forgetting the original user prompt, Parrot also introduces original prompt-centered guidance at inference time. Experimental results and a user study demonstrate that Parrot outperforms baseline methods across various quality criteria, including aesthetics, human preference, image sentiment, and text-image alignment.

AIM

Apple has introduced AIM, a set of autoregressive image models inspired by large language models (LLMs), exhibiting scalability similar to their textual counterparts. The key findings indicate that visual features’ performance scales with both model capacity and data quantity, and the objective function value correlates with downstream task performance. By Pre-training a seven billion parameter AIM on two billion images, This AIM model achieved an 84.0% accuracy on ImageNet1k, even when its core part (trunk) is kept fixed or unchanged.

AIM, pre-trained similarly to LLMs, presents a scalable method without image-specific strategies. It has potential as a new scope for large-scale vision model training, boasting desirable properties and strong correlations between pre-training and downstream performance. The models exhibit no saturation signs, suggesting further performance improvements with larger models trained for extended schedules.

InstandID: Zero-shot Identity-Preserving Generation in Seconds

While methods like Textual Inversion, DreamBooth, and LoRA have been important discoveries in personalised image synthesis, practical application is often hindered due to storage demands, lengthy fine-tuning, and reliance on multiple reference images. On the other hand, ID embedding-based methods face challenges such as extensive fine-tuning, lack of compatibility with pre-trained models, or compromised face fidelity.

Diffusion model-based model InstantID was made to solve this problem. This plug-and-play module efficiently handles image personalisation in various styles with just one facial image, ensuring high fidelity. InstantID introduces IdentityNet, incorporating strong semantic and weak spatial conditions for image generation.

It integrates with popular text-to-image models like SD1.5 and SDXL, offering a versatile plugin. Our solution excels in zero-shot identity-preserving generation, proving valuable in real-world scenarios. While demonstrating robustness, compatibility, and efficiency, challenges persist, including coupled facial attributes and potential biases.

Distilling Vision-Language Models on Millions of Videos

In an aim to replicate the success of abundance image-text data for video-language models, Google and the University of Texas teamed up to fine-tune a video-language model using synthesised instructional data, starting from a robust image-language baseline. The resulting video-language model is used to automatically label millions of videos, generating high-quality captions.

It claims to excel in generating detailed descriptions for new videos, providing high-quality textual supervision compared to existing methods. Experiment results reveal that a video-language dual-encoder model, contrastively trained on auto-generated captions, outperforms the strongest baseline leveraging vision-language models by 3.8%.

Motionshop

Alibaba has introduced a new framework called Motionshop that can be used for replacing video characters with 3D avatars, comprising two main components: a video processing pipeline for background extraction and a pose estimation/rendering pipeline for avatar generation. The process is expedited through parallelisation and the use of a high-performance ray-tracing renderer (TIDE), enabling completion in minutes.

Additionally, it employs pose estimation, animation retargeting, and light estimation for consistent integration of 3D models. The rendering phase employs TIDE for efficient video production with photorealistic features. The final video is generated by compositing the rendered image with the original video.

LEGO

Chinese tech company ByteDance and Fudan University have introduced a unified end-to-end multi-modal grounding model called LEGO, that can capture fine-grained local information, demonstrating precise identification and localisation in images or videos.

The model is trained on a diverse multi-modal, multi-granularity dataset, resulting in improved performance on tasks demanding detailed understanding. To address data scarcity, a comprehensive multi-modal grounding dataset was also created. The model, code, and dataset are open sourced to foster advancements in the field.

The post Top 6 Computer Vision Models Released in 2024 appeared first on Analytics India Magazine.

Follow us on Twitter, Facebook
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 comments
Oldest
New Most Voted
Inline Feedbacks
View all comments

Latest stories

You might also like...