NVIDIA released ‘LLaMA-Mesh’ earlier this week, a novel method for enabling large language models (LLMs) to generate 3D meshes from text prompts.
This approach integrates 3D mesh generation with language understanding, allowing the model to represent meshes in plain text format without modifying its vocabulary or tokenisers.
Ahsen Khaliq, ML at Hugging Face took to LinkedIn to announce this latest update.
LLaMA-Mesh builds on LLaMA, a language model, by fine-tuning it on a curated dataset of 3D dialogue. NVIDIA and Tsinghua University researchers designed the method to preserve the model’s language capabilities while extending its functionality to generate and understand 3D content.
How Does LLaMA-Mesh Work?
The method leverages existing spatial knowledge embedded in LLMs, derived from textual sources like 3D tutorials. It tokenises 3D mesh data, including vertex coordinates and face definitions, into text, allowing seamless processing by language models.
To train the model, the researchers developed a supervised fine-tuning dataset. This dataset enables the LLM to perform tasks such as generating 3D meshes from text prompts, producing interleaved text and 3D outputs, and interpreting 3D mesh structures.
The study shows that LLaMA-Mesh achieves 3D mesh generation quality comparable to specialised models trained exclusively on 3D data.
Training and Results: According to the official research paper, the model is trained on 32 A100 GPUs for 21k iterations over 3 days using an AdamW optimiser with a small learning rate and warm-up steps.
A batch size of 128 and cosine scheduling ensure smooth training. The loss shows quick adaptation to the new task with no spikes or issues, which highlights the model’s stability and ability to learn efficiently.
The researchers claim that the method creates detailed, high-quality 3D meshes with artist-level designs, learning this during training. It can generate diverse, creative outputs from the same text prompt, perfect for tasks needing multiple design options.
Even after fine-tuning for 3D mesh generation, the model keeps its strong language skills, understanding complex instructions, asking smart questions, and giving detailed answers. Tests show it performs as well as other models in reasoning and problem-solving while also excelling at creating 3D designs.
What Happens to 3D Modelling?
AI is transforming animation and 3D modelling, impacting artists, studios, developers, and end-users alike.
By automating repetitive tasks, AI frees artists to focus on creativity, while studios benefit from faster, cost-effective production. This raises the question: Is it going to be the end of human animators and modellers soon?
Earlier in October, NVIDIA announced EdgeRunner, which could generate highly detailed 3D meshes with up to 4,000 faces at a spatial resolution of 512.
This was derived from both images and point clouds—resulting in sequences that are twice as long and four times higher in resolution compared to previous methods.
Referring to LLaMA-Mesh, an entrepreneur on LinkedIn says, “3D modelling is about to enter a new phase. The speed of execution will make it possible to create projects at lower costs. It’s not bad, although it will require less modeller…”
The researchers noted, “This work represents a significant step toward integrating multi-modal content generation within a cohesive language model.”
LLaMA-Mesh opens new possibilities for conversational 3D generation and understanding, highlighting the potential for unifying 3D and text modalities in language models.
The post NVIDIA Launches LLaMA-Mesh, a Unified 3D Mesh Generation Method Using LLMs appeared first on Analytics India Magazine.