Synthetic Data Generation in Simulation is Keeping ML for Science Exciting

If only AI could create infinite streams of data for training, we wouldn’t have to deal with the problem of not having enough data. This is what is keeping a lot of things undiscoverable in the field of science as there is only a limited amount of data available that can be used for training.

This is where AI is taking up a crucial role with the help of simulation. The integration of data generation through simulation is rapidly becoming a cornerstone in the field of ML, especially in science. This approach not only holds promise but is also reigniting enthusiasm among researchers and technologists.

As Yann LeCun pointed out, “Data generation through simulation is one reason why the whole idea of ML for science is so exciting.”

Data generation through simulation is one reason why the whole idea of ML for science is so exciting. https://t.co/iC3lkquKOf

— Yann LeCun (@ylecun) August 6, 2024

Simulations allow researchers to generate vast amounts of synthetic data, which can be critical when real-world data is scarce, expensive, or challenging to obtain. For instance, in fields like aerodynamics or robotics, simulations enable the exploration of scenarios that would be impossible to test physically.

Richard Socher, the CEO of You.com, highlighted that while there are challenges, such as the combinatorial explosion in complex systems, simulations offer a pathway to manage and explore these complexities.

Synthetic Data is All You Need?

This is similar to what Anthropic chief Dario Amodei said about producing quality data using synthetic data and that it sounds feasible to create an infinite data generation engine that can help build better AI systems.

“If you do it right, with just a little bit of additional information, I think it may be possible to get an infinite data generation engine,” said Amodei, while discussing the challenges and potential of using synthetic data to train AI models.

“We are working on several methods for developing synthetic data. These are ideas where you can take real data present in the model and have the model interact with it in some way to produce additional or different data,” explained Amodei.

Taking the example of AlphaGo, Amodei said that those little rules of Go, the little additional piece of information, are enough to take the model from “no ability at all to smarter than the best human at Go”. He noted that the model there just trains against itself with nothing other than the rules of Go to adjudicate.

Similarly, OpenAI is a big proponent of synthetic data. The former team of Ilya Sutskever and Andrej Karpathy has been a significant force in leveraging synthetic data to build AI models.

The development at OpenAI is testimony to the advanced growth of generative AI in the entire ecosystem, but not everyone agrees that they will be able to achieve AGI with the current methodology of model training. Likewise, Microsoft is also researching in this direction; its research on Textbooks Are All You Need is a testament to the power of synthetic data.

Google’s AlphaFold, which is spearheading protein fold prediction and creations for drug discovery, too can benefit immensely from synthetic data. At the same time, it can be scary to use this data for a sensitive field like science.

Synthetic Data is Too Synthetic

However, the potential of simulations extends beyond mere data generation. Giuseppe Carleo, another expert in the field, emphasised that the most exciting aspect is not just fitting an ML model to data generated by an existing simulator.

Instead, true innovation lies in training ML models to become advanced simulators themselves—models that can simulate systems beyond the capabilities of traditional methods, all while remaining consistent with the laws of physics.

This is becoming possible with synthetic data generated by agentic AI models, which are increasing in the field of AI. Models that can test, train, and fine-tune themselves using the data they created is something that is exciting for the future of AI research.

Moreover, the discussion around simulations also touches on broader applications. Sina Shahandeh, a researcher in the field of biotechnology, for example, suggested that the ultimate simulation could model entire economies using an agent-based approach, a concept that is slowly becoming feasible.

Despite the excitement, the field is not without its sceptics. Stephan Hoyer, a researcher with a cautious outlook on AGI, pointed out that simulating complex biological systems to the extent that training data becomes unnecessary would require groundbreaking advancements.

He believes this task is far more challenging than achieving AGI. Similarly, Jim Fan, senior AI scientist at NVIDIA, said that while synthetic data is expected to have a noteworthy role, blind scaling alone will not suffice to reach AGI.

When it comes to science, using synthetic data can be tricky. But its generation in simulation shows promise as it can be tried and tested without deploying in real-world applications. Besides, the possibility of it being infinite is what keeps ML exciting for researchers.