3D Internet Building Block Championed by Nvidia and Apple Faces Challenges

3D Internet Building Block Championed by Nvidia and Apple Faces Challenges August 3, 2023 by Agam Shah

(POR666/Shutterstock)

Apple and Nvidia want to dominate the 3D universe, and the world of scientific simulation, with a building block championed by both companies.

The closed-source proponents are going the open standards route to drive the development and adoption of a file format called USD (Universal Scene Descriptor), which is described as the HTML of the metaverse and 3D Internet.

The Alliance of OpenUSD, or AoUSD, wants to make USD a linchpin to create and render 3D worlds, AI-powered graphics, and animated avatars. Those universes could be on the Internet, in virtual reality worlds, in movies, or in graphics-rich scientific simulations.

The other AoUSD founding members include Pixar, Adobe, and Autodesk. None of the other major chip makers, software, or cloud providers are listed as initial supporters. The Worldwide Web Consortium and International Organisation for Standardization do not yet consider USD a standard, but AoUSD’s goal is to get it there.

The goal is to "take the open-source project that Pixar created and make a specification that will enable it to become an international standard that can be used by anyone worldwide, like the standards we use today such as JPEG or H.264, HTML or other standards," said Steve May, the chairperson at AoUSD, and chief technology officer at Pixar, during a media briefing.

The file format allows companies to share and reuse 3D assets in virtual worlds or graphics applications. Animators can simply take 3D objects from existing repositories and add them to their projects. The 3D objects could be models, animations, backgrounds, materials, and other assets.

USD implementations typically use rendering engines that pull out procedural descriptions to patch together scenes from shared assets. Nvidia has collaborated with Pixar on many projects in its metaverse product called Omniverse, which relies on the company’s GPUs to render animations, virtual worlds, and simulations.

The USD file format has been used for decades by Pixar to create animated movies. The company developed USD to reuse animations instead of recreating every pixel from scratch. Now USD plays a starring role in Pixar’s moviemaking efforts.

Creating complex animated scenes involves many workflows, 3D content, software tools, and technology, May said.

"And historically, those tools all used different data and file formats. Pixar wanted to enable a more powerful creative expression for the artist by streamlining workflows to allow the same data re-entry … by all the content creation tools," May said.

USD is a unifier for graphics in supercomputing and entertainment, May said, adding, “I see this as kind of an exciting time as it grows that we can see benefits crossing over between the different areas.”

Nvidia's metaverse strategy hinges on the success of the USD file format, which has composition operators such as position, orientation, colors, and layers, which allows for real-time sharing and collaboration in the metaverse.

Nvidia has projected Omniverse as a tool for engineers to collaborate in real-time on the creation of equipment like aircraft, cars, and machines. Nvidia’s Earth-2 simulation of climate patterns, which ingests visual data from multiple sources, is based on the USD file format and the Omniverse back-end.

The graphics chip maker is using artificial intelligence and metaverse to sell more GPUs and software. But the software development platform behind Omniverse, called CUDA, is proprietary. Nvidia locks down CUDA customers to its hardware and software in AI and the metaverse, and USD will allow it to create services.

Apple is like Nvidia: customers are locked into the devices, software, and services. But the company’s interest in the standard could stem from its recent introduction of the $3,999 Vision Pro headset, which is a headset computer on which users can watch movies, videoconference, and interact with virtual worlds.

Apple called the Vision Pro its "first spatial computer" that is responsible for creating a new spatial computing category. Apple is trying to attract developers to write applications for Vision Pro, and the USD file format could be at the center of it all.

USD "enables realistic augmented reality experiences essential to things like spatial computing," May said.

But USD has many usability issues. There are no practical browser-side implementations yet, and it is heavily reliant on server-side processing. Users of the technology can read USD by creating their own Python and C++ services, after which they can send back the needed client-side information.

Autodesk has taken some steps in creating libraries that will make USD practical in browser-based applications, an Nvidia spokesperson said in an email.

Autodesk has created a prototype for USD JavaScript bindings to enable more client-side processing and has been posting relevant proposals to the GitHub page of OpenUSD. OpenUSD is an open-source repository where the code for USD lives, and AoUSD will specify and codify what is represented in that code base.

Other USD acceleration in client-side processing will come through native WebGPU support in Hydra, which is a communication and rendering engine for USD. WebGPU is the successor to WebGL and is designed for client-side acceleration to distribute AI computing by leveraging local GPUs, which will reduce the server load. Google recently announced that WebGPU was fully integrated within Chrome.

But for Apple and Nvidia, standardizing USD to be the HTML of the 3D Internet will be challenging. For one, it has to contend with a standard that is dubbed the JPEG of 3D.

Companies are leaning toward a long-established USD alternative called glTF, which is a 3D file format backed by W3C and ISO. The glTF and USD file formats are being discussed by the Metaverse Standards Forum, which was formed earlier this year but does include Apple and Pixar as members.

"glTF is viewed as a simpler and lighter weight way to represent 3D data. And USD is viewed as the way to make much more complex sorts of scenes and have more people interact with them at the same time," May said.

The glTF format, which is backed by Khronos, has a well-established workflow with lightweight delivery and browser-based acceleration.

"One of the interesting challenges, and if we embrace this challenge, is can we make USD as lightweight and as optimal for simpler things as glTF? In many ways, it would be ideal if we had kind of one solution for both things. That is going to be an active area of debate in the community," May said.

May contends that USD will be the file format for scientific computing, which requires complicated graphics simulations. That is something that glTF cannot handle in its current form as a web-friendly data interchange format.

“OpenUSD will become the fundamental building block on which all 3D content will be created,” May said, adding, “Industrial applications or scientific visualization applications have overlap with what we do in entertainment.”

The power of USD is the ability to aggregate and modify large numbers of assets and then combine them into a complete picture, May said.

The Linux Foundation’s Joint Development Foundation will manage the AoUSD efforts. AoUSD could potentially partner with other associations such as the Academy Software Foundation (ASWF).

“JDF is really structured to help kind of incubate these early techniques and technologies on the pathway to becoming a standard,” May said.

The Nvidia spokeswoman said AWSF has a working group that will in the future partner with AoUSD to bring USD on the Web.

The stakeholders in both groups will share findings and use cases to inform community priorities.

Related

Why Do You Need Worldcoin When You Have Aadhaar?

Now that we have witnessed the euphoria surrounding Worldcoin with people queuing to get their irises scanned and the crowd going berserk, leading to a shut down of all three initial locations of orb operations in Bengaluru last Friday, let’s explore the big picture for Worldcoin in India.

Drawing parallels to the government-organised large-scale biometric scan that began 12 years ago, which also claimed to create a unique digital identity and provide benefits to citizens via welfare schemes, it looks like Worldcoin may not be too different from Aadhaar — but it is!

India Leading The Way

Former chief economist of World Bank, Paul Romer referred to Aadhaar as the most “sophisticated system” he had ever seen, and considered it to be “good for the world if this became widely adopted”. As a prelude to organisations planning to adopt similar mechanisms, it is not surprising that Worldcoin also took inspiration from Aadhaar. In their 2021 blog that launched Worldcoin, the company mentioned that biometric approaches for proof of personhood is an accessible method and allows accurate verification of uniqueness, calling out parallels to Aadhaar in India.

Though they admitted to challenges related to privacy and fraud detection, the company believes latest advancements can solve it.

not many westerners are aware of the Aadhaar system running in India — an existence proof of the onboarding of billions of humans onto a biometric identity platform https://t.co/GnQzXUWiUP

— Worldcoin (@worldcoin) July 11, 2022

While another method to approach proof-of-personhood is through the social-graph method, with a number of companies already adopting such a technique, scalability challenges impede its adoption. For a project like Worldcoin, biometric was the viable option. OpenAI CEO Sam Altman, in his last visit to India, appreciated India’s efforts in building and adopting technology, with a special mention to Aadhaar and UPI systems that emerged here.

With an ambitious goal of 1 billion signups, Worldcoin announced their grand project two years ago. As of now, they have achieved over 2 million sign ups and counting. Interestingly, within five-and-a-half years, Aadhaar clocked 1 billion registrations.

India Unperturbed by Worldcoin

Though the goals for both projects are vaguely similar, the modus operandi are different. Creation of digital identity is one of the goals for Worldcoin, crypto tokens being another. On successful completion of the biometric scan, Worldcoin offers 25 ‘World tokens’ (WLD). Each WLD is a little over $2. While the broader goal of the company is to distribute wealth generated by AI back to society, the timeline to implement this goal is uncertain and will probably fall in the distant future.

Furthermore, being a private organisation, no form of enforcement can be implemented, unless a government affiliation or law comes into the picture — which obviously means that the distribution of universal basic income will not come to fruition. However, Aadhaar has no such obstacles. Though not mandatory, in order to avail welfare schemes, individuals need to possess an Aadhaar card. As per the Economic Survey 2023, 318 Central schemes and over 720 state DBT (Direct Benefit Transfer) schemes come under Aadhaar Act, and these are facilitated via Aadhaar cards.

In this grand scheme of things, where Aadhaar is functioning like a well-oiled machine, Worldcoin will not hold a candle to it.

Rising Mutiny

While Worldcoin is making its way across the globe with setting up orb scanning locations across 20 countries (35 cities), not all has been rosy for the company. Investigations and watchful scrutiny has risen regarding Worldcoin operations. Yesterday, Kenya became the first country to ban Worldcoin operations owing to security and financial concerns. The suspension of Worldcoin also applies to other entities that operate in a similar fashion that “engage the people of Kenya”. It will not operate until the authorities ascertain there is no form of risk to the public, as they are looking into how and what would be done with people’s data. However, the orb locations across Kenya witnessed throngs of people.

Within a week of its launch, UK and European regulators were quick to jump onto the scrutiny train. Countries including France and Germany have started investigating its operations over concerns of collecting sensitive biometric data. It’s not surprising that the EU, known for having stringent data regulatory policies in the world, will scrutinise the project, but it’s strange that they are looking into it after it’s been launched.

With countries waking up to rigorous scrutiny, it looks like India might follow the regulatory scrutiny too. Besides, with Aadhaar already in place and pretty much meeting the goals of ambitious Worldcoin, why have it here in India?

The post Why Do You Need Worldcoin When You Have Aadhaar? appeared first on Analytics India Magazine.

Breaking Boundaries: US Chipmakers Outwit Restrictions to Serve China with Chips

Since last year, the US government has been trying hard and taking different measures to isolate China in the development of AI technology. The first event took place in April 2022, when the US urged the Dutch and Japanese governments to stop selling lithography machinery to China. In the following months, it expanded the ban and put restrictions on chip making companies from exporting high performing GPUs to China.

The restrictions put chip making companies under huge pressure as they lost one of the biggest chip markets in the world, which imports about 53.7% of the world’s supply of chips worth around $240 billion.

Regardless of the restrictions imposed by the U.S, domestic companies are finding ways to export AI chips to China. Following in the footsteps of Nvidia and Intel, AMD is also exploring opportunities to sell its chips in the Chinese market.

Lisa Su, CEO of AMD, said on an earnings call late Tuesday that China is an “important” market and that the semiconductor giant wants to be fully compliant with U.S. export controls.

“Our plan is to, of course, be fully compliant with the U.S. export controls, but we do believe there’s an opportunity to develop a product for our customer set in China that is looking for AI solutions, and we’ll continue to work in that direction,” she added.

Going the Nvidia Way?

These chip companies are trying to circumvent the system from being very specific to details provided by the government.

For instance, the banning rules explicitly forbid the sale of advanced chips to Chinese customers if they possess both high performance (minimum 300 trillion operations per second, or 300 teraops) and fast interconnect speed (typically, at least 600 gigabytes per second).

An example of such a chip is Nvidia’s A100, which exceeds 600 teraops in performance and matches the required 600 Gb/s interconnect speed. During the earnings call AMD CEO Lisa Su said AMD is set to ramp up production of its flagship MI300 artificial-intelligence chips in the fourth quarter. The accelerator chips, which are in short supply, are designed to compete against the advanced H100 chips already sold by Nvidia.

In response to the export regulations, Nvidia introduced a modified version of its A100 chip, known as the A800, to ensure legal export to China in November last year. In March 2023 , the company announced another China-export version, the H800 chip, tweaked version of H100 chip which Chinese tech giants like Alibaba Group Holding Ltd, Baidu Inc, and Tencent Holdings Ltd are already utilizing in their cloud computing units.

In July this year, Intel entered the scene by introducing the Gaudi2 HL-225B, a chip explicitly tailored for the Chinese market. To adhere to US trade restrictions, Intel reduced the interconnect bandwidth by approximately 17 percent.

Currently, AMD’s MI300s exceed the performance limits set by the export controls enforced in October last year, which prohibits the sale of specific advanced chips to China. Unlike Nvidia and Intel, AMD has not yet designed customized chips for the lucrative Chinese market. It needs to be seen if AMD will follow the suit and design chips particularly for China.

Why Chip Makers are Obsessed with China

Last month, CEOs of Intel Corp., Qualcomm Inc., and Nvidia Corp. lobbied against the extension of restrictions on selling certain chips and semiconductor manufacturing equipment to China. The Biden administration is expected to introduce these restrictions in the coming weeks.

As per the Bloomberg report, China stands as the world’s largest commercial market for commodity semiconductors, accounting for approximately one-fifth of Nvidia’s global revenue. The remarkable 220% surge in Nvidia’s stock price this year, is attributed to the soaring demand for high-end chips utilized in artificial intelligence systems and the optimistic outlook for sustained access to the lucrative Chinese market.

Moreover, around 60% of Qualcomm’s revenue originates from providing components to China, which serves as the manufacturing hub for most of the world’s consumer electronics.

China, including Hong Kong, contributed $5.2 billion to AMD’s revenue in 2022. This region represents a substantial portion of the company’s business. China contributes around 30-40% AMD’s total revenue in 2022. However, recent geopolitical tensions and political conflicts between the US and China could potentially impact these figures in the future.

While the U.S. officials are concerned about AI’s potential impact on national security as AI-powered weapons could benefit adversaries, and AI tools could be misused to create dangerous substances or harmful computer code.

On the other hand chipmakers worry about their bottomline being China is a huge market

for them. The restrictions by the U.S government comes at a crucial time for GPU makers as the industry is experiencing a GPU crisis.
Amid the global GPU shortage, all eyes are on Nvidia, the leading supplier of GPUs in the market. Also Nvidia founder and CEO Jensen Huang recently warned that China will cultivate its own chip companies in response to tensions with the U.S. and that existing chip players will have to work hard to stay competitive.

The post Breaking Boundaries: US Chipmakers Outwit Restrictions to Serve China with Chips appeared first on Analytics India Magazine.

Breaking the Data Barrier: How Zero-Shot, One-Shot, and Few-Shot Learning are Transforming Machine Learning

Breaking the Data Barrier: How Zero-Shot, One-Shot, and Few-Shot Learning are Transforming Machine Learning
Photo credit: Allison Saeng via Unsplash Introduction

In today’s fast-changing world, technology is improving every day and Machine Learning and Artificial Intelligence have revolutionized a variety of industries with the power of process automation and improved efficiency. However, humans still have a distinct advantage over traditional machine learning algorithms because these algorithms require thousands of samples to respond to the underlying correlations and identify an object.

Imagine the frustration of unlocking your smartphone using fingerprints or facial recognition by performing 100 scans just before the algorithm works. This type of function would never have been put on the market.

However, since 2005, machine learning experts have developed new algorithms that could completely change the game. The improvements made over the last almost two decades have produced algorithms that can learn from the smallest (Zero, One or Few) number of samples.

In this article, we explore the concepts behind those algorithms and provide a comprehensive understanding of how these learning techniques function, while also shedding light on some challenges faced when implementing them.

How does Zero-Shot Learning work?

Zero-shot learning is the concept of training a model to classify objects it has never seen before. The core idea is to exploit the existing knowledge of another model to obtain meaningful representations of new classes.

Breaking the Data Barrier: How Zero-Shot, One-Shot, and Few-Shot Learning are Transforming Machine Learning

It uses semantic embeddings or attribute-based learning to leverage prior knowledge in a meaningful way that can provide a high-level understanding of relationships between known and unknown classes. Both can be used together or independently.

Semantic Embeddings are vector representations of words, phrases, or documents that capture the underlying meaning and relationship between them in a continuous vector space. These embeddings are typically generated using unsupervised learning algorithms, such as Word2Vec, GloVe, or BERT. The goal is to create a compact representation of the linguistic information, where similar meanings are encoded with similar vectors. In this way, semantic embeddings allow for efficient and accurate comparisons and manipulation of textual data and to generalize to unseen classes by projecting instances into a continuous, shared semantic space.

Attribute-Based Learning enables the classification of objects from unseen classes without access to any labeled examples of those classes. It decomposes objects into their meaningful and noticeable properties, which serve as an intermediate representation, allowing the model to establish a correspondence between seen and unseen classes. This process typically involves attribute extraction, attribute prediction, and label inference.

  1. Attribute extraction involves deriving meaningful and discriminative attributes for each object class to bridge the gap between low-level features and high-level concepts.
  2. Attribute prediction involves learning a correspondence between low-level features of instances and high-level attributes, using ML techniques to recognize patterns and relationships between features to generalize to novel classes.
  3. Label inference involves predicting a new instance’s class label using its predicted attributes and the relationships between attributes and unseen class labels, without relying on labeled examples.

Despite the promising potential of zero-shot learning, several challenges remain, such as:

  • Domain Adaptation: The distribution of instances in the target domain may differ significantly from that in the source domain, leading to a discrepancy between the semantic embeddings learned for seen and unseen classes. This domain shift can harm the performance, as the model may not establish a meaningful correspondence between instances and attributes across domains. To overcome this challenge, various domain adaptation techniques have been proposed, such as adversarial learning, feature disentangling, and self-supervised learning, by aiming to align the distributions of instances and attributes in the source and target domains.

How does One-shot Learning work?

In the process of developing a traditional neural network, for example to identify cars, the model needs thousands of samples, captured from different angles and with different contrasts, in order to effectively differentiate them. One-shot learning takes a different approach. Instead of identifying the car in question, the method determines whether image A is equivalent to image B. This is obtained by generalizing the information the model has gained from experience with previous tasks. One-shot learning is mainly used in computer vision.

Breaking the Data Barrier: How Zero-Shot, One-Shot, and Few-Shot Learning are Transforming Machine Learning

Techniques used to achieve this include Memory Augmented Neural Networks (MANNs) and Siamese Networks. By leveraging these techniques independently, one-shot learning models can quickly adapt to new tasks and perform well even with very limited data, making them suitable for real-world scenarios where obtaining labeled data can be expensive or time-consuming.

Memory Augmented Neural Networks (MANNs) are a class of advanced neural networks designed to learn from very few examples, similar to how humans can learn from just one instance of a new object. MANNs achieve this by having an extra memory component that can store and access information over time.

Imagine a MANN as a smart robot with a notebook. The robot can use its notebook to remember things it has seen before and use that information to understand new things it encounters. This helps the robot to learn much faster than a regular AI model.

Siamese Networks, on the other side, are designed to compare data samples by employing two or more identical subnetworks with shared weights. These networks learn a feature representation that captures essential differences and similarities between data samples.

Imagine Siamese Networks as a pair of twin detectives who always work together. They share the same knowledge and skills, and their job is to compare two items and decide if they’re the same or different. These detectives look at the important features of each item and then compare their findings to decide.

The training of a Siamese network evolves two stages: The Verification and the Generalization stage.

  • During the verification, the network determines whether the two input images or data points belong to the same class or not. The network processes both inputs separately using twin subnetworks.
  • During the generalization, the model generalizes its understanding of the input data by effectively learning the feature representation that can discriminate between different classes.

Once the two stages have been carried out, the model is capable of determining whether image A corresponds to image B.

One-shot learning is very promising because it does not need to be retrained to detect new classes. However, it faces challenges, such as high memory requirements and immense need for computational power, since twice as many operations are needed for learning.

How does Few-Shot Learning work?

The last learning method to be presented is Few-Shot Learning, a subfield of meta-learning, aiming to develop algorithms capable of learning from a few labeled examples.

Breaking the Data Barrier: How Zero-Shot, One-Shot, and Few-Shot Learning are Transforming Machine Learning

In this context, Prototypical Networks and Model-Agnostic Meta-Learning (MAML) are two prominent alternative techniques that have demonstrated success in few-shot learning scenarios.

Prototypical Networks

Prototypical Networks are a class of neural networks designed for few-shot classification tasks. The core idea is to learn a prototype, or a representative example, for each class in the feature space. The prototypes serve as a basis for classification by comparing the distance between a new input and the learned prototypes.

Three main steps are involved:

  1. Embedding: The network computes an embedding for each input using a neural network encoder, such as a Convolutional Neural Network (CNN) or a Recurrent Neural Network (RNN). The embeddings are high-dimensional representations that capture the salient features of the input data.
  2. Prototype computation: For each class, the network computes the prototype by taking the mean of the embeddings of the support set, which is a small subset of labeled examples for each class. The prototype represents the “center” of the class in the feature space.
  3. Classification: Given a new input, the network calculates its embedding and computes the distance (e.g. Euclidean distance) between the input’s embedding and the prototypes. The input is then assigned to the class with the nearest prototype.

The learning process involves minimizing a loss function that encourages the prototypes to be closer to the embeddings of their respective class and farther away from the embeddings of other classes.

Model-Agnostic Meta-Learning (MAML)

MAML is a meta-learning algorithm that aims to find an optimal initialization for the model’s parameters, such that it can rapidly adapt to new tasks with a few gradient steps. MAML is model-agnostic, meaning it can be applied to any model that is trained with gradient descent.

MAML involves the following steps:

  1. Task sampling: During meta-training, tasks are sampled from a distribution of tasks, where each task is a few-shot learning problem with a few labeled examples.
  2. Task-specific learning: For each task, the model’s parameters are fine-tuned using the task’s training data (support set) with a few gradient steps. This results in task-specific models with updated parameters.
  3. Meta-learning: The meta-objective is to minimize the sum of the task-specific losses on the validation data (query set) for all tasks. The model’s initial parameters are updated via gradient descent to achieve this objective.
  4. Meta-testing: After meta-training, the model can be quickly fine-tuned on new tasks with a few gradient steps, leveraging the learned initialization.

MAML requires significant computational resources, as it involves multiple nested gradient updates which raise challenges. One such challenge is Task Diversity. In many few-shot learning scenarios, the model must adapt to a wide range of tasks or classes, each with only a few examples. This diversity can make it challenging to develop a single model or approach that can effectively handle different tasks or classes without extensive fine-tuning or adaptation.

Conclusion

The incredible world of machine learning has gifted us with groundbreaking techniques like Zero-Shot, One-Shot, and Few-Shot Learning. These approaches allow AI models to learn and recognize objects or patterns with only a handful of examples, much like the way humans do. This opens up a world of possibilities across various industries, such as healthcare, retail, and manufacturing, where access to vast amounts of labeled data isn’t always a luxury.
Christophe Atten leads a dynamic team of data scientists in finance, and since 2022 also Medium AI Writer, focused on transforming raw data into insightful solutions.

Original. Reposted with permission.

More On This Topic

  • Zero-shot Learning, Explained
  • Ensuring Reliable Few-Shot Prompt Selection for LLMs
  • A Solid Plan for Learning Data Science, Machine Learning, and Deep Learning
  • KDnuggets News, December 14: 3 Free Machine Learning Courses for Beginners…
  • Top 2020 Stories: 24 Best (and Free) Books To Understand Machine Learning;…
  • Learning Data Science and Machine Learning: First Steps After The Roadmap

It’s Time to Master ChatGPT During Our Back-to-School Sale

An employee using ChatGPT.
Image: Stack Commerce

Businesses aren’t usually impacted by the school calendar unless they’re running marketing campaigns to capitalize on the season. However, professional development should be a priority for all business owners who want to get the most out of their employees. As such, our Back-to-School sale is a great time for business leaders to invest in learning new skills and encourage their employees to do the same.

And if you’re spending time learning something new, why not learn how to master the world’s most exciting new technology, ChatGPT? The sale is running from July 28 to August 13, and during that time you can get The Complete ChatGPT Artificial Intelligence OpenAI Training Bundle for a special discounted price of just $19.97.

This four-course bundle is designed to take students from absolute ChatGPT novice to certified expert in just four hours. You’ll kick off with an introductory course from ChatGPT educator Mike Wheeler (4.5/5-star instructor rating) that will teach you the basics of ChatGPT.

From there, you’ll expand your practical skill set with a course from Alex Genadinik (4.4/5-star rating) that will help you write sales copy, use ChatGPT to ideate blog content and create a copywriting workflow in ChatGPT.

Finally, you’ll get two courses from John Elder (4.5/5-star rating) that will teach you how to create your own ChatGPT-like bots using Python, Django and Tkinter.

By the end of the courses, you’ll not only be able to use ChatGPT like a pro, but you’ll also be able to build your own chatbot to emulate its abilities. From July 28th through August 13th, you can get The Complete ChatGPT Artificial Intelligence OpenAI Training Bundle for more than half off $52 at just $19.97.

Prices and availability are subject to change.

Person using a laptop computer.

Subscribe to the Daily Tech Insider Newsletter

Stay up to date on the latest in technology with Daily Tech Insider. We bring you news on industry-leading companies, products, and people, as well as highlighted articles, downloads, and top resources. You’ll receive primers on hot tech topics that will help you stay ahead of the game.

Delivered Weekdays Sign up today

Samsung, Hyundai back AI startup Tenstorrent: Everyone wants competition to Nvidia, says CEO Keller

jim-keller-tenstorrent-2023

"The problem with the winner-take-all strategy is it generates an economic environment where people really want an alternative," says Tenstorrent's CEO, Jim Keller, regarding Nvidia's dominance of AI.

Chip giant Nvidia is the most powerful force in artificial intelligence today, more powerful than Microsoft or Google or OpenAI. Its GPU chips are the dominant form of computing in the industry for programs such as ChatGPT. A raft of startups have failed to stem that dominance despite years of trying.

And yet, the world still hungers for competition, and Nvidia may be vulnerable because, some believe, the economics of Nvidia's dominance cannot be sustained.

"Nvidia has monopoly [profit margins]," said Jim Keller, CEO of AI chip startup Tenstorrent, in an exclusive interview with ZDNET. "If you want to go build a high-performance solution with AI inside of it, Nvidia will command most of the margin in the product."

"The problem with the winner-take-all strategy is it generates an economic environment where people really want an alternative."

Also: Cerebras just built a gargantuan computer system with 27 million AI 'cores'

Keller is a rockstar of the computer-chip world, known for a long string of chip successes, from turning around Advanced Micro Devices's processor business to creating the basis of Apple's custom processor business to building Tesla's Autopilot chip platform. He believes the industry's frustration with Nvidia's control, and the emerging technology of RISC-V, have opened a path for alternatives.

Some powerful parties believe he has a shot.

Wednesday, Keller announced Tenstorrent has received a hundred million dollars in funding from Hyundai Motor Group, the third-biggest car maker in the world, and Samsung Catalyst Fund, a venture arm of the electronics giant Samsung Electronics, along with participation from Fidelity Ventures, Eclipse Ventures, Epiq Capital, Maverick Capital, and more.

The new money is on top of previous funding of almost a quarter billion dollars, giving seven-year-old Tenstorrent an ample war chest to bring to market several chips for AI.

Hyundai, like many companies, is interested in alternatives to Nvidia for things such as putting AI in cars, said Keller. The company also has a robotics arm, having bought MIT robotics spinoff Boston Dynamics back in 2020.

"They are a technology leadership company," said Keller of Hyundai, "they are making money and investing it in technology because they see a path to building next-generation products with AI."

Also: Meta unveils its first custom AI chip

More to the point, Hyundai "want to build their own products and to hit the cost points and, and performance points," said Keller. "You can't give 60% gross margin to Nvidia for standard product, to be honest, and they're looking for options."

Hyundai's Executive Vice President and Head of the Global Strategy Office, Heung-soo Kim, said in prepared remarks, "Tenstorrent's high growth potential and high-performance AI semiconductors will help the Group secure competitive technologies for future mobilities."

"With this investment, the Group expects to develop optimized but differentiated semiconductor technology that will aid future mobilities and strengthen internal capabilities in AI technology development."

Keller and Hyundai Motor Co. Executive Vice President and Head of the Global Strategy Office, Heung-soo Kim.

Samsung's investment makes particular sense for one of the world's largest contract makers of semiconductors. The company has produced many of the chips for which Keller is famous, including the Tesla Autopilot. Samsung knows that from tiny acorns come mighty oaks, and today's startup could be tomorrow's big customer for chip-making.

The head of Samsung's Semiconductor Innovation Center, Marco Chisari, said in prepared remarks, "Tenstorrent's industry-leading technology, executive leadership, and aggressive roadmap motivated us to co-lead this funding round," adding, "We are excited by the opportunity to work with Tenstorrent to accelerate AI and compute innovations."

Keller with the head of Samsung's Semiconductor Innovation Center, Marco Chisari.

Keller lauded companies both in prepared remarks announcing the funding, stating, "The trust in Tenstorrent shown by Hyundai Motor Group and Samsung Catalyst Fund leading our round is truly humbling."

For Keller, who has many times over built the world's fastest chips, the argument is principally economic, but also heavily technological.

"I don't believe this is the end game for AI at all," he said, meaning, "GPUs running CUDA and PyTorch."

"When the aliens land, I don't think they'll be asking us, Did we invent CUDA?" quips Keller, referring to Nvidia's software platform for running those neural networks.

"While GPUs have success today, they're not the obvious best answer, they're more like the good-enough answer that was available," said Keller of the dominance of Nvidia chips such as the H100 "Hopper" GPU, which is Nvidia's leading product for running neural networks.

The more-advanced generative AI models, especially those coming from the open-source software community, will lead to a profound change in the field's distinction between training and inference. "I think the AI engine of the future … will have a fairly diverse set of capabilities that won't look like inference versus training," but more like a fusion of the two.

Also: Why open source is essential to allaying AI fears, according to Stability.ai founder

He concedes that Nvidia built an incredible lead in AI by co-founder and CEO Jensen Huang's wise decision to focus the company's efforts on software very early.

"The AI software challenge is harder than anybody thought, to be honest," observed Keller. "Most of the AI startups were started by hardware guys." Nvidia, he noted, "had a longer investment in that software stack, partly because they invested in HPC," high-performance computing, which is for complex scientific workloads, "when nobody wanted to," said Keller. That required special programming frameworks to be developed for the GPU. "They invested and they got some stuff to work."

But, said Keller, the world is changing. The rise of open-source alternatives to CUDA, the various AI frameworks such as TensorFlow and PyTorch, created by companies such as Stability.ai and MosaicML, and hosted by hubs such as Hugging Face, are promising, he said. "The intriguing thing is the amount of open source collaboration that's happening on the software front, which we need to match on the hardware front."

To match that open-source effort in hardware, Keller is betting on RISC-V, the open-source chip instruction set that was developed over a decade ago at the University of California at Berkeley by a peer of Keller's the renowned chip pioneer Dr. David Patterson and his colleagues.

Also: RISC-V, the Linux of the chip world, is starting to produce technological breakthroughs

For Keller, who is an astute problem solver on a grand scale, there is something crucial that is coming together on the economic front, and the technological front. It seems akin to other moments in technology when Keller made a profound impact, often against the prevailing wisdom in the industry.

"I like to explore the space and understand it and then do something," he said.

At the legendary Digital Equipment Corp., in the 1980s and 1990s, he built the world's fastest chip at the time. One of Keller's former startups, P.A. Semi Inc., was bought by Apple in 2008 and became the basis for the "A-series" silicon that now powers all Apple devices, an unlikely break from Intel. Tesla was "just a small engineering company" turning out no more than a quarter-million cars when Keller lead a team to develop the hardware for Tesla's Autopilot, now in every car.

He resuscitated AMD's moribund chip development at a time when "Everybody told me AMD was going to go bankrupt," Keller recalled. His efforts laid the foundation that not only brought the company back from the brink, but turned it into a chip powerhouse.

Also: Startup Tenstorrent shows AI is changing computing and vice versa

In Tenstorrent, founded in 2016, Keller saw something intriguing. The company has focused on the chip opportunity in the explosion in size of deep learning AI models such OpenAI's GPT, programs that demand greater and greater performance.

Keller, who had been an angel investor in Tenstorrent when he was still at Tesla, knew founder Ljubisa Bajic, who had worked for Keller at AMD. "I had a chance to look at a whole bunch of proposals for AI engines, and I thought what he was doing was quite interesting," recalled Keller.

He was interested enough to take the top spot at Tenstorrent in January of 2021. "It [Tenstorrent] was on some level a research project, and I felt that we were starting to figure out what the research project was," he explained.

Also: In a sign of AI's fundamental impact on computing, legendary chip designer Keller joins startup Tenstorrent

What is clear, said Keller, is that the arrival of AI is merging with the arrival of RISC-V and the economic pressure of Nvidia's dominance.

"Computation will be dominated by AI" moving forward, said Keller. The generative neural networks that Tenstorrent was built to address, as they increase in scale, are demanding more and more silicon horsepower, and so they are coming to dominate all chip design.

As in his past efforts, Keller didn't simply adopt the playbook as he found it at Tenstorrent. He has made the surprising decision to not only continue with the dedicated AI chips but to also build a general-purpose CPU that can handle the management of the AI chips.

"We decided to build a RISC-V processor to be a general purpose computing companion to the AI processor because general-purpose computing and AI are gonna work together, and they need to be tightly embedded."

To do so, "I hired some of the best designers from AMD, Apple and Nvidia," said Keller, who said he's into "the team adventure."

"We have a great CPU team; I'm really, really excited about it."

At one time, ARM, the privately held unit of SoftBank Group, which is preparing for an initial public offering, was a potential savior for the chip industry caught between Intel and Nvidia dominance. That has changed said Keller. "I talked to ARM quite a bit, and ARM had two big problems," observed Keller.

"One is they're way too expensive now," he said. What had been a workable economic situation for companies licensing ARM's technology has turned into a matter of constantly raising raising prices, demanding a higher and higher percentage of the products customers built with ARM's technology.

The other problem, he said, is that ARM wouldn't make changes to the fundamental instructions to accommodate new forms of handling data that AI requires. "AI is changing fast," Keller observed. He turned to Silicon Valley startup SiFive, which has made a business licensing a version of chip CPUs based on RISC-V. "They [ARM] didn't wanna make the modifications I needed on my next chip; SiFive said, 'sure'."

With RISC-V, both Tenstorrent, and its customers, can have control, Keller emphasized, unlike dealing with a monopolist. "Another design mission was, How do you build great technology where people have the rights to license?" he said.

As a result of the open-ness of RISC-V, "Slowly RISC-V is gonna replace everything," said Keller, meaning, ARM, Nvidia's own instruction set, and the legacy x86 code on which the Intel empire is built.

In addition to the new RISC-V-based CPU, the Tenstorrent AI accelerator is undergoing a massive change to embed RISC-V capability inside of it. "Our AI engine has a large matrix multiplier, a tensor processor, a vector processor, but it also has five little RISC-V processors that basically issue the AI instruction stream," explained Keller.

The company is just starting to sell its first two generations of chips, and is rapidly moving to have first silicon of a third generation and working on the fourth.

The Tenstorrent business will have several avenues to make money. The general-purpose CPU being designed is a "high-end processor" that "has licensing value," and then there's the AI accelerator chip that is both a part the company can sell but "also a RISC-V AI engine that other people could license."

Also: RISC-V CEO: Biggest opportunity to change computing since the 1980s

There should be plenty of takers given that industry is chaffing at the Nvidia tax, as one might call the high prices of H100 and the other parts.

"I've talked to power-supply companies, microcontroller companies, autonomous driving startups, data center edge server makers," he said. "If you make your own chip and you want an AI engine in there, you can't put a $2,000 GPU in it," he observed. "I talked to two robotics companies who basically had the same line: I can't put a $10,000 GPU in a $10,000 robot."

Also: Nvidia's ownership of ARM could drive customers to RISC-V, other alternatives if not careful, says Xilinx CEO

Of course, Nvidia seems these days like a company with no competition. It routinely dominates benchmark tests of chip performance such as MLPerf. Its data center business, which contains the AI chips, towers above the AI sales of Intel and Advanced Micro Devices. The unit may double in revenue terms this year, to $31 billion, almost two thirds of all of Intel's annual revenue.

A raft of startups packed with incredibly talented engineers, such as Cerebras Systems, Graphcore and Samba Nova Systems, have failed to make a dent in Nvidia, despite the fact that everyone knows everyone wants an alternative to Nvidia.

None of that phases Keller, who has fought and won many battles in a lifetime of chip design. For one thing, those companies haven't leveraged RISC-V, which is a game-changer, in his view. "If we had come up with the open-source RISC-V AI engine five years ago" that Tenstorrent is now building, said Keller, "then 50 startups could have been innovating on that rather than solving the same problem 15 different ways, 50 different times," as Cerebras and others have been doing.

On a simpler level, people always assume the status quo will stay in place, and that's never the case.

"The war for computation has been over many times," mused Keller. "Mainframes won it, and then mini-computers won it, and then workstations won it, and then PCs won it, and then mobile won it — the war has been won, so let's start the next battle!"

On an even simpler level, "I think computers are an adventure," he said. "I like to design computers; I'm into the adventure."

India Restricts Import of Laptops, Tablets, and PCs for Make in India 

India has implemented a restriction on the import of laptops, tablets, other personal computers, and servers, effective immediately, as announced by the Ministry of Commerce and Industry. The amendment introduces a licensing requirement for imports, which analysts believe is a step to strengthen local manufacturing initiatives.

According to a government notification, the import of laptops, tablets, all-in-one personal computers, ultra small form factor computers, and servers categorized under HSN 8741 will be restricted. However, import will be permitted with a valid license for restricted imports. The restriction does not apply to passengers carrying these devices in their baggage.

According to the Reuters report, the government notification gave no reason for the move, but Prime Minister Narendra Modi’s government has been promoting local manufacturing and discouraging imports under his “Make in India” plan.

The Indian government, under the leadership of Prime Minister Narendra Modi, introduced a $2 billion scheme in May to incentivize and encourage local businesses engaged in hardware manufacturing, including laptops, personal computers (PCs), servers, and related edge computing equipment.

In an effort to attract significant investments in IT hardware manufacturing, the Indian government has extended the deadline for companies to apply for a $2 billion incentive scheme. This move is crucial to India’s aspirations of establishing itself as a major player in the global electronics supply chain. The country aims to achieve annual production worth $300 billion by the year 2026.

Laptops, tablets and personal computers account for about 1.5% of India’s total annual imports, with nearly half of those from China, according to government data.

Many of Apple’s iPads and Dell’s laptops are imported into the country, rather than being manufactured locally.

The post India Restricts Import of Laptops, Tablets, and PCs for Make in India appeared first on Analytics India Magazine.

EPAM Launches DIAL, a Unified Generative AI Orchestration Platform

EPAM Systems has announced the launch of its AI-powered DIAL (Deterministic Integrator of Applications and LLMs) Orchestration Platform, which merges the power of Large Language Models (LLMs) with deterministic code — offering a secure, scalable, and customisable AI workbench to streamline and enhance AI-driven business solutions.

Developed by EPAM’s Reliable AI Lab (RAIL), DIAL helps enterprises speed their experimentation and innovation efforts across an extensive range of LLMs, AI-native Applications and Custom Add-ons as well as provides a practical approach for engineering business solutions with reliable AI capabilities.

The DIAL Platform offers a unified user interface, empowering businesses to leverage a spectrum of public and proprietary LLMs, Add-ons, APIs, Datastores and Business Applications. This integration promotes the development of novel enterprise assets that co-exist seamlessly with an organization’s existing workflows.

In keeping with EPAM’s long-standing commitment to Open Source, portions of DIAL will be released under an Apache 2.0 licensing scheme as part of its launch. This initiative encourages responsible use, community innovation and the adoption of responsible AI enterprise standards within the industry.

Moreover, Applications and Add-ons can be implemented through diverse approaches, encompassing LangChain, LLamaIndex, Semantic Kernel or custom code — all within an integrated, secure and scalable framework.

The DIAL Platformme aggregates multi-cloud assets libraries including components, routing, rate-limiting software, monitoring tools, load-balancing solutions and deployment scripts. This extensive, curated toolkit supports a wide range of business use cases and integration scenarios and offers approaches to significantly optimize the consumption of external LLMs.

The post EPAM Launches DIAL, a Unified Generative AI Orchestration Platform appeared first on Analytics India Magazine.

7 Steps to Mastering Data Cleaning and Preprocessing Techniques

7 Steps to Mastering Data Cleaning and Preprocessing Techniques
Illustration by Author. Inspired by MEME of Dr. Angshuman Ghosh

Mastering Data Cleaning and Preprocessing Techniques is fundamental for solving a lot of data science projects. A simple demonstration of how important can be found in the meme about the expectations of a student studying data science before working, compared with the reality of the data scientist job.

We tend to idealise the job position before having a concrete experience, but the reality is that it’s always different from what we really expect. When working with a real-world problem, there is no documentation of the data and the dataset is very dirty. First, you have to dig deep in the problem, understand what clues you are missing and what information you can extract.

After understanding the problem, you need to prepare the dataset for your machine learning model since the data in its initial condition is never enough. In this article, I am going to show seven steps that can help you on pre-processing and cleaning your dataset.

Step 1: Exploratory Data Analysis

The first step in a data science project is the exploratory analysis, that helps in understanding the problem and taking decisions in the next steps. It tends to be skipped, but it’s the worst error because you’ll lose a lot of time later to find the reason why the model gives errors or didn’t perform as expected.

Based on my experience as data scientist, I would divide the exploratory analysis into three parts:

  1. Check the structure of the dataset, the statistics, the missing values, the duplicates, the unique values of the categorical variables
  2. Understand the meaning and the distribution of the variables
  3. Study the relationships between variables

To analyse how the dataset is organised, there are the following Pandas methods that can help you:

df.head()  df.info()  df.isnull().sum()  df.duplicated().sum()  df.describe([x*0.1 for x in range(10)])   for c in list(df):      print(df[c].value_counts())

When trying to understand the variables, it’s useful to split the analysis into two further parts: numerical features and categorical features. First, we can focus on the numerical features that can be visualised through histograms and boxplots. After, it’s the turn for the categorical variables. In case it’s a binary problem, it’s better to start by looking if the classes are balanced. After our attention can be focused on the remaining categorical variables using the bar plots. In the end, we can finally check the correlation between each pair of numerical variables. Other useful data visualisations can be the scatter plots and boxplots to observe the relations between a numerical and a categorical variable.

Step 2: Deal with Missings

In the first step, we have already investigated if there are missings in each variable. In case there are missing values, we need to understand how to handle the issue. The easiest way would be to remove the variables or the rows that contain NaN values, but we would prefer to avoid it because we risk losing useful information that can help our machine learning model on solving the problem.

If we are dealing with a numerical variable, there are several approaches to fill it. The most popular method consists in filling the missing values with the mean/median of that feature:

df['age'].fillna(df['age'].mean())  df['age'].fillna(df['age'].median())

Another way is to substitute the blanks with group by imputations:

df['price'].fillna(df.group('type_building')['price'].transform('mean'),  inplace=True)

It can be a better option in case there is a strong relationship between a numerical feature and a categorical feature.

In the same way, we can fill the missing values of categorical based on the mode of that variable:

df['type_building'].fillna(df['type_building'].mode()[0])

Step 3: Deal with Duplicates and Outliers

If there are duplicates within the dataset, it’s better to delete the duplicated rows:

df = df.drop_duplicates()

While deciding how to handle duplicates is simple, dealing with outliers can be challenging. You need to ask yourself “Drop or not Drop Outliers?”.

Outliers should be deleted if you are sure that they provide only noisy information. For example, the dataset contains two people with 200 years, while the range of the age is between 0 and 90. In that case, it’s better to remove these two data points.

df = df[df.Age<=90]

Unfortunately, most of the time removing outliers can lead to losing important information. The most efficient way is to apply the logarithm transformation to the numerical feature.

Another technique that I discovered during my last experience is the clipping method. In this technique, you choose the upper and the lower bound, that can be the 0.1 percentile and the 0.9 percentile. The values of the feature below the lower bound will be substituted with the lower bound value, while the values of the variable above the upper bound will be replaced with the upper bound value.

for c in columns_with_outliers:     transform= 'clipped_'+ c     lower_limit = df[c].quantile(0.10)     upper_limit = df[c].quantile(0.90)     df[transform] = df[c].clip(lower_limit, upper_limit, axis = 0)

Step 4: Encode Categorical Features

The next phase is to convert the categorical features into numerical features. Indeed, the machine learning model is able only to work with numbers, not strings.

Before going further, you should distinguish between two types of categorical variables: non-ordinal variables and ordinal variables.

Examples of non-ordinal variables are the gender, the marital status, the type of job. So, it’s non-ordinal if the variable doesn’t follow an order, differently from the ordinal features. An example of ordinal variables can be the education with values “childhood”, “primary”, “secondary” and “tertiary", and the income with levels “low”, “medium” and “high”.

When we are dealing with non-ordinal variables, One-Hot Encoding is the most popular technique taken into account to convert these variables into numerical.

In this method, we create a new binary variable for each level of the categorical feature. The value of each binary variable is 1 when the name of the level coincides with the value of the level, 0 otherwise.

from sklearn.preprocessing import OneHotEncoder    data_to_encode = df[cols_to_encode]  encoder = OneHotEncoder(dtype='int')  encoded_data = encoder.fit_transform(data_to_encode)  dummy_variables = encoder.get_feature_names_out(cols_to_encode)  encoded_df = pd.DataFrame(encoded_data.toarray(), columns=encoder.get_feature_names_out(cols_to_encode))    final_df = pd.concat([df.drop(cols_to_encode, axis=1), encoded_df], axis=1)

When the variable is ordinal, the most common technique used is the Ordinal Encoding, which consists in converting the unique values of the categorical variable into integers that follow an order. For example, the levels “low”, “Medium” and “High” of income will be encoded respectively as 0,1 and 2.

from sklearn.preprocessing import OrdinalEncoder    data_to_encode = df[cols_to_encode]  encoder = OrdinalEncoder(dtype='int')  encoded_data = encoder.fit_transform(data_to_encode)  encoded_df = pd.DataFrame(encoded_data.toarray(), columns=["Income"])    final_df = pd.concat([df.drop(cols_to_encode, axis=1), encoded_df], axis=1)

There are other possible encoding techniques if you want to explore here. You can take a look here in case you are interested in alternatives.

Step 5: Split dataset into training and test set

It’s time to divide the dataset into three fixed subsets: the most common choice is to use 60% for training, 20% for validation and 20% for testing. As the quantity of data grows, the percentage for training increases and the percentage for validation and testing decreases.

It’s important to have three subsets because the training set is used to train the model, while the validation and the test sets can be useful to understand how the model is performing on new data.

To split the dataset, we can use the train_test_split of scikit-learn:

from sklearn.model_selection import train_test_split    X = final_df.drop(['y'],axis=1)  y = final_df['y']    train_idx, test_idx,_,_ = train_test_split(X.index,y,test_size=0.2,random_state=123)  train_idx, val_idx,_,_ = train_test_split(train_idx,y_train,test_size=0.2,random_state=123)    df_train = final_df[final_df.index.isin(train_idx)]  df_test = final_df[final_df.index.isin(test_idx)]  df_val = final_df[final_df.index.isin(val_idx)]

In case we are dealing with a classification problem and the classes are not balanced, it’s better to set up the stratify argument to be sure that there is the same proportion of classes in training, validation and test sets.

train_idx, test_idx,y_train,_ = train_test_split(X.index,y,test_size=0.2,stratify=y,random_state=123)  train_idx, val_idx,_,_ = train_test_split(train_idx,y_train,test_size=0.2,stratify=y_train,random_state=123)

This stratified cross-validation also helps to ensure that there is the same percentage of the target variable in the three subsets and give more accurate performances of the model.

Step 6: Feature Scaling

There are machine learning models, like Linear Regression, Logistic Regression, KNN, Support Vector Machine and Neural Networks, that require scaling features. The feature scaling only helps the variables be in the same range, without changing the distribution.

There are three most popular feature scaling techniques are Normalization, Standardization and Robust scaling.

Normalization, aso called min-max scaling, consists of mapping the value of a variable into a range between 0 and 1. This is possible by subtracting the minimum of the feature from the feature value and, then, dividing by the difference between the maximum and the minimum of that feature.

from sklearn.preprocessing import MinMaxScaler  sc=MinMaxScaler()  df_train[numeric_features]=sc.fit_transform(df_train[numeric_features])  df_test[numeric_features]=sc.transform(df_test[numeric_features])  df_val[numeric_features]=sc.transform(df_val[numeric_features])

Another common approach is Standardization, that rescales the values of a column to respect the properties of a standard normal distribution, which is characterised by mean equal to 0 and variance equal to 1.

from sklearn.preprocessing import StandardScaler  sc=StandardScaler()  df_train[numeric_features]=sc.fit_transform(df_train[numeric_features])  df_test[numeric_features]=sc.transform(df_test[numeric_features])  df_val[numeric_features]=sc.transform(df_val[numeric_features])

If the feature contains outliers that cannot be removed, a preferable method is the Robust Scaling, that rescales the values of a feature based on robust statistics, the median, the first and the third quartile. The rescaled value is obtained by subtracting the median from the original value and, then, dividing by the Interquartile Range, which is the difference between the 75th and 25th quartile of the feature.

from sklearn.preprocessing import RobustScaler  sc=RobustScaler()  df_train[numeric_features]=sc.fit_transform(df_train[numeric_features])  df_test[numeric_features]=sc.transform(df_test[numeric_features])  df_val[numeric_features]=sc.transform(df_val[numeric_features])

In general, it’s preferable to calculate the statistics based on the training set and then use them to rescale the values on both training, validation and test sets. This is because we suppose that we only have the training data and, later, we want to test our model on new data, which should have a similar distribution than the training set.

Step 7: Deal with Imbalanced Data 7 Steps to Mastering Data Cleaning and Preprocessing Techniques

This step is only included when we are working in a classification problem and we have found that the classes are imbalanced.

In case there is a slight difference between the classes, for example class 1 contains 40% of the observations and class 2 contains the remaining 60%, we don’t need to apply oversampling or undersampling techniques to alter the number of samples in one of the classes. We can just avoid looking at accuracy since it’s a good measure only when the dataset is balanced and we should care only about evaluation measures, like precision, recall and f1-score.

But it can happen that the positive class has a very low proportion of data points (0.2) compared to the negative class (0.8). The machine learning may not perform well with the class with less observations, leading to failing on solving the task.

To overcome this issue, there are two possibilities: undersampling the majority class and oversampling the minority class. Undersampling consists in reducing the number of samples by randomly removing some data points from the majority class, while Oversampling increases the number of observations in the minority class by adding randomly data points from the less frequent class. There is the imblearn that allows to balance the dataset with few lines of code:

# undersampling  from imblearn.over_sampling import RandomUnderSampler,RandomOverSampler  undersample = RandomUnderSampler(sampling_strategy='majority')  X_train, y_train = undersample.fit_resample(df_train.drop(['y'],axis=1),df_train['y'])  # oversampling  oversample = RandomOverSampler(sampling_strategy='minority')  X_train, y_train = oversample.fit_resample(df_train.drop(['y'],axis=1),df_train['y'])

However, removing or duplicating some of the observations can be ineffective sometimes in improving the performance of the model. It would be better to create new artificial data points in the minority class. A technique proposed to solve this issue is SMOTE, which is known for generating synthetic records in the class less represented. Like KNN, the idea is to identify k nearest neighbors of observations belonging to the minority class, based on a particular distance, like t. After a new point is generated at a random location between these k nearest neighbors. This process will keep creating new points until the dataset is completely balanced.

from imblearn.over_sampling import SMOTE  resampler = SMOTE(random_state=123)  X_train, y_train = resampler.fit_resample(df_train.drop(['y'],axis=1),df_train['y'])

I should highlight that these approaches should be applied only to resample the training set. We want that our machine model learns in a robust way and, then, we can apply it to make predictions on new data.

Final Thoughts

I hope you have found this comprehensive tutorial useful. It can be hard to start our first data science project without being aware of all these techniques. You can find all my code here.

There are surely other methods I didn’t cover in the article, but I preferred to focus on the most popular and known ones. Do you have other suggestions? Drop them in the comments if you have insightful suggestions.

Useful resources:

  • A Practical Guide for Exploratory Data Analysis
  • Which models require normalized data?
  • Random Oversampling and Undersampling for Imbalanced Classification

Eugenia Anello is currently a research fellow at the Department of Information Engineering of the University of Padova, Italy. Her research project is focused on Continual Learning combined with Anomaly Detection.

More On This Topic

  • Exploring Data Cleaning Techniques With Python
  • 7 Steps to Mastering SQL for Data Science
  • 7 Steps to Mastering Python for Data Science
  • 7 Steps to Mastering Data Science Project Management with Agile
  • Mastering TensorFlow Tensors in 5 Easy Steps
  • Mastering TensorFlow Variables in 5 Easy Steps

Google AI Helps Doctors Decide When to Trust AI Diagnoses

ML in healthcare

AI has changed the face of many sectors, but in healthcare its adoption is moving at snail’s pace. The healthcare sector faces several challenges such as data privacy, security, lack of interoperability, and the lack of regulation which have restricted its adoption.

The AI model is prone to make mistakes, and as we know it is human to err. Google Research has asked the question, what would be the error rates when you combine the expertise of predictive AI models and clinicians?

In July this year, Google DeepMind joined hands with Google Research and introduced the Complimentary-driven-Deferral-to-Clinical-Workflow (CoDoC), a system that maximises accuracy by combining human expertise with predictive AI. The system essentially decides if the AI model is more accurate than a hypothetical clinician’s workflow of diagnosis. It does this using a confidence score of the predictive model as one of the inputs.

The comprehensive tests of the CoDoC with multiple real-world datasets has shown that with human expertise and predictive AI results with CoDoC provides greater accuracy. They saw a reduction of 25% in false positives for mammography datasets and more importantly, didn’t miss any true positives.

The published paper is a significant advancement in collaboration between AI and clinicians. It promises improved accuracy in determining disease with binary outcomes. Their datasets focused on breast cancer screening using X-Ray mammography and a triage for TB tests using chest X-Rays.

Sidestepping AI hurdles in medicine

Knowing when to say ‘I don’t know’ is essential when working with artificial intelligence tools in a medical setting. The paper addresses the crucial challenge of when to acknowledge uncertainty and then to pass on the responsibility to the clinician. “If you use CoDoC together with the AI tool, and the outputs of a real radiologist, and then CoDoC helps decide which opinion to use, the resulting accuracy is better than either the person or the AI tool alone,” says Alan Karthikesalingam at Google Health UK, who worked on the research.

The CoDoC model also does not require medical images from patients to make the diagnosis. This takes care of the privacy of the patient. It requires only three inputs for each case in training the dataset. First is the outputs of the hospital’s own existing predictive AI’s confidence score (0 is a certainty of no disease present and 1 is certain that the disease is present). Second, the outputs from a non-AI expert clinical workflow and finally historical ‘ground truth’ data.

The system could be compatible with any proprietary AI models and would not need access to the model’s inner workings or data it was trained on. To apply the CoDoC paradigm to existing predictive AI systems, researchers would follow the methodology described in the paper – this involves training a CoDoC-style model using the outputs of their own existing predictive AI system.

How the predictive model works

It learns by comparing the predictive AI model’s accuracy with what the doctor interprets and then checks how the accuracy varies with the confidence score generated by the predictive AI model.

After being trained, CoDoC is placed in a hypothetical future clinical workflow, working alongside both the predictive AI and the human doctor. When a new patient image is evaluated by the predictive AI model, its associated confidence score is fed into CoDoC.

By having an AI as an effective tool to confirm the diagnosis, doctors can be confident of their diagnoses even in edge cases, something that was not available before.

“The advantage of CoDoC is that it’s interoperable with a variety of proprietary AI systems,” says Krishnamurthy Dvijotham at Google DeepMind.

Inviting developers to test, validate, build

To help researchers build on their work, be transparent and ensure safer AI models for the real world, they’ve open-sourced CodoC’s code on Github.

This work is theoretical so far but according to the researchers it shows the AI system’s potential to adapt, improve performance on interpreting medical imaging across varied demographic populations, clinical settings, medical imaging equipment used, and disease types.

Helen Salisbury from the University of Oxford said, “It is a welcome development, but mammograms and tuberculosis checks involve fewer variables than most diagnostic decisions, so expanding the use of AI to other applications will be challenging.” She further said, “For systems where you have no chance to influence, post-hoc, what comes out the black box, it seems like a good idea to add on machine learning.”

The post Google AI Helps Doctors Decide When to Trust AI Diagnoses appeared first on Analytics India Magazine.