Nvidia Launches AI Factories in DGX Cloud Amid the GPU Squeeze

Nvidia Launches AI Factories in DGX Cloud Amid the GPU Squeeze July 28, 2023 by Agam Shah

(JLStock/Shutterstock)

Nvidia is now renting out its homegrown AI supercomputers with its newest GPUs in the cloud for those keen to access its hardware and software packages.

The DGX Cloud service will include its high-performance AI hardware, including the H100 and A100 GPUs, which are currently in short supply. Users will be able to rent the systems through Nvidia’s own cloud infrastructure or Oracle’s cloud service.

“DGX Cloud is available worldwide except where subject to U.S. export controls,” said Tony Paikeday, senior director, DGX Platforms at Nvidia.

The cloud will be available in Nvidia’s cloud infrastructure, which includes its DGX systems located in the U.S. and U.K. DGX Cloud will also be available through the Oracle Cloud infrastructure.

Nvidia announced the wide availability of its DGX Cloud service after initially announcing the service at its GTC conference in March. The Tuesday announcement followed a string of AI-in-cloud announcements last week.

Rival Cerebras Systems is installing AI systems in cloud services run by Middle Eastern cloud provider G42, which will deliver 36 exaflops of performance. Tesla announced it was starting production of its Dojo supercomputer, which will run on its homegrown D1 chips, and deliver 100 exaflops of performance by the end of next year. The benchmarks vary depending on the data type.

Tesla CEO Elon Musk last week talked about shortages of Nvidia GPUs for its existing AI hardware, and that Tesla was waiting for supplies. Users can lock down access to Nvidia’s hardware and software on DGX Cloud, but at a hefty premium.

An Nvidia DGX superpod.

The DGX Cloud rentals include access to Nvidia’s cloud computers, each with H100 or A100 GPUs and 640GB of GPU memory, on which companies can run AI applications. Nvidia’s goal is to run its AI infrastructure like a factory — feed in data as raw material, and the output is usable information that companies can put to work. Customers do not have to worry about the software and hardware in the middle.

Paikeday also mentioned, “DGX Cloud serves a critical need: dedicated compute for multi-node training of large complex generative AI models like large language models.” Paikeday continued, “Enterprises will also get a deep bench of technical expertise to deploy and operate the environment supporting such workloads.”

The pricing for DGX cloud starts at $36,999 per instance for a month.

That is about double the price of Microsoft Azure’s ND96asr instance with eight Nvidia A100 GPUs, 96 CPU cores, and 900GB of RAM, which costs $19,854 per month. Nvidia’s base price includes AI Enterprise software, which provides access to large language models and tools to develop AI applications.

The rentals include a software interface called the Base Command Platform so companies to manage and monitor DGX Cloud training workloads. The Oracle Cloud has clusters of up to 512 Nvidia GPUs with a 200 gigabits-per-second RDMA network, and includes support for multiple file systems such as Lustre.

All major cloud providers have their own deployments of Nvidia’s H100 and A100 GPUs, which are different from DGX Cloud. 

Nvidia H100 die.

Google earlier this year announced the A3 supercomputer with 26,000 Nvidia H100 Hopper GPUs, which has a setup that resembles Nvidia’s DGX Superpod, which spans 127 DGX nodes, each equipped with eight H100 GPUs. Amazon’s AWS EC2 UltraClusters with P5 instances will be based on the H100.

“We expect DGX Cloud to attract new generative AI customers and workloads to our partners’ clouds,” Paikeday said.

With lock down, also comes lock-in — Nvidia is trying to get customers to use its proprietary AI hardware and software technologies based on its CUDA programming models. It could provide costly for companies in the long run, as they would pay for software licenses and GPU time. Nvidia said investments in AI will benefit companies in the form of long-term operational savings.

The AI community is pushing open-source models and railing against proprietary models and tools, but Nvidia has a stranglehold on the AI hardware market. Nvidia is one of the few companies that can provide hardware and software stacks and services that make practical implementations of machine learning possible.

The interest in Nvidia’s AI hardware comes amid a rush to tap into the promise of generative AI. OpenAI’s ChatGPT demonstrated the capabilities of AI in the form of a chatbot, but new models are now emerging for vertical markets that include health care, insurance, and finance.

This article first appeared on HPCwire.

Related

Google DeepMind’s new RT-2 system enables robots to perform novel tasks

Abstract robot AI being tested

As artificial intelligence advances, we look to a future with more robots and automations than ever before. They already surround us — the robot vacuum that can expertly navigate your home, a robot pet companion to entertain your furry friends, and robot lawnmowers to take over weekend chores. We appear to be inching towards living out The Jetsons in real life. But as smart as they appear, these robots have their limitations.

Google DeepMind unveiled RT-2, the first vision-language-action (VLA) model for robot control, which effectively takes the robotics game several levels up. The system was trained on text data and images from the internet, much like the large language models behind AI chatbots like ChatGPT and Bing are trained.

Also: How researchers broke ChatGPT and what it could mean for future AI development

Our robots at home can operate simple tasks they are programmed to perform. Vacuum the floors, for example, and if the left-side sensor detects a wall, try to go around it. But traditional robotic control systems aren't programmed to handle new situations and unexpected changes — often, they can't perform more than one task at a time.

RT-2 is designed to adapt to new situations over time, learn from multiple data sources like the web and robotics data to understand both language and visual input, and perform tasks it has never encountered nor been trained to perform.

"A visual-language model (VLM) pre-trained on web-scale data is learning from RT-1 robotics data to become RT-2, a visual-language-action (VLA) model that can control a robot," from Google DeepMind.

A traditional robot can be trained to pick up a ball and stumble when picking up a cube. RT-2's flexible approach enables a robot to train on picking up a ball and can figure out how to adjust its extremities to pick up a cube or another toy it's never seen before.

Instead of the time-consuming, real-world training on billions of data points that traditional robots require, where they have to physically recognize an object and learn how to pick it up, RT-2 is trained on a large amount of data and can transfer that knowledge into action, performing tasks it's never experienced before.

Also: Can AI detectors save us from ChatGPT? I tried 5 online tools to find out

"RT-2's ability to transfer information to actions shows promise for robots to more rapidly adapt to novel situations and environments," said Vincent Vanhoucke, Google DeepMind's head of robotics. "In testing RT-2 models in more than 6,000 robotic trials, the team found that RT-2 functioned as well as our previous model, RT-1, on tasks in its training data, or 'seen' tasks. And it almost doubled its performance on novel, unseen scenarios to 62% from RT-1's 32%."

Some of the examples of RT-2 at work that were published by Google DeepMind.

The DeepMind team adapted two existing models, Pathways Language and Image Model (PaLI-X) and Pathways Language Model Embodied (PaLM-E), to train RT-2. PaLI-X helps the model process visual data, trained on massive amounts of images and visual information with other corresponding descriptions and labels online. With PaLI-X, RT-2 can recognize different objects, understand its surrounding scenes for context, and relate visual data to semantic descriptions.

PaLM-E helps RT-2 interpret language, so it can easily understand instructions and relate them to what is around it and what it's currently doing.

Also: The best AI chatbots

As the DeepMind team adapted these two models to work as the backbone for RT-2, it created the new VLA model, enabling a robot to understand language and visual data and subsequently generate the appropriate actions it needs.

RT-2 is not a robot in itself — it's a model that can control robots more efficiently than ever before. An RT-2-enabled robot can perform tasks ranging in degrees of complexity using visual and language data, like organizing files alphabetically by reading the labels on the documents and sorting them, then putting them away in the correct places.

It could also handle complex tasks. For instance, if you said, "I need to mail this package, but I'm out of stamps," RT-2 could identify what needs to be done first, like finding a Post Office or merchant that sells stamps nearby, take the package, and handle the logistics from there.

Also: What is Google Bard? Here's everything you need to know

"Not only does RT-2 show how advances in AI are cascading rapidly into robotics, it shows enormous promise for more general-purpose robots," Vanhoucke added.

Let's hope that 'promise' leans more towards living out The Jetsons' plot than The Terminator's.

Artificial Intelligence

Hammerspace Raises $56M to Reimagine Data Orchestration

Hammerspace Raises $56M to Reimagine Data Orchestration July 28, 2023 by Jaime Hampton

(thodonal88/Shutterstock)

In the gaming world, a hammerspace is an instantly accessible storage location that allows characters to seemingly grab objects out of thin air. If you have ever pulled a refrigerator out of your pocket while playing Nintendo’s “Animal Crossing,” you’ve seen a hammerspace in action.

When it comes to the vast troves of data enterprises now contend with, a real-life hammerspace could really come in handy. That is the impetus behind the company of the same name: Hammerspace, an enterprise data orchestration firm that just raised $56.7 million in an institutional investment round.

Maintaining secure access and control over an organization’s entire data ecosystem is a monumental task. Data is often siloed between different vendor storage solutions in multiple locations and cloud environments, and while many storage systems are highly scalable, they come with major latency tradeoffs.

Hammerspace CEO and Co-founder David Flynn previously told Datanami’s Alex Woodie that on a foundational level, the relationship between data and storage infrastructure is quite broken.

During the company’s quarterly update call this week, Flynn told of how Hammerspace was founded with the goal of decoupling data from infrastructure, allowing for a single view of data even when data is physically distributed.

Hammerspace’s software brings an enterprise’s existing data stores together into one file system. Rather than copying data from silo to silo, Hammerspace’s vendor-neutral orchestration system bridges any on-prem or cloud-based storage type, the company says, creating a cross-platform global data environment. Data services and file operations are automated as background tasks. Users have uninterrupted file access via standard file protocols regardless of data placement actions, infrastructure changes, or storage type and location, including multi-cloud use cases.

Flynn’s background is in high performance computing where speed is key. He was CEO and co-founder of Fusion-io, a startup known for its enterprise flash storage products for HPC that was acquired by SanDisk in 2014 for $1.1 billion. At one point, Flynn’s company helped the San Diego Supercomputer Center cut its MySQL database query times from 30 minutes to 3 minutes.

(Source: Hammerspace)

Flynn’s HPC roots led to the development of Hammerspace’s unique architecture that separates the control plane from the data path and introduces an abstraction layer that enhances performance scalability, a technique commonly used in supercomputing but not yet seen in enterprise data storage.

To achieve this, Hammerspace adapted Parallel NFS, part of the NFS distributed file system protocol used mainly in academic settings. To overcome the NFS limitation where all data must flow through a single NFS server, PNFS separates metadata from file data, allowing for data access directly from storage devices, which can be multiple and parallel. Hammerspace saw an opportunity to use PNFS for data orchestration, building on this architecture to create its platform.

Demand for Hammerspace is growing. Flynn said on the update call that the company has experienced a 300% year-over-year growth. One of its customers is Jellyfish Pictures, the remote visual effects and animation studio behind the latest “Star Wars” movies and shows, as well as season four of “Stranger Things.”

The company found its visual effects workloads rapidly growing due to COVID shutting down most live-action shoots. Jellyfish leveraged Hammerspace and a partnership with Microsoft Azure to orchestrate its content to its globally distributed workforce. The company deployed on-prem Anvil metadata servers with data services nodes containing high-speed NVMe SSDs connected to Hammerspace instances in multiple Azure regions. “The Anvils work by replicating metadata between points in a bi-directional replication configuration. All sites are active, with all artists able to perform high-performance read/write on the same shared dataset,” Hammerspace explained in a case study.

Jeff Bezos’s rocket manufacturing company, Blue Origin, is also a Hammerspace customer. The company uses Hammerspace across all its on-prem storage infrastructure and cloud infrastructure, spanning five different facilities around the country and their cloud presence. Blue Origin uses it for data sharing and work across industrial design, manufacturing, live test feedback, and even marketing, all within a single file system.

Hammerspace CEO and co-founder David Flynn.

It seems investors are also taking notice. Hammerspace’s latest $56.7M institutional investment round was led by Prosperity7 Ventures, the venture arm of Saudi Aramco. On the update call, Flynn was joined by Jonathan Tower, managing director of Prosperity7.

Tower shared that current economic volatility can be viewed as a recalibrating of VC investment: “We really had a very strong period of investment, lots of new funds were created, lots of new emerging managers came out of that,” he said. “And I think we are going through a cycle where there’s a windowing of that now, where a lot of capital was deployed in many, many companies, one might make the argument that perhaps too much capital was deployed, and perhaps too many new companies were created.”

He went on to say that this economic environment is encouraging investors to focus on truly innovative products whose reach and scale have a worldwide impact.

“From our perspective, it’s a great time to be investing. It’s a great time to be building great companies that are solving big, global problems,” Tower said.

Tower says the shift to a remote and dispersed workforce brought on by COVID has created a critical need for seamless data access unrestricted by data silos or latency issues.

“We’re not going to go back to the world of pre-2019 in that sense. And so, these systems need to be built for the future of how data is going to be accessed and required, as opposed to the concept of the cloud of 20 years ago,” Tower said, later adding: “At the end of the day, nobody was going after the problem like Hammerspace was going after the problem. The solution that Hammerspace is going after and the way they’re building this is very differentiated.”

This article first appeared on sister site Datanami.

Related

OpenAI, Microsoft, Google, Anthropic Launch Frontier Model Forum to Promote Safe AI

Artificial intelligence and modern computer technology image concept.
Image: putilov_denis/Adobe Stock

OpenAI, Google, Microsoft and Anthropic have announced the formation of the Frontier Model Forum. With this initiative, the group aims to promote the development of safe and responsible artificial intelligence models by identifying best practices and broadly sharing information in areas such as cybersecurity.

Jump to:

  • What is the Frontier Model Forum’s goal?
  • What are the Frontier Model Forum’s main objectives?
  • What are the criteria for membership in the Frontier Model Forum?
  • Cooperation and criticism of AI practices and regulation
  • Other AI safety initiatives

What is the Frontier Model Forum’s goal?

The goal of the Frontier Model Forum is to have member companies contribute technical and operational advice to develop a public library of solutions to support industry best practices and standards. The impetus for the forum was the need to establish “appropriate guardrails … to mitigate risk” as the use of AI increases, the member companies said in a statement.

Additionally, the forum says it will “establish trusted, secure mechanisms for sharing information among companies, governments, and relevant stakeholders regarding AI safety and risks.” The forum will follow best practices in responsible disclosure in areas such as cybersecurity.

SEE: Microsoft Inspire 2023: Keynote Highlights and Top News (TechRepublic)

What are the Frontier Model Forum’s main objectives?

The forum has crafted four core objectives:

1. Advancing AI safety research to promote responsible development of frontier models, minimize risks and enable independent, standardized evaluations of capabilities and safety.

2. Identifying best practices for the responsible development and deployment of frontier models, helping the public understand the nature, capabilities, limitations and impact of the technology.

3. Collaborating with policymakers, academics, civil society and companies to share knowledge about trust and safety risks.

4. Supporting efforts to develop applications that can help meet society’s greatest challenges, such as climate change mitigation and adaptation, early cancer detection and prevention, and combating cyberthreats.

SEE: OpenAI Is Hiring Researchers to Wrangle ‘Superintelligent’ AI (TechRepublic)

What are the criteria for membership in the Frontier Model Forum?

To become a member of the forum, organizations must meet a set of criteria:

  • They develop and deploy predefined frontier models.
  • They demonstrate a strong commitment to frontier model safety.
  • They demonstrate a willingness to advance the forum’s work by supporting and participating in initiatives.

The founding members noted in statements in the announcement that AI has the power to change society, so it behooves them to ensure it does so responsibly through oversight and governance.

“It is vital that AI companies — especially those working on the most powerful models — align on common ground and advance thoughtful and adaptable safety practices to ensure powerful AI tools have the broadest benefit possible,” said Anna Makanju, vice president of global affairs at OpenAI. Advancing AI safety is “urgent work,” she said, and the forum is “well-positioned” to take quick actions.

“Companies creating AI technology have a responsibility to ensure that it is safe, secure and remains under human control,” said Brad Smith, vice chair and president of Microsoft. “This initiative is a vital step to bring the tech sector together in advancing AI responsibly and tackling the challenges so that it benefits all of humanity.”

SEE: Hiring kit: Prompt engineer (TechRepublic Premium)

Frontier Model Forum’s advisory board

An advisory board will be set up to oversee strategies and priorities, with members coming from diverse backgrounds. The founding companies will also establish a charter, governance and funding with a working group and executive board to spearhead these efforts.

The board will collaborate with “civil society and governments” on the design of the forum and discuss ways of working together.

Cooperation and criticism of AI practices and regulation

The Frontier Model Forum announcement comes less than a week after OpenAI, Google, Microsoft, Anthropic, Meta, Amazon and Inflection agreed to the White House’s list of eight AI safety assurances. These recent actions are especially interesting in light of recent measures taken by some of these companies regarding AI practices and regulations.

For instance, in June, Time magazine reported that OpenAI lobbied the E.U. to water down AI regulation.Further, the formation of the forum comes months after Microsoft laid off its ethics and society team as part of a larger round of layoffs, calling into question its commitment to responsible AI practices.

“The elimination of the team raises concerns about whether Microsoft is committed to integrating its AI principles with product design as the organization looks to scale these AI tools and make them available to its customers across its suite of products and services,” wrote Rich Hein in a March 2023 CMSWire article.

Other AI safety initiatives

This is not the only initiative geared toward promoting the development of responsible and safe AI models. In June, PepsiCo announced it would begin collaborating with the Stanford Institute for Human-Centered Artificial Intelligence to “ensure that AI is implemented responsibly and positively impacts the individual user as well as the broader community.”

The MIT Schwarzman College of Computing has established the AI Policy Forum, which is a global effort to formulate “concrete guidance for governments and companies to address the emerging challenges” of AI such as privacy, fairness, bias, transparency and accountability.

Carnegie Mellon University’s Safe AI Lab was formed to “develop reliable, explainable, verifiable, and good-for-all artificial intelligent learning methods for consequential applications.”

Subscribe to the Developer Insider Newsletter

From the hottest programming languages to commentary on the Linux OS, get the developer and open source news and tips you need to know.

Delivered Tuesdays and Thursdays Sign up today

AMD Will Build its Largest Ever R&D Centre in Bengaluru

US-based chipmaker Advanced Micro Devices ( AMD) recently revealed that it will build its largest design centre in Bengaluru, Karnataka.

The centre is expected to come up by the end of this year, and could potentially employ nearly 3000 engineers in the next five years.

“It will certainly play an important role in building a world class semiconductor design and innovation ecosystem,” Rajeev Chandrasekhar, Minister of State for Electronics and IT, Skill Development and Entrepreneurship, said.

“It will also provide tremendous opportunities for our large pool of highly skilled semiconductor engineers and researchers and will catalyse PM Narendra Modi’s vision of India becoming a global talent hub,” he added.

The new centre, which covers an area of 500,000-square-foot will increase AMD’s office footprint in India to 10 locations.

Further, AMD has also announced investments worth USD 400 million in the country in the next five years.

Currently, the chipmaker employs more than 6,500 people in the country. “Our India teams will continue to play a pivotal role in delivering the high-performance and adaptive solutions that support AMD customers worldwide,” Mark Papermaster, Chief Technology Officer at AMD, said.

Last month, AMD announced a new AI chip titled MI300X which aims to tap a part of the AI accelerator market that NVIDIA is enjoying dominance over. The chip is purpose-built for AI tasks, and comes with up to 192GB of memory, perfect for big models like LLMs.

Read more: AMD’s Zenbleed Exploit is Too Big to Handle for Enterprises

The post AMD Will Build its Largest Ever R&D Centre in Bengaluru appeared first on Analytics India Magazine.

Oracle MySQL’s Unyielding Success

Oracle, a major player in the cloud and relational database management systems domain, firmly believes in its approach to make this sector intriguing and user-friendly by introducing products that simplify data querying, making it both efficient and cost-effective for enterprises.

Oracle recently announced general availability of MySQL HeatWave Lakehouse which enables customers to query data in object storage as fast as querying data inside the database.

In an exclusive interview with AIM, Oracle India’s technology headSaravanan P, said that the most common problem that the enterprise customers face is creating consolidated reports as there are multiple copies of databases. “We brought the MySQL Heatwave to address these multiple data source challenges, specifically, here is a database which can run both OLTP and OLAP on a single database”.

Saravanan further added HeatWave Lakehouse enables customers to process and query hundreds up to hundreds of terabytes, stored in object stores like CSV and Parquet formats, as well as Amazon’s Aurora and Amazon Redshift backups. “Majority of customers of MySQL are digital native companies which heavily rely on clouds to manage their data,” he added.

The aim of MySQL HeatWave Lake House is to enable customers to easily obtain valuable real-time insights by combining data from object storage with database data. MySQL Heatwave Lakehoue delivers significantly improved query performance and faster data loading, all at a lower cost.

Currently MySQL HeatWave Lakehouse can load data from object storage that amounts to 400 terabytes, and it does this 8 times faster than Redshift and 2.7 times faster than Snowflake.

As demonstrated by a 500 TB TPC-H* benchmark, its query performance is 9 times faster than Amazon Redshift, 17 times faster than Databricks, 17 times faster than Snowflake, and 36 times faster than Google BigQuery. (see below)

Reason: MySQL HeatWave uses MySQL Autopilot. It is a unique feature of MySQL HeatWave that uses machine learning to automate and improve query execution. It learns from past queries and optimises the execution plan for future ones. This innovation is exclusive to MySQL HeatWave users and not found anywhere else.

In addition, files in the object store are queried directly by HeatWave without copying the data into the MySQL database. As a result, MySQL HeatWave Lakehouse sets new standards for scalability and performance of query processing, speed of loading data, cluster provisioning time, and automation to query data in object storage.

MySQL HeatWaveLakehouse provides customers the ability to query data in the object store in a variety of file formats and optionally combine it with data in the MySQL database. In addition, files in the object store are queried directly by HeatWave without copying the data into the MySQL database.

This explains why Oracle MySQL is the first name that comes to the forefront. Currently, MySQL holds a relative market share of 44.04% in the realm of database management tools. Within this market, it claims 31.39% in the U.S., 8.19% in India, and 6.75% in the U.K. As an open-source software, MySQL takes the lead among other relational databases in the market.

MySQL vs the World

Today, MySQL is not alone, there are multiple players including SingleStoreDB (SQL), alongside database companies such as Redis (Real-time), MongoDB (NoSQL), Neo4j (Graph) and others are offering more or less similar services to its customers, and the competition is only getting bigger.

“It’s like horses for courses,” said Sarvanan. He believes that each kind of database has its own use case. He said that Oracle’s moat is managing multiple data sources. Customers do not need to store data on one cloud and at one location. They should have the option to store data at different sources. “If you asked me today, customers are more about user databases distributed,” he added, saying that MySQL’s perfectly fits in this equation.

Multi-Cloud Approach

At the same time, MySQL faces heat from hyperscalers such as Google Cloud SQL, AWS SQL Server Databases, and Micrsoft Azure SQL Database, who are providing, similar platforms and services for its internal cloud customers.

But, Oracle seems to be in the long game. Saravanan said that Oracle believes in multi cloud approach. “We don’t want to be homogeneous, we wanted customers to use this feature functionality, even if they’re on Azure or AWS,” he added.

In other words, even if you are running your workload on AWS or Azure, you can take advantage of Oracle MySQL to query your data. “They get similar performance by using the MySQL Heatwave so that they can make use of it without migrating the data out of the cloud, and changing the application code” he added.

Enterprise Data Security

Speaking on data security for enterprises, Saravanan told AIM that security is in the DNA of Oracle and it takes center stage at the company and OCI (Oracle Cloud Infrastructure) is a highly secured cloud. “Any data that moves within OCI is encrypted,” explained Sarvanan.

Further, he said it takes the core security of OCI to MySQL as well. He emphasised that MySQL gives role based data access which means that you cannot get access to the data even though you are admin. Citing RBI and MeitY, he said OCI complies with regulatory requirements in the country, and strictly follows the law of the land across geographies.

En route: Generative AI

Last month, Oracle partnered with Cohere, where it looks to develop powerful, generative AI services for organizations globally. The duo plans to automate end-to-end business processes, improving the decision making, alongside enhancing customer experiences.

What’s next? In the coming months, Oracle is planning to announce more such products and tools to unleash their generative AI strategy for enterprise , alongside a few announcements around its expansion into vector databases and more.

The post Oracle MySQL’s Unyielding Success appeared first on Analytics India Magazine.

Introduction to Data Science: A Beginner’s Guide

Introduction to Data Science: A Beginner's Guide
Image by Author

You haven’t been living under a rock for the last two decades, so you may think you know, more or less, what data science is. You’re probably hoping to get a brief overview of what it entails, to learn what you need to start learning data science and get a job.

Here are the highlights of what this article will give you:

  • The main point of data science: data comes in, and insights come out. The job of a data scientist is to manage that data-to-insights pipeline at every stage.
  • What tools, technologies, and skills you’ll need to get a job in data science.
  • The general landscape of data science as a career.

If that sounds like what you’re looking for, let’s dive in.

What is Data Science?

As I said earlier, data science is best summarized as a data-to-insights pipeline. As a data scientist, no matter what company you’re in, you’ll be doing tasks like:

  • Extracting data
  • Cleaning or massaging it
  • Analyzing the data
  • Identifying patterns or trends
  • Building prediction and statistical models on top of the data
  • Visualizing and communicating the data

In short, you’re solving problems, making predictions, optimizing processes, and guiding strategic decision-making.

Because very few companies have a firm grasp on exactly what a data scientist does, you’ll likely have other responsibilities too. Some employers expect data scientists to add infosec or cybersecurity responsibilities to their role. Others may expect data scientists to have expertise in cloud computing, database management, data engineering, or software development. Be ready to wear many hats.

This job is important not because Harvard Business Review called it the sexiest job of the 21st century, but because data is increasing in volume and very few people know how to turn data into insights. As a data scientist, you see the forest for the trees.

Volume of data/information created, captured, copied, and consumed worldwide from 2010 to 2020, with forecasts from 2021 to 2025

Introduction to Data Science: A Beginner's Guide
Source: https://www.statista.com/statistics/871513/worldwide-data-created/ Key Concepts in Data Science

Now you’ve got the big picture. Let’s take a look at some of the key concepts in data science. If you can envision that data-to-insights pipeline, I’ll identify where each key concept comes into play.

Data manipulation

At the very start of that pipeline, you’ve got a slurry of data, of mixed quality. There’s a famous (and incorrect) statistic that data scientists spend 80% of their time cleaning data. While it’s probably not as high as that, building funnels and massaging data is a big part of the job.

Imagine you’re a data scientist for an e-commerce company. There, data manipulation might involve cleaning and transforming customer transaction data, merging and reconciling data from different sources such as website analytics and customer relationship management (CRM) systems, and handling missing or inconsistent data.

You might need to standardize formats, remove duplicates or NaNs, and deal with outliers or erroneous entries. This process ensures that the data is accurate, consistent, and ready for analysis.

Data exploration and visualization

Once the data has been wrangled into submission, now you can start looking at it. You might think that data scientists start throwing statistical models at the data immediately, but the truth is there are too many models. First, you need to get to grips with the kind of data you’ve got. Then you can look for significant insights and predictions.

For example, if you’re a data scientist at GitHub, data exploration would involve analyzing user activity and engagement on the platform. You could look at metrics like the number of commits, pull requests, and issues, as well as user interactions and collaborations. By exploring this data, you gain an understanding of how users engage with the platform, identify popular repositories, and uncover trends in software development practices.

And because most humans parse the significance of pictures better than that of tables, data visualization is also included in data exploration. For example, as a GitHub data scientist, you might use line charts to show the number of commits over time. Bar charts could be used to compare the popularity of different programming languages used on the platform. Network graphs could illustrate collaborations between users or repositories.

Introduction to Data Science: A Beginner's Guide
Source: https://www.reddit.com/r/DataScienceMemes/comments/nzoogr/i_dont_like_gravy/

Statistical analysis

At this point in the data-to-insights pipeline of data science, you’ve got the first two-thirds covered. The data is in, you’re poking and prodding at it. Now it’s time to pull out insights. Finally, you’re reading to apply some statistical analyses to your numbers.

Pretend you’re a data scientist at a company like Hello Fresh. You might run statistical analyses like linear regression to understand the factors that influence customer churn, clustering algorithms to segment customers based on their preferences or behavior, or hypothesis testing to determine the effectiveness of marketing campaigns. These statistical analyses help uncover relationships, patterns, and significant findings within the data.

Machine learning

The cool thing about data scientists is that they predict the future. Visualize the data-to-insights pipeline. You’ve got insights into how things are in the past and now. But your boss might want to ask: well, what happens if we add a new product to our offering? What if we close on Mondays? What if we convert half our fleet to electric vehicles?

As a data scientist, you look into your crystal ball and create intelligent predictions using machine learning. For example, say you’re a data scientist at a logistics company like FedEx. You could use historical shipping data, weather data, and other relevant variables to develop predictive models. These models can forecast shipping volumes, estimate delivery times, optimize route planning, or predict potential delays.

Using machine learning algorithms such as regression, time series analysis, or neural networks, you could predict the impact of adding a new distribution center on delivery times, simulate the effects of different operational changes on shipping costs, or forecast customer demand for specific shipping services.

Communication and business intelligence

The most important concept in data science isn’t machine learning or data cleaning. It’s communication. You present those insights to decision-makers at your company who don’t know a neural network from a gradient-boosting algorithm. That’s communication and business acumen are both key concepts in data science.

Imagine you’re a data scientist at a company like Meta. You’ve just discovered a significant correlation between user engagement metrics and customer retention rates, but you need to share it with a VP of marketing who isn’t familiar with the concept of “statistical significance.” You also need to be familiar with customer lifetime value (CLV) to be able to explain the relevance and importance of your finding.

Essential Skills for Data Scientists

We’ve covered the key concepts in data science. Now let’s take a look at the essential skills you’ll be expected to have as a data scientist. I’ve covered some more granular skills to be a data scientist here if you’re interested in learning more.

Programming languages, data querying, and data viz

It’s hard to rank skills on their importance – data scientists need a mix of skills, all as important as each other. That being said, if there’s one skill you absolutely cannot do without, it’s gotta be coding.

Coding breaks down into a few facets – you need programming languages, typically R or Python (or both). You also need query languages for data retrieval and manipulation, such as SQL (Structured Query Language) for relational databases. Finally, you will probably need to know other languages or programs like Tableau for data visualization, though it’s worth mentioning that a lot of data viz is done with Python or R nowadays.

Math

Remember the statistics I mentioned earlier? As a data scientist, you need to know how to do math. Data viz only goes so far before you need some actual statistical significance. Critical math skills include:

  • Probability and Statistics: Probability distributions, hypothesis testing, statistical inference, regression analysis, and analysis of variance (ANOVA). These skills let you make sound statistical judgments and draw meaningful conclusions from data.
  • Linear Algebra: Operations on vectors and matrices, solving systems of linear equations, matrix factorization, eigenvalues and eigenvectors, and matrix transformations.
  • Calculus: You’ll need to be familiar with concepts like derivatives, gradients, and optimization to train models, optimize, and fine-tune models.
  • Discrete Mathematics: Topics like combinatorics, graph theory, and algorithms. You’ll use these to do network analysis, recommendation systems, and algorithm design. It’s most important for developing algorithms that handle large-scale data.

Model management

Let’s talk about models. As a data scientist, you need to know how to build, deploy, and maintain models. This includes ensuring the models integrate seamlessly with the existing infrastructure, addressing scalability and efficiency concerns, and continuously evaluating their performance in real-world scenarios.

In terms of technology, that means you’ll need to be familiar with:

  • Machine Learning Libraries: These include scikit-learn in Python, TensorFlow, PyTorch, or Keras for deep learning, and XGBoost or LightGBM for gradient boosting.
  • Model Development Frameworks: Frameworks like Jupyter Notebook or JupyterLab for interactive and collaborative model development.
  • Cloud Platforms: Think Amazon Web Services (AWS), Microsoft Azure, or Google Cloud Platform (GCP) to deploy and scale machine learning models.
  • Automated Machine Learning (AutoML): Google AutoML, H2O.ai, or DataRobot automate the process of building machine learning models without extensive manual coding.
  • Model Deployment and Serving: Docker and Kubernetes are commonly used for packaging and deploying models as containers. These let models be deployed and scaled across different environments. Additionally, tools like Flask or Django in Python let you create web APIs to serve models and integrate them into production systems.
  • Model Monitoring and Evaluation: Prometheus, Grafana, or ELK (Elasticsearch, Logstash, Kibana) stack for log aggregation and analysis. These tools help track model metrics, detect anomalies, and ensure that models continue to perform well over time.

Communication

So far we’ve covered the “hard” skills. Now let’s think about what soft skills you’ll need. As I mentioned in the “concepts” portion, a big skill you need is communication. Here are a few examples of the kind of communication you’ll need to do as a data scientist:

  • Data Storytelling: You need to turn complex technical concepts into clear, concise, and compelling narratives that resonate with your audience, including the significance of your analysis and its implications for decision-making.
  • Visualization: Yes, data viz gets its subsection in the communication skill. Alongside the technical chops to create a chart, you should also know when, what kind, and how to talk about your data visualizations.
  • Collaboration and Teamwork: No data scientist works in a vacuum. You’ll collaborate with data engineers, business analysts, and domain experts. Practice your active listening and constructive feedback skills.
  • Client Management: This isn’t true for all data scientists, but sometimes you’ll work directly with clients or external stakeholders. You need to develop strong client management skills, including understanding their requirements, managing expectations, and providing regular updates on project progress.
  • Continuous Learning and Adaptability: Last but not least, you need to be ready to learn new things on the reg. Stay up to date with the latest advancements in the field and be open to acquiring new skills and knowledge as needed.

Business acumen

This boils down to knowing why a number matters in the context of your business. For example, you might find that there’s a highly significant relationship between people buying eggs on Sundays and the weather. But why does it matter to your business?

In this case, you might analyze further and discover that the increased egg purchases on Sundays are correlated with sunny weather, indicating that customers are more likely to engage in outdoor activities or host brunches during favorable weather conditions. This insight could be utilized by a grocery store or a restaurant to plan their inventory and promotional activities accordingly.

By connecting the dots between data patterns and business outcomes, you can provide strategic guidance and actionable recommendations. In the example, this could involve optimizing marketing campaigns for egg-related products during sunny weekends or exploring partnerships with local brunch spots.

Data Science Workflow

What does a data scientist do? To get an idea, let’s take a look at the typical steps involved in a data science project: problem formulation, data collection, data cleaning, exploratory data analysis, model building, evaluation, and communication.

I’ll illustrate each step with an example: for the rest of this section, pretend you work as a data scientist for an e-commerce company, and the company's marketing team wants to improve customer retention.

1. Problem Formulation:

This means you get to grips with the business objective, clarify the problem statement, and define the key metrics for measuring customer retention.

You’ll aim to identify factors that contribute to customer churn and develop strategies to reduce churn rates.

To measure customer retention, you define key metrics including customer churn rate, customer lifetime value (CLV), repeat purchase rate, or customer satisfaction scores. By defining these metrics, you establish a quantifiable way to track and evaluate the effectiveness of your strategies in improving customer retention.

2. Data Collection

Gather relevant data sources, such as customer purchase history, demographic information, website interactions, and customer feedback. This data could be obtained from databases, APIs, or third-party sources.

3. Data Cleaning

The collected data will almost certainly contain missing values, outliers, or inconsistencies. In the data cleaning stage, you preprocess and clean the data by handling missing values, removing duplicates, addressing outliers, and ensuring data integrity.

4. Exploratory Data Analysis (EDA)

Next, gain insights into the data and understand its characteristics by visualizing the data, examining statistical summaries, identifying correlations, and uncovering patterns or anomalies. For example, you may discover that customers who make frequent purchases tend to have higher retention rates.

5. Model Building

Develop predictive models to analyze the relationship between different variables and customer retention. For instance, you might build a machine learning model like logistic regression or random forest, to predict the likelihood of customer churn based on various factors like purchase frequency, customer demographics, or website engagement metrics.

6. Evaluation

Evaluate your model’s performance using metrics like accuracy, precision, recall, or area under the ROC curve. You validate the models using techniques like cross-validation or train-test splits to ensure their reliability.

7. Communication

You’ve got some findings – now share them with the class. In keeping with our example, you’ll need to be able to intelligently talk about your customer churn results in the context of both the business you work for and the wider business landscape. Make people care, and explain why this particular finding matters, and what they should do about it.

For example, after analyzing customer churn, you might find a significant correlation between customer satisfaction scores and churn rates.

When you share this with the marketing team or senior executives, you’ll need to effectively communicate the implications and actionable insights. You would explain that by focusing on enhancing customer satisfaction through improved customer support, personalized experiences, or targeted promotions, the company can mitigate churn, retain more customers, and ultimately drive higher revenue.

Moreover, you would contextualize this finding within the wider business landscape. Compare the churn rates of your company with competitors.

So that’s how you go from data lakes to real business input. Ultimately, remember that data science is iterative and cyclical. You’ll repeat individual steps of this process as well as the entire process as you strive to find interesting insights, answer business questions, and solve problems for your employer.

Data Science Applications

Data science is a vast field. You can find data scientists working in almost every vertical, at any size company. It’s a critical role.

Here are a few real-world examples to showcase the impact of data science in solving complex problems:

  • Healthcare: Data scientists analyze large volumes of medical data to improve patient outcomes and healthcare delivery. They develop predictive models to identify high-risk patients, optimize treatment plans, and detect patterns in disease outbreaks.
  • Finance: Think risk assessment, fraud detection, algorithmic trading, and portfolio management. Data scientists develop models that help make informed investment decisions and manage financial risks.
  • Transportation and Logistics: Data scientists optimize route planning, reduce fuel consumption, improve supply chain efficiency, and predict maintenance needs.
  • Retail and E-commerce: Data scientists analyze customer data, purchase history, browsing patterns, and demographic information to develop models that drive customer engagement, increase sales, and improve customer satisfaction.

Getting Started in Data Science

Ok, that’s a lot of information. By now you should have a clear grasp of what data science is, how it all works, what tools and technologies you should be familiar with, and what a data scientist does.

Let’s now look at where to study and practice data science. This could be a separate article, so I’ll link to lists of resources where you can get started.

  1. The best free data science courses
  2. The best learning resources for data science (books, courses, and tutorials)
  3. The best Python data science projects for beginners
  4. The best computer science books
  5. Data science visualization best practices
  6. Where to get data to do your data science projects
  7. Best platforms to practice key data science skills
  8. Best data science communities to join

Overall, I recommend you do this:

  1. Make a checklist of skills you need, using this blog post and data scientist job descriptions.
  2. Start free to get the basics, then look for good, paid platforms to learn more.
  3. Build a portfolio of projects and libraries.
  4. Practice on platforms like Kaggle and StrataScratch.
  5. Get certified – some platforms like LinkedIn offer certifications to prove you’ve got the skills.
  6. Start applying.
  7. Network – join communities, Slack groups, and LinkedIn groups, and attend events.

Ultimately, you can expect the process to take some time. But it will be worth it in the end.

Job Opportunities and Career Path

Despite the FAANG layoffs, according to US News and World Report in 2022, information security analysts, software developers, data scientists, and statisticians ranked among the top 10 jobs.

Introduction to Data Science: A Beginner's Guide
Source: https://bootcamp.cvn.columbia.edu/blog/data-scientist-career-path/

The job market is still hot. Companies still want and need data scientists. Now, if you’re having a hard time getting a job as a data scientist, remember you don’t have to start from scratch. I recommend you start more junior and angle into the role over time. You could always start as a data analyst, data engineer, or machine learning engineer.

Conclusion

It’s hard to write an intro to data science for the simple fact that it’s a huge field, it’s growing, and more technologies and tools get added every day. If you take away just a few things from this post, it’s this:

  • Data science takes a multidisciplinary approach. You’ll need skills from across multiple fields of knowledge including statistics, machine learning, programming, and domain expertise. And the learning never stops.
  • Data science is iterative. It’s very process based, but you can expect to repeat, optimize, and update your processes as you continue. The successful and happy data scientist embraces experimentation.
  • Soft skills are where it’s at. You can’t just be a Python whiz; you need to convey findings and insights to non-technical stakeholders with stories, numbers, and pictures.

Hopefully, this has given you a place to start. Data science is a rewarding and challenging career path. If you learn the skills and apply yourself, you’ll be able to join this field in no time.
Nate Rosidi is a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Connect with him on Twitter: StrataScratch or LinkedIn.

More On This Topic

  • A Beginner's Guide to Anomaly Detection Techniques in Data Science
  • A Beginner’s Guide to Data Engineering
  • A Beginner's Guide to Q Learning
  • A Beginner's Guide to the CLIP Model
  • A Beginner's Guide to End to End Machine Learning
  • A Beginner’s Guide to Web Scraping Using Python

Capgemini Launches GenAI Portfolio, Announces Further INR 18 million AI Investment

French IT consulting group, Capgemini is today launching a generative AI portfolio of services, spanning from strategy definition through to practical development and implementation of generative AI at scale. The Group said it would invest 2 billion euros (INR 18 million) in AI over three years.

“Generative AI is already becoming a key pillar of digital transformation for businesses, and we see a breadth of opportunities to unlock substantial business value for our clients, which go way beyond important productivity gains,” commented Franck Greverie, Chief Portfolio Officer, Global Business Lines leader and Group Executive Board Member at Capgemini.

In the release blog, the Paris-based company stated it has delivered many generative AI projects over the last few years, particularly in Life Sciences, Consumer Products & Retail, and Financial Services. The latest portfolio includes:

  1. Generative AI strategy for CXOs to prioritize relevant use cases for their business and lay the right foundations in terms of people, process and technology.
  1. Generative AI for Customer Experience with four generative AI assistants for hyper-personalized customer experience.
  1. Generative AI for Software Engineering to improve the whole software life cycle for increased security.
  1. Custom generative AI for Enterprise is a platform for pre-trained open large foundation models (LFMs) with enterprise proprietary data to fine-tune LFMs to the needs of each customer

The Capgemini Group has already announced genAI partnerships with Google Cloud and Microsoft. The company will also train a large part of its workforce on generative AI, by integrating AI training in all of its training curriculum. The latest announcements came the same day it posted its higher half-year revenue, driven by its cloud, data and AI activities. Furthermore, the Group has announced an investment of 2 billion euros (INR 18 million) in AI over the next three years. Currently it is also working with Heathrow Airport to provide better customer experience by solutions through its generative AI.

Read more: Capgemini Strengthens Ties With Microsoft, Co Creates Azure Intelligent App Factory

The post Capgemini Launches GenAI Portfolio, Announces Further INR 18 million AI Investment appeared first on Analytics India Magazine.

Did We Really Get a Room-Temperature Superconductor?

This week has been rife with news on room-temperature superconductors. A group of scientists from the College of William & Mary in Virginia have created a new material that demonstrates some superconductive properties. While the researchers’ paper is currently in pre-print, concerns have already arisen over the nature of their findings. Are we finally in the era of room-temperature superconductors, or is this another scientific non-discovery?

Superconductors explained

Materials with superconducting properties have been one of the holy grails of scientific research over the past century. First discovered by Dutch physicist Heike Kamenlingh Onnes in 1911, superconductivity is a phenomenon where the electrical resistance of a material vanishes. The material also expels a field of magnetic flux, termed the Meissner effect. Superconductivity cannot be explained by classical physics, with scientists instead looking to quantum mechanics to delve into this phenomenon.

While scientists have been able to reproduce these properties, they usually require extreme physical conditions, such as extremely low temperature or very high pressure. The research community was successful in the creation of so-called ‘high-temperature’ superconductors, which began to exhibit properties around a temperature of 30 Kelvin. However, the recent scientific discovery has made waves due to its unique property of superconducting at ambient temperatures.

Termed LK-99, this new discovery emerged from research conducted by scientists at the College of William & Mary in Virginia. A team comprising Sukbae Lee, Ji-Hoon Kim, and Young-Wan Kwon, along with Hyun Tak Kim, claimed to have created a superconductor that works at temperatures upto 127 C. This was done by creating a modified lead-apatite structure, which was fused together under high pressures to create an internal strain on the material, resulting in a smooth channel where electrons can flow freely.

According to the paper, this material exhibits some of the common characteristics of superconductors, such as zero resistivity, critical current, and a critical magnetic field. Even as attention has been called concerning the veracity of the data presented in the paper, there is also a deeper issue at hand. Have the researchers found a superconductor, or have they created a diamagnet with superconducting properties?

The reality behind the miracle

One of the heads of the paper stated that the paper was published while having “many defects” without his permission. Scientists have also raised concerns over the lack of reporting around a quantity known as heat capacity, which was not reported in the paper. Susannah Speller, professor at the University of Oxford, stated, “So it is too early to say that we have been presented with compelling evidence for superconductivity in these samples.”

In the video provided by the researchers, the superconductor pellet also does not levitate fully, demonstrating a lack of the Meissner effect. This has led to many stating that it is nothing but a diamagnetic substance; a material that is usually repelled by a magnetic field. Even OpenAI CEO Sam Altman chimed in on this, stating, “I desperately want to believe but I think we are getting overexcited about a diamagnet.”

Even experts in the field spoke about this matter, expressing scepticism over the findings. Andrew Cote, an engineer at Stellarator, shed light on some of the data in the paper, especially the way the graphs have been plotted. Moreover, he also drew attention to the fact that a full set of numbers have not been released.

On the other hand, due to the approachable nature of being able to be replicated, enthusiasts are well on their way to try it themselves. Andrew McCalip, who works at Varda Space, has begun his efforts in replicating the results, documenting the process step by step on Twitter. While the situation is still developing, the general sentiment is still positive.

A room-temperature superconductor could open the door to a new era of scientific innovation. Not only will it cut down on transmission costs for electricity, it could also enable advancements in quantum and classical computing, nuclear fusion and energy storage. As the experiments are replicated across the academic community, the true answer of LK-99 will be revealed.

The post Did We Really Get a Room-Temperature Superconductor? appeared first on Analytics India Magazine.

Introduction to Statistical Learning, Python Edition: Free Book

Introduction to Statistical Learning, Python Edition: Free Book
Image by Author

For years, Introduction to Statistical Learning with Applications in R, better known as ISLR, has been cherished—by both machine learning beginners and practitioners alike—as one of the best machine learning textbooks.

Now that the Python edition of the book, Introduction to Statistical Learning with Applications in Python—or ISL with Python—is here, the community is all the more excited!

ISL with Python is Here. Great! But Why?

Glad you asked. 😀

If you’ve been in the machine learning space for a while, chances are you’ve already heard, read, or used the R version of the book before. And you know what you liked best about it. But here’s my story.

The summer before I started grad school, I decided to teach myself machine learning. I was lucky to stumble across ISLR early in my machine learning journey. The authors of ISLR do a great job at breaking down complex machine learning algorithms in an easy-to-follow manner—along with the required mathematical foundations—without overwhelming the learners. This is an aspect of the book I enjoyed.

The code examples and labs in ISLR, however, are in R. Sadly enough, I did not know R back then, but was comfortable programming in Python. So I had two options.

Introduction to Statistical Learning, Python Edition: Free Book
Image by Author

I could teach myself R. Or I could use other resources—tutorials and documentation—to build models in Python. Like most other Pythonistas, I chose the second option (yeah, the more familiar route, I know).

While R is great for statistical analysis, Python is a good first language if you’re just starting out on your data journey.

But this isn’t a problem anymore! Because this new Python edition lets you code along and build machine learning models in Python. No more worries about having to pick up a new programming language to follow along.

Story time’s up! Let’s take a closer look at the contents of the book.

Contents of ISL with Python

In terms of the content, the Python edition is pretty similar to the R edition. However, it's an appropriate adaptation for Python which is expected. This book also includes a Python programming crash course section to learn the basics.

This book covers sufficient breadth. From foundations of statistical learning, supervised and unsupervised learning algorithms to deep learning and more, the book is organized into the following chapters:

  • Statistical learning
  • Linear regression
  • Classification
  • Resampling methods
  • Linear model selection and regularization
  • Moving beyond linearity
  • Tree-based methods
  • Support Vector Machines
  • Deep Learning (covers vanilla neural networks to ConvNets and recurrent neural networks)
  • Survival Analysis and Censored Data
  • Unsupervised learning
  • Multiple testing (a deep dive into hypothesis testing)

The ISLP Python Package

The book uses datasets sourced from publicly available repositories such as the UCI Machine Learning repository and other similar resources. Some examples include datasets on bike sharing, credit card default, fund management, and crime rates.

Learning to collect data from various sources through the process of web scraping, and importing data from sources are super important for a data science project.

However for a learner who’s unfamiliar with the data collection step, it can introduce friction in the learning process if they want to use the book to get the hang of both the theory and hands-on sections.

To facilitate a smooth learning experience, the book comes with an accompanying ISLP package:

  • The ISLP package is available for all major platforms: Linux, Windows, and MacOS.
  • You can install ISLP using pip: pip install islp preferably in a virtual environment on your machine.

The ISLP package has a comprehensive documentation. The ISLP package comes with data loading utilities. When you work with a particular dataset, the docs page gives you ready-to-access information on the various features in the dataset, the number of records, and starter code to load the data into a pandas dataframe.

It also has helper functions and functionality to create higher-order features like polynomial and spline features.

Introduction to Statistical Learning, Python Edition: Free Book
Generating polynomial features | Image from ISLP docs

For a more complete learning experience, you can read in the data from their sources, perform feature engineering without using the ISLP package.

When you’re building models, you can try scikit-learn-only implementation and PyTorch or Keras for the deep learning sections.

So Who’s This Book For Again?

Data Science and Machine Learning Beginners: If you are a beginner who prefers a self-taught route to learn machine learning, this book is a great learning resource.

ML Practitioners: As a machine learning practitioner, you’ll have experience building machine learning models. But going back to the basics such as hypothesis testing and other algorithms can be helpful.

Educators: The theory and the labs together make this book a great companion for a first course in machine learning. Most universities and data science bootcamps these days teach machine learning. So if you are an educator who is teaching or looking to teach a machine learning course, this is a great course textbook to consider.

Wrapping Up

And that's a wrap. Introduction to Statistical Learning with Python has been one of the most exciting releases of this summer.

You can head over to statlearning.com and start reading the Python edition. While the soft copy is free to read, the paperback on Amazon sold out on the very first day. So we're excited to see you make the most of the book. Start reading it today. Happy learning!
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more.

More On This Topic

  • Introduction to Statistical Learning Second Edition
  • Pydon'ts — Write elegant Python code: Free Book Review
  • Deep Learning with Python: Second Edition by François Chollet
  • Statistical Functions in Python
  • KDnuggets News, September 14: Free Python for Data Science Course •…
  • Build a Reproducible and Maintainable Data Science Project: A Free Online…