MLOps Best Practices You Should Know

MLOps Best Practices You Should Know
Photo by Arian Darvishi on Unsplash

MLOps, or Machine Learning Operations, is a collection of techniques and tools for model deployment in a production environment. In the past years, the success of DevOps in minimising the time between software releases and minimising gaps has been integral in any company's lifetime.

With the successful history, developers came to the machine learning field to apply the DevOps principle, which created the MLOps. By combining the CI/CD principle with the machine learning model, the data world can integrate and deliver models in time mannerly within production. MLOps introduce new principles of Continous Training (CT) and Continous Monitoring, making the production environment even further suitable for any machine learning model.

With so much progress in the MLOps, we should follow a few best practices to achieve the best workflow. What are the practices? Let’s get into it.

MLOps Best Practices

Before continuing, this article would assume the reader already has the basic knowledge of MLOps, machine learning, and programming. With that in mind, let’s continue with the best practices.

1. Establishing a Clear Project Structure

MLOps is easier to assess with a clear structure for your company's use cases. There are no exact MLOps pipeline and tools for every single point, so we need a clear structure for our project. A well-organized project structure makes navigating, maintaining, and scaling our future projects more manageable.

Project structure means that we need to understand from end-to-end, from the business problem until the production and monitoring need to be precise. Some tips to improve our project structure include:

— Organize our code and data based on the environment and function. Also, keeps the code and data name convention neat to avoid mishap,

— Use version control such as GIT or DVC to track the changes,

— Having documentation with a consistent style,

— Communicate with the teams on whatever you do and change.

Establishing a clear project structure is a hassle, but it would certainly help our project in the long run.

2. Know Your Tools Stack

MLOps is not only about the concept, but it’s also about the tools. There are many tools to choose from for every activity in your MLOps. However, the choices depend on your project and company requirements.

For example, if your company's compliance requires that the data analysis be done in the tools created by the home company, we must follow it. That is why knowing the tools stack you want to use in your MLOps pipeline when we develop the capability is essential.

To help you understand the necessary tools for your project, here is the MLOps Stack Template by Valohai that you can refer to.

MLOps Best Practices You Should Know
Image by Valohai

Also, try to limit the tools to around three to five. The more tools you use, the more complicated it will become.

3. Track Your Expenses

The aim of using MLOps in our pipeline is to minimise technical debt. It’s a great aim, as we don’t want technical debt to complicate our project. However, don’t let the monetary expenses to become high just because we want to minimise our technical debt.

Many of the tools used for MLOps were subscription-based or pay-per-use based, depending on the tools themselves. Rather than develop it from scratch or using an open-source based, many paying tools offer a service that allows users to integrate MLOps with a better experience.

But, sometimes, we need to remember that the services require us to pay money, and we use them sparsely, which happens to me as well in my early time adopting MLOps. Remember to track our expenses well, as we don’t want the value provided by MLOps to be diminished by money.

Using cloud services such as AWS, a calculator, and an alarm would remind you of your expenses. If not, try to track them using various tools. Even simple Excel already works.

4. Having a Standard for Everything

We not only have a straightforward structure project, but we also need a standard for every part of our MLOps pipeline. Minimising technical debt means we want everything to work correctly, and often the fault is because there needs to be a standard in the team.

Imagine that the naming of the tools, variables, scripts, data, etc., were random, and there was no coherence between one teammate and the other. The process would become even longer as the developer needs to understand what happens, and it would incur technical debt.

Standardise applied not only to the naming conventions but could also apply to everything that related to the MLOps pipeline. The data analysis process, the environment used, the pipeline structure, the deployment process, and more. Have all the standards in place, and the MLOps would work well.

5. Assess Your MLOps Maturity Periodically

How far is our MLOps readiness already is a question that we need to ask often. We want to get the full benefit of the MLOps, which could only be present if the maturity level is already there. Sadly, it’s not something you can achieve in a day or even a month.

It would take some time, but that is why don’t wait for a perfect pipeline when you start implementing MLOps. Instead, start with a thing that we can process first and keep assessing the readiness of our MLOps.

As a reference, I love to use the MLOps maturity pyramid by Microsoft Azure to assess readiness. There are five levels, and each level provides value to our ecosystem.

MLOps Best Practices You Should Know
Image by Microsoft Azure Conclusion

MLOps or Machine Learning Operations becomes essential to the company life cycle. That is why there are some best practices you could follow:

  1. Establishing a Clear Structure Project
  2. Know Your Tools Stack
  3. Track Your Expenses
  4. Having a Standard for Everything
  5. Assess Your MLOps Maturity Periodically

I hope it helps.
Cornellius Yudha Wijaya is a data science assistant manager and data writer. While working full-time at Allianz Indonesia, he loves to share Python and Data tips via social media and writing media.

More On This Topic

  • MLOps Best Practices
  • MLOps: The Best Practices and How To Apply Them
  • Data Warehousing and ETL Best Practices
  • 7 Data Security Best Practices for 2021
  • MLOps – "Why is it required?" and "What it is"?
  • Overview of MLOps

Overview of the AI Index Report: Measuring Trends in Artificial Intelligence

Overview of the AI Index Report: Measuring Trends in Artificial Intelligence
Image by Author

The year 2023 has been a blast with the continuous release of AI applications. We’ve had ChatGPT, Google Bard, Baby AGI, and more. A lot of us are eager to know what the future holds and the potential trends.

What is the AI Index?

The AI Index is an independent report by the Stanford Institute for Human-Centered Artificial Intelligence (HAI), consisting of experts from different academics and a range of industries. The purpose of the annual report is to track, collate, and visualize data that is related to artificial intelligence. It is used to steer companies and decision-makers in a better direction to advance AI responsibly and ethically.

To get a good understanding of the advancements of AI, the HAI committee includes members from the Center for Security and Emerging Technology at Georgetown University, LinkedIn, McKinsey, and more.

Stanford University releases an AI index report every year, however, the 2023 report consists of more self-collected data and original analysis than ever before. In particular, the report dives deeper into foundation models and speaks on their different aspects and impacts. We have seen the use of foundation models in Visual ChatGPT.

This year, the AI index has also increased its tracking of global AI legislation going from 25 countries in 2022 to 127 in 2023.

AI Index Report 2023

The 2023 AI Index Report consist of 8 chapters:

1. Research and Development

Research and development is the pinnacle of AI as it has helped with its continuous growth over the decades. This chapter goes through a variety of resources such as AI publications, journal articles, repositories, and more to further delve into the specific trends in AI.

It also explores machine learning systems such as large language models by looking into data. Large language models are very popular as we have seen with ChatGPT.

Currently, the United States and China dominate the research and development of AI, however, it is becoming increasingly geographically adopted by more and more countries.

2. Technical Performance

The technical performance covers how well AI is being adopted and used as well as its effectiveness from the year 2022. It goes further into the advancements of AI applications such as computer vision, language, speech, reinforcement learning, and hardware.

This chapter also provides an analysis of the environmental impact that AI has, for example how it has helped specific sectors grow at a faster rate with a timeline-style overview of the most significant recent AI developments.

3. Technical AI Ethics

Ethics is a big concern around AI applications, with governments coming together to create legislations, frameworks and standards. As it is such a major topic of interest surrounding AI, it acts as a technical barrier to the creation and implementation of generative AI systems, such as OpenAI Dall-E.

Generative AI systems are very popular, and this has caused more and more companies to join the race to deploy and release generative AI models more than ever. However, the ethical issues surrounding the AI system are becoming more apparent to the general public.

4. The Economy

We are seeing more companies, governments and organizations implement AI applications than ever. Although some are adopting AI applications to increase productivity, some are seeing the adoption of AI replacing workers. This chapter goes through the effects of AI applications on the economy, good and bad using data from LinkedIn, Deloitte and more.

5. Education

As AI continues to grow, there is a huge demand for tech professionals. Naturally, based on supply and demand we are seeing more and more people trying to get their foot into the tech world. More people are going back to university to study Computer Science, get PhDs and the vast amount of online courses and training available.

6. Policy and Governance

Some may say that the growth of AI has been due to the fact that it's been an open playground. However, due to the constant growth in AI, governments and organizations have been prompted to strategize AI governance. In order to implement AI applications into everyday lives, these organizations have been directed to look into the societal and ethical concerns surrounding AI.

7. Diversity

Although AI systems are being created and deployed, this chapter looks into who is actually using AI. This chapter explains that North American AI researchers and practitioners found predominantly white and male people were using AI. This can lead to existing societal inequalities and bias.

8. Public Opinion

It wouldn’t be an accurate report if you didn’t hear from the public. This chapter examines the public opinion surrounding AI on a global, demographic and ethical aspect. AI is at a point where it's embedded into our everyday lives and monitoring the public attitudes toward AI is important in identifying ethical concerns, trends, and overall societal impacts.

AI Index Report Key Takeaways

These are the key takeaways from the 2023 AI Index Report:

  • The tech industry is winning the race against academia regarding producing machine learning models.
  • AI is helping the environment, however, it is also having serious environmental impacts.
  • The benchmarks improvements of AI continue to be marginal.
  • Some AI applications have such high performance and ability in the science sector, that they are being deemed as the ‘Best New Scientists’
  • The misuse of AI has increased 26x since 2012, and the number of incidents continues to rise.
  • The demand for AI professionals and skills has increased in every sector of the United States
  • Private investment into AI has shown a 26.7% decrease since 2021.
  • AI legislations, regulations, frameworks, and standards are on the rise to meet the supply of AI applications.
  • Chinese citizens have the most positive attitude toward AI products.

Wrapping it up

We’ve had to take in a lot of news in the past 2 months with the release of different AI applications. We expect to see more large language models, multimodal models, and autonomous AI applications. It will be interesting to see the 2024 report. What do you think will be in next year's report? Let us know in the comments.
Nisha Arya is a Data Scientist, Freelance Technical Writer and Community Manager at KDnuggets. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.

More On This Topic

  • Top 5 Artificial Intelligence (AI) Trends for 2021
  • ARTIFICIAL INTELLIGENCE (AI), A TEXTBOOK
  • Artificial Intelligence and the Metaverse
  • Should You Become a Freelance Artificial Intelligence Engineer?
  • Artificial Intelligence for Precision Medicine and Better Healthcare
  • Artificial Intelligence vs Machine Learning in Cybersecurity

Using ChatGPT to Learn SQL

Using ChatGPT to Learn SQL
Image by Editor | Microsoft Designer

ChatGPT can do many cool things. One of them is writing code. You only need to give the right instruction and ChatGPT will do the job for you.

If you want to learn SQL, ChatGPT is a great resource to get started. It can help you create SQL queries using natural language, solve any coding questions you might have or even help you understand pre-defined queries that you do not understand.

In this article, I will outline how you can use ChatGPT to learn SQL and become proficient in this valuable skill.

Let’s figure it out together!👇🏻

First things first, so… what’s exactly ChatGPT?

ChatGPT is a large language model trained by OpenAI. It is capable of generating human-like text based on the input it receives and can be used to answer questions and engage in conversations with people.

So basically, we can take advantage of its knowledge — and its capacity to tell us anything in a very simple and human way — to understand SQL and learn from it.

#Step 1: Set up ChatGPT

To get started with ChatGPT, you’ll need to sign up for an account here.

Using ChatGPT to learn SQL
Signing Up for ChatGPT display.

You’ll have to give your email address and phone number to start using ChatGPT.

#Step 2: Learn how to interact with ChatGPT

Once you have ChatGPT enabled, you should see the following display:

Using ChatGPT to learn SQL
Screenshot of the ChatGPT chat display.

In the lower input box, we can write anything to start interacting with ChatGPT. As I will be disturbing her — or him — for a while, I’ll start apologizing in advance 😉

Using ChatGPT to learn SQL
Self-made gif. ChatGPT answering my first message.
#Step 3: Ask anything you can imagine — and it will answer you.

So now, we can start asking ChatGPT anything. I will focus on learning SQL. There are 5 main things we can ask to improve our SQL skills by using ChatGPT.

1. Explain to us any concept or commands in SQL

When crashing into a new programming language, we usually have a lot of doubts and confusion about every command. ChatGPT can help us understand any command through easy explanations and give us as many examples of its usage as we request.

Let’s start with the most basic thing in SQL — the SELECT command.

Explain me shortly what is the SELECT command and an example of how to use it

ChatGPT answers a short answer explaining the SELECT command and gives the most basic example: SELECT * FROM

Using ChatGPT to learn SQL
Screenshot of the ChatGPT chat.

However, I can further ask for the SELECT command.

Explain me what is the SELECT command and some examples showing how to use it

And as you can observe in the following image, ChatGPT explains the basics of the command and gives us some examples of usage.

Using ChatGPT to learn SQL
Screenshot of the ChatGPT chat. Giving more examples of the SELECT command.

You can try to ask any command you can imagine of, for instance a more challenging command such as HAVING .

Give me some examples of the HAVING command in SQL

And ChatGPT answers with several examples like the following ones:

  1. To find the names and salaries of all employees who make more than the average salary, you might use a query like this:
SELECT        name,        salary  FROM employees  GROUP BY name  HAVING salary > (SELECT AVG(salary) FROM employees);

2. To find the names of all customers who have placed more than 10 orders in the past year, you might use a query like this.

SELECT        name   FROM customers   WHERE customer_id   IN   (    SELECT customer_id     FROM orders     WHERE date >= DATEADD(year, -1, GETDATE())     GROUP BY customer_id     HAVING COUNT(*) > 10  );

Of course, we can keep asking for more explanations and more examples. Try any other command you can think of, and it will answer right away.

2. You can ask how to do something in SQL and ChatGPT will let you know what command (or commands) to use.

I can ask how to do a specific action, and ChatGPT will let me know what command I need to use.

I want to merge two tables, what command should I use in SQL?

And ChatGPT answers me to use any join command, as you can observe in the following picture.

Using ChatGPT to learn SQL
Screenshot of the ChatGPT chat. Explaning how to merge two tables.

However, I know I just want to join two tables when the rows have coinciding values in some specific columns. In this case, I can ask again and get to know what command should I use.

I want to join two tables and just get the data that have coinciding values in some given columns.

Hence, ChatGPT let me know only the INNER JOINallows me to do that as you can observe in the following image:

Using ChatGPT to learn SQL
Screenshot of the ChatGPT chat. Explaning how to merge two tables and just mantaining the coinciding values.

And it gives me the corresponding query:

SELECT       *  FROM table1  INNER JOIN table2     ON  table1.id = table2.id     AND table1.name = table2.name;

3. You can ask ChatGPT to create a query using natural language

Now let’s imagine I know what result I need but I do not have any kind of idea how to formulate that query. I can simply explain what I want to do to ChatGPT and it will give me a structure to follow. Hence, I can learn how to structure queries following the examples of ChatGPT.

Explain to me how to create a SQL query that computes the most expensive cities in Europe having a table with the prices of different items in each city.

ChatGPT answers me right away, as you can observe in the following image.

Using ChatGPT to learn SQL

ChatGPT gives me an example of a query and explains what this query does.

4. You can ask ChatGPT that explains you how a query works.

Now let’s imagine you get to do the work from a workmate who is sick but you do not understand his queries — some people code in a messy way or you can just feel lazy and not want to waste a lot of time understanding other people’s queries.

That’s normal — and you can use ChatGPT to avoid this task. We can easily ask ChatGPT to explain a given query.

Let’s imagine we want to understand what the following query does:

What does the following query do: [Insert Query here]

ChatGPT just answers right away:

Using ChatGPT to learn SQL
Screenshot of the ChatGPT chat. It explains what a given query does.

As you can observe in the previous image, ChatGPT explains what this query does step by step.

First, it explains all contained subqueries and what they do. Then it explains the final query and how it uses the previous subqueries to merge all the data. We can even ask for more detailed explanations in a given subquery.

Can you further explain what the second subquery of the previous query does?

Using ChatGPT to learn SQL
Screenshot of the ChatGPT chat. It further explains what the second subquery of the given query does.

As you can observe in the previous image, ChatGPT explains in a detailed way what the second subquery performs.

You can challenge ChatGPT with any query you can imagine of!

5. You can ask ChatGPT to challenge you with exercises.

For me, the best part of ChatGPT is asking for some exercises and answers to practice and test your skills. It can even tell you when are you doing good — or not.

Can you give me some exercises to practice SQL

Using ChatGPT to learn SQL
Screenshot of ChatGPT giving me some exercises to practice SQL.

Now ChatGPT tells me some problems to perform. In this case, I can try to solve the first one and ask ChatGPT if my solution is correct.

Is the following query correct to the answer of the first previous exercise [Insert Query]

ChatGPT will answer and write away whether it is correct and why.

Using ChatGPT to learn SQL
Screenshot of ChatGPT answering if the query I have coded is correct or not.

I can ask for the correct answer to each of the previous examples:

Can you give me the correct answer to the previous exercises?

And as you can observe in the following image, ChatGPT will give me all the correct queries to perform.

Using ChatGPT to learn SQL

⚠️ Notice that the answer that ChatGPT is providing me and the one I provided to be checked are completely different.

Conclusion

SQL is a valuable skill to have in today’s data-driven world. By using ChatGPT to learn the basics and practice your skills, you can become proficient in SQL. With continued learning and practice, you can continue to expand your skills and perform a leap forward in your data professional life using this tool.

Let me know if ChatGPT surprises you with some other good features. I will read you in the comments! 😀

Data always has a better idea — trust it.

Josep Ferrer is an analytics engineer from Barcelona. He graduated in physics engineering and is currently working in the Data Science field applied to human mobility. He is a part-time content creator focused on data science and technology.

Original. Reposted with permission.

More On This Topic

  • Learning How to Use ChatGPT to Learn Python (or anything else)
  • Top Free Resources To Learn ChatGPT
  • Visual ChatGPT: Microsoft Combine ChatGPT and VFMs
  • ChatGPT: Everything You Need to Know
  • What Is ChatGPT Doing and Why Does It Work?
  • ChatGPT for Beginners

Top Posts April 17-23: AutoGPT: Everything You Need To Know

AutoGPT: Everything You Need To Know
Most Popular Posts Last Week

  1. AutoGPT: Everything You Need To Know by Nisha Arya
  2. Baby AGI: The Birth of a Fully Autonomous AI by Nisha Arya
  3. Mastering Generative AI and Prompt Engineering: A Free eBook by Matthew Mayo
  4. Data Analytics: The Four Approaches to Analyzing Data and How To Use Them Effectively by Nate Rosidi
  5. A Step-by-Step Guide to Web Scraping with Python and Beautiful Soup by Aryan Garg

Most Popular Posts Past 30 Days

  1. AutoGPT: Everything You Need To Know by Nisha Arya
  2. Automate the Boring Stuff with GPT-4 and Python by Natassha Selvaraj
  3. 8 Open-Source Alternative to ChatGPT and Bard by Abid Ali Awan
  4. Top 19 Skills You Need to Know in 2023 to Be a Data Scientist by Nate Rosidi
  5. LangChain 101: Build Your Own GPT-Powered Applications by Bala Priya C
  6. 4 Ways to Generate Passive Income Using ChatGPT by Youssef Rafaat
  7. 5 Free Tools For Detecting ChatGPT, GPT3, and GPT2 by Abid Ali Awan
  8. 10 Websites to Get Amazing Data for Data Science Projects by Nate Rosidi
  9. 4 Ways to Rename Pandas Columns by Abid Ali Awan
  10. OpenChatKit: Open-Source ChatGPT Alternative by Abid Ali Awan

More On This Topic

  • AutoGPT: Everything You Need To Know
  • Can Robots and Humans Combat Extinction Together? Find Out April 17
  • Top April Stories: The Most In-Demand Skills for Data Scientists in 2021
  • KDnuggets News, April 27: A Brief Introduction to Papers With Code; Machine…
  • KDnuggets News, April 13: Python Libraries Data Scientists Should Know in…
  • KDnuggets News, April 6: 8 Free MIT Courses to Learn Data Science Online;…

Automate Your Codebase with Promptr and GPT

Automate Your Codebase with Promptr and GPT
Image by Author Introduction

As the field of Artificial Intelligence is growing and evolving, we have seen the rise of powerful tools like GPT, ChatGPT, Bard, etc. Programmers are using these tools to streamline their workflows and optimize their codebase. It has enabled them to focus more on building the program's core logic and less on the more mundane and repetitive tasks. However, programmers are experiencing the issue of copy-pasting their code into these models, getting the recommendations, and then updating their codebase. This procedure becomes tiresome for the people who do it frequently.

Fortunately, there is a solution to this problem now. Let me introduce you to Promptr, an open-source command line-based tool that allows programmers to automate their codebase without leaving their editor. Sounds cool! Right? If you are interested to know more about how this tool works, what it offers, and how to set it up? Please sit back and relax while I explain it to you.

What is Promptr?

Promptr is a CLI tool that makes the process of applying the GPT code recommendations to your codebase a lot easier. You can refactor your code, implement the classes to pass the test, experiment with LLMs, perform debugging and troubleshooting, etc all with just a single line of code. As per its official documentation:

“This is most effective with GPT4 because of its larger context window, but GPT3 is still useful for smaller scopes.” (Source — GitHub)

This tool accepts several parameters separated by the space that specifies the mode, template, prompt, and other settings for generating the output.

General Syntax:

promptr  -m <mode> [options] <file1> <file2> <file3> ...

For example:

    • -m, —mode <mode>: It specifies the mode to use (GPT-3 or GPT-4). The default mode is GPT-3
    • -d, —dry-run: It is an optional flag when only the prompt is sent to the model but the changes are not reflected in your file system.
    • -i, —interactive: It enables the interactive mode and allows the user to pass various inputs.
    • -p, —prompt <prompt>: It is a non-interactive mode and it can be a string or a URL/path containing the prompt

    Similarly, you can use some other options mentioned on their GitHub repository depending on your use case. Now, you might be wondering how it all happens under the hood. So, let's explore that.

    How does Promptr Work? How does Promptr Work?
    Image by Author

    The first thing you do is clean your working area and commit any changes. Then, you need to write a prompt with clear instructions as if you are explaining the task to an inexperienced co-worker. After that, specify the context that you will send along with your prompt to GPT. Please note that prompt is your instruction to GPT while context refers to the files that GPT must know to perform the codebase operations. For instance,

    promptr -p "Cleanup the code in this file" index.js

    Here index.js refers to the context while "Cleanup the code in this file" is your prompt to GPT. Promptr will send it to GPT and wait for the response as it may take some time. Then the response generated by the GPT is first parsed by Promptr after which the suggested changes are applied to your file system. And that’s it! Simple yet a very useful tool.

    Setting up Promptr for Automating your Codebase

    Here are the steps to setup Promptr on your local computer:

    Requirements

    • Node.js v18 or later
    • OpenAI Api Key

    Installation

    Open the terminal or command line window. Install the Promptr globally by running either of the below-mentioned commands depending on the package manager that you are using:

    Npm:

    npm install -g @ifnotnowwhen/promptr

    Yarn:

    yarn global add @ifnotnowwhen/promptr

    You can also install Promptr by copying the binary for the current release to your path but it is only supported for macOS users as of now.

    Once the installation is complete you can verify it by executing the following command

    promptr --version

    Setting OpenAI API Key

    You will need an OpenAI API key to use promptr. If you don’t have one, you can sign up for a free account to get free credits up to $18.

    Once you get your secret key, you have to set an environment variable ‘OPENAI_API_KEY’.

    For Mac or Linux:

    export OPENAI_API_KEY=<your secret key>

    For Windows:

    Click “Edit the system environment variables” to add a new variable ‘OPENAI_API_KEY’ and set its value to the secret key that you received from your OpenAI account.

    Conclusion

    Although it allows humans to perform operations on their code just like they maintain their text files, this technology is still in its early stages and has some cons. For example, there is a potential for data loss if deleting files is recommended by GPT hence it is advised to commit to your important work before using it. Similarly, some people have expressed their concern about the per-token cost of using the OpenAI API. Nonetheless, I wonder how far is it when we can develop software that can self-repair. If you want to experiment with it, here is the link to the official GitHub repository — Promptr.
    Kanwal Mehreen is an aspiring software developer with a keen interest in data science and applications of AI in medicine. Kanwal was selected as the Google Generation Scholar 2022 for the APAC region. Kanwal loves to share technical knowledge by writing articles on trending topics, and is passionate about improving the representation of women in tech industry.

    More On This Topic

    • 5 Tasks To Automate With Python
    • Automate the Boring Stuff with GPT-4 and Python
    • Automate Microsoft Excel and Word Using Python
    • The Prefect Way to Automate & Orchestrate Data Pipelines

Working with Confidence Intervals

Working with Confidence Intervals
Image by Editor

In data science and statistics, confidence intervals are very useful for quantifying uncertainty in a dataset. The 65% confidence interval represents data values that fall within one standard deviation of the mean. The 95% confidence interval represents data values that are distributed within two standard deviations from the mean value. The confidence interval can also be estimated as the interquartile range, which represents data values between the 25th percentile and the 75th percentile, with the 50th percentile representing the mean or median value.

In this article, we illustrate how the confidence interval can be calculated using the heights dataset. The heights dataset contains male and female height data.

Visualization of Probability Distribution of Heights

First, we generate the probability distribution of the male and female heights.

# import necessary libraries  import numpy as np  import pandas as pd  import matplotlib.pyplot as plt  import seaborn as sns    # obtain dataset  df = pd.read_csv('https://raw.githubusercontent.com/bot13956/Bayes_theorem/master/heights.csv')    # plot probability distribution of heights  sns.kdeplot(df[df.sex=='Female']['height'], label='Female')  sns.kdeplot(df[df.sex=='Male']['height'], label = 'Male')  plt.xlabel('height (inch)')  plt.title('probability distribution of Male and Female heights')  plt.legend()  plt.show()

Working with Confidence Intervals
Probability distribution of male and female heights | Image by Author.

From the figure above, we observe that males are on average taller than females.

Calculation of Confidence Intervals

The code below illustrates how the 95% confidence intervals for the male and female heights can be calculated.

# calculate confidence intervals for male heights  mu_male = np.mean(df[df.sex=='Male']['height'])  mu_male    >>> 69.31475494143555    std_male = np.std(df[df.sex=='Male']['height'])  std_male    >>> 3.608799452913512    conf_int_male = [mu_male - 2*std_male, mu_male + 2*std_male]  conf_int_male    >>> [65.70595548852204, 72.92355439434907]    # calculate confidence intervals for female heights  mu_female = np.mean(df[df.sex=='Female']['height'])  mu_female    >>> 64.93942425064515    std_female = np.std(df[df.sex=='Female']['height'])  std_female    >>> 3.752747269853828    conf_int_female = [mu_female - 2*std_female, mu_female + 2*std_female]  conf_int_female    >>> [57.43392971093749, 72.4449187903528]

Confidence Interval Using Boxplot

Another method to estimate the confidence interval is to use the interquartile range. A boxplot can be used to visualize the interquartile range as illustrated below.

# generate boxplot  data = list([df[df.sex=='Male']['height'],                  df[df.sex=='Female']['height']])    fig, ax = plt.subplots()  ax.boxplot(data)  ax.set_ylabel('height (inch)')  xticklabels=['Male', 'Female']  ax.set_xticklabels(xticklabels)  ax.yaxis.grid(True)  plt.show()

Working with Confidence Intervals
Box plot showing the interquartile range.| Image by Author.

The box shows the interquartile range, and the whiskers indicate the minimum and maximum values of the data, excluding outliers. The round circles indicate the outliers. The orange line is the median value. From the figure, the interquartile range for male heights is [ 67 inches, 72 inches]. The interquartile range for female heights is [63 inches, 67 in]. The median height for males heights is 68 inches, while the median height for female heights is 65 inches.

Summary

In summary, confidence intervals are very useful for quantifying uncertainty in a dataset. The 95% confidence interval represents data values that are distributed within two standard deviations from the mean value. The confidence interval can also be estimated as the interquartile range, which represents data values between the 25th percentile and the 75th percentile, with the 50th percentile representing the mean or median value.
Benjamin O. Tayo is a Physicist, Data Science Educator, and Writer, as well as the Owner of DataScienceHub. Previously, Benjamin was teaching Engineering and Physics at U. of Central Oklahoma, Grand Canyon U., and Pittsburgh State U.

More On This Topic

  • Confidence Intervals for XGBoost
  • How to calculate confidence intervals for performance metrics in Machine…
  • Working With The Lambda Layer in Keras
  • Working with Spark, Python or SQL on Azure Databricks
  • Working With Sparse Features In Machine Learning Models
  • Working with Python APIs For Data Science Project

Are You Competing with Your Customers?

Competing with customers – the statement, although oxymoronic, is the state of the tech ecosystem today. It is hard to visualise a linear end-to-end tech value chain. There are always intermingling and jumps occurring in between. Be it in the area of semiconductors or the cloud, the logic of maximising profits and diversifying product portfolio for multiple revenue streams have made companies get into markets where their customers are.

The case of Arm

As per recent reports, chip designer Arm is making its own prototype semiconductor for mobile devices, laptops, and other electronics. This tops another report that Arm sought to increase prices and overhaul its business model by charging royalties to device-makers rather than some of its chipmaker customers.

Arm has traditionally been a company that sells IP to others and charges a licensing fee for it. But, in an annual report published last week, the company highlighted that a principal risk to its business was the “significant concentration” in its customer base – for example, Arm’s top 20 customers accounted for 86 percent of revenues last year.

On the outside, it looks unlikely that it would go ahead to produce its own chips, since Arm would want to upset its primary revenue contributors. However, given its desire to diversify its revenue streams, Arm may venture into a new market. “If Arm comes into this market, it has to be in high performance computing (HPC) or data centres because that market doesn’t have a lot of Arm players. On the other hand, the smartphone market is a very mature one for Arm to enter,” said Sravan Kundojjala, Principal Industry Analyst at TechInsights.

However, producing 5 nm or less data centre chips and scaling up the volume, as well as developing software support will come with a significant capital investment. And going by their quarterly results, the revenue seems to be barely catching up on the R&D spending required to keep up. Perhaps that is why the price increase. “But to what extent will it keep increasing the price? A sudden business model change may not resonate with smartphone OEMs who already feel burdened by the Qualcomm royalty,” said Kundojjala.

“There is also the custom silicon market that Arm can get into,” said Kundojjala. Companies like Broadcom and Marvell have a very significant custom silicon business. If Arm goes into custom silicon, they can also work with hyperscalers like Amazon, Google, Tencent, Microsoft, etc.

Coopetition is the game

But, this model of ‘compete and cooperate’ (in other words, coopetition) with your customers is not new, especially in the chip world. For example, Samsung buys the best-in-class chips from Qualcomm, and in return, Qualcomm gives some foundry business to Samsung. So, in the case of Samsung-Qualcomm, it was necessary and in the interest of both parties. Even Nvidia, for example, cooperates with Intel server CPUs because their GPUs have to work very closely with Intel x86 systems, but at the same time, they compete with Intel in AI and other segments.

On the other hand, companies like TSMC and ASML don’t tamper with any of their customers primarily because the barrier to entry is too high. For example, TSMC cannot make EUV tools. Likewise, ASML cannot pursue foundry. “In a way, sometimes you compete, sometimes you cooperate,” said Kundojjala.

Operating system makers like Microsoft licence Windows to other PC companies, while also producing its own PC products. Hence, it’s very common in the tech industry.

In recent times, this phenomenon is only getting amplified. According to a report by The Information, Microsoft has been developing its own AI chips since 2019, in order to reduce its dependence on Nvidia and avoid costly reliance on their products. With this, it will be joining other major cloud computing providers like GCP and AWS in developing its own chips for AI training and inference. All this while these platforms are also offering its competitor, Nvidia’s AI chips on the cloud.

Cloud is no exception

A survey of 535 IT professionals conducted by Spiceworks in 2018 found that only 47 percent of IT professionals were loyal to their cloud service providers (CSPs). The numbers are likely to be even low now. Perhaps because it’s easier for IT pros to switch from one cloud provider to another if a better deal comes along.

In the past five to seven years, cloud companies were limited in their server and database offerings, with a limited number of templates available. But, customers seek flexibility. A multi-cloud strategy gives customers new server templates allowing them to choose as per their needs, making it much cheaper.

The post Are You Competing with Your Customers? appeared first on Analytics India Magazine.

Pune-based Persistent Systems Crosses USD 1 Billion in Annual Revenue

Pune-based mid-tier IT company, Persistent Systems Ltd, has reported a consolidated net profit of Rs. 251.50 crore in their fourth-quarter ended March 2023, an increase of 5.7% sequentially from Rs. 237.90 crore. The company also posted a consolidated revenue from operations for the quarter ended at Rs. 2,254.5 crore, up 4% on a quarter-on-quarter basis from Rs. 2,169.40 crore.

The company’s net profit increased by 35% year-on-year to Rs. 237.90 crore in the previous quarter.

Sandeep Kalra, Chief Executive Officer and Executive Director at Persistent acknowledged that the current fiscal year of 2023 was a “momentous” one for the company as it reached several key milestones like attaining USD 1 billion in annual revenue and saw its inclusion in three key indices of the National Stock Exchange of India including the Nifty IT index.

“We have been nimble, proactive, and disciplined, allowing us to build a healthy booking pipeline and maintain competitive advantage. We’re truly grateful to our clients, partners, investors, and team members for their unwavering trust. As we move to the next phase of growth, we will continue to strengthen our partner ecosystem, maintain operational rigor, and deepen our capabilities to scale our Digital Engineering expertise and drive business value for our clients,” he stated.

The board has recommended a final dividend of Rs. 12 per equity share and a special dividend of Rs. 10 per equity share of Rs. 10 each for FY 2022-23, according to a company filing. The special dividend, which will be given together with the final payout, is recommended for reaching $1 billion in annual revenue.

The company also said in the filing that the order booking for the quarter under review was at USD 421.60 million in annual contract value terms.

The post Pune-based Persistent Systems Crosses USD 1 Billion in Annual Revenue appeared first on Analytics India Magazine.

Data Science Hiring Process at Pepperfry

Since its debut in 2011, Mumbai-based Pepperfry has been a game-changer in the way Indians shop for furniture and stylish home decor. Pepperfry was one of the early adopters of cutting-edge technology when it launched its online marketplace. Apart from classy furniture, Pepperfry is known for its wide range of furnishings, lighting, and other home utility products. Started by former eBay executives Ambareesh Murty and Ashish Shah, Pepperfry caters to every taste and requirement for every type of home.

AIM got in touch with Devvrat Arya, vice president of technology, at Pepperfry to understand how the furniture giant is implementing data science and who are cut out for such roles.

“We are at the cusp of a new era where AI will become ubiquitous, and machine learning and deep learning will power everything. To succeed in this field, one must have a strong foundation in linear algebra and statistical analysis, as every algorithm is based on these fundamental concepts,” said Arya.

Pepperfry’s AI & Analytics Play

Pepperfry’s data science team is fairly small and focuses on addressing problems related to the company’s customers and business. The organisation strongly believes in the super-lean methodology and begins by defining problem statements before seeking individuals to build a team around the project. This approach also governs the team’s structure and hiring process, which align with the super-lean methodology.

Pepperfry implements AI/ML models to improve the customer experience and one specific area that they focus on is anomaly detection. The data science team is analysing historical order-placed patterns from the past year and examining all possible combinations. By using various ML algorithms, the team can identify order anomalies that deviate from the predicted order-placed pattern logic. Once an anomaly is detected, the ML engine automatically notifies the relevant parties of the possible root cause of the issue. The data science team and developers collaborate to address the problem and prevent any technology leaks that could negatively impact business outcomes.

Although in the beta stage, Pepperfry has also developed a visual search engine that allows customers to upload an image of furniture they’re interested in and receive recommendations for similar products. The algorithm compares the uploaded image with database images to find the top similar products. This feature simplifies the process of searching for furniture and decor items and provides insight into customer preferences for a more personalised shopping experience.

Read more: Data Science Hiring Process at Pepperfry

Interview Process

Pepperfry’s interview process for hiring data scientists involves several stages. First, the candidates are given an initial assessment consisting of complex and randomised data science and Python-based questions to assess their skills and knowledge. The first interview round evaluates the candidate’s analytical skills and suitability for the role based on personality traits. In the second technical interview round, the focus is on testing the candidate’s knowledge and understanding of multiple machine learning and deep learning algorithms. Lastly, there is an HR round for negotiating the salary and the issuance of an offer letter.

Skill Set Required

Pepperfry emphasises the importance of a candidate’s personality fitting the role they are applying for. The company’s initial evaluation of potential candidates focuses on several personality traits, including analytical skills, agility, communication abilities, honesty, curiosity, and strong work ethics.

The data science team focuses on visual-based customer experiences and has prioritised the use of deep learning-based libraries and frameworks. Tensorflow, Pytorch, and Scikit-learn are the primary tools utilised for data analysis and deep learning projects. Additionally, Fast RCNN and Yolo libraries are extensively used for object detection and segmentation, and SWIN transformers are employed for efficient image-based result classification. The company intends to delve deeper into GPT to increase organic traffic gradually.

Applicants with a strong background in linear algebra, probability theory, statistical analysis, and distribution will be viewed as an asset.

Arya, who has interviewed over 500 data science engineers, has noticed that the most common mistake candidates make is by attempting to apply a simplistic statistical and analytical approach to the ML process. This, he says, is inadequate as the disparity between statistical and ML analysis is significant, and a uniform strategy is unlikely to be effective in solving the problems at hand.

Work Culture

“Pepperfry, as an organisation, values a culture that is diverse and inclusive and promotes positivity,” said Arya. Here, about 35% of employees are females against 65% males. Despite being in existence for over eleven years, they still maintain a start-up mentality through continuous innovation while remaining focused on its business goals. The company employs a flat organisational structure, which eliminates communication barriers between colleagues.

“We have abolished the cabin culture this year, and now all employees sit together regardless of their role for faster and easier ideation, discussions, and conflict resolution within the organisation,” he added.

Pepperfry offers various perks to its employees, such as a hybrid working model, the flexibility to choose working hours, and ensuring a healthy work-life balance. The company consistently invests in its employees to enhance their skills and expertise in the latest technology and tools and provides access to top-notch learning resources and certification programs.

“If you have the knack for solving complex and innovative problems with the right mix of startup energy, Pepperfry is the place for you,” concluded Arya.

Click here to apply.

Read more: Data Science Hiring Process at Livspace

The post Data Science Hiring Process at Pepperfry appeared first on Analytics India Magazine.

Why LinkedIn’s Feed Algorithm Needs a Revamp

Spam can make or mar our social media experience. A study by Nexgate found that spam on social networking sites was growing at a faster rate than comments. To this end, last week, LinkedIn released a blog titled, ‘Viral spam content detection at LinkedIn’, where the platform claimed that it uses proactive and reactive defences to combat spam and malicious content on its platform.

The platform’s proactive defences consist of two types of classifiers that analyse specific spam categories and content types, using deep neural networks trained with TensorFlow. LinkedIn uses these techniques to detect early signals of potential spam content and takes appropriate actions such as filtering or conducting a manual human review. On the other hand, reactive defences employ a combination of predictive machine learning and heuristics, analysing member behaviour, content features, and interaction patterns to predict the probability of spam appearing as viral content.

Your spam, not my spam

For everyday users, spam content is as good as ‘unwanted’ content. However, Analytics India Magazine has learnt that LinkedIn defines spam content as regular content, which does not comply with LinkedIn’s policy. The platform’s spam detection algorithm does not flat the ‘unwanted’ content that pops up on the feed and any content which is generated originally, can not be considered spam on the site.

However, LinkedIn is filled with content that users did not want in the first place. The reasoning behind this is that while the platform transitioned from a pure networking platform to a content consumption one, they missed out on adapting accordingly. When a user ‘connects’ with someone on the platform, it automatically follows that person. This becomes a real issue as this feature makes it difficult for someone to curate the feed after.

And unless a user really wants to remain private on the platform and is very strict about only connecting with people it 100% knows, generally there tends to be many people in the network that the user has never met before.

Hence, every time a connection likes, comments, or shares content, it ends up on the user’s feed which at times is spam. But, the LinkedIn algorithm considers this as ‘original’ content.

Problem of attribution

This then brings us to a somewhat personal issue that Analytics India Magazine faced with one of our articles. Linkedin claims that the platform is not designed for virality but on occasion, posts content which results in significant engagement in the form of likes, reactions, comments, and reshares in a short period of time could be considered viral. This is along the lines of what happened with one of our articles.

A few weeks ago, an article by us went viral across the country, garnering thousands of reactions and catching the attention of virtually all media outlets. However, we encountered a complication as the article was shared on other social media platforms without being properly attributed to us. On LinkedIn, for example, hundreds of users shared the article, and although some of those posts generated three to four times the engagement of the original post, our authorship was never acknowledged.

However, our source at Linkedin responded to this dilemma saying, “Linkedin cannot completely remove the copied content from the site because these will not be detected as spam.” Attributing the ownership of an article to a particular user is more difficult than it seems.

Now, with the rise of LLMs, it becomes even harder to identify the first source of the content. (Just to compare, YouTube has had an automatic option preventing users from uploading videos with a copyright, for years now)

What LinkedIn can Learn from Reddit

Owing to its downvote feature, Reddit has an excellent feed when it comes to written content. The users have a certain amount of control over the type of content they want to read, similar to what can be seen in Quora– another platform that has a rather good recommendation algorithm for its feed.

This ingredient of community suggestion is exactly what LinkedIn is missing. Reddit’s secret is that it lets you downvote content and reactions so that users can identify those that are at opposite ends of the community’s interest, those that are seen as great contributions, and of course, those that appear to be more controversial. It often makes the participation more interesting than the content itself.

Even Twitter and Facebook have been struggling with their recommendation algorithm for a couple of months now. Users have reported seeing vulgar content on their feeds. However, these platforms were quick to react and correct this, something that LinkedIn is yet to achieve.

The post Why LinkedIn’s Feed Algorithm Needs a Revamp appeared first on Analytics India Magazine.