Meta-Qualcomm Partnership will Bring Llama 2 to the Masses

The conversation around the use cases of LLMs has been rapidly growing, supercharged by the release of Llama 2, Meta’s new open source model. Even as Meta’s Llama has been released to the public, there is still a huge barrier for entry in terms of being able to run it on local hardware. To remedy this issue and truly open up the power of Llama 2 to everyone, Meta has partnered with Qualcomm to enable the chipmaker to optimise the model for running on-device, powered by the chips’ AI capabilities.

Industry experts are predicting that open LLMs could create a new generation of AI-powered content generation, smart assistants, productivity applications, and more. Adding the capability to natively run LLMs on-device is sure to create a strong new ecosystem of AI-powered applications, akin to the app store explosion that happened with iPhones.

This move will not only democratise access to the model, but also unlock a bunch of possibilities for on-device AI processing. It also comes at a time where the consumer hardware and software industries are waking up to the possibility of AI capabilities at the edge. First spearheaded by Apple’s inclusion of a neural engine in the M1 chip, the addition of a new type of processor on personal computers will finally give developers the tools to create truly democratic AI.

What the Qualcomm-Meta partnership entails

For context, Qualcomm is currently creating a new set of AI-enabled chips under the Snapdragon platform. Using what it calls the Hexagon processor, the chipmaker equips its chips with various AI capabilities. Using an approach called micro tile inferencing, Qualcomm is able to integrate tensor cores, dedicated processing for SegNet, scalar, and vector workloads into an AI processor, which is then integrated into a Snapdragon mobile chip.

As part of its partnership with Meta, Qualcomm will make Llama 2 implementations available on-device, harnessing the capabilities of the new AI-enabled Snapdragon chips. Since the model will be running on-device, developers can not only cut down on cloud computing costs for their applications, but also bring a higher degree of privacy to users, as no data is in transit to servers off-device.

Running the models on the device also brings the additional benefits of being able to use generative AI without connection to the Internet. Moreover, the models can also be personalised to the users’ preferences as it ‘lives’ on the device. Llama 2 will also fit neatly into the Qualcomm AI Stack, a set of developer tools made to further optimise running AI models on-device.

“We applaud Meta’s approach to open and responsible AI and are committed to driving innovation and reducing barriers-to-entry for developers of any size by bringing generative AI on-device,” said Durga Malladi, Qualcomm’s senior vice president and general manager of technology, planning and edge solutions businesses.

Qualcomm has also worked closely with Meta in the past, mainly to make chips for its Oculus Quest VR headsets. The company has also tied up with Microsoft to help scale on-device AI workloads. As part of a partnership with Qualcomm and other chipmakers like Intel, AMD, and NVIDIA, Microsoft introduced the new Hybrid AI Loop toolkit to support AI development at the edge. Taking a zoomed out look at the edge AI hardware and software ecosystem, it is clear that the industry is moving towards AI at the edge, and Llama 2 might have a bigger role to play than anyone thinks.

Setting the open source world on fire

It seems that Meta has learnt a lot from the leak of the first LLaMA model. While the first iteration of this LLM was only available to researchers and academic institutions, the model and its weights were leaked on the Internet through 4chan. This resulted in an explosion of open source LLM innovation using LlaMa as the base model.

In just under a month of its launch, the open source community had already bettered LLaMA in every way possible. Researchers at Stanford University created a version of LLaMA that could be trained for a cost of $600, which then led to the development of many other faster and lighter versions. Most, if not all of these versions, could be run on-device, giving the world access to their very own LLMs.

One developer ported the LLM model to C++, which then resulted in a version of the model that could be run on a phone. The project, dubbed LLaMA.cpp, was fueled by the open source community, resulting in the model’s weights being quantised. This innovation allowed it to run on a Google Pixel 5, albeit generating only 1 token per second.

As part of the latest partnership with Meta, Snapdragon could receive information about the inner workings of the model. This would enable the chipmaker to bake in certain optimisations, allowing Llama 2 to run better than other models. Considering the 2024 release window, it is also likely that Qualcomm will likely explore other partnerships to coincide with the launch of its Snapdragon 8 Gen 3 chip.

The open source community is also sure to contribute its fair share to the (almost) completely open Llama 2. When combined with huge industry momentum for on-device AI, this move is the first of many to support a vibrant on-device AI ecosystem.

The post Meta-Qualcomm Partnership will Bring Llama 2 to the Masses appeared first on Analytics India Magazine.

Forget PIP, Conda, and requirements.txt! Use Poetry Instead And Thank Me Later

Forget PIP, Conda, and requirements.txt! Use Poetry Instead And Thank Me Later
Image by me with Midjourney

Library A requires Python 3.6. Library B relies on Library A but needs Python 3.9, and Library C depends on Library B but requires the specific version of Library A that is compatible with Python 3.6.

Welcome to dependency hell!

Since native Python is rubbish without external packages for data science, data scientists can often find themselves trapped in catch-22 dependency situations like the one above.

Tools like PIP, Conda, or the laughable requirements.txt files can’t solve this problem. Actually, dependency nightmares exist largely because of them. So, to end their suffering, the Python open-source community developed the charming tool known as Poetry.

Poetry is an all-in-one project and dependency management framework with over 25k stars on GitHub. This article will introduce Poetry and list the problems it solves for data scientists.

Let’s get started.

Installation

While Poetry can be installed as a library with PIP, it is recommended to install it system-wide so you can call poetry on the CLI anywhere you like. Here is the command that runs the installation script for Unix-like systems, including Windows WSL2:

curl -sSL https://install.python-poetry.org | python3 -

If, for some weird reason, you use Windows Powershell, here is the suitable command:

(Invoke-WebRequest -Uri https://install.python-poetry.org -UseBasicParsing).Content | py -

To check if Poetry is installed correctly, you can run:

$ poetry -v    Poetry (version 1.5.1)

Poetry also supports tab completion for a variety of shells like Bash, Fish, Zsh, etc. Learn more about it here.

1. Consistent structure for all projects

Since Poetry is an all-in-one tool, you can use it from the start to the very end of your project.

When starting a fresh project, you can run poetry new project_name. It will create a default directory structure that is almost ready to build and publish to PyPI as a Python package:

$ poetry new binary_classification  Created package binary_classification in binary_classification    $ ls binary_classification  README.md  binary_classification  pyproject.toml  tests    $ tree binary_classification/    binary_classification  ├── pyproject.toml  ├── README.md  ├── binary_classification  │   └── __init__.py  └── tests      └── __init__.py

But we, data scientists, rarely create Python packages, so it is recommended to start the project yourself and call poetry init inside:

$ mkdir binary_classification  $ poetry init

The CLI will ask you a series of questions for setup, but you can leave most of them blank as they can be updated later:

Forget PIP, Conda, and requirements.txt! Use Poetry Instead And Thank Me Later
GIF. Mine.

The init command will produce the most critical file of Poetry — pyproject.toml. The file contains some project metadata, but most importantly, it lists the dependencies:

$ cat pyproject.toml  [tool.poetry]  name = "binary-classification"  version = "0.1.0"  description = "A binary classification project with scikit-learn."  authors = ["Bex Tuychiev "]  readme = "README.md"  packages = [{include = "binary_classification"}]    [tool.poetry.dependencies]  python = "^3.9"      [build-system]  requires = ["poetry-core"]  build-backend = "poetry.core.masonry.api"

Right now, the only dependency under tool.poetry.dependencies is Python 3.9 (we will learn what ^ is later). Let's populate it with more libraries.

If you want to learn what all the fields in pyproject.toml file do, jump over here.

2. Dependency specification

To install dependencies for your project, you will no longer have to use PIP or Conda, at least directly. Instead, you will start using poetry add library_name commands.

Here is an example:

$ poetry add scikit-learn@latest

Adding the @latest flag installs the most recent version of Sklearn from PyPI. It is also possible to add multiple dependencies without any flags (constraints):

$ poetry add requests pandas numpy plotly seaborn

The beauty of add is that if the specified packages don't have any version constraints, it will find the versions of all packages that resolve, i.e., not throw any errors when installed together. It will also check against the dependencies already specified in the pyproject.toml.

$ cat pyproject.toml  [tool.poetry]  ...    [tool.poetry.dependencies]  python = "^3.9"  numpy = "^1.25.0"  scikit-learn = "^1.2.2"  requests = "^2.31.0"  pandas = "^2.0.2"  plotly = "^5.15.0"  seaborn = "^0.12.2"

Let’s try downgrading numpy to v1.24 and see what happens:

$ poetry add numpy==1.24    ...  Because seaborn (0.12.2) depends on numpy (>=1.17,<1.24.0 || >1.24.0) ...  version solving failed.

Poetry won’t let it happen because the downgraded version would conflict with Seaborn. If this was PIP or conda, they would gladly install Numpy 1.24 and would grin back at us as the nightmare starts.

In addition to standard installations, Poetry provides a versatile syntax for defining version constraints. This syntax allows you to specify exact versions, set boundaries for version ranges (greater than, less than, or in between), and pin down major, minor, or patch versions. The following tables, taken from the Poetry documentation (MIT License), serve as examples.

Caret requirements:

Forget PIP, Conda, and requirements.txt! Use Poetry Instead And Thank Me Later

Tilde requirements:

Forget PIP, Conda, and requirements.txt! Use Poetry Instead And Thank Me Later

Wildcard requirements:

Forget PIP, Conda, and requirements.txt! Use Poetry Instead And Thank Me Later

For even more advanced constraint specifications, visit this page of the Poetry docs.

3. Environment management

One of the core features of Poetry is isolating the project environment from the global namespace in the most efficient way possible.

When you run the poetry add library command, here is what happens:

  1. If you initialized Poetry inside an existing project with a virtual environment already activated, the library will be installed into that environment (it can be any environment manager like Conda, venv, etc.).
  2. If you created a blank project with poetry new or initialized Poetry with init when no virtual environment is activated, Poetry will create a new virtual environment for you.

When case 2 happens, the resulting environment will be under /home/user/.cache/pypoetry/virtualenvs/ folder. The Python executable will be there somewhere as well.

To see which Poetry-created env is active, you can run poetry env list:

$ poetry env list    test-O3eWbxRl-py3.6    binary_classification-O3eWbxRl-py3.9 (Activated)

To switch between Poetry-created environments, you can run poetry env use command:

$ poetry env use other_env

You can learn more about environment management from here.

4. Fully reproducible projects

When you run the add command, Poetry will generate a poetry.lock file. Rather than specifying version constraints, like 1.2.*, it will lock the exact versions of libraries you are using, like 1.2.11. All subsequent runs of poetry add or poetry update will modify the lock file to reflect the changes.

Using such lock files ensures that people who are using your project can fully reproduce the environment on their machines.

People have long used alternatives like requirements.txt but its format is very loose and error-prone. A typical human-created requirements.txt is not thorough as developers don't usually bother with listing the exact library versions they are using and just state version ranges or worse, simply write the library name.

Then, when others try to reproduce the environment with pip install -r requirements.txt, PIP itself tries to resolve the version constraints, and that's how you quietly end up in dependency hell.

When using Poetry and lock files, none of that happens. So, if you are initializing Poetry in a project with requirements.txt already present, you can add the dependencies inside with:

$ poetry add `cat requirements.txt`

and delete the requirements.txt.

But, please note that some services like Streamlit or Heroku still require old requirements.txt files for deployment. When using those, you can export your poetry.lock file to a text format with:

$ poetry export --output requirements.txt

The workflow to follow

I want to leave the article with a step-by-step workflow to integrate Poetry into any data project.

Step 0: Install Poetry for your system.

Step 1: Create a new project with mkdir and call poetry init inside to initialize Poetry. If you want to convert your project into a Python package later, create the project with poetry new project_name.

Step 2: Install and add dependencies with poetry add lib_name. It is also possible to manually edit pyproject.toml and add the dependencies under the [tool.poetry.dependencies] section. In this case, you have to run poetry install to resolve the version constraints and install the libraries.

After this step, Poetry creates a virtual environment for the project and generates a poetry.lock file.

Step 3: Initialize Git and other tools such as DVC and start tracking the appropriate files. Put pyproject.toml and poetry.lock files under Git.

Step 4: Develop your code and models. To run Python scripts, you must use poetry run python script.py so that Poetry's virtual environment is used.

Step 5: Test your code and make any necessary adjustments. Iterate on your data analysis or machine learning algorithms, experiment with different techniques, and refine your code as needed.

Optional steps:

  1. To update already-installed dependencies, use the poetry update library command. update only works within the constraints inside pyproject.toml so, check the caveats here.
  2. If you are starting from a project with requirements.txt, use poetry add cat requirements.txt to automatically add and install the dependencies.
  3. If you want to export your poetry.lock file, you can use poetry export --output requirements.txt.
  4. If you chose a package structure for your project (poetry add), you can build the package with poetry build and it will be ready to push to PyPI.
  5. Switch between virtual environments with poetry env use other_env.

With these steps, you will ensure that you are never in dependency hell again.

Thank you for reading!

Bex Tuychiev is a Top 10 AI writer on Medium and a Kaggle Master with over 15k followers. He loves writing detailed guides, tutorials, and notebooks on complex data science and machine learning topics with a bit of a sarcastic style.

Original. Reposted with permission.

More On This Topic

  • Forget Telling Stories; Help People Navigate

Quantum Computing Will Make GenAI More Advanced

While generative AI is catching everyone’s imagination, Quantum computing is seen as a technology that could have potential impact in the coming years. However, there is a converging point of both the technologies.

Generative is an exciting technology; however, it doesn’t come cheap. It cost OpenAI millions of dollars to make the technology accessible for all. The costs encompass both the expenses involved in training Large Language Models (LLMs) like the GPT models ( which powers ChatGPT) and the ongoing costs of running these models to respond to user queries. Training and running generative AI models require substantial computational resources, making them compute-intensive, financially burdensome, and environmental costs.

“Future quantum computers and quantum-inspired techniques promise to address these challenges and make Generative AI more accessible, efficient, and advanced. The increased processing power of Quantum Computers can enable faster computations than the classical computer by harnessing the principles of quantum mechanics. This may allow for faster processing of large-scale generative models, enabling more complex and realistic outputs within shorter timeframes,” Aan S. Chauhan, chief technology officer, LTIMindtree, told AIM.

Moreover, efficient data processing through Quantum Computing can help process larger datasets and more efficiently uncover the patterns and anomalies within the data. “Quantum computers can also solve optimisation problems in generative AI models more efficiently than classical computers resulting in improved performance and generation capabilities. Additionally, it can add value through improved sampling techniques, generating solutions beyond image and text, etc.

However, Chauhan does stress that we’re still in the early days of quantum computing and the full potential of how much it can advance generative AI will depend on advancements in hardware, algorithms, and further research in the field.

Quantum computing research at LTIMindtree

In the present day, LTIMindtree is indulging in significant research when it comes to quantum computing. “Our research effort primarily aims to perform applied research to unearth the potential benefits Quantum technology could yield in the short, medium, and long term for all the industries we cater to.”

As the technology matures from the current Noisy Intermediate-Scale Quantum (NISQ) phase towards the era of Fault-tolerant Quantum Computing, the IT services company’s research endeavours are centred on extracting incremental value from this evolving technology. “To accomplish this, we have deployed a multi-skilled team of physicists and software engineers actively developing Quantum Computing use cases catering to sectors like banking, financial services, insurance, manufacturing, logistics, etc. Our initial focus for these use cases is Quantum Optimisation, Quantum Machine Learning, and Quantum Simulation.”

Moreover, the advent of Quantum Computing is paving the way for novel forms of cyber-attacks, including the concept of ‘harvest now, decrypt later, according to Chauhan. This implies that cyber attackers could store encrypted data today and decrypt it later using a sufficiently powerful Quantum computer. “Recognising this trend, we’re actively developing capabilities in solutions for Post-Quantum Cryptography and Quantum Key Distribution (QKD) to help enterprises mitigate these risks.”

How can Quantum Computing benefit Indian IT?

The Indian IT sector has been crucial in spearheading global digital transformation over the past few decades. According to Chauhan, the sector has acquired extensive domain knowledge across various industry verticals by leading this transformation. “This unique combination of experience in building complex IT systems and deep domain understanding positions the Indian IT sector favourably to implement research findings of the emerging Quantum technology into practical applications.”

Opportunity is enormous for IT service providers to introduce innovative new products and offerings around quantum technologies. One such avenue is leading in developing quantum-classical hybrid workflows, where integrating quantum computing with conventional computing allows for the extraction of incremental value. “For example, conventional High-Performance Computing (HPC) systems can benefit from quantum-inspired algorithms for optimisation tasks, amplifying their computational capabilities.”

Another avenue is post-quantum cryptography, where Indian IT companies can create encryption/security offerings to meet the rising demand for advanced data protection. Beyond these examples, there could be many novel ideas around quantum machine learning, simulation, etc.

Moreover, the IT industry has the potential to make a noteworthy contribution to developing a highly influential quantum-ready workforce through partnerships with research institutes and universities, reskilling initiatives, and targeted training programmes.

“By collaborating with leading research institutes, organisations can ensure their position at the forefront of harnessing this evolving cutting-edge technology to address real-world business challenges as the technology matures effectively.”

India’s quantum computing ambitions

Earlier this year, the Indian government greenlit the National Quantum Mission (NQM) with a budget of INR 6003.65 crore spanning from 2023-24 to 2030-31. The mission’s objective is to foster scientific and industrial R&D, nurture growth in Quantum technology, and create a dynamic and innovative Quantum technology ecosystem.

Chauhan said Quantum technologies offer immense potential in addressing complex problems spanning societal, industrial, and national security interests. Achieving self-reliance in this domain is of utmost priority and the National Quantum Mission is a step in this direction.

“By facilitating the development of intermediate-scale quantum computers, establishing secure quantum communications networks, and advancing research in quantum materials and devices, the mission paves the way for creating ground-breaking applications across sectors, including communication, health, financial services, energy, security, etc.

“Additionally, the planned four thematic hubs, Quantum Computing, Quantum Communication, Quantum Sensing & Metrology, and Quantum Materials & Devices, will be platforms for collaborative and consolidated research efforts between leading research institutes, academia, and industries. These efforts will drive faster innovation toward addressing critical challenges and developing ground-breaking quantum technologies.”

The post Quantum Computing Will Make GenAI More Advanced appeared first on Analytics India Magazine.

WNS Announces Revenue of $326.5 mn in Q1 24, up 10.5% From Q1 22

WNS Limited, a leading provider of global Business Process Management (BPM) solutions, today announced revenue of USD 326.5 million in the first quarter of FY 2024, a 10.5% increase from USD 295.3 million reported during the same period last year.

Profit in the fiscal first quarter was USD30.1 million, as compared to USD 33.1 million in Q1 of last year and USD36.4 million in the previous quarter.

Year-over-year, profit decreased as a result of wage increases, increased return-to-office costs, higher share-based compensation expense, and increased costs associated with our acquisitions including amortization of intangibles, interest expense, and other acquisition-related expenses, according to the company.

““In the fiscal first quarter, WNS continued to deliver healthy financial results and position our business for long-term success. Despite the challenging macro environment, WNS grew constant currency revenue less repair payments* by more than 17% and maintained our industry-leading adjusted operating margins*.

“Our updated guidance and visibility demonstrate the healthy and resilient nature of our business, and we believe WNS remains well-positioned to meet the evolving needs of our clients. This includes ongoing technology and automation advancements such as AI and generative AI. The company remains focused on investing in domain, technology, and talent, driving strong operational and financial execution, and delivering long-term sustainable value for all of our stakeholders,” Keshav Murugesh, chief executive at WNS said.

The post WNS Announces Revenue of $326.5 mn in Q1 24, up 10.5% From Q1 22 appeared first on Analytics India Magazine.

A Beginner’s Guide to Data Engineering

A Beginner’s Guide to Data Engineering
Image by Author

With the influx of huge amounts of data from a multitude of sources, data engineering has become essential to the data ecosystem. And organizations are looking to build and expand their team of data engineers.

Some data roles such as that of an analyst do not necessarily require prior experience in the field so long as you have strong SQL and programming skills. To break into data engineering, however, previous experience in data analytics or software engineering is generally helpful.

So if you’re looking to pursue a career in data engineering, this guide is for you to:

  • learn more about data engineering and the role of a data engineer, and
  • gain familiarity with the essential data engineering concepts.

What Is Data Engineering?

Before we discuss what data engineering is all about it's helpful to review the need for data engineering. If you have been in the data space for a while, you’ll be skilled in SQL queries querying relational databases with SQL and NoSQL databases with SQL-like languages.

But how did the data reach there—ready for further analysis and reporting? Enter data engineering.

We know that data comes from various sources in several forms: including from legacy databases to user conversations and IoT devices. The raw data has to be pulled into a data repository. To expand: data from the various resources—should be extracted and processed—before being made available in ready-to-use form in data repositories.

Data engineering encompasses the set of all processes that collect and integrate raw data from various resources—into a unified and accessible data repository—that can be used for analytics and other applications.

What Does a Data Engineer Do?

Understanding what data engineering is should’ve definitely helped you guess what data engineers do on a day-to-day basis. The responsibilities of data engineer include but are not limited to the following:

  • Extracting and integrating data from a variety of sources—data collection.
  • Preparing the data for analysis: processing the data by applying suitable transformations to prepare the data for analysis and other downstream tasks. Includes cleaning, validating, and transforming data.
  • Designing, building, and maintaining data pipelines that encompass the flow of data from source to destination.
  • Design and maintain infrastructure for data collection, processing, and storage—infrastructure management.

Data Engineering Concepts

Now that we understand the importance of data engineering and the role of data engineers in an organization, it's time to review some fundamental concepts.

Data Sources and Types

As mentioned, we have incoming data from all resources across the spectrum: from relational databases and web scraping to news feeds and user chats. The data coming from these sources can be classified into one of the three broad categories:

  • Structured data
  • Semi-structured data
  • Unstructured data

Here’s an overview:

Type Characteristics Examples
Structured data Has a well-defined schema. Data in relational databases, spreadsheets, and the like
Semi-structured data Has some structure but no rigid schema. Typically has metadata tags that provide additional information. Include JSON and XML data, emails, zip files, and more
Unstructured data Lacks a well-defined schema. Images, videos and other multimedia files, website data

Data Repositories: Data Warehouses, Data Lakes, and Data Marts

The raw data collected from various sources should be staged in a suitable repository. You should already be familiar with databases—both relational and non-relational. But there are other data repositories, too.

Before we go over them, it'll help to learn about two data processing systems, namely, OLTP and OLAP systems:

  • OLTP or Online Transactional Processing systems are used to store day-to-day operational data for applications such as inventory management. OLTP systems include relational databases that store data that can be used for analysis and deriving business insights.
  • OLAP or Online Analytical Processing systems are used to store large volumes of historical data for carrying out complex analytics. In addition to databases, OLAP systems also include data warehouses and data lakes (more on this shortly).

The choice of data repository is often determined by the source and type of data. Let’s go over the common data repositories:

  • Data warehouses: A data warehouse refers to a single comprehensive store house of incoming data.
  • Data lakes: Data lakes allow to store all data types—including semi-structured and unstructured data—in their raw format without processing them. Data lakes are often the destination for ELT processes (which we’ll discuss shortly).
  • Data mart: You can think of data mart as a smaller subsection of a data warehouse—tailored for a specific business use case common
  • Data lake houses: Recently, data lake houses are also becoming popular as they allow the flexibility of data lakes while offering the structure and organization of data warehouses.

Data Pipelines: ETL and ELT Processes

Data pipelines encompass the journey of data—from source to the destination systems—through ETL and ELT processes.

ETL—Extract, Transform, and Load—process includes the following steps:

  • Extract data from various sources
  • Transform the data—clean, validate, and standardize data
  • Load the data into a data repository or a destination application

ETL processes often have a data warehouse as the destination.

ELT—Extract, Load, and Transform—is a variation of the ETL process where instead of extract, transform, and load, the steps are in the order: extract, load, and transform.

Meaning the raw data collected from the source is loaded to the data repository—before any transformation is applied. This allows us to apply transformations specific to a particular application. ELT processes have data lakes as their destination.

Tools Data Engineers Should Know

The list of tools data engineers should know can be overwhelming.

A Beginner’s Guide to Data Engineering
Image by Author

But don’t worry, you do not need to be an expert at all of them to land a job as a data engineer. Before we go ahead with listing the various tools data engineers should know, it’s important to note that data engineering requires a broad set of foundational skills including the following:

  • Programming language: Intermediate to advanced proficiency in a programming language preferably one of Python, Scalar, and Java
  • Databases and SQL: Good understanding of database design and ability to work with databases both relational databases such as MySQL and PostgreSQL and non-relational databases such as MongoDB
  • Command-line fundamentals: Familiarity which Shell scripting and data processing and the command line
  • Knowledge of operating systems and networking
  • Data warehousing fundamentals
  • Fundamentals of distributed systems

Even as you are learning the fundamental skills, be sure to build projects that demonstrate your proficiency. There’s nothing as effective as learning, applying what you’ve learned in a project, and learning more as you work on it!

In addition, data engineering also requires strong software engineering skills including version control, logging, and application monitoring. You should also know how you use containerization tools like Docker and container orchestration tools like Kubernetes.

Though the actual tools you use may vary depending on your organization, it's helpful to learn:

  • dbt (data build tool) for analytics engineering
  • Apache Spark for big data analysis and distributed data processing
  • Airflow for data pipeline orchestration
  • Fundamentals of cloud computing and working with at least one cloud provider such as AWS or Microsoft Azure.

To learn more about engineering tools including tools for data warehousing and stream processing, read: 10 Modern Data Engineering Tools.

Wrapping Up

Hope you found this introduction to data engineering informative! If designing, building, and maintaining data systems at scale excites you, definitely give data engineering a go.

Data engineering zoomcamp is a great place to start—if you are looking for a project-based curriculum learn data engineering. You can also read through the list of commonly asked data engineer interview questions to get an idea of what you need to know.
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more.

More On This Topic

  • MLOps is an Engineering Discipline: A Beginner's Overview
  • A Beginner's Guide to Anomaly Detection Techniques in Data Science
  • A Beginner's Guide to Q Learning
  • A Beginner's Guide to the CLIP Model
  • A Beginner's Guide to End to End Machine Learning
  • A Beginner’s Guide to Web Scraping Using Python

Microsoft Now Makes Google Sweat 

Microsoft is literally everywhere. This “Co-pilot” obsessed tech giant wants anything and everything to do with generative AI – open source or closed door, small or large – it doesn’t matter, and can go to any extent to bring that capability to life.

While OpenAI’s closed-source approach might have raised concerns among developers, Meta is viewed as the good guy, and Microsoft stands as a neutral guardian, focused solely on investing in technological progress. The ongoing debates between open source and closed source don’t seem to bother Microsoft, as it has ensured its position is well-balanced and prepared for any scenario.

Candidly speaking Microsoft is not committed to any one particular approach while it comes to generative AI. The company is even backing budding startups that might rival OpenAI’s ChatGPT in the coming future. Microsoft recently invested in Inflection AI, a startup backed by several Silicon Valley heavyweights, which raised $1.3 billion.

Microsoft is also investing in small LLMs. Recently Microsoft released another open source LLM Orca, a highly advanced model with 13 billion parameters, designed to imitate the reasoning capabilities of LFMs. It utilises GPT-4 to learn from various signals, including explanation traces, step-by-step thought processes, and complex instructions. With the recent partnership with Meta, and backing of the open source research projects like Orca and their extended partnership with OpenAI Microsoft is unifying the generative AI ecosystem.

Currently Microsoft has 252+ open source models on Hugging Face. In comparison, Google has 591 open source model contributions. With the latest partnership, which leads the open source AI contributions, we can expect more open source models from the duo (Microsoft and Meta).

Making Google Sweat

While Microsoft made Google dance with its OpenAI partnership, it surely is making them sweat with the Meta partnership. The fact that Google missed out on Llama 2’s release, which is now available on AWS and Hugging Face, seems intentional. In the competitive cloud market, Microsoft is aiming to provide specialised LLMs for its enterprise clients. Azure OpenAI customers utilising Microsoft Cloud will have access to top-notch specialized enterprise LLMs that they can seamlessly integrate into their businesses.

When Microsoft partnered with OpenAI, Microsoft deployed OpenAI technology through API and the Azure OpenAI Service—enabling enterprise and developers to build on top of GPT, DALL·E, and Codex. They also worked together to build OpenAI’s technology into apps like GitHub Copilot and Microsoft Designer.

Microsoft’s decision to partner with Meta is a well-thought-out business strategy. By collaborating with different entities, they avoid depending solely on one supplier. Microsoft is fully aware that the AI field is constantly evolving, and it’s only a matter of time until someone develops an even better LLM than GPT 4. To position themselves for future advancements, they are actively exploring open source opportunities and acting as a bridge between GPT 4 and Llama 2.

Is Apple Next?

Over the years, Microsoft has faced challenges in keeping up with Google, especially in the mobile phone operating system and internet browsing domains. While Google Chrome gained significant popularity over Internet Explorer and Bing, Microsoft attempted to boost Bing’s performance by integrating it with OpenAI chatbot. Every move Microsoft makes is driven by the desire to enhance its products and compete more effectively with Google. It further integrated ChatGPT into Bing launched Microsoft co-pilot bringing generative AI abilities to Word, Excel, PowerPoint, Outlook and Teams.

In the past as well, Microsoft tried launching Windows smartphones but failed miserably in front of Android phones. But now it is training small LLMs. Meta, along with Microsoft, has also partnered with Qualcomm, eyeing an entire ecosystem to make LlaMA 2 implementation on phones and PCs starting next year. Smaller models like Orca can take full advantage of this. It is worth noting that ChatGPT hasn’t been released yet on Android PlayStore.

In the blog post Microsoft specifically mentioned that Llama 2 is optimised to run locally on Windows. Open source Llama 2 will help Microsoft to pick cream developments and implement it. As of now, we can only speculate about the future of Llama 2 and its potential integration with Windows devices.

Microsoft releasing models like Orca and others on edge devices can also be a threat to Apple. But, looks like Apple is ahead, and it is currently testing its generative AI chatbot ‘Apple GPT.’

The post Microsoft Now Makes Google Sweat appeared first on Analytics India Magazine.

Singapore releases draft guidelines on personal data use in AI training

Abstract AI data waves

Singapore has released draft guidelines on how personal data should be managed when used to train artificial intelligence (AI) models and systems.

The document outlines how the country's Personal Data Protection Act (PDPA) will apply when businesses use personal information to develop and train their AI systems, according to the Personal Data Protection Commission (PDPC), which administers the Act. The guidelines also include best practices in establishing transparency on how AI systems use personal data to make decisions, forecasts, and recommendations.

Also: AI is more likely to cause world doom than climate change, according to an AI expert

The guidelines, however, are not legally binding and do not supplement or alter any existing laws. They look at issues and situations, such as how companies may benefit from existing exceptions within the PDPA in the development of machine learning models or systems.

The guidelines also address how organizations can meet requirements involving consent, accountability, and notification when collecting personal data for machine learning AI systems that facilitate predictions, decisions, and recommendations.

The document also cites when it's appropriate for companies to turn to two exceptions, for research and business improvement, without having to seek consent for the use of personal data to train AI models.

Also: 6 harmful ways ChatGPT can be used

Business improvement exceptions might apply when companies develop a product, or have an existing product, that they are looking to improve. This exception might also be relevant when the AI system is used to power decision-making processes that improve operational efficiency or that offer personalized products and services.

For instance, the business improvement exception can be applied for internal human resource recommendations systems that are used to provide a first cut of potential candidates for a role. It might also be applied in the use of AI or machine learning models and systems to provide new features that improve the competitiveness of products and services.

Organizations, though, will have to ensure the business improvement purpose "cannot reasonably" be attained without using personal data in an individually identifiable way.

Also: Just how big is this generative AI? Think internet-level disruption

Under the research exception, organizations are permitted to use personal data to conduct research and development that might not have an immediate application in existing products and services or business operations. This can include joint commercial research work with other companies to develop new AI systems.

Organizations should ensure the research cannot be reasonably accomplished without the use of personal data in an identifiable form. There should also be clear public benefits in using the personal data for research, and the results of the research cannot be used to make decisions that affect the individual. In addition, published results of the research should not identify the individual.

The guidelines also recommend organizations that use personal data for AI systems should conduct a data protection impact assessment, which looks at the effectiveness of risk mitigation and remediation measures applied to the data.

Also: AI could automate 25% of all jobs. Here's which are most at risk

With regards to data protection, organizations should include appropriate technical processes and legal controls when developing, training, and monitoring AI systems that use personal data.

"In the context of developing AI systems, organizations should practise data minimization as good practice," the guidelines state.

"Using only personal data containing attributes required to train and improve the AI system or machine learning model will also reduce unnecessary data protection and cyber risks to the AI system."

The PDPC is seeking public feedback on the draft guidelines, which should be submitted by August 31.

Partnership to test privacy safeguard tools

Singapore has also announced a partnership with Google that enables local businesses to test the use of "privacy enhancing technologies", or what the government coins PETs.

Touting these as further tools to help organizations build their datasets, Minister of Communications and Information Josephine Teo said: "PETs allow businesses to extract value from consumer datasets, while ensuring personal data is protected. Through facilitating data sharing, they can also help businesses develop useful data insights and AI systems."

The use of PETs, for example, allows banks to collect data and build AI models for more effective fraud detection, while protecting their customers' identity and financial data, Teo said.

To drive the adoption of PETs, the Infocomm Media Development Authority (IMDA) last year introduced a PET sandbox to offer businesses access to grants and resources to develop such solutions.

Also: ChatGPT is more like an 'alien intelligence' than a human brain, says futurist

The collaboration with Google will allow Singapore organizations to test their Google privacy sandbox applications within the IMDA sandbox. This system provides a secure environment in which companies can use or share data without revealing sensitive information, the PDPC said.

It added that the IMDA and Google sandbox is available to businesses based in Singapore and is designed for adtech, publishers, and developers, among others.

Also: Why your ChatGPT conversations may not be as secure as you think

According to Teo, the partnership marks Google's first such collaboration with a regulator in Asia-Pacific to facilitate the testing and adoption of PETs.

Through the initiative, organizations could access a "safe space" to pilot projects using PETs on a platform on which they already operate, she said.

"With the deprecation of third-party cookies, businesses can no longer rely on these to track consumers' behavior through the browser and will need PETs as an alternative," she said. "Consumers will experience being served more relevant content without fearing their personal data is compromised."

Artificial Intelligence

AI Will Do a Lot More Good Than Harm

The emergence of generative AI has sparked intense discussions and debates about its potential threat to humanity. Influential figures like Eliezer Yudkowsky have raised concerns about AI going rogue. Not so long ago, Geoffrey Hinton, one of the Godfathers of AI, left Google to address the growing dangers in the field. However, on the contrary, Anil Kaul, chief AI officer & chief executive- AbolutData, Infogain, is an optimist. He believes AI will do more good than harm.

“I have been in the AI space since 1993-94. At that time, not many people were working on AI. I believe AI is absolutely going to change the world and generative AI has actually been a big shot in that direction,” he told AIM.

However, Kaul emphasises the need for caution. Drawing comparisons between nuclear energy and generative AI, he warns that in the wrong hands, this technology could become a threat. Nonetheless, he believes that AI is designed to augment human capabilities. “Besides, I don’t want to underestimate the intelligence of humans, because a lot of times we tend to think singularity is happening and AI systems will be smarter and more intelligent and I think they will be in certain ways. But humans will be able to control and manage it.”

Generative AI

Kaul heads the AI division at Infogain, a human-centred digital platform and software engineering company and according to him, Infogain has been leveraging generative AI capabilities way before it became popular, notably after the launch of ChatGPT.

“We were relatively easily able to bring in generative AI into Navik because the rest of the infrastructure was already in place,” Kaul said. Navik AI is a suite of AI products designed for enterprises to help automate their customer value management, improve their marketing campaigns, and offer hyper-personalised experiences to their customers.

“We’ve used Microsoft Azure environment to bring OpenAI models in an enterprise acceptable manner. What we did was replace the NLP model that we had built with OpenAI’s GPT models. So today, there is a generative AI module built into Navik,” Kaul said. However, even though they are leveraging GPT models for now, Infogain is also exploring the potential of leveraging other closed and open-source available on the market.

There is interest in genAI but also apprehension

What generative AI has done compared to conventional AI is that it has caught everyone’s attention, Kaul said. Infogain’s customers are eager to leverage generative AI capabilities, but they are also a bit apprehensive at the same time.

“Despite considerable interest, there is a prevailing sense of caution and a desire to understand the technology better before committing to any particular approach from an enterprise client perspective,” Kaul continued, “Hence, our recommendation to clients is to proceed systematically and thoughtfully. Avoid rushing into decisions due to the unknowns, but don’t stay out of the AI landscape. Embrace AI and generative AI to stay competitive in a market with numerous startups vying to replace those who don’t adapt.”

Building proprietary generative AI capabilities

One of the major causes of apprehension among enterprises for leveraging generative AI is security risks. Kaul believes ChatGPT did a lot of good things, like bringing generative AI to the limelight. “But it also did one bad thing, which is it scared the enterprise folks because it was actually never built for the enterprise,” Kaul said.

Hence, one way to minimise the security risks is leveraging generative AI capabilities through the hyperscalers whether its Microsoft, AWS or Google. “By leveraging generative AI through their system, many enterprise security issues are already addressed, providing us with a relatively secure environment.”

Another way for enterprises to tackle security issues is to build their own generative AI models, Kaul said. This eliminates the risks associated with LLM APIs. Moreover, in the coming years, almost everybody is going to be using the same generative AI models, according to Kaul.

“If three years from now, everybody is going to be using the same generative AI models, how are we going to be better from the others? Our advantage lies in combining generative AI with our superior code, empowering our developers to produce higher-quality code than others. This competitive edge necessitates creating unique intellectual property internally, possibly by refining open-source models and building our own solutions.”

Navik AI

Through Navik, Infogain is helping enterprises leverage AI and help them make better business decisions. What Navik does is create a recommendation on what actions to take. Kaul says Navik’s design allows not just analysts but the business team themselves to utilise its capabilities effectively.

Navik AI helps businesses make data-driven decisions, and can be utilised by sales, marketing, technology, and operations leaders. It blends AI, data, and analytics to serve as an intelligence layer for forward-thinking businesses.

“With Navik, we have helped the sales team of a US-based beverage company to decide which particular drink or beverage should be put in a particular restaurant.” Similarly, for one of its insurance clients, AI is employed for accident assessments. Customers simply send a picture of the damage, and AI determines whether it’s significant or minor. “For substantial damages, a physical inspection occurs, while smaller claims are assessed and settled based on the AI’s analysis of the image, estimating repair costs and more.

“Furthermore, for another client, Navik AI performs volume forecasts for 500,000 restaurants across 200 brands weekly. This extensive forecasting process is supported by a deep learning model and a neural network that generates precise forecasts,” Kaul concluded.

The post AI Will Do a Lot More Good Than Harm appeared first on Analytics India Magazine.

Why Apple will Build the Best Chatbot

Why Apple will Build the Best Chatbot

Last year, after the release of ChatGPT, when every big tech was frantically trying to adopt or build LLM-based chatbots, Apple decided to ban the use of ChatGPT internally citing privacy concerns. It even reached to the level that it even halted to build a LLM-based chatbot.

However, according to recent reports, the tech giant had a change of heart and has developed an internal chatbot—nicknamed “Apple GPT” by its employees. Though the tech giant has not yet decided how to release it to the public, the company is planning to make a significant AI-related announcement next year.

To ensure that the chatbot is better than the others in the market, Apple created its own framework called Ajax to build its LLM-based chatbot, similar to OpenAI’s ChatGPT and Google’s Bard. This framework runs on Google Cloud and was built using Google JAX, the search giant’s machine learning framework.

Apple will build a league of its own

Apple’s ecosystem and its dedicated consumers are probably the biggest advantage that the company has over its competitors. Apple has an integrated ecosystem, which gives it a significant opportunity to leverage its M1 and M2 capabilities and develop private and personalised LLMs. It also has a massive developer ecosystem, for which, the company recently released Transformers architecture which is optimised for Apple Silicon.

Apart from technology, Apple is also known for aesthetics and design thinking principles, which might help the company beat ChatGPT’s highly praised simple UI design.

Whatever the mission is, it is clear that integrating LLM technology on single devices is a difficult task.However, the task becomes more challenging when the company focuses on offering LLM technology along with maintaining privacy and security of users. Tim Cook has emphasised that the company wants to incorporate AI into its offerings thoughtfully and responsibly.

Probably, this is the reason why chatbot is not able to produce the right outputs. According to anonymous Apple employees, the company has directed that the output from the new chatbot still cannot be used to create features for end customers. It seems like Apple does not trust its own technology at the moment.

Apple has started caring about AI

The recent partnership announcement by Meta and Microsoft to release Llama 2 is possibly based on integrating language models for edge use cases — a wake up call for Apple. To enable this even further, Microsoft has also partnered with Qualcomm for possibly designing chips for Android devices. Given the smaller size of the Llama 2 model, Microsoft might be able to achieve it, something that arguably wouldn’t have been possible with OpenAI’s huge models like GPT-4.

Despite the fact that the report said that Apple’s chatbot doesn’t offer any additional distinguishing features, it is safe to say that Apple distinguishes itself from competitors in the market. At WWDC 2023, the team already announced a lot of improvements using machine learning, including on-device Transformer-based auto-fill in keyboards on iOS.

According to a report from March, Apple had held an internal event focusing on AI and LLMs. The participants included the Siri team and reported that it is testing “language-generating concepts”. Moreover, 9to5Mac reported that Apple has introduced a framework for ‘Siri Natural Language Generation’ in tvOS 16.4. Recent reports also indicate that Apple has been actively seeking talent in generative AI, posting job openings for experts in the field with a strong understanding of large language models and generative AI.

What can Apple do next?

Following all this, the best use case for the integration of this GPT-like technology for Apple would be Siri. Even after so many announcements, Siri hasn’t really upgraded itself since its launch. Developers have been trying to integrate ChatGPT capabilities within Siri. Now, Apple has the chance to do that natively.

Currently, OpenAI has its ChatGPT app only on iOS. If Apple develops its own chatbot that it is able to run natively on its own ecosystem, OpenAI might need to start worrying about what moves it can make next as Apple can simply drop it from the App Store. Maybe the ChatGPT app was also a wake up call for Apple to get into LLMs.

Apple was shying away from LLMs, similar to how Meta was. Now the Mark Zuckerberg company is being regarded as one of the top players in the open source market, nothing less can be expected from Apple as it is arriving last. Apple has always been very careful with user data, and have maintained that they don’t want to take any unnecessary risks in this arena.

The post Why Apple will Build the Best Chatbot appeared first on Analytics India Magazine.

How RapidAI Can Transform the Face of Neurological Diagnosis in India

“As per statistics, there is only 1 neurosurgeon per 10 lakh people in India,” said Apul Nahata, in an exclusive interview with AIM, India Head and VP-Engineering of RapidAI, at a tech event, in Bengaluru.

Prior joining RapidAI India, Nahata was the CTO of GenNext Ventures, Reliance Industries, and has founded muliple startups, including Kalpnik Technologies, TringMe and others. At RapidAI, he leads the expansion, partnerships and operations and scaling its engineering team in the country.

Founded in 2011, RapidAI was started by the co-founder of the Stanford Stroke Centre, Greg Albers, and co-director of the Stanford 3D Lab, Roland Bammer. RapidAI works on neurovascular and vascular AI-enhanced clinical decision support and patient workflow, received an FDA approval for their ICH (intracranial haemorrhage) model, making it the first in the world to get a specificity of 100%. Integrated over 2000+ hospitals across 100 countries with a major footprint in the US, RapidAI is steadily growing. From covering over 5 million scans in April last year, they have crossed 10 million scans today.

Besides its large presence in the US, RapidAI is betting big in India. It has partnered with a few hospitals in India including Bengaluru and Mumbai. “In a country such as India, application of AI will bring a radical shift- you are not going to leapfrog but pole vault.”

Talking about an ideal model for future medical technology, Nahata emphasised on AI models that can help in taking early decisions to save lives. “If you are able to build AI as part of a workflow upstream, wherein you can diagnose a disease much earlier, it will help doctors take early decisions and change the course of treating patients. This will help bring down the costs of treatment, and for a country such as India, it’s going to be a massive value add that AI can bring. Saving lives, improving lives and doing it at a scalable level is ideal.”

Recently, Google DeepMind launched CoDoC (Complementarity-driven Deferral-to-Clinical Workflow) an AI system that understands when to utilise predictive AI tools and when to seek input from a clinician.

More AI, Less Trauma

One of the most crucial treatments, necessary not only to save lives but also to improve the quality of life is for stroke patients. National University Hospital (NUH) in Singapore has employed an AI tool that has been helping stroke patients receive appropriate treatment within an hour of arriving. In February, since implementing this AI technology that processes brain CT scans in less than a minute, NUH has been able to administer prompt treatment to over 400 stroke patients – RapidAI has been able to make this possible.

“AI can make a massive difference when applied in healthcare and this is one place where you are not only saving lives but ensuring the quality of life post trauma is much better than what it would have been otherwise,” said Nahata.

Being an imaging-heavy company, RapidAI works with data with ‘clinical and technical depth’ and processes them in a quick amount of time with ‘time being an essence here.’ Two parameters for determining the model’s efficiency are : sensitivity, which determines how accurately an algorithm is able to point out if a case is positive or not, and specificity for reflecting whether a case is misdiagnosed or not. “RapidAI algorithms have both these parameters to an upwards of 90-95%” said Nahata.

Scaling Data Training

Having the right datasets for training models in the medical vertical is crucial. “It’s not the volume of data you have, it’s the diversity of data. In order to get a holistic view, it becomes essential to have a diverse range. If you had only one or two datasets for training, the common problem of biases can set in. For instance, Rapid’s algorithm allows one to look at a case of aneurysm in 3D which otherwise looked in 2D will not determine how big and critical it is.”

“One good thing in the medical field is that the data is not going to significantly change between geographies because anatomy is the same barring a few nuances. As opposed to fintech where analysing the spending pattern of consumers in the US vs India will be different, here that problem doesn’t arise,” said Nahata.

Talking about human intervention in data training, Nahata spoke about its limitations. “Several years ago, AI or decision-making systems were more of an ‘if-then’ sort of mechanism. We study a pattern, infer the disease and build rules around it, but when a certain number of rules are exceeded, it becomes difficult for a human to sit through it and make a decision. This is where machines excel, and even in conflicting rules, RapidAI has been able to achieve a sensitivity of 100%.

Faster AI Adoption

Resistance to adopting new technology is evident in every vertical, and healthcare is no exception, it is lagging significantly.

“At one time even telecom and banking was a walled garden. From a doctor or hospital’s point of view, there is always resistance because technology keeps changing rapidly and the question of returns on the investments made will surface. This is when you need to position yourself by showing the stakeholders the value you can deliver to them which you can capitalise on. Tech is essential as human capital can only scale so much. Even if the cost of tech is high, the amount of time saved and consistency delivered can be the selling point.”

The post How RapidAI Can Transform the Face of Neurological Diagnosis in India appeared first on Analytics India Magazine.