RAPIDS cuDF Cheat Sheet

RAPIDS cuDF

RAPIDS cuDF is an open-source Python library for GPU accelerated DataFrames. cuDF provides a Pandas-like API that allows data engineers, analysts, and data engineers can use perform data manipulation and analysis tasks on large datasets and time series data using the power of NVIDIA GPUs allowing for faster data processing and analysis.

Getting started with cuDF is straightforward, especially if you have experience using Python and libraries like Pandas. While both cuDF and Pandas offer similar APIs for data manipulation, there are specific types of problems in which cuDF can provide significant performance improvements over Pandas, including large scale datasets, data preprocessing and engineering, real-time analytics, and, of course, parallel processing. The bigger the dataset, the greater the performance benefits.

For more on using cuDF for data science, check out our latest cheat sheet.

RAPIDS cuDF Cheat Sheet

This cheat sheet covers the following aspects of RAPIDS cuDF:

  • Installation
  • Reading data
  • Writing data
  • Selecting data
  • Handling missing data
  • Applying functions
  • Processing data
  • and more

Check out the RAPIDS cuDF Cheat Sheet now, and check back soon for more.

More On This Topic

  • RAPIDS cuDF to Speed up Your Next Data Science Workflow
  • RAPIDS cuDF for Accelerated Data Science on Google Colab
  • The ChatGPT Cheat Sheet
  • Streamlit for Machine Learning Cheat Sheet
  • GitHub CLI for Data Science Cheat Sheet
  • Data Cleaning with Python Cheat Sheet

RIP, webcams: Google’s Project Starline is a remote worker’s dream machine

Project Starline

Now, Project Starline has been reduced to the size of a flat-screen TV and stand, more similar to a videoconferencing system.

At Google I/O yesterday, many announcements for innovative ways to use artificial intelligence and new products took over the spotlight. But the company's Project Starline, a videoconferencing system that makes the person in your virtual meetings seem like they're right in front of you, has its own noteworthy improvements to announce.

Also: Logitech's Project Ghost draws on old camera tricks for better video meetings today

Remote work is more mainstream than ever before in human history and companies like Logitech and Google are looking to reinvent what that looks like. Project Starline uses AI to create a photorealistic model of another person that looks as if you were talking to someone through a window, rather than a computer screen.

Also: Every major AI feature announced at Google I/O 2023

Google has shrunk the Project Starline technology, from the booth used during initial prototypes, to a system roughly the size of a TV with a stand. It achieved this through significant AI advancements that combine the feeds of several standard cameras and sensors to create realistic 3D images from a screen.

Reduced from the size of a large restaurant booth to the size of a videoconferencing system.

The tech giant has already deployed the latest prototype of Project Starline through an early access program, giving companies like T-Mobile, Salesforce, and WeWork a chance to try it. Some of the users' experiences were shared yesterday in a video released by Google.

"The first meeting I had on Starline, I said, 'wow, you've got blue eyes'. And this is the person I've been meeting with for a year," said David Levinson from Salesforce in the video. "Just to see a person in 3D, it was really outstanding."

Also: 4 ways to secure your remote work setup

These videoconferencing techniques aim to make it so someone who may be miles or oceans away feel like they're right in front of you.

"Project Starline has the potential to help create authentic and immersive connections that foster deeper relationships with both our employees and customers, enhance trust and transparency, and drive productivity and efficiency," Andy White, senior vice president of Business Technology at Salesforce explained.

Also: 8 habits of highly secure remote workers

In earlier versions, Project Starline required the use of infrared light emitters and special cameras to achieve the lifelike images. Assembled, this system used to occupy the space of a large restaurant booth, which poses difficulties for offices that want to adopt the technology without sufficient space for it.

Google

New Google search tool will distinguish real images from AI-generated phonies

Google tool revealing fake AI-generated image

Picture this: You've come across a great image in a Google search that you'd like to use. But something about the image seems off, making your Spidey senses tingle while you wonder whether it's real or fake. Now Google has cooked up a way to determine the authenticity of an image via an upcoming tool called About this Image.

Scheduled to launch first in the U.S. in the coming months, About this Image will be an option available through the three-dot menu that appears for any image in Google's search results, Cory Dunton, product manager for Google Search, said in a blog post.

Also: How to join Google Search Labs waitlist to access its new AI search engine early

Selecting this option will tell you when the image and associated ones were first indexed by Google, where it first may have appeared, and where else it's been online such as on social media or fact-checking websites. The goal is to give you the necessary details to determine whether the image is real and genuine or whether it's a fake one created by AI.

Beyond the option appearing in Google's regular search results, it will pop up in other places. You'll be able to access it when searching an image or screenshot in Google Lens and when swiping up on search results in the Google app. Later in the year, you'll also be able to use it by right-clicking or long-pressing on an image in the Chrome browser on your desktop or mobile device.

Also: How to use Google Bard now

Fake content has been a pervasive problem on the internet, causing people to question whether what they view online is true. The growth of AI threatens to make that problem even stickier. As the tools become more advanced, they become ever more capable of generating images and other content that seem real but are actually fake.

To combat this growing threat, Google and other major players need to find ways to help us determine the validity of what we see online.

Also: I asked ChatGPT, Bing, and Bard what worries them. Google's AI went Terminator on me

As one example given by Dunton in the blog post, About this Image was asked to identify a photo depicting a staged moon landing. The tool cited several news articles confirming that the image was AI-generated and therefore fake.

Google also plans to make this capability work beyond its own search tools and apps. Dunton said that the company will make sure that every one of its own AI-generated images will contain a markup in the original file to give it context if you see it on a different site.

Also: Every hardware product Google just announced at I/O 2023 (and yes, there's a foldable)

Third-party creators and publishers, such as Midjourney and Shutterstock, will also be able to add similar markups to their own AI-created images. When those images then show up in a Google search, a label will identify them as AI-generated.

Google

AMP Robotics attracts investment from Microsoft’s Climate Innovation Fund

AMP Robotics attracts investment from Microsoft’s Climate Innovation Fund Kyle Wiggers 8 hours

AMP Robotics, a Denver, Colorado-based startup creating robotic systems that can automatically sort recyclable material, today announced that it extended its Series C round to $99 million, thanks to an investment from Microsoft’s Climate Innovation Fund. That’s up from $91 million when the round closed in November.

The extended Series C, which saw participation investors including Congruent Ventures and Wellington Management (who co-led), Blue Earth Capital, Sidewalk Infrastructure Partners, Tao Capital Partners, XN, Sequoia Capital, GV, Range Ventures and Valor Equity Partners, brings AMP’s total raised to around $178 million.

“The capital is helping us scale our operations, including deploying technology solutions to retrofit existing recycling infrastructure and expanding new infrastructure based on our application of AI-powered automation,” founder and CEO Matanya Horowitz told TechCrunch via email.

Horowitz founded AMP in 2014 after earning his Ph.D. from Caltech. While pursuing his doctorate, he says he saw how powerful computer vision was becoming, and began exploring different areas where the technology could be most useful — including recycling.

“After visiting a recycling facility and seeing not only how demanding conditions were, but how challenging of a working environment it could be, I recognized this industry was a compelling opportunity for robotics,” Horowitz said. “The convergence of machine learning and robotics offered compelling opportunities to automate what had historically been tasks that were labor intensive, high cost, inconsistent and limiting.”

Amp Robotics

A sorter machine from AMP Robotics.

It’s also lucrative. The recycling industry contributes nearly $117 billion to the U.S. economy, according to the Institute of Scrap Recycling Industries, and the industry processes 130 million metric tons of valuable commodities annually.

Horowitz asserts that landfilled plastics represent significant losses to the U.S. economy — an average of around $7.2 billion in 2019, per the Department of Energy (DoE). Of the estimated 44 million metric tons of plastic waste managed domestically in 2019, approximately 86% was landfilled, 9% was combusted and 5% was recycled, according to the DoE.

The recovery of U.S. plastic packaging and food-service plastic alone could represent a pool of earnings of $2 billion to $4 billion per year, Horowitz estimates.

AMP’s primary customers are recycling facilities, who use the startup’s flagship product, a robotic sorting system called AMP Cortex, to sort, pick and reclaim plastics, cardboard, paper, cans, cartons and other containers and packaging types. AMP claims that Cortex can perform 80 to 120 picks per minute while remaining accurate with respect to what it’s sorting where.

Recently, AMP, which employs a team of around 200 people, unveiled a more compact solution dubbed AMP Cortex-C alongside an integrated, standalone facility offering for waste management companies. Horowitz says that the company’s robotic fleet of around 275 is now deployed in over 100 centers including several owned by Waste Connections, its largest customer, and that AMP’s AI platform has identified over 75 billion objects to date.

“Our broad product suite directly deals with the core challenges of operating recycling facilities, and we have some other amazing technology soon to come,” Horowitz added. “We have a number of larger opportunities in front of us, from opportunities in Europe and across the world, to large, fleet-wide deployments of robots and deployment of fully automated sorting facilities. The capital helps us build the technologies and team to support these opportunities.”

On the subject, AMP plans to grow its secondary sortation business in the U.S. across its three production facilities in the Denver, Atlanta and Cleveland metro areas. In addition to providing robotics infrastructure and software to customers, AMP resells recyclable commodities like bespoke chemical and polymer blends to end-market buyers.

Temenos Sets Record Benchmark for Banking Cloud Platform on MongoDB & Azure 

Banking software solutions company Temenos has released new high-water benchmark performance results for its Temenos Banking Cloud platform. These results were achieved on the Microsoft Azure and MongoDB Atlas infrastructure and simulated a client with 50 million retail customers, 100 million accounts, and a Banking-as-a-Service (BaaS) offering for 10 brands and 50 million embedded finance customers on a single cloud instance.

During the benchmark test, the Temenos Banking Cloud platform successfully processed 200 million embedded finance loans and 100 million retail accounts at a record-breaking 150,000 transactions per second. This demonstrated the platform’s robustness and scalability to support banks’ business models for growth through BaaS or by distributing their products themselves. The benchmark included not only core transaction processing but also payments, financial crime mitigation (FCM), data hub, and digital channels.

The performance benchmark with Microsoft and MongoDB highlighted the scalability of Temenos’ cloud-native and composable platform, helping banks to massively scale and cater for the high volumes of transactions in the BaaS world where multiple brands are hosted on a single platform. The benchmark also emphasised the advances Temenos has made to provide a greener infrastructure, helping banks scale efficiently and achieve their sustainability goals. Temenos cloud architecture elastically scales, enabling banks to process higher volumes of transactions with a 49% like-for-like improvement for the tested workloads compared to the previous release.

Real-world deployments have shown the scalable performance of the Temenos platform, with a global payments provider launching its Buy-Now-Pay-Later service on the Temenos Banking Cloud, reaching 25 million BNPL consumers, which equates to 150 million loans. This was the fastest start to any product launch in the company’s history.

Temenos offers a single platform suitable for any type and size of the bank, including retail, corporate, or wealth management. The platform is cloud-native and readily available on all major public cloud providers for banks to run themselves or as a SaaS solution via Temenos Banking Cloud.

Boris Bialek, Managing Director of Industry Solutions at MongoDB, stated that Temenos’ cloud-native, microservices-based architecture on the MongoDB Atlas database provides customers with flexibility, improved security, performance, and scalability. He also highlighted the importance of the document model for a composable banking architecture and the ability of the Temenos banking platform and MongoDB’s developer data platform to support the needs of even the largest global banks, as demonstrated by the high throughput of the benchmark.

MongoDB Simplifies Complex Data for Developers

MongoDB, founded 15 years ago, has become a popular choice for developers who want to avoid working with tedious amounts of data within relational databases. Last year, in an exclusive interaction with AIM, MongoDB’s Vice President of India and APC, Sachin Chawla, and Manager of Solutions Architect Corp APAC, Himanshumali, discussed the latest developments and the roadmap ahead. The company added several features to its developer data platform, including queryable encryption, relational migrator, and partnerships with AWS, GCP, and Azure. MongoDB also revamped its MongoDB University to train developers. The company has over 300 million downloads globally, and 2,300 of its 37,000 global customers are in India, where the customer base has grown 50% annually. A 2022 MongoDB survey found that many tech professionals struggle with data architecture, presenting an opportunity for the company to simplify its offerings.

Read more: The good, the bad and the ugly – the story of MongoDB

The post Temenos Sets Record Benchmark for Banking Cloud Platform on MongoDB & Azure appeared first on Analytics India Magazine.

What are Large Language Models and How Do They Work?

What are Large Language Models and How Do They Work?
Image by Editor What are Large Language Models?

Large language models are a type of artificial intelligence (AI) model designed to understand, generate, and manipulate natural language. These models are trained on vast amounts of text data to learn the patterns, grammar, and semantics of human language. They leverage deep learning techniques, such as neural networks, to process and analyze the textual information.

The primary purpose of large language models is to perform various natural language processing (NLP) tasks, such as text classification, sentiment analysis, machine translation, summarization, question-answering, and content generation. Some well-known large language models include OpenAI’s GPT (Generative Pre-trained Transformer) series, with GPT-4 being one of the most famous, Google’s BERT (Bidirectional Encoder Representations from Transformers), and Transformer architectures in general.

How Large Language Models Work

Large language models work by using deep learning techniques to analyze and learn from vast amounts of text data, enabling them to understand, generate, and manipulate human language for various natural language processing tasks.

A. Pre-training, Fine-Tuning and Prompt-Based Learning

Pre-training on massive text corpora: Large language models (LLMs) are pre-trained on enormous text datasets, which often encompass a significant portion of the internet. By learning from diverse sources, LLMs capture the structure, patterns, and relationships within language, enabling them to understand context and generate coherent text. This pre-training phase helps LLMs build a robust knowledge base that serves as a foundation for various natural language processing tasks.

Fine-tuning on task-specific labeled data: After pre-training, LLMs are fine-tuned using smaller, labeled datasets specific to particular tasks and domain, such as sentiment analysis, machine translation, or question answering. This fine-tuning process allows the models to adapt their general language understanding to the nuances of the target tasks, resulting in improved performance and accuracy.

Prompt based-learning differs from traditional LLM training approaches, such as those used for GPT-3 and BERT, which demand pre-training on unlabeled data and subsequent task-specific fine-tuning with labeled data. Prompt-based learning models, on the other hand, can adjust autonomously for various tasks by integrating domain knowledge through the use of prompts.

The success of the output generated by a prompt-based model is heavily reliant on the prompt’s quality. An expertly formulated prompt can steer the model towards generating precise and pertinent outputs. Conversely, an inadequately designed prompt may yield illogical or unrelated outputs. The craft of devising efficient prompts is referred to as prompt engineering.

B. Transformer architecture

Self-attention mechanism: The transformer architecture, which underpins many LLMs, introduces a self-attention mechanism that revolutionized the way language models process and generate text. Self-attention enables the models to weigh the importance of different words in a given context, allowing them to selectively focus on relevant information when generating text or making predictions. This mechanism is computationally efficient and provides a flexible way to model complex language patterns and long-range dependencies.

Positional encoding and embeddings: In the transformer architecture, input text is first converted into embeddings, which are continuous vector representations that capture the semantic meaning of words. Positional encoding is then added to these embeddings to provide information about the relative positions of words in a sentence. This combination of embeddings and positional encoding allows the transformer to process and generate text in a context-aware manner, enabling it to understand and produce coherent language.

C. Tokenization methods and techniques

Tokenization is the process of converting raw text into a sequence of smaller units, called tokens, which can be words, subwords, or characters. Tokenization is an essential step in the pipeline of LLMs, as it allows the models to process and analyze text in a structured format. There are several tokenization methods and techniques used in LLMs:

Word-based tokenization: This method splits text into individual words, treating each word as a separate token. While simple and intuitive, word-based tokenization can struggle with out-of-vocabulary words and may not efficiently handle languages with complex morphology.

Subword-based tokenization: Subword-based methods, such as Byte Pair Encoding (BPE) and WordPiece, split text into smaller units that can be combined to form whole words. This approach enables LLMs to handle out-of-vocabulary words and better capture the structure of different languages. BPE, for instance, merges the most frequently occurring character pairs to create subword units, while WordPiece employs a data-driven approach to segment words into subword tokens.

Character-based tokenization: This method treats individual characters as tokens. Although it can handle any input text, character-based tokenization often requires larger models and more computational resources, as it needs to process longer sequences of tokens.

Applications of Large Language Models

A. Text generation and completion

LLMs can generate coherent and fluent text that closely mimics human language, making them ideal for applications like creative writing, chatbots, and virtual assistants. They can also complete sentences or paragraphs based on a given prompt, demonstrating impressive language understanding and context-awareness.

B. Sentiment analysis

LLMs have shown exceptional performance in sentiment analysis tasks, where they classify text according to its sentiment, such as positive, negative, or neutral. This ability is widely used in areas such as customer feedback analysis, social media monitoring, and market research.

C. Machine translation

LLMs can also be used to perform machine translation, allowing users to translate text between different languages. LLMs like Google Translate and DeepL have demonstrated impressive accuracy and fluency, making them invaluable tools for communication across language barriers.

D. Question answering

LLMs can answer questions by processing natural language input and providing relevant answers based on their knowledge base. This capability has been used in various applications, from customer support to education and research assistance.

E. Text summarization

LLMs can generate concise summaries of long documents or articles, making it easier for users to grasp the main points quickly. Text summarization has numerous applications, including news aggregation, content curation, and research assistance.

Conclusion

Large language models represent a significant advancement in natural language processing and have transformed the way we interact with language-based technology. Their ability to pre-train on massive amounts of data and fine-tune on task-specific datasets has resulted in improved accuracy and performance on a range of language tasks. From text generation and completion to sentiment analysis, machine translation, question answering, and text summarization, LLMs have demonstrated remarkable capabilities and have been applied in numerous domains.

However, these models are not without challenges and limitations. Computational resources, bias and fairness, model interpretability, and controlling generated content are some of the areas that require further research and attention. Nevertheless, the potential impact of LLMs on NLP research and applications is immense, and their continued development will likely shape the future of AI and language-based technology.

If you want to build your own large language models, sign up at Saturn Cloud to get started with free cloud computing and resources.
Saturn Cloud is a data science and machine learning platform flexible enough for any team supporting Python, R, and more. Scale, collaborate, and utilize built-in management capabilities to aid you when you run your code. Spin up a notebook with 4TB of RAM, add a GPU, connect to a distributed cluster of workers, and more. Saturn also automates DevOps and ML infrastructure engineering, so your team can focus on analytics.

Original. Reposted with permission.

More On This Topic

  • Learn About Large Language Models
  • Top Open Source Large Language Models
  • Top Free Courses on Large Language Models
  • Introducing TPU v4: Googles Cutting Edge Supercomputer for Large Language…
  • Introducing Healthcare-Specific Large Language Models from John Snow Labs
  • The Ultimate Open-Source Large Language Model Ecosystem

Is Google Following the Apple Path?

At the latest in-person Google I/O event, the tech giant announced a plethora of news at lightning speed, including the launch of major Pixel products, signaling that the Pichai-led company is finally building its own ecosystem after nearly a decade of attempts.

Yesterday, a flurry of software and service features, coupled with state-of-the-art hardware, was showcased. This highly anticipated annual event unveiled a trio of Pixel products, spearheaded by the Pixel 7a as the successor to the Pixel 6a. Furthermore, Google ventured into uncharted territory by introducing a dockable tablet and a foldable device.

With an expanding Pixel portfolio, the Google ecosystem has reached newfound heights of capability. Despite the limitations of the Android platform, leveraging its prowess, Google seems determined to deliver an experience that resonates with users on par with Apple’s unmatched standards.

The New Entrants

Since its debut in 2016, Pixel phones have evolved immensely. As a smartphone original equipment manufacturer (OEM), Google had a rough start. With a focus on integrating hardware and software seamlessly the latest developments suggest that the tech titan is now committed to building its own robust ecosystem.

While previous versions of Pixel phones may not have been universally praised, the recently unveiled Pixel 7a has departed from the status quo. It boasts significant enhancements both internally and externally; with upgraded camera hardware, Google’s proprietary processor, and a boxy design. Notably, the tech giant seems determined to retain the signature camera bar for the brand.

The Pixel Tablet disrupts the landscape of smart screens by breaking in two. Composed of an Android slate and a magnetic dock unit equipped with its own built-in speakers, this design is clever. Amazon had previously explored a similar concept with its Fire tablets, but Google’s showcase is more aesthetic. Capitalizing on the feature, Google bundled the two components together at $499 (INR 40,970). Positioned as more than just a tablet, this device assumes the roles of a smart home controller/hub, a teleconferencing solution, and a video streaming machine. While it may not replace one’s television experience entirely, the Tablet is suitable for YouTube content and more.

Furthermore, the search giant also unveiled the Pixel Fold, its first foldable phone, which will launch this summer at a whopping $1,799 (INR 1,47,500). Currently, South Korean-based Samsung rules the foldable phone domain, with a 62% market share during the first half of 2022, as per Counterpoint Research. Notably, Samsung is also the most sought-after brand for foldable phones. While the Pixel Fold seems promising, its high price point can act as an obstacle for an average consumer.

Got To Get Serious

The Pixel series hopefully remains more consistent henceforth which will help make things more interconnected and seamless for the tech giant.

However, releasing many new products in a rush, could lead to problems. We know that the Pixel ecosystem won’t be nearly as exclusive and restrictive as the Apple ecosystem; quite the opposite but a gradual and rigorous attempt from Google’s side is visible.

For the Mountain View-based company, this is a prime opportunity to finally put to bed the claims about the hardware not being the company’s cup of tea. Anshel Sag, a senior analyst at Moor Insights & Strategy, emphasized that if Google doesn’t give up, it will be good for the company, however, “Google can only subsidize its hardware business for so long and will have to eventually chase profitability.”

While Google’s Pixel line promises an ecosystem, the company’s reputation for taking U-turns remains unmatched and remains a challenge for potential investors. The inaugural Pixel device, in 2016, left a lasting impression and Google is gradually resolving its shortcomings. While Google’s recent moves can establish the firm as a serious player in the hardware market, it will take time to overcome the uncertainties.

The post Is Google Following the Apple Path? appeared first on Analytics India Magazine.

Data Masking: The Core of Ensuring GDPR and other Regulatory Compliance Strategies

Data Masking: The Core of Ensuring GDPR and other Regulatory Compliance Strategies
Image by Bing Image Creator

Privacy is not a product up for sale but a valuable asset that preserves the integrity of every individual. That’s just one of the many triggers that led to the formulation of the GDPR and several other global regulations. With the increasing importance placed on data privacy, data masking has become necessary for organizations of all sizes to maintain the security and confidentiality of personal information.

Data masking has a mission – to protect Personally Identifiable Information (PII) and restrict access whenever possible. It anonymizes and safeguards personal and sensitive information. That’s why it applies to bank accounts, credit cards, phone numbers, and health and social security details. No Personally Identifiable Information (PII) is visible during a data breach. You can also set additional security access rules within your organization.

What is Data Masking?

Data masking, as we know, is a technique used to protect sensitive data by replacing it with fictitious but realistic data. It protects personal data in compliance with the General Data Protection Regulation (GDPR) by ensuring that data breaches do not reveal sensitive information about individuals.

Since data masking is an integral component of the data protection strategy, it applies to various data types such as files, backups, and databases. It works closely with encryption, access controls, monitoring, and others to ensure end-to-end compliance with GDPR and other regulations.

Why do we need this for GDPR and other Regulations?

Despite masking’s proven capability in eliminating the exposure of sensitive data, a lot of enterprises are not following the guidelines and stand at the risk of breach. The most popular case is related to a clothing retailer, H&M, that had to incur a fine of 35 million Euros for violating the GDPR norms. It was found that the management had access to sensitive data such as an individual’s religious beliefs, personal issues, etc. That’s what GDPR tries to avoid and that’s why data masking is essential.

However, heavily regulated industries such as BFSI and healthcare are already implementing data masking to comply with privacy regulations. These include the Payment Card Industry Data Security Standard (PCI DSS), and the Health Insurance Portability and Accountability Act (HIPAA).

The implementation of Europe's GDPR in 2018 has sparked a global trend of privacy laws, with jurisdictions such as California, Brazil, and Southeast Asia introducing laws such as CCPA and CCPR, LGPD, and PDPA, respectively, to protect personal data.

Data masking can provide several benefits for regulatory compliance, including

  • Protecting sensitive data: Data masking can protect sensitive data, such as personal information, by replacing it with fictitious but realistic data. This can prevent unauthorized access or accidental exposure of sensitive data.
  • Compliance with regulations: Data masking can be used to anonymize personal data, which can help organizations comply with regulations such as the General Data Protection Regulation (GDPR) and other data privacy laws.
  • Auditing and compliance: Data masking can provide an auditable trail of who has accessed sensitive data, which can help organizations demonstrate compliance with regulatory requirements.
  • Data Governance: Data masking can be used as a data governance tool; organizations can ensure that sensitive data is only used for the intended purposes and by authorized personnel.

Key Data Masking Practices for GDPR

Data Minimization

Data minimization in data masking refers to only masking the minimum amount necessary to protect sensitive information while still allowing the data to be used for its intended purpose. This can help organizations balance the need to protect sensitive data with the need to make use of the data for business purposes.

For example, an organization may only need to mask the last four digits of a credit card number to protect sensitive information while allowing the data to be used for financial transactions. Similarly, in personal data, only masking specific fields like name and address while keeping the other fields like gender and date of birth can be sufficient for specific use cases.

Pseudonymization

Pseudonymisation uses pseudonyms to replace the identifying information of the users and thus protect their privacy. This is useful in ensuring compliance with regulations such as the General Data Protection Regulation (GDPR) by ensuring that data breaches do not reveal sensitive information about individuals.

This data masking technique replaces personal identifiers such as name, address, and social security number with a unique pseudonym while keeping other non-sensitive attributes such as gender and date of birth intact. The pseudonyms can be generated using cryptographic techniques, such as hashing or encryption, to ensure that the original personal data cannot be reconstructed.

It also aligns with the regulation's requirements for security and safe data processing for scientific, historical, and statistical purposes (analytics). It's a valuable tool in ensuring compliance with the GDPR's data protection by design principle.

You can optimize your DevOps function. For DevOps, data masking enables realistic yet secured fictitious data for testing. This is particularly beneficial for organizations that rely on internal or third-party developers as it ensures security and minimizes delays in the DevOps process. Data masking allows you to test your customers' data while maintaining their privacy.

Data Masking with Data Products for GDPR and other Regulations

Treating data as products and using them to implement masking techniques have a lot of benefits. In 2022, many data fabrics and product platforms got popular for their innovative approach. For example, K2view performs data masking at the business entity level, ensuring consistency and completeness while preserving referential integrity.

To ensure maximum security, each business entity's data is managed within its Micro-Database, protected by its 256-bit encryption key. Additionally, the personally identifiable information (PII) within the Micro-Database is masked in real-time, following predefined business rules, providing an added layer of protection.

End Note

Implementing data masking techniques can help organizations avoid hefty fines and damage to their reputation. However, it's important to note that data masking alone is insufficient to achieve GDPR compliance and should be used in conjunction with other security measures.

Yash Mehta is an internationally recognized IoT, M2M and Big Data technology expert. He has written a number of widely acknowledged articles on Data Science, IoT, Business Innovation and Cognitive intelligence. He is the founder of a data insight platform called Expersight. His articles have been featured in the most authoritative publications and awarded as one of the most innovative and influential works in the connected technology industry by the IBM and Cisco IoT departments.

More On This Topic

  • Scaling Your Data Compliance Strategy with Legal Automation
  • Detecting Data Drift for Ensuring Production ML Model Quality Using Eurybia
  • Deep Learning For Compliance Checks: What's New?
  • Strategies of Docker Images Optimization
  • 5 strategies for enterprise machine learning for 2021
  • 20 Core Data Science Concepts for Beginners

Google-backed Startups Are Using OpenAI’s GPT, Should it be Worried?

Google-backed Startups Are Using OpenAI’s GPT, Should it be Worried?

“And in the end, OpenAI doesn’t matter. They are making the same mistakes as we are in their posture relative to open source, and their ability to maintain an edge is necessarily in question,” reads Google’s internal leaked document titled, “We Have No Moat, and Neither Does OpenAI.”

In the leaked document, it also said that it is definitely worth noting that ever since developers got their hands on Meta’s leaked LLM, LLaMA, the open-source community is flooded with LLMs-based generative AI models. Looks like Meta’s “mistake” has actually brought them back into the race, but only through open-source.

Interestingly, Google claims that OpenAI has no moat when it comes to LLMs. But among the hype around ChatGPT, we are increasingly seeing startups that are either using GPT in their name, or building technologies using the APIs provided by OpenAI.

During all this, Google backed a lot of AI startups to fight against OpenAI and Microsoft. It has been trying to push generative AI into its cloud infrastructure, and thus partnered with Salesforce, Box Inc, Jasper AI, and Canva. The companies have been using Google’s LLM offering through Vertex AI on Google Cloud.

OpenAI Has A Moat

Interestingly, despite the partnership with Google, last month Salesforce also announced Einstein GPT, integrating OpenAI’s technology and offering generative AI capabilities for its customers.

In the same announcement, Salesforce also announced a $250 million generative AI fund which will first invest in four companies that include Cohere and Anthropic, both of which are backed by Google; and the companies will use GPT technology on Salesforce’s platform. Seems like the startups backed by Google are using its funds on OpenAI’s technology to run their businesses.

Another Indian startup, Slang Labs, which is a voice assistant and search based, is backed by Google. Interestingly, the startup announced the launch of CONVA 2.0, a multilingual AI co-pilot which is powered by GPT technology, leveraging it for e-commerce shoppers. The company also boasts that their technology with GPT offers 46% better performance than Google’s current Voice Assistant.

In an interview with AIM, Slang Labs’ CEO and co-founder, Kumar Rangarajan said that the company decided to use GPT into its products to offer services to its customers regardless of who developed it. “Making LLMs from ground is a very expensive and computation heavy process. It does not make sense to do it, therefore we adopted the GPT technology into our services,” Rangarajan explained.

Similarly, Anthropic, an AI startup backed by Google also partnered with Notion, which is a conversation chatbot actually a wrapper of OpenAI’s GPT-3.5. The businesses just want the best services for their customers, and it seems as though OpenAI is providing them with that.

Opensource: Google Killer?

This should be concerning for Google that some of its startups are also leveraging its rivals’ technologies. Looks like GPT is the moat for OpenAI. And where OpenAI is lagging behind with a lot of its open-source policies, privacy issues, and many more concerns, Meta’s LLaMA is filling the gap with its ability to allow individuals to build local models.

“The modern internet runs on open source for a reason,” reads the document. “And we should not expect to be able to catch up [to open source].”

Google likened the recent open-source-based AI development to the Internet. It acknowledged that in the Internet technology revolution, the open-source community has a bigger role to play and cannot be replicated by any company. “Individuals are not constrained by licenses to the same degree as corporations,” read the leaked document from Google.

The only thing that the big-tech has is computation capabilities, which are also not required since the introduction of LLaMA based models that can build ChatGPT like models on a single computer. Probably, instead of fearing OpenAI and GPT, Google is more worried about open-source.

“But the uncomfortable truth is, we aren’t positioned to win this arms race and neither is OpenAI. While we’ve been squabbling, a third faction has been quietly eating our lunch — open-source.” But is it true that OpenAI and Microsoft are on the same level as Google, without a moat? It might not be true, Google.

Can Google Catch Up?

A lot of the development has happened, probably, because most of the startups have already started building on top of GPT. There was no other offering other than GPT before. “Every developer I know is building on top of GPT,” said Robert Scoble in his tweet talking about what Google along with Apple should do to get ahead in the game.

For example, Khan Academy, one of the first companies’ to adopt GPT-4 into its offerings built his entire system on GPT-4. Would there be any reason for him to make the change and adopt Google or Apple’s offering in the future, unless they are a hundred steps ahead of OpenAI and Microsoft’s?

Surprisingly, Google I/O 2023 was a hit. Sundar Pichai and team introduced several advancements in their AI systems, including generative AI and LLMs. It also announced the launch of the PaLM-2 language model along with its API. It also hinted at the multimodal Gemini project that the company has been working on with DeepMind.

On the flip side, the people who have already been building on top of GPT products would never want to shift to what Google is going to offer in the near future, such as PaLM-2, unless it is worth it. Same is the case with Apple if it gets into the LLMs field. We will hopefully hear more about this at Apple’s conference in June.

For now, Meta’s LLaMA has won a lot of developers on its side, but apart from that, the case is still that as soon as someone hears generative AI or chatbot, they hear GPT or OpenAI. Google needs to step up its game, and the Google I/O conference was definitely one step in the right direction. Anyway, OpenAI’s GPT is built using Google developed Transformers.

The post Google-backed Startups Are Using OpenAI’s GPT, Should it be Worried? appeared first on Analytics India Magazine.

Clustering with scikit-learn: A Tutorial on Unsupervised Learning

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Image by Author

Clustering is a popular unsupervised machine learning technique, meaning it is used for datasets where the target variable or outcome variable is not provided.

In unsupervised learning, algorithms are tasked with catching the patterns and relationships within data without any pre-existing knowledge or guidance.

What does clustering do? It groups similar data points, enabling us to discover hidden patterns and relationships within our data.

In this article, we will explore the different clustering algorithms available and their respective use cases, along with important evaluation metrics to assess the quality of clustering results.

We will also demonstrate how to develop multiple clustering algorithms at once using the popular Python library scikit-learn.

Finally, we will highlight some of the most famous real-life applications that used clustering, discussing the algorithms used and the evaluation metrics employed.

But first, let’s get familiar with the clustering algorithms.

What are the Clustering Algorithms?

Below you will find an overview of the clustering algorithms and short definitions.

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Image by Author

According to scikit-learn official documentation, there are 11 different clustering algorithms: K-Means, Affinity propagation, Mean Shift, Special Clustering, Hierarchical Clustering, Agglomerative Clustering, DBScan, Optics, Gaussian Mixture, Birch, Bisecting K-Means.

Here you can find the official documentation.

This section will explore the 5 most famous and important clustering algorithms. They are K-Means, Mean-Shift, DBScan, Gaussian Mixture, and Hierarchical Clustering.

Before diving deeper, let’s look at the following graph. It shows how these five algorithms work on six differently structured datasets.

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Clustering Algorithms — Image by Author

In the scikit-learn documentation, you will find similar graphs which inspired the image above. I limited it to the five most famous clustering algorithms and added the dataset's structure along the algorithm name, e.g., K-Means — Noisy Moons or K-Means Varied.

There are six different datasets shown, all generated by using scikit-learn:

  • Noisy Circles: This dataset consists of a large circle containing a smaller circle that is not perfectly centered. The data also has random Gaussian noise added to it.
  • Noisy Moons: This dataset consists of two interleaving half-moon shapes that are not linearly separable. The data also has random Gaussian noise added to it.
  • Blobs: This dataset consists of randomly generated blobs that are relatively uniform in size and shape. The dataset contains three blobs.
  • No Structure: This dataset consists of randomly generated data points with no inherent structure or clustering pattern.
  • Anisotropicly Distributed: This dataset consists of randomly generated data points that are anisotropically distributed. The data points are generated with a specific transformation matrix to make them elongated along certain axes.
  • Varied: This dataset consists of randomly generated blobs with varied variances. The dataset contains three blobs, each with a different standard deviation.

Seeing the plots and how each algorithm works on them will help us compare how well our algorithms perform on each dataset. This may help you in your data project if your data have the same structure as in these graphs.

Now let’s dig deeper into these five algorithms, starting with the K-Means algorithm.

K-Means

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
K-Means 2D | Image by Author
Clustering with scikit-learn: A Tutorial on Unsupervised Learning
K Means 3D | Image by Author

K-Means is a popular clustering algorithm that partitions a dataset into K distinct clusters, where K is a hyperparameter specified by the user.

The algorithm works by assigning each data point to the nearest cluster centroid and then recalculating the centroid based on the average of all the data points in that cluster.

This process continues until the centroids no longer move or a specified maximum number of iterations is reached.

It has also been used in various real-life applications, such as customer segmentation in e-commerce, disease clustering in healthcare, and image compression in computer vision.

To see the real-life applications of K-Means or other algorithms, continue reading. We will get to it in the later sections.

DBScan

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
DBScan | Image by Author

DBScan (Density-Based Spatial Clustering of Applications with Noise) is a density-based clustering algorithm that identifies clusters as high-density regions separated by low-density regions.

The algorithm groups together point that are close to each other based on a density threshold and a minimum number of points.

DBScan is often used in outlier detection, spatial clustering, and image segmentation, where the goal is to identify distinct clusters in the data while also handling noisy or outlier data points.

Hierarchical Clustering

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Hierarchical Clustering | Image by Author

Hierarchical Clustering is a clustering algorithm that builds a tree-like structure of clusters by merging or splitting clusters.

Depending on the approach taken, the algorithm can be agglomerative (bottom-up) or divisive (like the graph above).

Hierarchical Clustering has been used in various real-life applications such as social network analysis, image segmentation, and ecological research, where the goal is to identify meaningful relationships between clusters and subclusters.

Mean-Shift Clustering

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Mean-Shift Clustering | Image by Author

Mean-shift clustering is a non-parametric algorithm that doesn't require prior assumptions about the shape or number of clusters.

The algorithm works by shifting each data point towards the local mean (x in the above graph) until convergence, where the kernel density function estimates the local mean.

The mean-shift algorithm identifies clusters as high-density regions in the feature space.

Mean-shift clustering has been used in real-life applications such as image segmentation, object tracking in video surveillance, and anomaly detection in network intrusion detection, where the goal is to identify distinct regions or patterns in the data.

Gaussian Mixture Clustering

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Gaussian Mixture Model in the moon-shaped data set | Image by Author

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
K Means Clustering Model in the moon-shaped data set | Image by Author

Gaussian Mixture is a probabilistic clustering algorithm that models the distribution of data points using a mixture of Gaussian distributions. The algorithm fits a set of Gaussian distributions to the data, where each Gaussian corresponds to a separate cluster.

Gaussian Mixture has been used in various real-life applications such as speech recognition, gene expression analysis, and face recognition, where the goal is to model the underlying distribution of the data and identify clusters based on the fitted Gaussian distributions.

As we can see from the graph above, the Gaussian mixture has a better capability of capturing trends in elliptical data points as above and drawing elliptical clusters.

Overall, each clustering algorithm has its unique strengths and weaknesses. The choice of algorithm depends on the problem at hand and the dataset's characteristics.

Understanding the nuances of each algorithm and its use cases is crucial for achieving accurate and meaningful results.

Clustering Evaluation Metrics

After applying the algorithm, you need to evaluate its performance to see whether there is room for improvement or change the algorithm if the performance of your algorithm does not meet the criteria. To do that, you should use evaluation metrics.

Here’s an overview of the most popular ones.

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Image by Author

These are not all, of course. You can get the complete list on scikit-learn. It lists the following evaluation metrics: Rand Index, Mutual Information based scores, Silhouette Coefficient, Fowlkes-Mallows scores, Homogeneity, Completeness, V-measure, Calinski-Harabasz Index, Davies-Bouldin Index, Contingency Matrix, Pair Confusion Matrix.

Here you can see the official documents.

We will stick with the popular ones and start with the Rand Index.

Rand Index

The Rand Index evaluates the similarity between the true cluster labels and the predicted cluster labels.

The index ranges from 0 to 1, with 1 indicating a perfect match between the true and predicted labels.

The Rand Index is often used in image segmentation, text clustering, and document clustering, where the true labels of the data are known.

Silhouette Coefficient

The Silhouette Coefficient measures the quality of clustering based on how well-separated the clusters are and how similar are the data points within each cluster.

The coefficient ranges from -1 to 1, with 1 indicating a well-separated and compact cluster and -1 indicating an incorrect clustering.

The Silhouette Coefficient is often used in market segmentation, customer profiling, and product recommendation, where the goal is to identify meaningful clusters based on customer behavior and preferences.

Fowlkes-Mallows scores

The Fowlkes-Mallows index is named after two researchers, Edward Fowlkes and S.G. Mallows, who proposed the metric in 1983.

The index measures the similarity between a clustering algorithm's true and predicted labels.

The score ranges from 0 to 1, with 1 indicating a perfect match between the true and predicted labels.

The Fowlkes-Mallows score is often used in image segmentation, text clustering, and document clustering, where the true labels of the data are known.

Davies-Bouldin Index

The Davies-Bouldin Index is named after two researchers, David L. Davies and Donald W. Bouldin, who proposed the metric in 1979.

The index ranges from 0 to infinity, with lower values indicating better clustering quality.

It is handy for identifying the optimal number of clusters in the data and for detecting cases where the clusters overlap or are too similar to each other. However, the index assumes that the clusters are spherical and have similar densities, which may not always hold in real-world datasets.

The Davies-Bouldin Index is often used in market segmentation, customer profiling, and product recommendation, where the goal is to identify meaningful clusters based on customer behavior and preferences.

Calinski-Harabasz Index

The Calinski-Harabasz Index is named after T. Calinski and J. Harabasz, who proposed the metric in 1974.

The Calinski-Harabasz Index measures the quality of clustering based on how well-separated the clusters are and how well the data points within each cluster are similar to each other.

The index ranges from 0 to infinity, with higher values indicating better clustering.

The Calinski-Harabasz Index is often used in market segmentation, customer profiling, and product recommendation, where the goal is to identify meaningful clusters based on customer behavior and preferences.

Comparing Evaluation Metrics

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Image by Author

For the Rand Index, Silhouette Coefficient, and Fowlkes-Mallows scores, higher values indicate better clustering performance.

The best score is 1.

For Davies-Bouldin Index, lower values indicate better clustering performance.

The best score is 0.

For Calinski-Harabasz Index, the highest scores indicate better performance.

The best score is ∞. (infinity.)

In theory, the best score for the Calinski-Harabasz (CH) Index would be infinity, as it would indicate an extremely high between-cluster dispersion compared to the within-cluster dispersion. However, achieving an infinite CH Index value is not realistic in practice.

There is no fixed upper bound for the best score, as it depends on the specific data and clustering.

Don’t forget: there is no algorithm or script which is perfect. If you achieve the best scores with any of these evaluation metrics, it’s quite likely your model is overfitting.

How to Develop Multiple Clustering Algorithms at Once With scikit-learn?

The purpose here is to apply multiple clustering algorithms to the Iris dataset and calculate their performance using different evaluation metrics.

Here we will use the IRIS dataset.

You can find this dataset here.

The iris dataset is a famous multi-class classification dataset that contains 150 samples of iris flowers, each having four features (length and width of sepals and petals).

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Image from Machine Learning in R for beginners

There are three classes in the dataset representing three types of iris flowers.

The dataset is commonly used for machine learning and pattern recognition tasks, particularly for testing and comparing different classification algorithms. It was introduced by British statistician and biologist Ronald Fisher in 1936.

Here we will write the code, which imports the necessary libraries to load the dataset, implements five clustering algorithms (DBSCAN, K-Means, Hierarchical Clustering, Gaussian Mixture Model, and Mean Shift), and evaluates their performance using five metrics.

To do that, we will add the evaluation metrics and algorithms in the dictionaries and apply them with two for loops each other.

But we have an exception here. The rand_score and fowlkes_mallows_score functions compare clustering results with true labels, so we will add the if-else block to provide that.

Then we will add these results to the data frame to do further analysis.

Here is the code.

import numpy as np  import pandas as pd  import matplotlib.pyplot as plt  from sklearn.datasets import load_iris  from sklearn.cluster import (      DBSCAN,      KMeans,      AgglomerativeClustering,      MeanShift,  )  from sklearn.mixture import GaussianMixture  from sklearn.metrics import (      silhouette_score,      calinski_harabasz_score,      davies_bouldin_score,      rand_score,      fowlkes_mallows_score,  )    # Load the Iris dataset  iris = load_iris()  X = iris.data  y = iris.target    # Implement clustering algorithms  dbscan = DBSCAN(eps=0.5, min_samples=5)  kmeans = KMeans(n_clusters=3, random_state=42)  agglo = AgglomerativeClustering(n_clusters=3)  gmm = GaussianMixture(n_components=3, covariance_type="full")  ms = MeanShift()    # Evaluate clustering algorithms with three evaluation metrics  labels = {      "DBSCAN": dbscan.fit_predict(X),      "K-Means": kmeans.fit_predict(X),      "Hierarchical": agglo.fit_predict(X),      "Gaussian Mixture": gmm.fit_predict(X),      "Mean Shift": ms.fit_predict(X),  }    metrics = {      "Silhouette Score": silhouette_score,      "Calinski Harabasz Score": calinski_harabasz_score,      "Davies Bouldin Score": davies_bouldin_score,      "Rand Score": rand_score,      "Fowlkes-Mallows Score": fowlkes_mallows_score,  }      for name, label in labels.items():      for metric_name, metric_func in metrics.items():          if metric_name in ["Rand Score", "Fowlkes-Mallows Score"]:              score = metric_func(y, label)          else:              score = metric_func(X, label)          pred_df = pred_df.append(              {"Algorithm": name, "Metric": metric_name, "Score": score},              ignore_index=True,          )  # Display the DataFrame  pred_df.head(10)  

Here is the output.

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Prediction DataFrame | Image by Author

Now, let’s make a visualization to see the result better. Here the aim is to create visuals of the clustering algorithm's evaluation metrics.

The following code pivots the data to have algorithms as columns and metrics as rows and then generates bar charts for each metric. This allows for easy comparison of the clustering algorithms' performance across different evaluation measures.

Here is the code.

import pandas as pd  import numpy as np  import matplotlib.pyplot as plt    # Pivot the data to have algorithms as columns and metrics as rows  pivoted_df = pred_df.pivot(      index="Metric", columns="Algorithm", values="Score"  )    # Define the three metrics to plot  metrics = [      "Silhouette Score",      "Calinski Harabasz Score",      "Davies Bouldin Score",  ]    # Define a colormap to use for each algorithm  cmap = plt.get_cmap("Set3")    # Plot a bar chart for each metric  fig, axs = plt.subplots(nrows=1, ncols=3, figsize=(15, 5))    # Add a main title to the figure  fig.suptitle("Comparing Evaluation Metrics", fontsize=16, fontweight="bold")    for i, metric in enumerate(metrics):      ax = pivoted_df.loc[metric].plot(kind="bar", ax=axs[i], rot=45)      ax.set_xticklabels(ax.get_xticklabels(), ha="right")      ax.set_ylabel(metric)      ax.set_title(metric, fontstyle="italic")        # Iterate through the algorithm names and set the color for each bar      for j, alg in enumerate(pivoted_df.columns):          ax.get_children()[j].set_color(cmap(j))  plt.show()

Here is the output.

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Image by Author

In summary, the Mean Shift algorithm performs the best according to the Silhouette Score and Davies Bouldin Score.

The K-Means algorithm performs the best according to the Calinski Harabasz Score, and GMM performs the best according to the Rand Score and Fowlkes-Mallows Score.

There is no clear winner among the clustering algorithms, as each one performs well on different metrics.

The choice of the best algorithm depends on the specific requirements and the importance assigned to each evaluation metric in your clustering problem.

Clustering Real-Life Applications

Now, let’s see the real-life examples of both our algorithms and evaluation metrics to grasp the logic even further.

Here’s the overview of the examples we’ll talk about in detail.

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Image by Author

Super Market Chain Personalization

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Image by Author

Real-life example: A supermarket chain wants to create personalized marketing campaigns for its customers. They use K-Means clustering to segment customers based on their purchasing habits, demographics, and store visit frequency. These segments help the company tailor its marketing messages to engage better and serve its customers.

Algorithm: K-Means clustering

K-Means is chosen because it is a simple, efficient, and widely-used clustering algorithm that works well with large datasets. It can quickly identify patterns and create distinct customer segments based on the input features.

Evaluation metrics: Silhouette Score

The Silhouette Score is used to evaluate the quality of customer segmentation by measuring how well each data point fits within its assigned cluster compared to other clusters. This helps ensure the clusters are compact and well-separated, which is essential for creating effective personalized marketing campaigns.

Fraudulent Transaction

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Image by Author

Real-life example: A credit card company wants to detect fraudulent transactions. They use DBSCAN to cluster transactions based on factors like transaction amount, time, and location. Unusual transactions that don't fit into any cluster are flagged as potential frauds for further investigation.

Algorithm: DBSCAN

DBSCAN is chosen because it is a density-based clustering algorithm that can identify clusters of varying shapes and sizes, as well as detect noise points that do not belong to any cluster. This makes it suitable for detecting unusual patterns or outliers, such as potentially fraudulent transactions.

Evaluation metrics: Silhouette Score

The Silhouette Score is chosen as an evaluation metric in this case because it helps assess the effectiveness of DBSCAN by separating normal transactions from potential outliers representing fraud.

A higher Silhouette Score indicates that the clusters of regular transactions are well separated from each other and the noise points (outliers). This separation makes it easier to identify and flag suspicious transactions that deviate significantly from normal patterns.

Cancer Genomics Relation

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Image by Author

Real-life example: Researchers studying cancer genomics want to understand the relationships between different types of cancer cells. They use Hierarchical Clustering to group cells based on their gene expression patterns. The resulting clusters help them identify commonalities and differences between cancer types and develop targeted therapies.

Algorithm: Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering is chosen because it creates a tree-like structure (dendrogram) that allows researchers to visualize and interpret the relationships between cancer cells at multiple levels of granularity. This approach can reveal nested subgroups of cells and helps researchers understand the hierarchical organization of cancer types based on their gene expression patterns.

Evaluation metrics: Calinski-Harabasz Index

The Calinski-Harabasz Index is chosen in this case because it measures the ratio of between-cluster dispersion to within-cluster dispersion. For cancer genomics, it helps researchers evaluate the clustering quality in terms of how distinct and well-separated the groups of cancer cells are based on their gene expression patterns.

Autonomous Car

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Image by Author

Real-life example: A self-driving car company wants to improve the car's ability to identify objects in its surroundings. They use Mean-Shift Clustering to segment images captured by the car's cameras into different regions based on color and texture, which helps the car recognize and track objects like pedestrians and other vehicles.

Algorithm: Mean Shift Clustering

Mean Shift clustering is chosen because it is a non-parametric, density-based algorithm that can automatically adapt to the underlying structure and scale of the data.

This makes it particularly suitable for image segmentation tasks, where the number of clusters or regions may not be known in advance, and the shapes of the regions can be complex and irregular.

Evaluation metrics: Fowlkes-Mallows Score (FMS)

The Fowlkes-Mallows Score is chosen in this case because it measures the similarity between two clusterings, typically comparing the algorithm's output to a ground-truth clustering.

In the context of self-driving cars, the FMS can be used to assess how well the Mean Shift clustering algorithm segments the images compared to human-labeled segmentations.

News Recommendation

Clustering with scikit-learn: A Tutorial on Unsupervised Learning
Image by Author

Real-life example: An online news platform wants to group articles into topics to improve content recommendations for its users. They use Gaussian Mixture Models to cluster articles based on features extracted from their texts, such as word frequency and term co-occurrence. By identifying distinct topics, the platform can recommend articles more relevant to a user's interests.

Algorithm: Gaussian Mixture Model (GMM) clustering

Gaussian Mixture Models are chosen because they are a probabilistic, generative approach that can model complex, overlapping clusters. This is particularly useful for text data, where articles may belong to multiple topics or have shared features. GMM can capture these nuances and provide a soft clustering, assigning each article a probability of belonging to each topic.

Evaluation metrics: Silhouette Coefficient

The Silhouette Coefficient is chosen because it measures the compactness and separation of the clusters, helping assess the quality of the topic assignments.

A higher Silhouette Coefficient indicates that the articles within a topic are more similar to each other and distinct from other topics, which is important for accurate content recommendations.

If you want to know more about Unsupervised algorithms, here you can collect more information on “Unsupervised Learning Algorithms”. Also, check out “Supervised vs Unsupervised Learning” the two approaches that we should know in the world of machine learning.

Conclusion

In conclusion, clustering is an essential unsupervised learning technique used to find similarities or patterns in data without prior knowledge of class labels.

We discussed different clustering algorithms, including K-Means, Mean Shift, DBScan, Gaussian Mixture, and Hierarchical Clustering, along with their use cases and real-life applications.

Additionally, we explored various evaluation metrics, including the Silhouette Coefficient, Calinski-Harabasz Index, and Davies-Bouldin Index, which help us assess the quality of clustering results.

We also learned how to develop multiple clustering algorithms simultaneously using scikit-learn and evaluated them using the metrics we had already discovered.

Finally, we discussed some popular applications that utilized clustering algorithms to solve real-world problems.

If you still have questions, here is an article explaining Clustering and its algorithms.

Clustering has a wide range of applications, from customer segmentation in marketing to image recognition in computer vision, and it is an essential tool for discovering hidden patterns and insights in data.

Nate Rosidi is a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Connect with him on Twitter: StrataScratch or LinkedIn.

More On This Topic

  • Scikit-learn for Machine Learning Cheatsheet
  • Getting Started with Scikit-learn for Classification in Machine Learning
  • The Ultimate Scikit-Learn Machine Learning Cheatsheet
  • The Best Machine Learning Frameworks & Extensions for Scikit-learn
  • Using Scikit-learn's Imputer
  • 10 Things You Didn’t Know About Scikit-Learn