CDC Data Replication: Techniques, Tradeoffs, Insights

CDC Data Replication: Techniques, Tradeoffs, Insights

Many organizations across industries operate production databases in which most of the data does not change very frequently; that is, daily changes and updates only account for a relatively small portion of the overall amount of data stored in them. It is these organizations that can benefit most from change data capture (CDC) data replication.

In this article, I will define CDC data replication, briefly discuss the most common use cases, and then talk about common techniques and the tradeoffs of each. Towards the end, I will give some general implementation insights that I’ve learned as the CEO and founder of data integration company Dataddo.

What Is Change Data Capture (CDC) Data Replication?

CDC data replication is a method of copying data in real or near real time between two databases whereby only newly added or modified data is copied.

It is an alternative to snapshot replication, which involves moving an entire snapshot of one database to another again and again. Snapshot replication may be suitable for organizations that need to preserve individual snapshots of their data over time, but it’s very processing-intensive and leaves a big financial footprint. For organizations that don’t need to do this, CDC can save a lot of paid processing time.

Changes to data can be captured and delivered to their new destination in real time or in small batches (e.g. every hour).

CDC Data Replication: Techniques, Tradeoffs, Insights
This image illustrates log-based CDC, where the red row is newly added data.

It’s worth mentioning that CDC is not a new process. However, until recently, only large organizations had the engineering resources to implement it. What is new is the growing selection of managed tools that enable it for a fraction of the cost, hence its newfound popularity.

Most Common CDC Use Cases

There’s not enough space in this article to cover all the use cases of CDC data replication, but here are three of the most common.

Data Warehousing for Business Intelligence and Analytics

Any organization that runs a proprietary, data-collecting system is likely to have a production database that stores key info from this system.

Since production databases are designed for write operations, they don’t do much to put data into profitable use. Many organizations will therefore want to copy the data into a data warehouse, where they can run complex read operations for analytics and business intelligence.

If your analytics team needs data in near real time, CDC is a good way to give it to them, because it will quickly deliver the changes to the analytics warehouse as they are made.

Database Migration

CDC is also useful when you are migrating from one database technology to another, and you need to keep everything available in case of downtime. A classic example would be migration from an on-premise database to a cloud database.

Disaster Recovery

Similar to the migration case, CDC is an efficient and potentially cost-effective way to ensure all your data is available in multiple physical locations all the time, in case of downtime in one.

Common CDC Techniques and The Tradeoffs of Each

There are three main CDC techniques, each with its own set of advantages and disadvantages.

CDC Data Replication: Techniques, Tradeoffs, Insights
CDC implementation involves tradeoffs between flexibility, fidelity, latency, maintenance, and security.

Query-Based CDC

Query-based CDC is quite straightforward. All you do with this technique is write a simple select query to select data from a specific table, followed by some condition, like “only select the data that was updated or added yesterday.” Assuming you already have the schema for a secondary table configured, these queries will then take this changed data and produce a new, two-dimensional table with the data, which can be inserted into a new location.

Advantages

  • Highly flexible. Allows you to define which changes to capture and how to capture them. This makes it easier to customize the replication process in a very granular way.
  • Reduces overhead. Only captures changes that meet specific criteria, so it’s much cheaper than CDC that captures all changes to a database.
  • Easier to troubleshoot. Individual queries can easily be examined and corrected in case of any issues.

Disadvantages

  • Complex maintenance. Each individual query has to be maintained. If you have a couple hundred tables in your database, for example, you would probably need this many queries as well, and maintaining all of them would be a nightmare. This is the main disadvantage.
  • Higher latency. Relies on polling for changes, which can introduce delays in the replication process. This means that you cannot achieve real-time replications using select queries, and that you would need to schedule some kind of batch processing. This may not be much of a problem if you need to analyze something using a long time series, like customer behaviour.

Log-Based CDC

Most database technologies we use today support clustering, meaning you can run them in multiple replicas to achieve high availability. Such technologies must have some kind of binary log, which captures all changes to the database. In log-based CDC, changes are read from the log rather than the database itself, then replicated to the target system.

Advantages

  • Low latency. Data changes can be replicated very quickly to downstream systems.
  • High fidelity. The logs capture all changes to the database, including data definition language (DDL) changes and data manipulation language (DML) changes. This makes it possible to track deleted rows (which is impossible with query-based CDC).

Disadvantages

  • Higher security risk. Requires direct access to the database transaction log. This can raise security concerns, as it will require extensive access levels.
  • Limited flexibility. Captures all changes to the database, which limits the flexibility to define changes and customize the replication process. In case of high customization requirements, the logs will have to be heavily post-processed.

In general, log-based CDC is difficult to implement. See the “insights” section below for more information.

Trigger-Based CDC

Trigger-based CDC is kind of a blend between the first two techniques. It involves defining triggers for capturing certain changes in a table, which are then inserted into and tracked in a new table. It is from this new table that the changes are replicated to the target system.

Advantages

  • Flexibility. Allows you to define which changes to capture and how to capture them (like in query-based CDC), including deleted rows (like in log-based CDC).
  • Low latency. Each time a trigger fires, it counts as an event, and events can be processed in real time or near real time.

Disadvantages

  • Extremely complex maintenance. Just like queries in query-based CDC, all triggers need to be maintained individually. So, if you have a database with 200 tables and need to capture changes for all of them, your overall maintenance cost will be very high.

Implementation Insights

As the CEO of a data integration company, I’ve had a lot of experience implementing CDC on scales large and small. Here are a few things I’ve learned along the way.

Different Implementations for Different Logs

Log-based CDC is particularly complex. This is because all logs—e.g., BinLog for MySQL, WAL for Postgres, Redo Log for Oracle, Oplog for Mongo DB—although conceptually the same, are implemented differently. You will therefore need to dive deep into the low-level parameters of your chosen database to get things working.

Writing Data Changes to the Target Destination

You will need to determine how exactly to insert, update, and delete data in your target destination.

In general, inserting is easy, but volume plays a big role in dictating approach. Whether you use batch insert, data streaming, or decide to load changes using a file, you will always face technology tradeoffs.

To ensure proper updating and avoid unnecessary duplicates, you will need to define a virtual key on top of your tables that tells your system what should be inserted and what should be updated.

To ensure proper deleting, you will need to have some failsafe mechanism to make sure that bad implementation won’t cause deletion of all the data in the target table.

Maintaining Long-Running Jobs

If you are transferring only a few rows, things will be quite easy, but if this is the case, then you probably don’t need CDC. So, in general, we can expect CDC jobs to take several minutes or even hours, and this will require reliable mechanisms for monitoring and maintenance.

Error Handling

This could be the topic of a separate article altogether. ??But, in short, I can say that each technology has a different way for how to raise exceptions and present errors. So, you should define a strategy for what to do if a connection fails. Should you retry it? Should you encapsulate everything in the transactions?

CDC Data Replication: Techniques, Tradeoffs, Insights

Implementing CDC data replication in-house is quite complicated and very case-specific. This is why it hasn’t traditionally been a popular replication solution, and also why it’s hard to give general advice about how to implement it. In recent years, managed tools like Dataddo, Informatica, SAP Replication Server, and others have significantly lowered the barrier to accessibility.

Not for All, but Great for Some

As I mentioned at the beginning of this article, CDC has the potential save a lot of financial resources for companies:

  • Whose main database consists largely of data that doesn’t frequently change (i.e. daily changes only account for a relatively small portion of the data in them)
  • Whose analytics teams need data in near real time
  • That don’t need to retain full snapshots of their main database over time

Nevertheless, there are no perfect technological solutions, only tradeoffs. And the same applies to CDC data replication. Those who choose to implement CDC will have to unequally prioritize flexibility, fidelity, latency, maintenance, and security.
Petr Nemeth is the founder and CEO of Dataddo—a fully managed, no-code data integration platform that connects cloud-based services, dashboarding applications, data warehouses, and data lakes. The platform offers ETL, ELT, reverse ETL, and database replication functionality (including CDC), as well as an extensive portfolio of 200+ connectors, enabling business professionals with any level of technical expertise to send data from virtually any source to any destination. Before founding Dataddo, Petr worked as a developer, analyst, and system architect for telco, IT, and media companies on large-scale projects involving the internet of things, big data, and business intelligence.

More On This Topic

  • Software Mistakes and Tradeoffs: New book by Tomasz Lelek and StackOverflow…
  • 7 Techniques to Handle Imbalanced Data
  • Dimensionality Reduction Techniques in Data Science
  • Top AI and Data Science Tools and Techniques for 2022 and Beyond
  • Exploring Data Cleaning Techniques With Python
  • The Role of Resampling Techniques in Data Science

The Relevance of Aadhaar in the Era of AI

Previously, we raised a question: “Why Do You Need Worldcoin When You Have Aadhaar?” However, it turns out that despite Aadhaar’s presence, Worldcoin might still hold significance, or vice-versa.

When Aadhaar was introduced, its aim was to provide ‘proof of residence’ with a 12-digit unique identity number. However, now as we move ahead with AI picking up pace, it might become difficult for Aadhaar to serve its purpose.

Proving your personhood might become difficult in the near future from what you might have imagined. In the physical world, it is easy to identify humans based on their physical characteristics. However, advancements in AI will make it difficult to distinguish between AI and humans in the virtual world, highlighting a need for authentic human recognition and verification.

For instance, social media platforms like Instagram and X (formerly Twitter) offer easy account creation processes. Musk recently highlighted Twitter’s bot-related challenges. In order to combat this, he, alongside his team, introduced the verification system with X Premium (previously Twitter Blue), aimed at curbing bots, and implemented new APIs to mitigate their impact. However, these efforts have mostly proven unsuccessful. In a similar vein, Meta has recently been making strides in discerning between AI-generated and human-created content.

With companies like Meta and X facing challenges with rapid growth of AI there is a high chance that the Indian government might be left behind. In a recent blog post, Worldcoin expressed their willingness to assist governments in addressing the impending identity crisis. Worldcoin said that World ID is an open source, permissionless protocol that anyone can use. “All of the developer docs are freely available to the public and the government can freely make use of it” they said.

Worldcoin vs Aadhaar

By now we are aware of all data the Aadhar card has collected. This includes Name, Date of Birth, Gender, Address, Parent/ Guardian’s name, mobile number and email id. The UIDAI also collects biometric information to establish uniqueness – therefore collecting photos, 10 fingerprints and iris.

Nevertheless, there’s a significant issue here — all the data and information are interconnected. If someone gains access to your Aadhaar database, they essentially possess a comprehensive profile of you. In one instance in June 2022, a small sample of farmers’ information and corresponding Aadhaar numbers were exposed by the PM-Kisan website, as revealed by TechCrunch. This was not the first time where Aadhaar data was leaked, there have been several instances in the past where privacy of citizens was compromised.

Here, World ID comes into the picture as its underlying identity protocol powered by zero-knowledge proof. A zero-knowledge proof is a cryptographic technique where one party can prove something is true to another party without revealing any specific details. Zero-knowledge proofs first appeared in a 1985 paper, “The knowledge complexity of interactive proof systems”

Another fascinating feature of World ID is You don’t need to enter any personal information to get or use the World App. This means no name, no phone number, no email, no social profile, no selfie, no passport, etc.

Should Aadhar Collaborate with Worldcoin?

Recently, Worldcoin announced that it is open to partner with governments and organisations, this would be the right opportunity for the government of India to hold hands with Worldcoin and bring in Zero knowledge proof technology to Aadhaar. The trust of the citizens of a nation lies with their government rather than a company.

Going solo isn’t the solution for Worldcoin as well because the population of India is huge, and it will take an exceptional effort to cover the whole country. As on 30th November 2022, the Authority has issued 135.1071 crore Aadhaar numbers to the residents of India.

With Aadhaar offices already there all over the country, the government should take measures to make Aadhar relevant in the times of generative AI otherwise it will become very difficult to catch up later on and probably get replaced by Worldcoin. Aadhaar and UPI have shown over time of what India can offer the world and India has the chance to show it can do it again.

The post The Relevance of Aadhaar in the Era of AI appeared first on Analytics India Magazine.

Do you need a speech therapist? Now you can consult AI

Speech icon above soundwave

Speech disorders are common, with an estimated 17.9 million US adults reporting to have experienced a problem with their voice during the past 12 months, according to the National Institute on Deafness and Other Communication Disorders.

Accessibility to specialist resources, such as speech pathologists, is essential to helping patients work through their speech impediments and improve their overall quality of life.

Also: Raz Mobility has a new smartphone designed for the visually impaired

Better Speech is one online provider that boosts accessibility by providing speech therapy for children and adults, helping to treat speech delays, voice disorders, aphasia, stuttering, and more from the comfort of an individual's home.

Now the company is introducing Jessica, a generative AI-powered Speech Therapist, which it hopes will further increase accessibility to provision and reduce costs.

Jessica will use AI algorithms to create personalized therapy sessions for each client's needs.

Also: AI can conduct breast cancer screenings in less time than humans but just as well, study finds

The AI speech assistant will also have advanced speech-recognizing and natural language-processing abilities that can "assess speech patterns, identify areas for improvement, and deliver targeted interventions," according to the company's press release.

The video below shows a demonstration of Jessica teaching a patient how to pronounce "rabbit" and then giving personalized feedback after hearing the student's response.

"This revolutionary technology should transform the lives of millions by making high-quality speech therapy accessible and convenient to many people who currently can't afford speech therapy," said Ranan Lachman, CEO of Better Speech.

Better Speech is working with the American Speech and Hearing Association to create Category III medical insurance codes to help provide more patients with access to the technology.

The company will also gift Jessica to 1,000 children in underdeveloped countries who would otherwise not have access to speech therapy.

Since Jessica is an AI model, it will become more advanced as it has more interactions with patients, learning and becoming more intelligent every day.

Also: Meta unveils text-to-music AI tools to compete with Google's

Lachman emphasizes that although the tool is very capable and will only become smarter, it is not meant to replace therapists — yet.

"Our software is designed to augment our service and serve as a practicing tool, not to replace our speech therapists," said Lachman.

"We have no doubt that, soon, Jessica will become the best speech therapist in the world, being trained on tens of thousands of patients, from children to seniors, and would be able to assist with most speech impediments."

Artificial Intelligence

AI’s Analogical Reasoning Abilities: Challenging Human Intelligence?

Analogical reasoning, the unique ability that humans possess to solve unfamiliar problems by drawing parallels with known problems, has long been regarded as a distinctive human cognitive function. However, a groundbreaking study conducted by UCLA psychologists presents compelling findings that might push us to rethink this.

GPT-3: Matching Up to Human Intellect?

The UCLA research found that GPT-3, an AI language model developed by OpenAI, demonstrates reasoning capabilities almost on par with college undergraduates, especially when tasked with solving problems akin to those seen in intelligence tests and standardized exams like the SAT. This revelation, published in the journal Nature Human Behaviour, raises an intriguing question: Does GPT-3 emulate human reasoning due to its extensive language training dataset, or is it tapping into an entirely novel cognitive process?

The exact workings of GPT-3 remain concealed by OpenAI, leaving the researchers at UCLA inquisitive about the mechanism behind its analogical reasoning skills. Despite GPT-3's laudable performance on certain reasoning tasks, the tool isn’t without its flaws. Taylor Webb, the study's primary author and a postdoctoral researcher at UCLA, noted, “While our findings are impressive, it's essential to stress that this system has significant constraints. GPT-3 can perform analogical reasoning, but it struggles with tasks trivial for humans, such as utilizing tools for a physical task.”

GPT-3's capabilities were put to the test using problems inspired by Raven’s Progressive Matrices – a test involving intricate shape sequences. By converting images to a text format GPT-3 could decipher, Webb ensured these were entirely new challenges for the AI. When compared to 40 UCLA undergraduates, not only did GPT-3 match human performance, but it also mirrored the mistakes humans made. The AI model accurately solved 80% of the problems, exceeding the average human score yet falling within the top human performers' range.

The team further probed GPT-3’s prowess using unpublished SAT analogy questions, with the AI outperforming the human average. However, it faltered slightly when attempting to draw analogies from short stories, although the newer GPT-4 model showed improved results.

Bridging the AI-Human Cognition Divide

UCLA's researchers aren't stopping at mere comparisons. They've embarked on developing a computer model inspired by human cognition, constantly juxtaposing its abilities with commercial AI models. Keith Holyoak, a UCLA psychology professor and co-author, remarked, “Our psychological AI model outshined others in analogy problems until GPT-3's latest upgrade, which displayed superior or equivalent capabilities.”

However, the team identified certain areas where GPT-3 lagged, especially in tasks requiring comprehension of physical space. In challenges involving tool usage, GPT-3's solutions were markedly off the mark.

Hongjing Lu, the study’s senior author, expressed amazement at the leaps in technology over the past two years, particularly in AI's capability to reason. But, whether these models genuinely “think” like humans or simply mimic human thought is still up for debate. The quest for insights into AI's cognitive processes necessitates access to the AI models' backend, a leap that could shape AI's future trajectory.

Echoing the sentiment, Webb concludes, “Access to GPT models' backend would immensely benefit AI and cognitive researchers. Currently, we're limited to inputs and outputs, and it lacks the decisive depth we aspire for.”

After Google, Zoom Updates its Privacy Policy to Train AI Models

Zoom

After Google, the online video communication platform Zoom is the latest tech company to update their privacy policy about collecting user data to train AI models.

“There is no opt-out for paid customers, doctors, therapists, lawyers, or others who routinely need to discuss confidential matters. I’ve had a paid subscription for six years; I just cancelled it, and I recommend that you and your employer cancel yours until Zoom backs down,” said Greg Wilson, Senior Software Engineering Manager at Deep Genomics, in a LinkedIn post.

Although Wilson quickly rectified that Zoom has a special license for healthcare with a different privacy policy and price. But this special license doesn’t seem to apply to people like lawyers, religious leaders, or staff at family shelters. Users find this to be a huge breach of privacy.

Read more: When Google Thinks It Owns the Internet

Taking Google’s Lead

As per the new policy, Zoom has complete control over the data generated during Zoom calls. They can change, share, and use this data however they want, following the law and it will be used to train their AI models.

In the updated terms, Zoom can also do a lot of things with the stuff you share on their platform. They can use, change, show, and distribute it around the world forever, without paying you, and even let others do the same.

Furthermore, according to the revised terms in section 10.4, Zoom has acquired an enduring and global license that allows them to freely distribute, publish, access, utilize, store, transmit, examine, reveal, safeguard, extract, alter, reproduce, share, display, copy, distribute, translate, transcribe, generate new versions, and handle Customer Content. This license is non-exclusive, meaning others can also use the content, and it can be sublicensed or transferred to others.

The viral post also saw a response from Aparna Bawa, chief operating officer at Zoom that the purpose of clause 10.2 is to ensure transparency regarding data usage, aiming to improve the user experience. This involves analysing customer usage patterns, such as peak times in specific time zones, to optimize data center load balancing.

Regarding customer content and generative AI, Bawa clarifies that it involves a distinct approach. New generative AI features, such as team chat composition and meeting summary, are available for free trial. Customers have the choice to activate these features and independently decide whether to share their customer content with Zoom to aid product enhancement. This is an opt-in process, and participants are informed within meetings or chat pop-ups when these AI features are enabled via the user interface.

However, several users have raised concerns about transparency and compliance with privacy regulations. It is important to have clearer communication to address worries about AI model training from recorded meetings, while another user questions the use of meetings and webinars for AI training, even without cloud recording. Legal aspects come into play as one person asks about the legally binding nature of a response compared to broader terms. Others emphasise the necessity of GDPR consent and specify concerns related to data usage.

A month ago, Google updated its privacy policy to permit the collection of publicly available data for training AI models, potentially strengthening its products like Bard and Cloud AI. However, this move raises worries about tech giants monopolizing the internet and the erosion of its open nature. This change in policy echoes concerns about OpenAI’s practices, inviting regulatory scrutiny.

Read more: Google’s Privacy Policy Isn’t Enough to Protect It from OpenAI’s Nightmares

The post After Google, Zoom Updates its Privacy Policy to Train AI Models appeared first on Analytics India Magazine.

The Importance of Data Cleaning in Data Science

The Importance of Data Cleaning in Data Science
Image by Editor

In data science, the accuracy of predictive models is vitally important to ensure any costly errors are avoided and that each aspect is working to its optimal level. Once the data has been selected and formatted, the data needs to be cleaned, a crucial stage of the model development process.

In this article, we will provide an overview of the importance of data cleaning in data science, including what it is, the benefits, the data cleaning process, and the commonly used tools.

What Is Data Cleaning?

In data science, data cleaning is the process of identifying incorrect data and fixing the errors so the final dataset is ready to be used. Errors could include duplicate fields, incorrect formatting, incomplete fields, irrelevant or inaccurate data, and corrupted data.

The Importance of Data Cleaning in Data Science
Source

In a data science project, the cleaning stage comes before validation in the data pipeline. In the pipeline, each stage ingests input and creates output, improving the data each step of the way. The benefit of the data pipeline is that each step has a specific purpose and is self-contained, meaning the data is thoroughly checked.

The Importance of Data Cleaning in Data Science

Data seldom arrives in a readily usable form; in fact, it can be confidently stated that data is never flawless. When collected from diverse sources and real-world environments, data is bound to contain numerous errors and adopt different formats. Hence, the significance of data cleaning arises — to render the data error-free, pertinent, and easily assimilated by models.

When dealing with extensive datasets from multiple sources, errors can occur, including duplication or misclassification. These mistakes greatly affect algorithm accuracy. Notably, data cleaning and organization can consume up to 80% of a data scientist's time, highlighting its critical role in the data pipeline.

Examples of Data Cleaning

Below are three examples of how data cleaning can fix errors within datasets.

Data Formatting

Data formatting involves transforming data into a specific format or modifying the structure of a dataset. Ensuring consistency and a well-structured dataset is crucial to avoid errors during data analysis. Therefore, employing various techniques during the cleaning process is necessary to guarantee accurate data formatting. This may encompass converting categorical data to numerical values and consolidating multiple data sources into a unified dataset.

Empty/ Missing Values

Data cleaning techniques play a crucial role in resolving data issues such as missing or empty values. These techniques involve estimating and filling in gaps in the dataset using relevant information.

For instance, consider the location field. If the field is empty, scientists can populate it with the average location data from the dataset or a similar one. Although not flawless, having the most probable location is preferable to having no location information at all. This approach ensures improved data quality and enhances the overall reliability of the dataset.

Identifying Outliers

Within a dataset, certain data points may lack any substantive connection to others (e.g., in terms of value or behavior). Consequently, during data analysis, these outliers possess the ability to significantly distort results, leading to misguided predictions and flawed decision-making. However, by implementing various data cleaning techniques, it is possible to identify and eliminate these outliers, ultimately ensuring the integrity and relevance of the dataset.

The Importance of Data Cleaning in Data Science
Source The Benefits of Data Cleaning

Data cleaning provides a range of benefits that have a significant impact on the accuracy, relevance, usability, and analysis of data.

  • Accuracy — Using data cleaning tools and techniques significantly reduces errors and inaccuracies contained in a dataset. This is important for data analysis, helping to create models that make accurate predictions.
  • Usability — Once cleaned and correctly formatted, data can be applied to a number of use cases, making it much more accessible so it can be used in a range of project types.
  • Analysis — Clean data makes the analysis stage much more effective, allowing analysts to gain greater insights and deliver more reliable results.
  • Efficient Data Storage — By removing unnecessary and duplicate data, storage costs are reduced as only relevant, valuable data needs to be retained, whether that is on an on-site server or a cloud data warehouse.
  • Governance — Data cleaning can help organizations adhere to strict regulations and data governance, protecting the privacy of individuals and avoiding any penalties. More data compliance laws have been enacted in recent months. An example is the recent Texas consumer privacy law (TDPSA), which prohibits certain data practices such as gathering personal customer data that is not reasonably necessary for the purpose of collection.

The Data Cleaning Process: 8 Steps

The data cleaning stage of the data pipeline is made up of eight common steps:

  • The removal of duplicates
  • The removal of irrelevant data
  • The standardization of capitalization
  • Data type conversion
  • The handling of outliers
  • The fixing of errors
  • Language Translation
  • The handling of any missing values

1. The Removal of Duplicates

Large datasets that utilize multiple data sources are highly likely to have errors, including duplicates, particularly when new entries haven't undergone quality checks. Duplicate data is redundant and consumes unnecessary storage space, necessitating data cleansing to enhance efficiency. Common instances of duplicate data comprise repetitive email addresses and phone numbers.

2. The Removal of Irrelevant Data

To optimize a dataset, it is crucial to remove irrelevant data fields. This will result in faster model processing and enable a more focused approach toward achieving specific goals. During the data cleaning stage, any data that does not align with the scope of the project will be eliminated, retaining only the necessary information required to fulfill the task.

3. The Standardization of Capitalization

Standardizing text in datasets is crucial for ensuring consistency and facilitating easy analysis. Correcting capitalization is especially important, as it prevents the creation of false categories that could result in messy and confusing data.

4. Data Type Conversion

When working with CSV data using Python to manipulate it, analysts often rely on Pandas, the go-to data analysis library. However, there are instances where Pandas fall short in processing data types effectively. To guarantee accurate data conversion, analysts employ cleaning techniques. This ensures that the correct data is easily identifiable when applied to real-life projects.

5. The Handling of Outliers

An outlier is a data point that lacks relevance to other points, deviating significantly from the overall context of the dataset. While outliers can occasionally offer intriguing insights, they are typically regarded as errors that should be removed.

6. The Fixing of Errors

Ensuring the effectiveness of a model is crucial, and rectifying errors before the data analysis stage is paramount. Such errors often result from manual data entry without adequate checking procedures. Examples include phone numbers with incorrect digits, email addresses without an "@" symbol, or unpunctuated user feedback.

7. Language Translation

Datasets can be gathered from various sources written in different languages. However, when using such data for machine translation, evaluation tools typically rely on monolingual Natural Language Processing (NLP) models, which can only handle one language at a time. Thankfully, during the data cleaning phase, AI tools can come to the rescue by converting all the data into a unified language. This ensures greater coherence and compatibility throughout the translation process.

8. The Handling of Any Missing Values

One of the last steps in data cleaning involves addressing missing values. This can be achieved by either removing records that have missing values or employing statistical techniques to fill in the gaps. A comprehensive understanding of the dataset is crucial in making these decisions.

Summary

The importance of data cleaning in data science can never be underestimated as it can significantly impact the accuracy and overall success of a data model. With thorough data cleaning, the data analysis stage is likely to output flawed results and incorrect predictions.

Common errors that need to be rectified during the data cleaning stage are duplicate data, missing values, irrelevant data, outliers, and converting multiple data types or languages into a single form.
Nahla Davies is a software developer and tech writer. Before devoting her work full time to technical writing, she managed — among other intriguing things — to serve as a lead programmer at an Inc. 5,000 experiential branding organization whose clients include Samsung, Time Warner, Netflix, and Sony.

More On This Topic

  • The Importance of Probability in Data Science
  • The Importance of Experiment Design in Data Science
  • The Berkson-Jekel Paradox and its Importance to Data Science
  • Celebrating Awareness of the Importance of Data Privacy
  • Data Cleaning and Wrangling in SQL
  • Getting Started Cleaning Data

The DPDP Bill was Approved by the Lok Sabha Today

The Digital Personal Data Protection Bill of 2023 has been approved by the Lok Sabha through a voice vote today. Minister Ashwini Vaishnaw, in his bill introduction, emphasised its objective to safeguard the right to privacy. He informed that the bill underwent comprehensive discussions within the IT Standing Committee and underwent thorough consultation procedures.

The minister said, “If there is a natural disaster somewhere, should the authorities then be giving data notices or taking consent, or should they be safeguarding lives? If the police have to catch a criminal, should the police care about writing forms or catching the criminal?”

Key Highlights of the Bill

– The bill focuses on user-friendly language.

– Emphasis on using she/her pronouns.

– Adherence to fundamental principles such as legality, purpose limitation, data minimisation, accuracy, storage limitation, reasonable safeguards, and accountability.

– Introduction of voluntary engagement and alternative dispute resolution, enabling entities to approach the Data Protection Board (established under the bill) for rectifying any errors.

Drawing a comparison, the Minister likened the Bill to the European Union’s GDPR, highlighting that while the GDPR has 16 exemptions, the DPDP only has 4.

He further said, “The DPDP Bill has inculcated several provisions for data protection. Firstly, principle of legality where if any platform takes any data, it mandates that the data should be taken legally. Secondly, the principle of purpose limitation mandates that the data should only be taken for the purpose that is mentioned…,” adding that the other principles include, that of data minimisation, principle of accuracy, storage limitation and reasonable safeguard.

Support for the DPDP came from BJP MP PP Chaudhary, former Chairperson of the Joint Parliamentary Committee for the PDP Bill. He noted that consent can be revoked at any time and that government services providing benefits do not necessitate consent.

However, YSRCP MP Srikrishna Lavu expressed concerns. He pointed out the absence of a clear definition of ‘harm’ for determining compensation and the lack of provisions for data portability and the right to be forgotten in the bill. He also raised concerns about executive control over the Data Protection Board. Lavu brought attention to the impact on children’s data, stating that the Bill could restrict internet access for those under 18, even though those over 14 are allowed to work. He underscored the reliance on parental consent, considering the limited computer and internet literacy among Indians.

Shiv Sena MP Shrikant Shinde highlighted the bill’s potential to combat pesky spam calls.

BSP MP Ritesh Pandey expressed apprehension over the Union government’s retained authority concerning individual data rights. He labelled the Union government as the principal fiduciary and voiced concerns about content blocking based on the Data Protection Board’s decisions.

BJP MP Sanjay Seth raised concerns about children’s data collection during shopping and how the bill would affect targeted advertising by play-schools after its enactment.

Various concerns were also raised about the proposed amendment to Section 8(1)(j) of the RTI Act, 2005, centralisation of power through exemptions, potential state surveillance, and increased censorship.

The Minister provided explanations for these concerns: first, the bill has different rules for different apps depending on how old you are; second, parents can use Digilocker to agree to things; third, there are important parts in the bill called “Right to be Forgotten” and “Independence of the DPB”; fourth, the Parliament can’t make all the rules it wants, there are some restrictions; fifth, a group of organisations will team up (DPB – TDSAT – SC); and finally, one change got approved to fix something in Section 37.

The post The DPDP Bill was Approved by the Lok Sabha Today appeared first on Analytics India Magazine.

One Model lands $41M to bring data science-powered insights to HR

One Model lands $41M to bring data science-powered insights to HR Kyle Wiggers 7 hours

One Model, a platform that uses AI to help employers make decisions about recruiting, hiring, promotions, layoffs and general workplace planning, today announced that it’s raised $41 million in a funding round led by Riverwood Capital.

Christopher Butler, CEO of One Model, said that the capital will be used to “boost several of the company’s growth initiatives,” particularly in the areas of technology, product development, customer success and go-to-market.

“One Model’s people analytics product roadmap will be expanded to solve problems for a diverse array of data science, analyst, people manager and C-level audiences, delivering tailored content proactively through alerts, notifications and individualized reporting,” Butler told TechCrunch via email. “Further investment in One AI will provide more analysts and decision-makers with actionable forecasts, all from a powerful ethical and data governance posture.”

One Model is what’s known as a “people analytics” platform — a platform designed to collect and apply organizational and talent data to improve business outcomes, at least in theory. There’s long been high interest in people analytics; according to a 2018 Deloitte survey, 84% of large organizations rated people analytics as “important” or “very important” and 69% had already formed a people analytics team.

And it’s showing no signs of slowing. By one estimate, the market for people analytics software will grow from $2.58 billion in 2022 to $7.67 billion by 2031.

One Model’s founding team, which includes Butler, hails from Inform, a people analytics service that was acquired by SuccessFactors, now SAP SuccessFactors, in 2010. Following the acquisition, they say they witnessed a growing disconnect between what customers were requesting to do with their people data and the solutions being provided in market.

“Customers want greater access and more flexibility to manage and democratize access to people data and analysis,” Butler said. “More data is being generated and collected by the tools we use to manage the workforce and rarely are they integrated with each other. Every new technology and vendor in the HR space generates data, and organizations are not equipped to manage this data complexity.”

Butler describes One Model as an “enterprise people data orchestration platform.” That’s quite a mouthful. But what One Model essentially does is provide a toolkit for extracting, modeling and governing HR data as well as delivering that data to various applications and services.

One Model can perform basic tasks like identifying areas where a company has a shortage of skills or talent and projecting future workforce needs based on demographic changes and business goals. But beyond this, the platform can calculate the cost of turnover and headcount, attempting to create a plan that measures and reduces this cost over time.

One Model

One Model’s monitoring dashboard, which unifies HR data in one place. Image Credits: One Model

“We recognized an opportunity to transform large organizations by enhancing enterprise people strategies through data insights,” Butler said. “One Model is addressing persistent, age-old data integration challenges that large organizations face, while also tackling modern concerns around ensuring consistent, auditable talent decisions with robust privacy standards and data governance.”

One Model customers can integrate different data sources and data destinations to create reports, dashboards and visualizations. They also gain access to One AI, One Model’s data science suite, which can be used to perform a range of workloads centering around HR data.

For example, with One AI, a user can try to predict the likelihood that an employee resigns within a specified time frame — factoring in aspects like their daily commute and the time since their last promotion. (One wonders how accurate these predictions are, of course, given AI’s potential to pick up on biases in practically any dataset.) Or they could identify opportunities to improve their company workforce’s efficiency, perhaps by reallocating resources or adjusting work schedules.

There’s competition in the people analytics space — see startups like ChartHop and Knoetic. But One Model claims to have built up a sizeable customer base, counting brands including Colgate-Palmolive, Squarespace, Robinhood and Airtable among its list of clients.

“We believe One AI takes a vastly different approach than what is available on the market today,” Butler said. “And we strongly believe in only the most ethical applications of advanced data science because it’s possible to make rational, impactful people decisions fairly and equitably … Our mission is clear: to ensure every workforce decision taken by an enterprise is the best one possible, informed by all relevant enterprise data, and executed in the most transparent and ethical manner possible.”

One Model’s newest tranche brings the Austin, Texas-based startup’s total raised to $44.8 million.

OpenAI Now Crawls the Internet with GPTBot

OpenAI is Now Crawling the Internet with GPTBot

Amidst controversies about scraping websites from the internet without consent, OpenAI has released GPTBot for crawling website automatically. This bot will gather publicly available data for training AI models, which the company says would be in a transparent and responsible manner.

OpenAI said in its documentation about the release that the web crawler will filter to remove sources that require paywall access while also removing personally identifiable information (PII), or text that violates its policies. The GPT creator claims allowing the bot can help improve the accuracy and capabilities of AI systems in the future. It can be identified with the below code:

User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

On the other hand, you can also disable GPTBot to access your site by adding GPTBot to your site’s robot.txt. This means that website owners would have to voluntarily make a step to disable OpenAI’s access to their website, instead of opting-in for training.

User-agent: GPTBot
Disallow: /

You can also control the access of the GPTBot on certain parts of your website by including the code below into robot.txt.

User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/

Though OpenAI is acknowledging that it scrapes the internet for training its large language models like GPT-4, this still looks like a half-baked approach to address the ethical dilemmas around copying data from other people’s websites.

People on HackerNews discuss the ethics of the release of this web crawler for training AI models. “OpenAI isn’t even citing in moderation. It’s making a derivative work without citing, thus obscuring it,” said one of the users. Moreover, OpenAI does not acknowledge the websites it has already used to build its models.

Recently, OpenAI had also filed for a trademark for ‘GPT-5’, hinting that the company is training its next version of GPT-4 which would be, according to several reports, close to AGI, what the company’s goal has been all this while. GPTBot is clearly going to help the company gather more data from across the internet to train this model. On other hand, the company also discontinued its AI Classifier for detecting GPT generated text.

The post OpenAI Now Crawls the Internet with GPTBot appeared first on Analytics India Magazine.

7 Transformative Accessibility Tech Solutions

In 2022, the Global Assistive Technology market was valued at $21.95 billion and is expected to hit $31.22 billion by 2030. Big tech companies create accessibility products to demonstrate inclusivity, comply with legal requirements, and tap into a growing market. Such products enhance the user experience, foster innovation, and provide a competitive advantage. Here are a few of the products and services offered by big tech companies such as Apple, Microsoft and Google to build an inclusive environment for the visually and hearing impaired.

Voice Control

Apple’s Voice Control is an advanced accessibility feature available on iOS and macOS devices. It allows users with motor impairments or limited physical dexterity to control their Apple devices entirely through voice commands. This feature goes beyond traditional voice assistants, providing comprehensive control over the entire operating system and various apps, making it an essential tool for individuals with disabilities seeking a more accessible and independent user experience.

Be My Eyes

Starting from 2012, Danish startup Be My Eyes has been dedicated to developing technology for the visually impaired community, which consists of over 250 million people who are blind or have low vision. With the introduction of GPT-4, they are now working on creating a GPT-4 powered Virtual Volunteer integrated into their app. This virtual volunteer aims to offer the same level of context and understanding as a human volunteer, leveraging the visual input capabilities of GPT-4 to better support the visually impaired community.

Seeing AI

Seeing AI is an innovative and powerful mobile app developed by Microsoft for iOS devices, designed to assist people with visual impairments. The app harnesses the capabilities of artificial intelligence and computer vision to provide real-time assistance to users in understanding the world around them.

Ease of Access Center

Windows Ease of Access Center is a central hub of accessibility features and settings built into the Windows operating system. It is designed to make Windows PCs more user-friendly and accessible for individuals with disabilities or impairments. The Ease of Access Center provides a comprehensive set of tools such as Narrator (screen reader), Magnifier (screen magnification), and Speech Recognition, that cater to various accessibility needs, allowing users to customize their computing experience according to their requirements.

Google TalkBack

Google’s Android Accessibility Suite – a comprehensive collection of accessibility features and services to make Android devices more inclusive and accessible to users with diverse disabilities – offers services such as TalkBack and Live transcribe.

This is a screen reader that provides audible feedback to users with visual impairments. It reads aloud on-screen text, buttons, icons, and other elements, enabling users to navigate their devices, interact with apps, and access information effectively. There is also a TalkBack Braille feature that integrates with external braille devices, allowing blind users to read and navigate the interface through braille input and output.

Google Live Transcribe

It is designed to assist individuals with hearing impairments by providing real-time speech-to-text transcription. The app uses the device’s microphone to capture spoken words and instantly converts them into written text on the screen, allowing users to read and follow conversations or speeches in real-time.

Switch Control

Apple’s Switch Control is a powerful accessibility feature available on iOS and macOS devices. It is designed to cater to individuals with motor impairments, physical disabilities, or limited dexterity, providing them with an alternative and customizable way to interact with their Apple devices. Switch Control allows users to control their devices using external adaptive switches, which can be physical buttons, Bluetooth-enabled devices, or other assistive technology.

The post 7 Transformative Accessibility Tech Solutions appeared first on Analytics India Magazine.