AI — Страница 1173

How to build a robust data science portfolio from scratch

It’s always wise to craft a killer data science portfolio if you want to get noticed in this increasingly competitive and in-demand niche. Of course, achieving this is easier said than done, particularly if you’re getting started with nothing more than a dream of eventual career success.

So with all that in mind, it’s time to lean in, because this is where you’ll get that game plan to flex your skills and turn heads in the industry.

Building a home to showcase your data science mastery

Setting up a slick website is like buying a plot in the swanky part of town for your data science prowess to reside. It’s all about first impressions, and you want yours to scream “data wizard” from the get-go.

First things first: snag yourself some dependable hosting via a provider like NameHero. This will keep your site zipping along faster than a hyperloop on steroids, which is key because nobody’s got the time for loading hourglasses.

Once hosted, pop open WordPress or any other CMS (Content Management System) you vibe with and start rolling out the red carpet. Choose a clean theme that’s easy on the eyes; think minimalist chic where your projects are the main attraction not hidden by flashy gimmicks. Content is king, but context? That’s the kingdom!

The art of showcasing your data skills

Alright, rampage through the CMS jungle is done – now it’s time to fill your new digital crib with some eye-popping data science artifacts. We’re talking about projects that show off your knack for shoveling through data mines and uncovering shiny nuggets of insight.

Kick things off with a couple of your heavyweight champs—the flagship projects that make you puff out your chest a bit. Maybe that time you predicted stock prices like you had a crystal ball, or when you crushed a Kaggle competition. Detail those bad boys! Put up the problem statement, sprinkle in some methodology magic, code snippets if they’re not longer than a CVS receipt, and most crucially—results that actually mean something outside of numberland!

Also thrown into the mix should be smaller, quirky experiments because personality wins prizes too. These can be little side hustles where maybe you played matchmaker with datasets or just showcased dope visualization tricks.

Code talk: Let your repos do the gabbing

Your website is just one piece of the Colosseum where you’ll unleash your data gladiator skills. Next up? Say hello to our friend, GitHub (or GitLab, Bitbucket—pick your fighter). This place is less about glossy images and more like that gritty garage where all the behind-the-scenes handiwork happens.

Fling those scripts and notebook files up on a public repo. Make them shine with READMEs so riveting you could swap them in for bedtime stories. They should guide any lost soul through your code maze with the grace of a gazelle leaping across the Savannah.

Tag each project with relevant keywords—machine learning, neural networks, data visualization; SEO isn’t just for marketers. It’s also dope for helping fellow code warriors find and frolic through your projects.

And hey, throw in some contributions to other repositories too! Bug fixes, feature additions—the works. It flexes your collaborative muscle while subtly yelling: “I play well with others.”

So next time someone comes knocking wanting proof that you can walk that data science talk? Give them directions straight to this goldmine!

Networking nirvana: Social proof and professional potion

Ok, so you’ve built a killer site, your repos could make a grown coder weep with joy, but if no one knows you or your work…well, that’s like dropping the mic in an empty stadium. Enter networking—the career equivalent of a turbo boost.

Slide into LinkedIn and let’s get buzzing! Jazz up your profile to mirror your portfolio’s themes. Chirrup about project updates; spar in discussions; pen some thought-leadership pieces that scream “I know my onions.” Plus connecting with other data science maestros—total power move.

An offline presence also matters; hitting up meetups or conferences not only scores you wisdom points but might even buddy you up with mentors who’ve seen things first-hand.

Final thoughts

Alright, you got this! Forge that data science portfolio with the zeal of a blacksmith at an anvil and the precision of a jeweler setting diamonds. Unleash those skills out into the wild and let them roar. Keep it real, keep it updated, and most importantly—keep coding!

CMU, Stanford Unveil Gaussian Adaptive Transformer

Researchers from Carnegie Mellon University, San Diego State University, and Stanford University team have unveiled their latest research paper titled ‘Gaussian Adaptive Attention is All You Need: Robust Contextual Representations Across Multiple Modalities’ aiming to reshape contextual representations.

The authors of the paper include Aman Chadha, Aaron Elkins, and George Ioannides.

At the core of their research lies the Multi-Head Gaussian Adaptive Attention Mechanism (GAAM) and the Gaussian Adaptive Transformer (GAT), designed to elevate contextual representations across diverse modalities, including speech, text, and vision.

GAAM introduces learnable mean and variance parameters into its attention mechanism, marking a significant leap in model performance.

Introducing the GAAM and GAT, the researchers establish a fully learnable probabilistic attention framework. The incorporation of learnable mean and variance parameters empowers the model with dynamic recalibration of feature importance, resulting in a substantial enhancement of capacity.

The researchers introduce the Importance Factor (IF) as a novel learning-based metric. This metric enhances model explainability within GAAM-based methods, quantitatively evaluating feature significance and thereby improving interpretability.

Through rigorous testing across multiple modalities, the study validates the effectiveness of GAAM within GAT. The findings showcase its superiority in handling highly non-stationary data compared to conventional dot-product attention and earlier Gaussian-based attention mechanisms.

The paper demonstrates the seamless integration of GAAM with Grouped Query Attention, highlighting its compatibility with existing Pre-Trained Models (PTM). This integration not only showcases improved performance but does so with only a marginal increase in learnable parameters.

Read: Meet the AI Expert Building Indic LLMs with IITs

The post CMU, Stanford Unveil Gaussian Adaptive Transformer appeared first on Analytics India Magazine.

Unlocking team productivity: Integrating data analytics into your Slack workflow

In a technology of rapid digital transformation, leveraging records analytics and collaborative tools may be a sport changer. One such integration that is proving to be impactful is that of data analytics with Slack. This effective merger provides teams with the capability to engage and make selections based totally on actual-time insights, in the long run improving productivity.

Why integrate data analytics with Slack?

The motive for integrating facts analytics with Slack is apparent: it pretty much enhances crew productivity and allows a statistics-pushed work lifestyle. Here’s why:

•Maximized Productivity: Having readily available statistics gets admission to the number one verbal exchange platform and empowers teams for quicker and more knowledgeable choice-making, significantly enhancing productivity stages.

•Streamlined Operations: Integration optimizes operation waft, doing away with the want for added structures or tools. This consolidation of sources within Slack saves time and simplifies workflows.

•Improved Collaboration: Data analytics integration fosters a knowledgeable talk amongst crew individuals. Collaborative trouble-fixing will become extra powerful while fueled by facts-driven insights, facilitating a subculture of facts-pushed choice-making.

•Real-time Insights: Real-time records reviews and notifications make sure the crew remains up to date on key metrics, presenting possibilities for immediate movement and adaptive gaining knowledge.

•Continual Improvement: Analyzing real-time statistics opens avenues for non-stop procedure improvement, enabling teams to become aware of and rectify overall performance gaps right away.

Integrating information analytics with Slack is a strategic step closer to developing a more green, agile, and information-knowledgeable workspace.

Unleashing productivity: Making the integration work

Integrating information analytics into your Slack workflow may be done through diverse techniques. The following are some arms-on techniques to release team productivity:

Analytics dashboard integration

Integrate your chosen data analytics tool dashboard into your Slack. This could be Google Analytics, Tableau, Power BI, etc. This integration will permit team participants to get right of entry to critical analytics facts without leaving their Slack workspace. Plus, it’s effects sharable inside the crew, fostering collaboration.

Using bots and apps

Bots and 0.33-party programs increase Slack’s capabilities. Bots inclusive of Statsbot can retrieve records analytics from various external databases and routinely generate reports inner Slack using Slack message automation. Similarly, apps like ArcGIS can visualize the information, making it less complicated to consume and recognize.

Setting up notifications

Timely indicators and notifications can ensure that recognized traits, opportunities, or issues are acted upon quickly. For instance, Google Analytics Slack Integration can ship you well-timed traffic drop signals. These immediate insights provide faster choice-making and hassle-solving, thereby enhancing team productiveness.

Wrapping up

Integrating facts analytics into your Slack workflow is a smart circulate for any business enterprise serious about improving productiveness. It not handiest fosters a collaborative, records-pushed paintings way of life but also streamlines operations, hurries up selection-making, and promotes continual learning.

More than something, this integration brings the strength of facts analytics into everyday conversations, making analytics an essential part of group discussions and choices. But do not forget, that effective integration is greater than simply linking equipment collectively. It includes shifting mindsets and workflows to leverage these abilities.

Organizations that unencumber this integration method can experience an extremely streamlined, productive, and data-driven work environment, imparting them a widespread area in the contemporary competitive panorama. Consequently, integrating data analytics into your Slack workflow isn’t always an alternative anymore — it’s a necessity. Harness this electricity these days and steer your group in the direction of a more effective, green, and facts-enlightened future.

CMU, Amazon GenAI Unveil Gaussian Adaptive Transformer

Researchers from Carnegie Mellon University, San Diego State University, and Amazon GenAI team have unveiled their latest research paper titled ‘Gaussian Adaptive Attention is All You Need: Robust Contextual Representations Across Multiple Modalities’ aiming to reshape contextual representations.

The authors of the paper include Aman Chadha, Aaron Elkins, and George Ioannides.

GAAM introduces learnable mean and variance parameters into its attention mechanism, marking a significant leap in model performance.

Read: Meet the AI Expert Building Indic LLMs with IITs

The post CMU, Amazon GenAI Unveil Gaussian Adaptive Transformer appeared first on Analytics India Magazine.

The AI radiologists replacement saga: Don’t be misled by the scaremongering – science v.s. science fiction

Seven years ago, an unexpected nationwide shortage of radiologists was triggered by a single statement from Professor Geoffrey Hinton.

The statement was:
“I think if you work as a radiologist, you are like the Wilie E Coyote in the cartoon. You are already over the edge of the cliff, but you have not looked down yet. There is no ground underneath. People should stop training as radiologists now. It’s just completely obvious that in five years, deep learning is going to do better than radiologists.”

This was in Nov 2016.

He boldly asserted that the training of radiologists should cease, as AI’s capabilities in image perception were rapidly surpassing those of humans. This proclamation had a profound impact: individuals who were considering specializing in radiology diverted to other paths, fearing job obsolescence in the face of advancing AI technology.

However, the role of a radiologist is not limited to just image perception. They shoulder a wide array of responsibilities, many of which cannot be replicated by AI. This misunderstanding led to increased pressure on an already strained profession. The sensational headlines that followed further discouraged potential trainees, exacerbating the shortage.

The situation also influenced policy-makers, leading to an overestimation of technological capabilities and decisions that didn’t align with the practical realities of healthcare needs.

While AI will undoubtedly reshape the role of radiologists in the future, their necessity remains indisputable. AI should be seen as a tool to aid radiologists, not as a complete replacement for them.

The same lessons apply today.

We should look beyond the scaremongering and the hype and the sensational threats attributed to AI. The radiologists are convinced I think of the AI dangers hype – as they go about their work!

I think we should learn from Einstein, who when offered the presidency of Israel, wisely declined citing lack of experience and aptitude. In other words, experience in deep learning does not give AI experts expertise over other professions.

But why does our industry uniquely exhibit this scaremongering and deviation from reason?

It’s partly because we are exposed to science fiction characters who influence our view of reality. In fact, it is possible to gain a better perspective of AI in terms of science fiction characters.

AGI characteristics include:

Human-Level Understanding and Reasoning
Versatility and Adaptability
Autonomy in Decision-Making
Continuous Learning and Evolution
Integration into Daily Life

Now, let’s look at three characters from science fiction that typify AGI

1) R. Daneel Olivaw from Asimov’s Robot series

R. Daneel Olivaw is an iconic character from Isaac Asimov’s “Robot” series,

Daneel is not just a machine but a character who is an integral part of human society. He works alongside humans, as a detective, contributing to solving societal and political problems.

Asimov’s portrayal of R. Daneel Olivaw signifies intelligent machines that coexist with humans, often with their own identities and moral complexities.

Daneel is virtually indistinguishable from a human being in appearance. He was constructed with a level of detail and sophistication that allows him to blend seamlessly into human society with Emotional Awareness:. Daneel operates under the guidance of Asimov’s famous Three Laws of Robotics, which prioritize human safety and welfare. These laws deeply influence his actions and decision-making processes.

There are two further characteristics of Daneel which have an impact on human society:

1) The longevity of the character and

2) The evolution of the character as he learns.

A more contemporary example is Lieutenant Commander Data from “Star Trek: The Next Generation.

2) HAL from Space Odyssey 2001

HAL 9000 from Arthur C. Clarke’s “2001: A Space Odyssey” is the second example of AGI.

Key features of HAL 9000 that demonstrate AGI include: Advanced Cognitive Functions,

Emotional and Social Interaction, Learning and Adaptability Ethical and Moral Dilemmas.

There are two further characteristics of HAL which make it interesting:

1) Firstly, Autonomy and Self-Preservation: Unlike typical machines, HAL makes autonomous decisions that it believes are in the best interest of the mission. This includes the infamous turn of events where HAL’s actions are driven by a perceived need for self-preservation and mission success.

2) And secondly, situational control (control of a ship) – leading to the famous line “I’m sorry Dave, I’m afraid I can’t do that.”

3) Wintermute from William Gibson’s novel “Neuromancer

The third character is Wintermute from William Gibson’s novel “Neuromancer

Although least known compared to the above, this character offers some interesting examples of AGI

Manipulation and Social Engineering: Unlike many other portrayals of AGI, Wintermute excels in social manipulation and engineering. It interacts with humans, often manipulating them to achieve its objectives, showcasing a deep understanding of human psychology and behavior.
Autonomous Goal-Seeking: Wintermute operates with a high degree of autonomy, pursuing its goals with a level of determination and resourcefulness. It is not just a tool or a passive system but an active agent with its own agenda.
Integration into Cyberspace: Wintermute’s existence is deeply intertwined with the novel’s depiction of cyberspace. It navigates and manipulates this digital realm, demonstrating how AGI might exist and operate in virtual environments.

Here, we have reference to superconsciousness. In this context, the meaning of superconsciousness refers to a level of artificial intelligence that significantly surpasses the capabilities and understanding of individual AIs or human intelligence in every aspect, including self-awareness, intelligence, and perhaps even a form of digital spirituality or enlightenment.

So, on the road to AGI, is an interim stage – best signified by HAL – where AI can take semi autonomous decisions based on an increasing ability to reason within a context (a company, a city?)

The other two scenarios are more in the realm of science fiction as of now because of the need for bipedal robots (like Data) and the consciousness scenario which is far out because it would need to exceed human intelligence at scale for all tasks. I have faith in human intelligence!

So, this way, we can decouple the science fiction from science itself by leveraging science fiction itself

Image source: reddit

How to publish your own custom GPT chatbot in OpenAI’s store

You've created your own AI-powered GPT chatbot using OpenAI's ChatGPT technology. And you think your GPT might prove helpful or interesting to other people. Now that OpenAI's GPT store is open for business, you can publish your GPT so that other ChatGPT subscribers can take it for a spin. And if your GPT is popular enough, you may even earn a bit of money from it down the road, though OpenAI hasn't revealed any details on that option just yet.

Also: Here's how to create your own custom chatbots using ChatGPT

Any ChatGPT subscriber can create and publish their own GPT. Before you can publish a GPT using the Builder profile tool, you'll need to verify your profile with either your name or a website domain. After being verified, you'll be able to post your chatbot for other subscribers to see.

You can create as many custom GPTs as you like. However, there are some rules and restrictions about the types of chatbots you can develop. Before you dive in, review OpenAI's usage policies, paying special attention to the sections on Building with ChatGPT and GPT Store.

How to publish your custom GPT chatbot in OpenAI's store

Also: How to use ChatGPT to write code

Why Agile doesn’t work for most IT pros: The bigger you are, the harder you fall

We're still a ways off from fully realizing the vision of the well-regarded Agile Manifesto — which outlined and encouraged the practice of working closely and informally with end users to iteratively build software. There was only one catch with this more open and collaborative approach — it didn't scale easily to larger organizations with multiple sites, systems, and teams working across the globe.

The issue of scale still inhibits large or growing organizations. Small organizations represented in a recent survey of 758 software pros conducted by Digital.ai report strong business benefits, while their larger counterparts keep running into roadblocks.

Also: Agile Intelligence: AI gives tech and business collaboration a much-needed boost

Users who are happy with Agile point to benefits such as improved collaboration (60%), while 57% saw better alignment to business needs and a quarter saw better quality software delivered.

Overall, while more than seven in 10 IT professionals — 71% — use Agile in software development, only 11% are fully satisfied with the outcomes, while 33% are "somewhat satisfied." That means at least 56% are not happy with the outcomes, or may not be aware of results.

Close to half, 46%, blame "too many mixed systems" in their companies for forcing them to adopt hybrid approaches to software development. Other challenges included siloed teams and resultant delays at 37%, while 34% said culture clash, inconsistent use across teams (30%), and inability to measure business value (28%).

AI is starting to work its way into Agile activities. Among Agile users, almost 30% are actively exploring employing large language models (LLMs) and code assistants to assist in development processes.

Agile's offspring, DevOps, is also on the table. Both are intended to increase end-to-end visibility and the ability to measure cycle times, wait times, and bottlenecks. Other areas in progress include continuous testing done earlier in the life cycle (29%), along with LLMs (10%) and code assist (10%).

Also: AI brings a lot more to the DevOps experience than meets the eye

Many issues with Agile result from size — mixed software development approaches, organizational resistance to change, lack of understanding among leaders, and internal silos, which are hallmarks of large, multi-departmental organizations. As a result, most successful Agile implementations are found in small companies. A majority of professionals in smaller organizations, 52%, believe Agile is a "powerful productivity and organizational framework resulting in increased collaboration, improved software quality, and better alignment with the business." Only 43% of professionals with larger companies agree.

Close to three-fourths of professionals with small companies (74%) — versus 62% at large companies — said a majority of their applications were delivered on time and "with quality," In addition, 71% of small organizations — compared to 53% at large companies — have "complete visibility into what's being developed and delivered across the software development lifecycle."

Also: AI in 2023: A year of breakthroughs that left no human thing unchanged

In addition, 61% of small-company respondents have product managers who can oversee the entire pipeline and measure value to the business — compared with just 43% of large companies.

This is the 17th year this study was fielded. One may be forgiven for seeing the original Agile Manifesto, written in 2001, as dated. Heads-down coding from scratch is vanishing. Over the past two decades, we've seen the onset of cloud, digital transformation, edge computing, remote work, artificial intelligence, and business leaders leaning even harder on their technology teams to take them into the future. The lines between technology and business have blurred or even disappeared altogether. Technology professionals have become business movers and shakers, and businesspeople are growing more tech-savvy,

Agile team leaders are being asked to do a lot, the survey's authors state — "from demonstrating business value and enabling digital transformation to incorporating AI and managing distributed workforces. From AI to developer burnout, hybrid work environments and unrelenting demand, change is happening in every organization in every industry. At this moment in time, it feels like Agile is having difficulty adapting."

Also: How tech professionals can survive and thrive at work in the time of AI

Still, the Agile philosophy remains the best bet for taking businesses forward into an uncertain future, dominated by technology. Scrum continues to be the most popular team-level methodology, employed at 63% of sites. The Scaled Agile Framework (SAFe) remains the top choice at the enterprise level at 26%, but 22% said they don't follow a mandated enterprise framework at all.

The benefits of Agile — improved collaboration and better alignment with the business — are still out of reach. A challenge cited by 37% is business teams simply don't understand what Agile is or what it can do. Another 27% said there is not enough training. "There is an ongoing disconnect between agile practitioners and the business, evidenced by resistance to organizational change, a lack of understanding amongst leadership, and inadequate training and support from the business side," the survey's authors report.

Despite all the AI hype, success depends on just one thing

There's an incredible amount of hype about the game-changing power of artificial intelligence (AI), but many experts agree the key ingredient to making the most of emerging technology is one thing — finding the right business use case.

Thierry Martin, senior manager for data and analytics strategy at Toyota Motors Europe, explains in a one-to-one video interview with ZDNET how the automotive giant is dedicating time and resources in research and development to the potential of AI.

Also: Data is the missing piece of the AI puzzle. Here's how to fill the gap

However, this exploratory work is very much focused on the current use case — and that means data science rather than prediction and automation.

"The analysis of data is much more important for us," says Martin. "For instance, how are people driving our cars? Is there a difference between different countries or highway driving between Germany and Belgium?"

The development of deep insight through data science and analysis is dependent on the collection of data, which is an area where Toyota excels.

"We can already get a lot of insight into how people are using our cars," he says. "We do forecast models, for instance, to do root cause analysis or to predict what kind of accessories we need to install to help with planning."

For now, Martin says Toyota is focused on using tools like Power BI to keep the human at the heart of the loop and to use analytics to develop a detailed understanding of automotive operations and processes.

"We are not letting the AI make decisions instead of people," he says. "We prefer to provide more insight — that's where we are."

Yet in the not-so-distant future, Martin can envisage a situation where his organization starts to exploit AI in production — and explorations to find the right use cases for line-of-business processes are already underway.

"We have quite a high demand for that," he says. "There are lots of use cases around analyzing text data and generative AI, which became possible since 2022 and the launch of the ChatGPT models."

Also: Business success and growth is dependent upon trust, data, and AI

While OpenAI's large language models (LLMs) helped push generative AI into the mainstream, Toyota — like so many other blue-chip enterprises — is proceeding with care when it comes to deploying emerging technologies.

Take the example of Omer Grossman, global CIO at CyberArk, who says his firm's work around AI follows guidelines for safe and secure working that can be adopted and adapted.

"If you need a one-sentence slogan, this is it: Make sure you build responsible guardrails that promote innovation while keeping it secure," he says.

In the case of Toyota Europe, Martin suggests two routes forward for making the most of AI.

The first pathway will focus on using tools like Microsoft Copilot at a personal level to help people complete tasks using non-sensitive data.

Also: What are Microsoft's different Copilots? Here's what they are and how you can use them

The second pathway, where his team is exploring its options through prototyping, is about using generative AI securely within the enterprise firewall to boost productivity.

"In terms of prototyping, we do a lot of work around chatbots," he says. "We are coding chatbots ourselves now. And once you have a library set up, it's very quick to set it up and try it by yourself. There's not so much complexity here."

Toyota Europe's work with AI is being supported by the creation of a data mesh, which Martin describes as an approach to governance that ensures responsibility for data products stays with the business owners.

Also: Every AI project begins as a data project, but it's a long, winding road

The organization is bringing its information together on a Snowflake platform that provides a foundation for well-governed data access.

The data mesh draws on a range of other technologies, including Dataiku for collaboration, Collibra for governance, and Denodo to connect data meshes across different parts of the organization, such as Toyota Europe and Japan.

Martin and his team are using these data mesh technologies to help explore AI. They've already built chatbots on Dataiku, which uses an LLM that runs on a secure instance of Azure Open AI to provide summaries of PDFs.

He's demonstrated the chatbot to top executives at Toyota Europe and suggests that internal development is the way to go because it helps alleviate some of the concerns associated with publicly available models from big-name providers.

Also: The best AI chatbots of 2024: ChatGPT and alternatives

"So, we already have access to our own language model," he says. "It's on Azure, but it's safe. And then on top of that, because we have the LLM and we have a chatbot, we can build our database and we can build interactive chatbots and things like that."

Martin says his team continues to explore its options: "We are building a knowledge-retrieval system, for instance, because there is a lot of knowledge scattered everywhere in the company. But that's still at the pilot level."

Across the organization's AI-enabled explorations, the watchword is "testing" to ensure that services meet tight governance requirements and the demands of line-of-business users.

"We want to confirm the value and we also need to confirm how to scale it," he says. "Once you start to use a chatbot service, for example, there is AI ethics and governance that we have to bring in. If I start to roll out a chatbot, then I need to be clear on lots of questions about ethics."

Also: Agile Intelligence: AI gives tech and business collaboration a much-needed boost

So, when might that broader implementation take place? Martin says he'd like to get some AI-based tools in production relatively quickly.

He's working with his technology partners, including Snowflake, to ensure governance issues are considered and access to data is constrained.

"The vision should be, for instance, that a logistics chatbot should have access to only logistics data and not to HR data, just as an employee only has access to certain data," he says.

Martin says Toyota Europe is continuing to prototype and could have some kind of AI-enabled chatbot service that extracts data from the Snowflake platform by mid-2024.

Also: How tech professionals can survive and thrive at work in the time of AI

He's also speaking with other technology partners, such as Dataiku and Collibra, about how his vision might be realized.

Most crucially of all, he'll be working closely with the business to demonstrate his AI services and to consider how these tools might work in specific areas of the organization.

"We need to understand where the best place is to run the chatbot," he says. "And that's why it's super-important for the engineers and also the leaders to really understand what we are talking about. That's where we'll be spending our time."

Artificial Intelligence

Innovation in Synthetic Data Generation: Building Foundation Models for Specific Languages

Synthetic data, artificially generated to mimic real data, plays a crucial role in various applications, including machine learning, data analysis, testing, and privacy protection. In Natural Language Processing (NLP), synthetic data proves invaluable for enhancing training sets, particularly in low-resource languages, domains, and tasks, thereby enhancing the performance and robustness of NLP models. However, generating synthetic data for NLP is non-trivial, demanding high linguistic knowledge, creativity, and diversity.

Different methods, such as rule-based and data-driven approaches, have been proposed to generate synthetic data. However, these methods have limitations, such as data scarcity, quality issues, lack of diversity, and domain adaptation challenges. Therefore, we need innovative solutions to generate high-quality synthetic data for specific languages.

A significant improvement in generating synthetic data includes adjusting models for different languages. This means building models for each language so that the synthetic data generated is more accurate and realistic in reflecting how people use those languages. It is like teaching a computer to understand and mimic different languages' unique patterns and details, making synthetic data more valuable and reliable.

The Evolution of Synthetic Data Generation in NLP

NLP tasks, such as machine translation, text summarization, sentiment analysis, etc., require a lot of data to train and evaluate the models. However, obtaining such data can be challenging, especially for low-resource languages, domains, and tasks. Therefore, synthetic data generation can help augment, supplement, or replace accurate data in NLP applications.

The techniques for generating synthetic data for NLP have evolved from rule-based to data-driven to model-based approaches. Each approach has its features, advantages, and limitations, and they have contributed to the progress and challenges of synthetic data generation for NLP.

Rule-based Approaches

Rule-based approaches are the earliest techniques that use predefined rules and templates to generate texts that follow specific patterns and formats. They are simple and easy to implement but require a lot of manual effort and domain knowledge and can only generate a limited amount of repetitive and predictable data.

Data-driven Approaches

These techniques use statistical models to learn the probabilities and patterns of words and sentences from existing data and generate new texts based on them. They are more advanced and flexible but require a large amount of high-quality data and may create texts that need to be more relevant or accurate for the target task or domain.

Model-based Approaches

These state-of-the-art techniques that use Large Language Models (LLMs) like BERT, GPT, and XLNet present a promising solution. These models, trained on extensive text data from diverse sources, exhibit significant language generation and understanding capabilities. The models can generate coherent, diverse texts for various NLP tasks like text completion, style transfer, and paraphrasing. However, these models may not capture specific features and nuances of different languages, especially those under-represented or with complex grammatical structures.

A new trend in synthetic data generation is tailoring and fine-tuning these models for specific languages and creating language-specific foundation models that can generate synthetic data that is more relevant, accurate, and expressive for the target language. This can help bridge the gaps in training sets and improve the performance and robustness of NLP models trained on synthetic data. However, this also has some challenges, such as ethical issues, bias risks, and evaluation challenges.

How Can Language-Specific Models Generate Synthetic Data for NLP?

To overcome the shortcomings of current synthetic data models, we can enhance them by tailoring them to specific languages. This involves pre-training text data from the language of interest, adapting through transfer learning, and fine-tuning with supervised learning. By doing so, models can enhance their grasp of vocabulary, grammar, and style in the target language. This customization also facilitates the development of language-specific foundation models, thereby boosting the accuracy and expressiveness of synthetic data.

LLMs are challenged to create synthetic data for specific areas like medicine or law that need specialized knowledge. To address this, techniques include using domain-specific languages (e.g., Microsoft's PROSE), employing multilingual BERT models (e.g., Google's mBERT) for various languages, and utilizing Neural Architecture Search (NAS) like Facebook's AutoNLP to enhance performance have been developed. These methods help produce synthetic data that fits well and is of superior quality for specific fields.

Language-specific models also introduce new techniques to enhance the expressiveness and realism of synthetic data. For example, they use different tokenization methods, such as Byte Pair Encoding (BPE) for subword tokenization, character-level tokenization, or hybrid approaches to capture language diversity.

Domain-specific models perform well in their respective domains, such as BioBERT for biomedicine, LegalGPT for law, and SciXLNet for science. Additionally, they integrate multiple modalities like text and image (e.g., ImageBERT), text and audio (e.g., FastSpeech), and text and video (e.g., VideoBERT) to enhance diversity and innovation in synthetic data applications.

The Benefits of Synthetic Data Generation with Language-specific Models

Synthetic data generation with language-specific models offers a promising approach to address challenges and enhance NLP model performance. This method aims to overcome limitations inherent in existing approaches but has drawbacks, prompting numerous open questions.

One advantage is the ability to generate synthetic data aligning more closely with the target language, capturing nuances in low-resource or complex languages. For example, Microsoft researchers demonstrated enhanced accuracy in machine translation, natural language understanding, and generation for languages like Urdu, Swahili, and Basque.

Another benefit is the capability to generate data tailored to specific domains, tasks, or applications, addressing challenges related to domain adaptation. Google researchers highlighted advancements in named entity recognition, relation extraction, and question answering.

In addition, language-specific models enable the development of techniques and applications, producing more expressive, creative, and realistic synthetic data. Integration with multiple modalities like text and image, text and audio, or text and video enhances the quality and diversity of synthetic data for various applications.

Challenges of Synthetic Data Generation with Language-specific Models

Despite their benefits, several challenges are pertinent to language-specific models in synthetic data generation. Some of the challenges are discussed below:

An inherent challenge in generating synthetic data with language-specific models is ethical concerns. The potential misuse of synthetic data for malicious purposes, like creating fake news or propaganda, raises ethical questions and risks to privacy and security.

Another critical challenge is the introduction of bias in synthetic data. Biases in synthetic data, unrepresentative of languages, cultures, genders, or races, raise concerns about fairness and inclusivity.

Likewise, the evaluation of synthetic data poses challenges, particularly in measuring quality and representativeness. Comparing NLP models trained on synthetic data versus real data requires novel metrics, hindering the accurate assessment of synthetic data's efficacy.

The Bottom Line

Synthetic data generation with language-specific models is a promising and innovative approach that can improve the performance and robustness of NLP models. It can generate synthetic data that is more relevant, accurate, and expressive for the target language, domain, and task. Additionally, it can enable the creation of novel and innovative applications that integrate multiple modalities. However, it also presents challenges and limitations, such as ethical issues, bias risks, and evaluation challenges, which must be addressed to utilize these models' potential fully.

3 Crucial Challenges in Conversational AI Development and How to Avoid Them

Image by Freepik

Conversational AI refers to virtual agents and chatbots that mimic human interactions and can engage human beings in conversation. Using conversational AI is fast becoming a way of life — from asking Alexa to “find the nearest restaurant” to asking Siri to “create a reminder,” virtual assistants and chatbots are often used to answer consumers’ questions, resolve complaints, make reservations, and much more.

Developing these virtual assistants requires substantial effort. However, understanding and addressing the key challenges can streamline the development process. I have used my first-hand experience in creating a mature chatbot for a recruitment platform as a reference point to explain key challenges and their corresponding solutions.

To build a conversational AI chatbot, developers can use frameworks like RASA, Amazon’s Lex, or Google’s Dialogflow to build chatbots. Most prefer RASA when they plan custom changes or the bot is in the mature stage as it is an open-source framework. Other frameworks are also suitable as a starting point.

The challenges can be classified as three major components of a chatbot.

Natural Language Understanding (NLU) is the ability of a bot to comprehend human dialogue. It performs intent classification, entity extraction, and retrieving responses.

Dialogue Manager is responsible for a set of actions to be performed based on the current and previous set of user inputs. It takes intent and entities as input (as part of the previous conversation) and identifies the next response.

Natural Language Generation (NLG) is the process of generating written or spoken sentences from given data. It frames the response, which is then presented to the user.

Image from Talentica Software Challenges in Natural Language Understanding

Insufficient data

When developers replace FAQs or other support systems with a chatbot, they get a decent amount of training data. But the same doesn’t happen when they create the bot from scratch. In such cases, developers generate training data synthetically.

What to do?

A template-based data generator can generate a decent amount of user queries for training. Once the chatbot is ready, project owners can expose it to a limited number of users to enhance training data and upgrade it over a period.

Unfitting model selection

Appropriate model selection and training data are crucial to get the best intent and entity extraction results. Developers usually train chatbots in a specific language and domain, and most of the available pre-trained models are often domain-specific and trained in a single language.

There can be cases of mixed languages as well where people are polyglot. They might enter queries in a mixed language. For instance, in a French-dominated region, people may use a type of English that is a mix of both French and English.

What to do?

Using models trained in multiple languages could reduce the problem. A pre-trained model like LaBSE (Language-agnostic Bert sentence embedding) can be helpful in such cases. LaBSE is trained in more than 109 languages on a sentence similarity task. The model already knows similar words in a different language. In our project, it worked really well.

Improper entity extraction

Chatbots require entities to identify what kind of data the user is searching. These entities include time, place, person, item, date, etc. However, bots can fail to identify an entity from natural language:

Same context but different entities. For instance, bots can confuse a place as an entity when a user types “Name of students from IIT Delhi” and then “Name of students from Bengaluru.”

Scenarios where the entities are mispredicted with low confidence. For example, a bot can identify IIT Delhi as a city with low confidence.

Partial entity extraction by machine learning model. If a user types “students from IIT Delhi,” the model can only identify “IIT” only as an entity instead of “IIT Delhi.”

Single-word inputs having no context can confuse the machine learning models. For example, a word like “Rishikesh” can mean both the name of a person as well as a city.

What to do?

Adding more training examples could be a solution. But there is a limit after which adding more would not help. Moreover, it’s an endless process. Another solution could be to define regex patterns using pre-defined words to help extract entities with a known set of possible values, like city, country, etc.

Models share lower confidence whenever they are not sure about entity prediction. Developers can use this as a trigger to call a custom component that can rectify the low-confident entity. Let’s consider the above example. If IIT Delhi is predicted as a city with low confidence, then the user can always search for it in the database. After failing to find the predicted entity in the City table, the model would proceed to other tables and, eventually, find it in the Institute table, resulting in entity correction.

Wrong intent classification

Every user message has some intent associated with it. Since intents derive the next course of actions of a bot, correctly classifying user queries with intent is crucial. However, developers must identify intents with minimal confusion across intents. Otherwise, there can be cases bugged by confusion. For example, “Show me open positions” vs. “Show me open position candidates”.

What to do?

There are two ways to differentiate confusing queries. Firstly, a developer can introduce sub-intent. Secondly, models can handle queries based on entities identified.

A domain-specific chatbot should be a closed system where it should clearly identify what it is capable of and what it is not. Developers must do the development in phases while planning for domain-specific chatbots. In each phase, they can identify the chatbot’s unsupported features (via unsupported intent).

They can also identify what the chatbot cannot handle in “out of scope” intent. But there could be cases where the bot is confused w.r.t unsupported and out-of-scope intent. For such scenarios, a fallback mechanism should be in place where, if the intent confidence is below a threshold, the model can work gracefully with a fallback intent to handle confusion cases.

Challenges with Dialogue Management

Once the bot identifies the intent of a user’s message, it must send a response back. Bot decides the response based on a certain set of defined rules and stories. For example, a rule can be as simple as utter “good morning” when the user greets “Hi”. However, most often, conversations with chatbots comprise follow-up interaction, and their responses depend on the overall context of the conversation.

What to do?

To handle this, chatbots are fed with real conversation examples called Stories. However, users don’t always interact as intended. A mature chatbot should handle all such deviations gracefully. Designers and developers can guarantee this if they don’t just focus on a happy path while writing stories but also work on unhappy paths.

Challenges in Natural Language Generation

User engagement with chatbots rely heavily on the chatbot responses. Users might lose interest if the responses are too robotic or too familiar. For instance, a user may not like an answer like “You have typed a wrong query” for a wrong input even though the response is correct. The answer here doesn’t match the persona of an assistant.

What to do?

The chatbot serves as an assistant and should possess a specific persona and tone of voice. They should be welcoming and humble, and developers should design conversations and utterances accordingly. The responses should not sound robotic or mechanical. For instance, the bot could say, “Sorry, it seems like I don’t have any details. Could you please re-type your query?” to address a wrong input.

Adding LLMs to a Chatbot System

LLM (Large Language Model) based chatbots like ChatGPT and Bard are game-changing innovations and have improved the capabilities of conversational AIs. They are not only good at making open-ended human-like conversations but can perform different tasks like text summarization, paragraph writing, etc., which could be earlier achieved only by specific models.

One of the challenges with traditional chatbot systems is categorizing each sentence into intents and deciding the response accordingly. This approach is not practical. Responses like “Sorry, I couldn’t get you” are often irritating. Intentless chatbot systems are the way forward, and LLMs can make this a reality.

LLMs can easily achieve state-of-the-art results in general named entity recognition barring certain domain-specific entity recognition. A mixed approach to using LLMs with any chatbot framework can inspire a more mature and robust chatbot system.

With the latest advancements and continuous research in conversational AI, chatbots are getting better every day. Areas like handling complex tasks with multiple intents, such as “Book a flight to Mumbai and arrange for a cab to Dadar,” are getting much attention.

Soon personalized conversations will take place based on the characteristics of the user to keep the user engaged. For example, if a bot finds the user is unhappy, it redirects the conversation to a real agent. Additionally, with ever-increasing chatbot data, deep learning techniques like ChatGPT can automatically generate responses for queries using a knowledge base.

Suman Saurav is a Data Scientist at Talentica Software, a software product development company. He is an alumnus of NIT Agartala with over 8 years of experience designing and implementing revolutionary AI solutions using NLP, Conversational AI, and Generative AI.

Building a home to showcase your data science mastery

The art of showcasing your data skills

Code talk: Let your repos do the gabbing

Networking nirvana: Social proof and professional potion

Final thoughts

Why integrate data analytics with Slack?

Unleashing productivity: Making the integration work

Analytics dashboard integration

Using bots and apps

Setting up notifications

Wrapping up

1) R. Daneel Olivaw from Asimov’s Robot series

2) HAL from Space Odyssey 2001

3) Wintermute from William Gibson’s novel “Neuromancer

How to publish your custom GPT chatbot in OpenAI's store

See also

Artificial Intelligence

The Evolution of Synthetic Data Generation in NLP

Rule-based Approaches

Data-driven Approaches

Model-based Approaches

How Can Language-Specific Models Generate Synthetic Data for NLP?

The Benefits of Synthetic Data Generation with Language-specific Models

Challenges of Synthetic Data Generation with Language-specific Models

The Bottom Line

Insufficient data

What to do?

Unfitting model selection

What to do?

Improper entity extraction

What to do?

Wrong intent classification

What to do?

What to do?

What to do?

More On This Topic