Open Assistant: Explore the Possibilities of Open and Collaborative Chatbot Development

Open Assistant: Explore the Possibilities of Open and Collaborative Chatbot Development
Image by Author | Bing Image Creator

We are seeing rapid development of ChatGPT open-source alternatives, and some of them, like Vicuna, are producing amazing results. But there is a catch. These new models have restricted licenses. We cannot use them for commercial use.

On other hand, Open Assistant is trying to change that. Their mission is to give everyone access to a great chat-based large language model like ChatGPT and GPT-4.

In this post, we will learn about the Open Assistant project, its features, limitations, and plans. Moreover, we will provide you with all the resources to start creating your chatbot.

What is an Open Assistant?

The Open-Assistant project is revolutionizing language innovations. Instead of keeping high-quality large language models private, they are letting everyone use datasets, models, code sources, and the Open Assistant platform.

The Open-Assistant models are trained on a dataset that was collected from more than 13,000 volunteers. The collected dataset has over 600K interactions, 150K messages, and 10K fully annotated conversation trees on diverse topics in multiple languages.

Watch the launch video to understand how cool this project is.

If you go to their Hugging Face page, you will see multiple model architectures trained on the Open Assistant dataset, for example, Stable LM, LLaMA, Pythia, Galactica, and more. They are working on a state-of-the-art model on the latest data, and soon they will launch that model with security features.

Note: some of the models have restricted licenses (for research only), like LLaMA, but you will also see models like Pythia that are open for any use.

How To Try It Out

You can check out a Hugging Face demo to interact with the model or sign up for free to official chat to experience state-of-the-art models.

As we all know that the project is created by an open-source community for the community, you will see options to improve the chat and contribute to data collection.

Chatting with the AI

Open Assistant lets you chat with a chatbot and give feedback on its responses. To start, sign up and click on the chat button. Then, use the thumbs-up or down icons to react to the chatbot's messages and help it learn.

Open Assistant: Explore the Possibilities of Open and Collaborative Chatbot Development
Image from Chat

Contributing to Data Collection

The data collection UI is quite simple. Just click on the Dashboard button, select the task, and start contributing. You can improve the capabilities of Open Assistant by submitting, ranking, and labeling model prompts and responses.

Open Assistant: Explore the Possibilities of Open and Collaborative Chatbot Development
Image from Open Assistant

When you make a valid contribution to the dataset, your score will be shown on a public leaderboard. This is a way of gamifying the contribution process.

Open Assistant: Explore the Possibilities of Open and Collaborative Chatbot Development
Image from Open Assistant Limitations

The limitations of Open Assistant are limitations of most open-source large language models. These models are trained on fewer coding and math interactions which results in failing horribly at answering math and coding questions.

The model is good at generating interesting answers and is more human-like, but sometimes it produces factually wrong or misleading answers.

You need to understand that these models are small compared to ChatGPT and there will be limitations.

Future Plan

The Open Assistant founders have a vision of creating an assistant of the future that can perform various tasks such as writing emails, doing meaningful work, using APIs, and dynamically researching information. Moreover, they want their assistant to be customizable and extensible to anyone who uses it.

  • They will continue to collect more high-quality data and train better models.
  • Their vision is to create a unified platform that includes conversational assistants, retrieval via search engines, integration of APIs and third-party integrations, and building blocks for developers.
  • They still have a few private models that they want to make public after working on security features.
  • The community is working on launching a methodology that will help train and run large language models on consumer-based GPUs.

Getting Started

The Open Assistant project is fully transparent and licensed for commercial use. Only a few models, such as LLaMa, are restricted. Everything else, including models, datasets, code, inference, paper, demo, and documentation, is free and public.

The platform lets you contribute to the dataset and climb the leaderboard. You can also train your model with the public dataset. Explore the endless possibilities.

  • Official Page: Open Assistant | Open Assistant (laion.ai)
  • GitHub: LAION-AI/Open-Assistant
  • HuggingFace Demo: Chat Llm Streaming — a Hugging Face Space by olivierdehaene
  • Official Chat: chat (open-assistant.io) (Requires signup)
  • Model Weights: OpenAssistant/oasst-sft-1-pythia-12b
  • Dataset: OpenAssistant/oasst1
  • Documentation: Introduction | Open Assistant (laion.ai)
  • Research Paper: OpenAssistant Conversations — Democratizing Large Language Model Alignment

Don’t forget to give likes, stars, and hearts to the project. They deserve our love as they are doing this selflessly.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

More On This Topic

  • Facebook Open Sources a Chatbot That Can Discuss Any Topic
  • Open Data and Why it is Necessary
  • The 7 Best Open Source AI Libraries You May Not Have Heard Of
  • DataOps Summit 2021 CFP Is Now Open!
  • Developing an Open Standard for Analytics Tracking
  • Top Open Source Large Language Models

Top 19 Skills You Need to Know in 2023 to Be a Data Scientist

Top 19 Skills You Need to Know in 2023 to Be a Data Scientist
Image by Author

Times are changing. If you want to be a data scientist in 2023, there are several new skills you should add to your roster, as well as the slew of existing skills you should have already mastered.

Why such an extensive set of skills? Part of the problem is job scope creep. Nobody knows what a data scientist is, or what one should do, least of all your future employer. So anything that has data gets stuck in the data science category for you to deal with.

You’re expected to know how to clean, transform, statistically analyze, visualize, communicate, and predict data. Not only that but new technology (or technology that has recently reached the mainstream) could also be added to your job responsibilities.

In this article, I’ll break down the top 19 skills you need to know in 2023 to be a data scientist.

Here’s an overview of the ten most important.

Top 19 Skills You Need to Know in 2023 to Be a Data Scientist
Image by Author

These skills will help you land a job, crush an interview, stay ahead of the curve, and negotiate for that promotion. In each section, I’ll briefly summarize what each skill is, why it matters, and offer a few places to learn these skills.

1. Data Cleaning and Wrangling

While it’s not 80% of a data scientist’s job, data cleaning and wrangling are still one of the most important skills a data scientist can master in 2023.

What is Data Cleaning and Wrangling?

Data cleaning and wrangling are the processes of transforming raw data into a format that can be used for analysis. This involves handling missing values, removing duplicates, dealing with inconsistent data, and formatting the data in a way that makes it ready for analysis.

Cleaning the data usually refers to getting rid of bad/inaccurate values, filling in any blanks, finding duplicates, and otherwise making sure your data set is as spotless and reliably accurate as can be expected. Wrangling it (or munging it, massaging it, or any other weird verb like that) means getting it into an analyzable shape. You convert it or map it into another, easier-to-look-at-format.

Why Does it Matter in Becoming a Data Scientist in 2023?

Ask any data scientist what they do, and one of the first things they mention will be data cleaning and wrangling. Data never comes into your hands in a nice, clean, analyzable shape, so it’s super important to know how to get it tidy.

The ability to clean and wrangle data ensures that your analysis results are trustworthy, and helps to avoid incorrect conclusions being drawn.

Where Can You Learn This Key Skill?

There are plenty of great options to learn data cleaning and wrangling. Harvard offers a course on EdX. You can also practice on your own by cleaning and wrangling free, raw datasets like the Common Crawl, web crawl data composed of over 50 billion web pages (here), or Brazil’s weather data (here).

2. Machine Learning

No, it’s not just a buzzword! Machine learning is a very important skill for any future data scientist to know.

What is Machine Learning?

Machine learning is the application of algorithms and statistical models to make predictions and decisions based on data.

It’s a subfield of artificial intelligence that enables computers to improve their performance on a specific task by learning from data, without being explicitly programmed. It helps with automation. You’ll find it in any industry.

Why Does It Matter in Becoming a Data Scientist in 2023?

You need to know about machine learning in 2023 because it’s a rapidly growing field that has become a crucial tool for solving complex problems and making predictions in various industries.

Machine learning algorithms can be used to classify images, recognize speech, do natural language processing, and create recommendation systems. You’ll be hard-pressed to find an industry that doesn’t do (or doesn’t want to) do those ML-assisted tasks.

Being proficient in machine learning allows a data scientist to extract valuable insights from large and complex data sets, and to develop predictive models that can drive better business decisions.

Where Can You Learn This Key Skill?

We’ve got a repository of over thirty machine-learning projects on ScrataScratch to show this skill off on your resume. TensorFlow also has a set of great free resources to learn machine learning.

3. Data Visualization Top 19 Skills You Need to Know in 2023 to Be a Data Scientist
Image by Author

This skill is pretty self-explanatory. When you analyze numbers, key stakeholders will want to understand your findings with pretty graphs and charts.

What is Data Visualization?

Data visualization is the creation of charts, graphs, and other graphics to help make data easier to understand. You take the numbers you’ve just cleaned, wrangled, or predicted and you put them into some kind of visual format, either to communicate trends with others or to make trends easier to spot.

Why Does it Matter in Becoming a Data Scientist in 2023?

In 2023, being able to visualize data is crucial for a data scientist. It's like having a secret superpower for uncovering hidden patterns and trends in the data that might not be obvious at first glance. And the best part? You get to share your findings with others in a way that's both engaging and memorable. As a data scientist, you’ll work with groups of all different experience levels, but a picture is much more easily understood than a row of numbers.

So, if you want to be a data scientist who can effectively communicate your insights and discoveries, it's important to master the art of data visualization.

Where Can You Learn This Key Skill?

Here’s a list of free places to learn data viz.

4. SQL & Database Management

SQL is a Structured Query Language. Data scientists use SQL to work with SQL databases as well as manage databases and perform data storage tasks.

What is SQL and Database Management?

SQL is a very popular language that lets you access and manipulate structured data. It goes hand in hand with database management, which is commonly done in SQL. Database management is basically how you can organize, store, and fetch data from a place. SQL databases are one of the top backend technologies to learn in 2023, so it’s not just for data science.

Why Does It Matter in Becoming a Data Scientist in 2023?

As a data scientist, you have to keep track of all the data, make sure it's organized, and retrieve it when someone needs it. That’s what SQL and database management let you do.

Where Can You Learn This Key Skill?

Coursera has a ton of great, well-priced database management/admin courses you can try. You can also get a sneak preview of some SQL interview questions here, which can be useful for testing your knowledge.

5. Big Data Processing

Big data is a buzzword, yes, but it’s also a real concept — Oracle defines it as “data that contains greater variety, arriving in increasing volumes and with more velocity,” or data with the three V’s.

What is Big Data Processing?

Big data processing is the ability to process, store, and analyze large amounts of data using technologies like Hadoop and Spark.

Why Does It Matter in Becoming a Data Scientist in 2023?

In 2023, the ability to process big data is critical for data scientists. The volume of data being generated continues to grow at an exponential rate, and being able to handle and analyze this data effectively is essential for making informed decisions and gaining valuable insights. Data scientists who have a deep understanding of big data processing techniques will be able to work with large data sets with ease and make the most out of the information they contain.

Also, thanks to its buzz-wordiness, it never hurts to whack “big data” on your resume.

Where Can You Learn it?

I love Simplilearn’s YouTube tutorial series on this concept.

6. Cloud Computing Top 19 Skills You Need to Know in 2023 to Be a Data Scientist
Image by Author
It’s funny – as more products and services move into the cloud, cloud computing becomes a job requirement for pretty much every techy job, whether it’s DevOps or a data scientist.

What is Cloud Computing?

Cloud computing is the use of cloud-based technologies and platforms like AWS, Azure, or Google Cloud to store and process data. It’s kind of like having a virtual storage room that you can access from anywhere at any time. Instead of storing data and computing resources on local machines or servers, cloud computing allows organizations – and data scientists – to access these resources through the internet.

Why Does It Matter in Becoming a Data Scientist in 2023?

As I keep highlighting, the amount of data you’re expected to work with as a data scientist is growing. More companies will be sticking it in the cloud rather than dealing with it on-prem. It's becoming increasingly important to have the ability to store and process this data in a scalable and efficient manner.

Cloud computing provides an effective solution for this, allowing data scientists to access vast amounts of computing resources and data storage without needing pricy hardware and infrastructure.

Where Can You Learn It?

The good news is because companies own various clouds, many of them have a vested interest in teaching you about it for free, so you learn to use theirs. Google, Microsoft, and Amazon all have great cloud computing resources.

7. Data Warehousing & ETL

“Wait, didn’t we just cover databases? What’s a data warehouse?” I hear you ask.

I get you. Sometimes it feels like the most critical data science skill is keeping all the acronyms and jargon straight.

What Are Data Warehousing and ETL?

First, let’s differentiate data warehouses from databases.

Warehouses store current and historical data for multiple systems, while databases store current data needed to power a project. A database stores the current data required to power an application whereas a data warehouse stores current and historical data for one or more systems in a predefined and fixed schema to analyze the data.

In short, you’d use a data warehouse for data for lots of different projects together, whereas a database mostly stores one single project’s data.

ETL is a process that involves data warehousing, short for extract, transform, and load. An ETL tool will extract data from any data source systems you want, transform it in the staging area (usually cleaning, manipulating, or “munging” it), and then load it into a data warehouse.

Why Does It Matter in Becoming a Data Scientist in 2023?

I feel like I’ve repeated this point in every skill, but data is growing. Companies are hungry for it, and they’ll expect you to manage it. Knowing how to manage data in buildable pipelines is critical.

Where Can You Learn It?

I recommend learning how to do a proper ETL with a specific language, like SQL or Python. Datacamp has got a good one with Python. Microsoft runs a more intermediate-level tutorial to go through a SQL option.

8. Data Modeling & Management

Every data scientist is a model specialist. I’m not talking about Giselle Bundchen. I mean creating a model of how data is stored and organized in a system.

What is Data Modeling And Management?

Data modeling and management is the process of creating mathematical models to represent data, as well as the management of data to maintain its quality, accuracy, and usefulness.

This involves defining data entities, relationships, and attributes, as well as implementing processes for data validation, integrity, and security.

In simpler terms, data modeling basically means you’re creating a blueprint for how data is organized and connected in your employer’s systems. You can think of it like drafting a blueprint of a house. Just like a blueprint shows the different rooms and how they're connected, data modeling shows how different pieces of information are related and connected to each other.

This helps ensure that data is stored and used in a consistent and effective way.

Why Does It Matter in Becoming a Data Scientist in 2023?

As a data scientist, you’ll be responsible for making sure data is organized and structured in an accessible way. Data modeling and management help you work with data, share it, make sure it’s accurate, and make decisions based on it.

Where Can You Learn It?

Microsoft has a good intro on their blog, just half an hour long and highly rated. It’s a good place to start.

9. Data Mining .Top 19 Skills You Need to Know in 2023 to Be a Data Scientist
Image byt Author

Many data science terms have just been robbed from other professions, like modeling and mining. Let’s get into what it means and why it matters.

What is Data Mining?

Data mining is the process of extracting useful information from data through techniques like clustering, classification, and association rules. You’re sifting through the veritable flood of data to find useful golden nuggets. (Maybe data panning would have been a better name for this skill!)

Why Does It Matter in Becoming a Data Scientist in 2023?

Imagine it: you’re a data scientist in 2023. You have data coming in from ten thousand different sources. What skill do you use to identify patterns across all these data fountains?

It’s data mining.

Where Can You Learn It?

Data mining is typically covered in courses that cover big data or data analytics since it’s a pretty critical component of those two skills. EdX offers a couple of options to learn data mining.

10. Deep Learning

Deep learning is subtly different from machine learning! Deep learning is a subfield of machine learning.

What is Deep Learning?

Deep learning is a facet of machine learning that focuses on creating algorithms that can learn patterns in data through multiple layers of artificial neural networks. (Artificial neural networks, by the way, are a type of machine learning algorithm modeled to be similar to the structure and function of the human brain.)

Why Does It Matter in Becoming a Data Scientist in 2023?

Artificial intelligence is getting more sophisticated in 2023. It’s not enough to know the basics of AI and ML – you should be familiar with the cutting edge, too, because it won’t be cutting edge tomorrow. Deep learning was novel a few years ago, and now it’s a necessity.

Data scientists will be expected to use deep learning when companies have access to a truly vast amount of data. It’s used for image and video processing, or computer vision applications.

Where can you learn it?

I like Simplilearn’s tutorial as a starting point.

What Other Skills Do You Need to Know to Become a Data Scientist in 2023?

There are plenty of up-and-coming technologies and techniques that are useful to know. These are either even more advanced, like generative adversarial networks, or more soft-skills-based, like data storytelling, or specialized to a field like time series forecasting. I’ll briefly summarize these here:

  • Natural Language Processing (NLP): A subfield of AI that handles processing and understanding of human language. Chatbots use this.
  • Time Series Analysis & Forecasting: The study of data over time and the use of statistical models to make predictions about future events. You might use this skill to do sales or revenue analysis.
  • Experimental Design & A/B Testing: The process of designing and conducting controlled experiments to test hypotheses and make decisions based on data.
  • Data Storytelling: The ability to effectively communicate data insights and findings to non-technical stakeholders. More and more stakeholders are taking an interest in the why behind data-based decisions, so this is critical.
  • Generative Adversarial Networks (GANs): A type of deep learning architecture where two neural networks are trained to work together to generate new data that resembles a given dataset.
  • Transfer Learning: A machine learning technique where a model is pre-trained on one task and is fine-tuned on a related task, improving performance and reducing the amount of training data needed. Smaller companies that are more resource-limited will find this useful.
  • Automated Machine Learning (AutoML): A method of automating the process of selecting, training, and deploying machine learning models.
  • Hyperparameter Tuning: Another ML subcategory. This is the process of optimizing the performance of a machine learning model by adjusting the parameters that are not learned from the data, such as the learning rate or the number of hidden layers.
  • Explainable AI (XAI): A branch of AI focused on creating algorithms and models that are transparent and interpretable, so their decision-making processes can be understood by humans. Again, helping stakeholders understand what’s happening.

If you want to be a data scientist in 2023, these 19 skills are absolutely critical. The really great news is that many of these skills can be self-taught, while others you can pick up while working in a more junior-level role like a data or business analyst.

A few ways to learn:

  • Always check YouTube. There are so many free, comprehensive resources. I’ve listed a few here, but there are practically infinite videos out there.
  • Platforms like Coursera and EdX often have lecture series
  • We’ve got over a thousand real interview questions to practice on, both coding-based and non-coding. We also offer data project examples.

Enjoy the journey of learning these skills to become a data scientist in 2023.
Nate Rosidi is a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of StrataScratch, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Connect with him on Twitter: StrataScratch or LinkedIn.

More On This Topic

  • Modern Data Science Skills: 8 Categories, Core Skills, and Hot Skills
  • Top 13 Skills That Every Data Scientist Should Have
  • 7 Most Recommended Skills to Learn to be a Data Scientist
  • KDnuggets™ News 20:n34, Sep 9: Top Online Data Science Masters…
  • 3 Valuable Skills That Have Doubled My Income as a Data Scientist
  • 5 Machine Learning Skills Every Machine Learning Engineer Should Know in…

This app uses generative AI to turn your iPhone videos into new content

Runway app demo

AI image generators such as DALL-E make it quick and easy for you to create any image from a simple prompt. AI video generation, on the other hand, is still in its early stages.

Now, a new app by AI startup Runway provides a tantalising glimpse of what the future of AI video generation might look like.

Also: Yelp uses AI to upgrade app features for foodies and thrill seekers

Runway launched its Runway ML iOS app on Tuesday, which lets users take video shot on the app or that's already in their photo gallery and transform it into an entirely new video using any prompts, images, or presets you pick.

For example, the demo Runway posted on Twitter shows a video of a dog being transformed it into a video of an artsy-looking cat.

Also: How AI-generated content will revolutionize the ways we work

The results from Runway can sometimes look warped or disfigured. However, it should be remembered that those kinds of unexpected outputs are common with generative AI and AI video generation, in particular, is still in its early phases.

Runway's video-to-video technology, called Gen-1, launched in February and has been available for use on desktop. The app version streamlines the process and makes it easier to generate new videos than on the desktop.

Gen-2, Runway's text-to-video technology, is also teased on the app, with a tag that says "coming soon".

Also: This new AI service wants to make your video call a little bit easier

The Runway ML iOS app is free to download and is available on Apple's App Store, where it already ranks number 19 in the Photo & Video category.

Despite being free, the app does allow for in-app purchases for different subscription plans that grant you more credits.

On the Free plan, users get a limit of 125 credits, without the option to purchase more, and are granted three video projects with 720p video exports. For context, each second of video uses 14 credits.

Also: The 5 biggest risks of generative AI, according to an expert

The Standard plan costs $143.99 per year and grants users 625 credits, plus the option to buy more, unlimited video projects, watermark removals, 1080p exports, and more.

Lastly, the Pro plan costs $334.99 and gives users 2,250 credits per month, plus the option to buy more, unlimited video projects, exports in ProRes, 500GB of assets, and more premium perks.

Artificial Intelligence

Intuit’s shift

Intuit’s shift

The growing pains (and promise) of embracing AI when you're a legacy financial software giant

Jagmeet Singh 8 hours

If artificial intelligence feels like it is taking over the world, that feeling has become an anxious obsession among larger tech companies.

These are organizations that have typically thrived secure in the knowledge that they can keep a competitive edge through their very DNA: leveraging software to build more interesting, faster and disruptive products.

Now, a big question hangs above all of them: will the evolution of AI supercharge that whole model, or will it disrupt them, the previous generation’s disruptors?

Intuit, the U.S. financial and accounting software behemoth, is among those sincerely hoping that it lands in the first of those camps. It may not have been on the forefront of developing generative AI, the plat du jour in tech, but it happens to have registered 700 AI-related patents covering areas like natural language processing and machine learning.

And in its artillery is also the “oil” that many AI machines need to work: a trove of data and data infrastructure — including 730 million customer interactions and 58 billion machine learning predictions daily over the last five years — that it believes will help catapult it into a strong position in the future.

But as you might guess, it hasn’t been a fully smooth ride.

Intuit’s move, led by CEO Sasan Goodarzi, to becoming an AI-driven company has been well conveyed. Some of that has not been easy. In early 2022, the company laid off over 700 employees, people working in less critical roles, adding a similar number of new roles, to mark a shift toward its AI plan.

And big bets like Credit Karma — a high-profile acquisition for Intuit, buying it for $7.1 billion in February 2020 — saw “revenue challenges” that resulted in a halt in hiring last year.

Daniel Jaster, director of equity research for software at BMO Capital Markets, told TechCrunch that while Intuit has framed the strategic vision as an “AI-driven expert platform”, its actual execution of that has not been quite so revolutionary. Intuit is currently focused on using AI and machine learning to augment the platform — not to build a full-scale replacement for financial experts in its tax and accounting software business.

“In the near-term, similar to most companies, Intuit as an overall organization is attempting to drive efficient growth in a macro environment which has become more challenging for many of their customers,” he said. “But like many companies of its size, the factors impacting their success vary depending on the business unit.”

On top of all this, Intuit has also a reputation of pushing against disruption in areas that might threaten its business. Specifically, it’s known for its strong lobbying efforts against free tax services in the U.S. Now, it looks like that is a hot potato Intuit could end up eating one way or the other: win the argument against free tax services; or win the business by making more compelling paid products that use AI.

And that is the point: there remains still a lot of potential, and now that having an AI strategy is a must for tech businesses being scrutinized in the public markets, the company’s AI tune is getting louder. Intuit’s chief data officer Ashok Srivastava believes company has the right mindset, based on the strategy that it set several years ago to position itself as an “AI-driven expert platform.”

That is to say, Mountain View-based Intuit’s portfolio of services — which includes platforms to manage personal finance (Credit Karma), marketing automation (MailChimp), accounting (QuickBooks) and income tax returns (TurboTax) — all have had, and will be getting, more of the AI treatment.

“The AI that we build is built on a very vast data infrastructure that we’ve been creating over the last many years,” said Srivastava in an interview. “This enables real-time data, enables streaming use cases, batch processing, all the manipulation, all of the storage. And most importantly, the transmission of clean data happens in this infrastructure layer.”

Srivastava joined Intuit in 2017 after cutting his teeth at IBM, Sama and NASA, where he handled technical and financial projects. He told TechCrunch that the goal with the AI strategy is to deliver “more money”, “reduce work” and “complete confidence” to customers.

If data is the new oil, then I would argue that AI is the new electricity.

“Very few companies have the customers, the scale of data that we have, and the expertise that we have,” he said. “There will always be competitors. But we’re singularly focused on the customer’s needs.”

He believes that companies without AI capability will not survive over time. “If data is the new oil, then I would argue that AI is the new electricity,” he said.

Generative AI behind the scenes

AI — both the talent to build it and work with it, as well as the tech itself — has been in demand for some time, but generative AI, which essentially works as a branch of AI that uses a large amount of data to create machine-generated content such as text, image and video, has taken interest and demand to new levels.

Various consumer-focused companies have started deploying generative AI on top of their products and services to attract customers and increase efficiency, and simply to ride the wave of hype in the space.

Companies like OpenAI have opened the door to the idea of a “general” AI capable of addressing any and all scenarios. But before that, companies like Intuit have been working on generative AI well before that to address more narrow use cases, according to Srivastava.

One example Srivastava shared: live services within TurboTax and QuickBooks that allow customers to interact with human experts. Instead of letting the experts type notes as the customer is talking to understand their problem and offer them solutions, automatic transcripts are generated to keep a record of the conversation.

After that, a generative AI capability summarizes the conversation into a few key steps to save everyone time at the end of the chat.

Intuit also uses an AI-native experience to optimize data flows based on the financial transaction data it gets from customers.

“We at Intuit deal with an enormous amount of financial transaction data. And as this data comes into our platform, we need to automatically categorize that data into different buckets for our customer,” said Srivastava. “What makes this problem hard is each customer has a different need for categorization. It’s completely personal… We actually reinvented this entire experience with artificial intelligence. We are building one model per customer, so that as this data flows in, each model does that optimization.”

India playing a critical role in AI strategy

Intuit has its second-largest team in India after the U.S., with over 1,600 employees working from Bengaluru. These employees are primarily tasked to help build solutions for global markets.

“India plays a critical role in our artificial intelligence strategy,” said Srivastava. “​​One of the unique things about India is that it’s showing a tremendous opportunity in terms of the talent and the skills that we have here.”

Indian engineers at Intuit work on AI, data and other capabilities that are being built into payroll and other parts of the company’s stack. Srivastava told TechCrunch that one of the key developments coming from the South Asian nation is an AI model for the mid-market.

Stating how the Indian engineers help the company fulfill its plan to become an AI-driven player, Srivastava said he saw the local team developing a solution that includes deep statistics and machine learning to converse with customers and generative narratives. The solution is currently showing results using templates, but those will soon be converted into generative AI.

“There are other areas in AI that the team has focused on, including modeling, churn and other aspects,” Srivastava said.

Although India has been an important market from the engineering perspective, Intuit does not consider the country a significant consumer market. It discontinued QuickBooks for Indian customers in January this year. However, the company, which celebrated its 18th anniversary in the country this year, has committed to continue to support and invest in Indian talent.

Pivoting from one model to another

It is not the first time that Intuit has planned a pivot from a legacy software vendor. The company has a history of going through a number of shifts to stay relevant in the market. That includes recent acquisitions, such as those of Credit Karma and MailChimp, which were made to expand to new kinds of business verticals.

Those moves have already helped the company see 32% year-on-year growth in revenue to $12.7 billion in the fiscal year 2022. Now, the ongoing shift toward AI is expected to expand the company’s growth further.

“Firms like Intuit could benefit from implementing the multiple pillars of AI that include descriptive analytics, predictive analytics, causal analytics and prescriptive analytics,” said Anindya Ghose, Heinz Riehl Chair Professor of Business at New York University’s Leonard N. Stern School of Business.

“They will need to invest in the right infrastructure and resources to determine which of these four pillars are most relevant to their lines of business. They need to identify complementarities between human intelligence and machine intelligence.”

Nonetheless, Intuit has challenges while looking to grow as an AI-driven company.

Steve Enders, a software research analyst at Citi Research, told TechCrunch that one of the biggest challenges for the company in the consumer segment is the growing advancements coming from generative AI. A recent GPT-4 demo suggested that the solution would be ready to complete users’ taxes using its advanced algorithms.

There has also been that nascent concern about competition in the overall tax market with reports of the U.S. expanding its free tax service, the analyst said referring a significant challenge for Intuit’s TurboTax.

When asked about TurboTax, free services, and whether AI would ever play a role in a new product from Intuit, Srivastava avoided a direct answer, claiming that AI in any form is intrinsic value to the company.

“Sometimes there are prominent uses of AI, sometimes there are not. But either way, people know that we’re working to deliver value for them. And that’s what we’re known for, something I’m very proud of,” he said.

Enders also pointed out that in the SMB segment, where Intuit targets QuickBooks, the company was moving into the mid-market, as well as trying to become more critical to its existing SMB customers by providing a broader portfolio of software solutions.

In the credit segment where the company has Credit Karma, the analyst said the biggest challenge is macro-related, with tightened lending standards from banks leading to fewer financial products being sold to consumers.

“I think there is always some risk of potential new competitors coming in, but Intuit should have a data and training set advantage given their dominant share in the tax and SMB accounting world,” he said. “Speed to market for some up and coming solutions may matter a bit more, particularly for CRM/marketing solutions in the SMB segment, but that is still a work in progress.” This is where those “revenue challenges” have also played out.

Jaster at BMO said that in the long term, it would be Intuit’s ability to harness its platforms to deepen relationships with customers and drive retention across the customer journey.

As the AI revolution marches on, it will be exciting to see how Intuit continues to transform its business in the years to come and uses its resources from markets including India to develop competitive solutions effectively for the future.

Generative AI could transform the way we interact with enterprise software

ChatGLM-6B: A Lightweight, Open-Source ChatGPT Alternative

ChatGLM-6B: A Lightweight, Open-Source ChatGPT Alternative
Image by Author

Recently we’ve all been having a super hard time catching up on the latest releases in the LLM space. In the last few weeks, several open-source ChatGPT alternatives have become popular.

And in this article we’ll learn about the ChatGLM series and ChatGLM-6B, an open-source and lightweight ChatGPT alternative.

Let's get going!

What is ChatGLM?

Researchers at the Tsinghua University in China have worked on developing the ChatGLM series of models that have comparable performance to other models such as GPT-3 and BLOOM.

ChatGLM is a bilingual large language model trained on both Chinese and English. Currently, the following models are available:

  • ChatGLM-130B: an open-source LLM
  • ChatGLM-100B: not open-sourced, but available through invite-only access
  • ChatGLM-6B: a lightweight open-source alternative

Though these models may seem similar to the Generative Pretrained Transformer (GPT) group of large language models, the General Language Model (GLM) pretraining framework is what makes them different. We’ll learn more about this in the next section.

How Does ChatGLM Work?

In machine learning, you'd know GLMs as generalized linear models, but the GLM in ChatGLM stands for General Language Model.

GLM Pretraining Framework

LLM pre training has been extensively studied and is still an area of active research. Let’s try to understand the key differences between GLM pretraining and GPT-style models.

The GPT-3 family of models use decoder-only auto regressive language modeling. In GLM, on the other hand, optimization of the objective is formulated as an auto regressive blank infilling problem.

ChatGLM-6B: A Lightweight, Open-Source ChatGPT Alternative
GLM | Image Source

In simple terms, auto regressive blank infilling involves blanking out a continuous span of text, and then sequentially reconstructing the text this blanking. In addition to shorter masks, there is a longer mask that randomly removes long blanks of text from the end of sentences. This is done so that the model performs reasonably well in natural language understanding as well as generation tasks.

Another difference is in the type of attention used. The GPT group of large language models use unidirectional attention, whereas the GLM group of LLMs use bidirectional attention. Using bidirectional attention over unmasked contexts can capture dependencies better and can improve performance on natural language understanding tasks.

GELU Activation

In GLM, GELU (Gaussian Error Linear Units) activation is used instead of the ReLU activation [1].

ChatGLM-6B: A Lightweight, Open-Source ChatGPT Alternative
GELU, ReLU, and ELU Activations | Image Source

The GELU activation and has non-zero values for all inputs and has the following form [3]:

ChatGLM-6B: A Lightweight, Open-Source ChatGPT Alternative

The GELU activation is found to improve performance in as compared to ReLU activations, though computationally more intensive than ReLU.

In the GLM series of LLMs, ChatGLM-130B which is open-source and performs as well as GPT-3’s Da-Vinci model. As mentioned, as of writing this article, there's a ChatGLM-100B version, which is restricted to invite-only access.

ChatGLM-6B

The following details about ChatGLM-6B to make it more accessible to end users:

  • Has about 6.2 billion parameters.
  • The model is pre-trained on 1 trillion tokens—equally from English and Chinese.
  • Subsequently, techniques such as supervised fine-tuning and reinforcement learning with human feedback are used.

Advantages and Limitations of ChatGLM

Let’s wrap up our discussion by going over ChatGLM’s advantages and limitations:

Advantages

From being a bilingual model to an open-source model that you can run locally, ChatGLM-6B has the following advantages:

  • Most mainstream large language models are trained on large corpora of English text, and large language models for other languages are not as common. The ChatGLM series of LLMs are bilingual and a great choice for Chinese. The model has good performance in both English and Chinese.
  • ChatGLM-6B is optimized for user devices. End users often have limited computing resources on their devices, so it becomes almost impossible to run LLMs locally—without access to high-performance GPUs. With INT4 quantization, ChatGLM-6B can run with a modest memory requirement of as low as 6GB.
  • Performs well on a variety of tasks including summarization and single and multi-query chats.
  • Despite the substantially smaller number of parameters as compared to other mainstream LLMs, ChatGLM-6B supports context length of up to 2048.

Limitations

Next, let’s list a few limitations of ChatGLM-6B:

  • Though ChatGLM is a bilingual model, its performance in English is likely suboptimal. This can be attributed to the instructions used in training mostly being in Chinese.
  • Because ChatGLM-6B has substantially fewer parameters as compared to other LLMs such as BLOOM, GPT-3, and ChatGLM-130B, the performance may be worse when the context is too long. As a result, ChatGLM-6B may give inaccurate information more often than models with a larger number of parameters.
  • Small language models have limited memory capacity. Therefore, in multi-turn chats, the performance of the model may degrade slightly.
  • Bias, misinformation, and toxicity are limitations of all LLMs, and ChatGLM is susceptible to these, too.

Conclusion

As a next step, run ChatGLM-6B locally or try out the demo on HuggingFace spaces. If you’d like to delve deeper into the working of LLMs, here's a list of free courses on large language models.

References

[1] Z Du, Y Qian et al., GLM: General Language Model Pretraining with Autoregressive Blank Infilling, ACL 2022

[2] A Zheng, X Liu et al., GLM-130B — An Open Bilingual Pretrained Model, ICML 2023

[3] D Hendryks, K Gimpel, Gaussian Error Linear Units (GELUs), arXiv, 2016

[4] ChatGLM-6B: Demo on HuggingFace Spaces

[5] GitHub Repo
Bala Priya C is a technical writer who enjoys creating long-form content. Her areas of interest include math, programming, and data science. She shares her learning with the developer community by authoring tutorials, how-to guides, and more.

More On This Topic

  • OpenChatKit: Open-Source ChatGPT Alternative
  • 8 Open-Source Alternative to ChatGPT and Bard
  • Dolly 2.0: ChatGPT Open Source Alternative for Commercial Use
  • MiniGPT-4: A Lightweight Alternative to GPT-4 for Enhanced Vision-language…
  • The 7 Best Open Source AI Libraries You May Not Have Heard Of
  • Top Open Source Large Language Models

How to Build a Scalable Data Architecture with Apache Kafka

How to Build a Scalable Data Architecture with Apache Kafka
Image by Author

Apache Kafka is a distributed message-passing system that works on a publisher-subscriber model. It is developed by Apache Software Foundation and written in Java and Scala. Kafka was created to overcome the problem faced by the distribution and scalability of traditional message-passing systems. It can handle and store large volumes of data with minimal latency and high throughput. Due to these benefits, it can be suitable for making real-time data processing applications and streaming services. It is currently open-source and used by many organisations like Netflix, Walmart and Linkedin.

A Message Passing System makes several applications send or receive data from each other without worrying about data transmission and sharing. Point-to-Point and Publisher-Subscriber are two widespread message-passing systems. In point-to-point, the sender pushes the data into the queue, and the receiver pops from it like a standard queue system following FIFO(first in, first out) principle. Also, the data gets deleted once it gets read, and only a single receiver is allowed at a time. There is no time dependency laid for the receiver to read the message.

How to Build a Scalable Data Architecture with Apache Kafka
Fig.1 Point-to-Point Message System | Image by Author

In the Publisher-Subscriber model, the sender is termed a publisher, and the receiver is termed a subscriber. In this, multiple senders and receivers can read or write data simultaneously. But there is a time dependency in it. The consumer has to consume the message before a certain amount of time, as it gets deleted after that, even if it didn’t get read. Depending on the user's configuration, this time limit can be a day, a week, or a month.

How to Build a Scalable Data Architecture with Apache Kafka
Fig.2 Publisher-Subscriber Message System | Image by Author Kafka Architecture

Kafka architecture consists of several key components:

  1. Topic
  2. Partition
  3. Broker
  4. Producer
  5. Consumer
  6. Kafka-Cluster
  7. Zookeeper

How to Build a Scalable Data Architecture with Apache Kafka
Fig.3 Kafka Architecture | Image by ibm-cloud-architecture

Let’s briefly understand each component.

Kafka stores the messages in different Topics. A topic is a group that contains the messages of a particular category. It is similar to a table in a database. A topic can be uniquely identified by its name. We cannot create two topics with the same name.

The topics are further classified into Partitions. Each record of these partitions is associated with a unique identifier termed Offset, which denotes the position of the record in that partition.

Other than this, there are Producers and Consumers in the system. Producers write or publish the data in the topics using the Producing APIs. These producers can write either on the topic or partition levels.

Consumers read or consume the data from the topics using the Consumer APIs. They can also read the data either at the topic or partition levels. Consumers who perform similar tasks will form a group known as the Consumer Group.

There are other systems like Broker and Zookeeper, which run in the background of Kafka Server. Brokers are the software that maintains and keeps the record of published messages. It is also responsible for delivering the right message to the right consumer in the correct order using offsets. The set of brokers collectively communicating with each other can be called Kafka clusters. Brokers can be dynamically added or removed from the Kafka cluster without facing any downtime in the system. And one of the brokers in the Kafka cluster is termed a Controller. It manages states and replicas inside the cluster and performs administrative tasks.

On the other hand, Zookeeper is responsible for maintaining the health status of the Kafka cluster and coordinating with each broker of that cluster. It maintains the metadata of each cluster in the form of key-value pairs.

This tutorial is mainly focused on the practical implementation of Apache Kafka. If you want to read more about its architecture, you can read this article by Upsolver.

Taxi Booking App: A Practical Use Case

Consider the use case of a taxi booking service like Uber. This application uses Apache Kafka to send and receive messages through various services like Transactions, Emails, Analytics, etc.

How to Build a Scalable Data Architecture with Apache Kafka
Fig.4 Architecture of the Taxi App | Image by Author

The architecture consists of several services. The Rides service receives the ride request from the customer and writes the ride details on the Kafka Message System.

Then these order details were read by the Transaction service, which confirms the order and payment status. After confirming that ride, this Transaction service writes the confirmed ride again in the message system with some additional details. And then finally, the confirmed ride details are read by other services like Email or Data Analytics to send the confirmation mail to the customer and to perform some analysis on it.

We can execute all these processes in real-time with very high throughput and minimum latency. Also, due to the capability of horizontal scaling of Apache Kafka, we can scale this application to handle millions of users.

Practical Implementation of the above Use Case

This section contains a quick tutorial to implement the kafka message system in our application. It includes the steps to download kafka, configure it, and create producer-consumer functions.

Note: This tutorial is based on python programming language and uses a windows machine.

Apache Kafka Downloading Steps

1.Download the latest version of Apache Kafka from that link. Kafka is based on JVM languages, so Java 7 or greater version must be installed in your system.

  1. Extract the downloaded zip file from your computer's (C:) drive and rename the folder as /apache-kafka.
  1. The parent directory contain two sub-directories, /bin and /config, which contains the executable and configuration files for the zookeeper and the kafka server.

Configuration Steps

First, we need to create log directories for the Kafka and Zookeeper servers. These directories will store all the metadata of these clusters and the messages of the topics and partitions.

Note: By default, these log directories are created inside the /tmp directory, a volatile directory that vanishes off all the data inside when the system shuts down or restarts. We need to set the permanent path for the log directories to resolve this issue. Let’s see how.

Navigate to apache-kafka >> config and open the server.properties file. Here you can configure many properties of kafka, like paths for log directories, log retention hours, number of partitions, etc.

Inside the server.properties file, we have to change the path of the log directory's file from the temporary /tmp directory to a permanent directory. The log directory contains the generated or written data in the Kafka Server. To change the path, update the log.dirs variable from /tmp/kafka-logs to c:/apache-kafka/kafka-logs. This will make your logs stored permanently.

log.dirs=c:/apache-kafka/kafka-logs

The Zookeeper server also contains some log files to store the metadata of the Kafka servers. To change the path, repeat the above step, i.e open zookeeper.properties file and replace the path as follows.

dataDir=c:/apache-kafka/zookeeper-logs

This zookeeper server will act as a resource manager for our kafka server.

Run the Kafka and Zookeeper Servers

To run the zookeeper server, open a new cmd prompt inside your parent directory and run the below command.

$ .binwindowszookeeper-server-start.bat .configzookeeper.properties

How to Build a Scalable Data Architecture with Apache Kafka
Image by Author

Keep the zookeeper instance running.

To run the kafka server, open a separate cmd prompt and execute the below code.

$ .binwindowskafka-server-start.bat .configserver.properties

Keep the kafka and zookeeper servers running, and in the next section, we will create producer and consumer functions which will read and write data to the kafka server.

Creating Producer & Consumer Functions

For creating the producer and consumer functions, we will take the example of our e-commerce app that we discussed earlier. The `Orders` service will function as a producer, which writes order details to the kafka server, and the Email and Analytics service will function as a consumer, which reads that data from the server. The Transaction service will work as a consumer as well as a producer. It reads the order details and writes them back again after transaction confirmation.

But first, we need to install the Kafka python library, which contains inbuilt functions for Producer and Consumers.

$ pip install kafka-python

Now, create a new directory named kafka-tutorial. We will create the python files inside that directory containing the required functions.

$ mkdir kafka-tutorial  $ cd .kafka-tutorial

Producer Function:

Now, create a python file named `rides.py` and paste the following code into it.

rides.py

import kafka  import json  import time  import random    topicName = "ride_details"  producer = kafka.KafkaProducer(bootstrap_servers="localhost:9092")    for i in range(1, 10):      ride = {          "id": i,          "customer_id": f"user_{i}",          "location": f"Lat: {random.randint(-90, 90)}, Long: {random.randint(-90, 90)}",      }      producer.send(topicName, json.dumps(ride).encode("utf-8"))      print(f"Ride Details Send Succesfully!")      time.sleep(5)

Explanation:

Firstly, we have imported all the necessary libraries, including kafka. Then, the topic name and a list of various items are defined. Remember that topic is a group that contains similar types of messages. In this example, this topic will contain all the orders.

Then, we create an instance of a KafkaProducer function and connect it to the kafka server running on the localhost:9092. If your kafka server is running on a different address and port, then you must mention the server’s IP and port number there.

After that, we will generate some orders in JSON format and write them to the kafka server on the defined topic name. Sleep function is used to generate a gap between the subsequent orders.

Consumer Functions:

transaction.py

import json  import kafka  import random    RIDE_DETAILS_KAFKA_TOPIC = "ride_details"  RIDES_CONFIRMED_KAFKA_TOPIC = "ride_confirmed"    consumer = kafka.KafkaConsumer(      RIDE_DETAILS_KAFKA_TOPIC, bootstrap_servers="localhost:9092"  )  producer = kafka.KafkaProducer(bootstrap_servers="localhost:9092")    print("Listening Ride Details")  while True:      for data in consumer:          print("Loading Transaction..")          message = json.loads(data.value.decode())          customer_id = message["customer_id"]          location = message["location"]          confirmed_ride = {              "customer_id": customer_id,              "customer_email": f"{customer_id}@xyz.com",              "location": location,              "alloted_driver": f"driver_{customer_id}",              "pickup_time": f"{random.randint(1, 20)}mins",          }          print(f"Transaction Completed..({customer_id})")          producer.send(              RIDES_CONFIRMED_KAFKA_TOPIC, json.dumps(confirmed_ride).encode("utf-8")          )

Explanation:

The transaction.py file is used to confirm the transitions made by the users and assign them a driver and estimated pickup time. It reads the ride details from the kafka server and writes it again in the kafka server after confirming the ride.

Now, create two python files named email.py and analytics.py, which are used to send emails to the customer for their ride confirmation and to perform some analysis respectively. These files are only created to demonstrate that even multiple consumers can read the data from the Kafka server simultaneously.

email.py

import kafka  import json    RIDES_CONFIRMED_KAFKA_TOPIC = "ride_confirmed"  consumer = kafka.KafkaConsumer(      RIDES_CONFIRMED_KAFKA_TOPIC, bootstrap_servers="localhost:9092"  )    print("Listening Confirmed Rides!")  while True:      for data in consumer:          message = json.loads(data.value.decode())          email = message["customer_email"]          print(f"Email sent to {email}!")

analysis.py

import kafka  import json    RIDES_CONFIRMED_KAFKA_TOPIC = "ride_confirmed"  consumer = kafka.KafkaConsumer(      RIDES_CONFIRMED_KAFKA_TOPIC, bootstrap_servers="localhost:9092"  )    print("Listening Confirmed Rides!")  while True:      for data in consumer:          message = json.loads(data.value.decode())          id = message["customer_id"]          driver_details = message["alloted_driver"]          pickup_time = message["pickup_time"]          print(f"Data sent to ML Model for analysis ({id})!")

Now, we have done with the application, in the next section, we will run all the services simultaneously and check the performance.

Test the Application

Run each file one by one in four separate command prompts.

$ python transaction.py
$ python email.py
$ python analysis.py
$ python ride.py

How to Build a Scalable Data Architecture with Apache Kafka
Image by Author

You can receive output from all the files simultaneously when the ride details are pushed into the server. You can also increase processing speed by removing the delay function in the rides.py file. The `rides.py` file pushed the data into the kafka server, and the other three files simultaneously read that data from the kafka server and function accordingly.

I hope you get a basic understanding of Apache Kafka and how to implement it.

Conclusion

In this article, we have learnt about Apache Kafka, its working and its practical implementation using a use case of a taxi booking app. Designing a scalable pipeline with Kafka requires careful planning and implementation. You can increase the number of brokers and partitions to make these applications more scalable. Each partition is processed independently so that the load can be distributed among them. Also, you can optimise the kafka configuration by setting the size of the cache, the size of the buffer or the number of threads.

GitHub link for the complete code used in the article.

Thanks for reading this article. If you have any comments or suggestions, please feel free to contact me on Linkedin.
Aryan Garg is a B.Tech. Electrical Engineering student, currently in the final year of his undergrad. His interest lies in the field of Web Development and Machine Learning. He have pursued this interest and am eager to work more in these directions.

More On This Topic

  • Build a synthetic data pipeline using Gretel and Apache Airflow
  • How to Use Kafka Connect to Create an Open Source Data Pipeline for…
  • Building a Scalable ETL with SQL + Python
  • Distributed and Scalable Machine Learning [Webinar]
  • Generalized and Scalable Optimal Sparse Decision Trees(GOSDT)
  • Building Massively Scalable Machine Learning Pipelines with Microsoft…

The Future of Work: How AI is Changing the Job Landscape

The Future of Work: How AI is Changing the Job Landscape
Image by Editor

If you just think about the last 5 years alone, how your conversations have changed between your family and friends. Some of you may not speak about technology at all, but we can admit that it's hard not to consider it is around us.

The recent release of ChatGPT and now GoogleBard are taking the world by storm with their amazing capabilities. You start to look at these tools and figure out how they can improve your work life, the company's process, your personal life, and more.

Artificial Intelligence is automating tasks that were once only capable of being done by humans. The relevance of automating certain tasks to make human life easier will only continue to grow. Some may call us lazy, some people just think it's the smarter option.

Here are a few examples of how AI is changing the job landscape:

  • Automation: Automating tasks. This is leading to a heavy loss of jobs but is also creating new opportunities for others.
  • New job opportunities: AI systems will require human workers to work alongside them, such as data scientists, and machine learning engineers.
  • A shift in skill requirements: The more AI applications are integrated, the current and new workers will need to understand more about how the system works, the engineer, the analysis phases, etc.
  • Work-life balance: The rise of AI tools is allowing more people to work remotely, part-time and/or freelance as the tasks are automated.

How is AI Automation Changing the Job Landscape?

Automation is currently the biggest reason why there is a shift in the job landscape. As more and more tasks are getting done by artificial intelligence and fewer are getting done by actual humans, you can understand why companies will start to lay off employees. The cost with employees comes with a salary, pension, health insurance, maternity/paternity leave and more. More companies are seeing employees as a loss, and automating artificial intelligence tools is their biggest breakthrough.

Below are a few sectors that have already implemented automated tasks:

Customer Service

Just a few weeks ago, the world was hit with ChatGPT and Google Bard. More and more industries are adopting AI-powered chatbots to provide customer service. With the rise of large language models, we can only expect these chatbots to get better at handling customers' queries, answering their questions, resolving problems, and even making sales.

For example, chatbots are also used in the financial industry for tasks such as signing up new applicants for insurance, Know Your Customer (KYC) and Anti Money Laundering (AML) policies and processes. The implementation of automated tooling for tasks as sensitive as these proves to use the success of AI and how it will only continue.

Data Entry

Data Entry tasks used to be manual tasks, that would be very tedious and repetitive. There were some flaws in this sector due to the tasks being very repetitive and boring and the workers were more susceptible to making errors.

AI is now able to automate data entry tasks, by extracting the data from raw files or documentation and entering it into a database.

Driving

Well, we all know about self-driving cars. More and more of them are coming to the market with companies such as Tesla, Waymo and Uber. These cars use AI computer vision to safely drive passengers from A to B, navigate roads and avoid obstacles.

Finance

As I already mentioned before, chatbots are being used in the financial industry to automate some processes and tasks such as KYC. AI is also being used to help these financial firms to analyze data, to make better current and future predictions.

The financial industry has a lot of data at its disposal. The more historical data, the better their analytical outputs will be. This unfortunately will cause a shift in the need for people to work along the AI systems, rather than the AI systems working for people.

Healthcare

For an industry where a lot of people are shocked to see the integration of AI tooling, we can only expect to see more. Healthcare industry professionals are using AI to diagnose diseases, recommend treatments through data analysis, and even perform surgery with robotics.

How AI is Creating New Jobs

As AI continues to grow, the choices for the majority of people will be to get laid off or work alongside the AI systems. This is why you naturally see a rise in data professionals, more courses on learning how to code, BootCamps and more.

You are going to naturally see more of these roles:

Data Science

Data Science is a combination of statistics, data analysis, machine learning, and artificial intelligence. Therefore, Data Scientists will be responsible for collating, preparing, cleaning, and manipulating data to identify patterns in the data and perform advanced data analysis.

Machine Learning Engineers

A Machine Learning (ML) Engineer is a programmer who is proficient in researching, building, and designing software to automate predictive models. Their role is to build Artificial Intelligence (AI) systems that consume large amounts of data to generate and develop algorithms that are capable of learning and making future predictions.

AI Trainers

AI tools are best when they’ve learned all the knowledge they need. AI trainers will be required to help teach AI systems how to perform tasks. They will also be responsible for collecting data, labelling it, and then inputting it into the AI algorithms to learn from the labelled data and ensure it produces accurate outputs.

So What Should We Expect?

It is difficult to see what the future holds, especially when artificial intelligence is in the mix. Unfortunately, we will start to see more people losing their jobs to AI, and others will be created to work alongside AI systems.

This will create a shift in people's desire to learn new skills to ensure job security. We will start to see more people learning programming languages, understanding AI and how to make use of it in sales, marketing, etc.

Along with the pandemic causing a big shift in how people work, AI has also added to it. There will be a continuation of more people working from home, and travelling the world whilst working as AI systems can automate a lot of their tasks.

With this information, I think it is important for anyone, regardless of their current job title, to be aware of how AI will cause a shift in the job landscape and how to prepare with meeting currently demanded skills.
Nisha Arya is a Data Scientist, Freelance Technical Writer and Community Manager at KDnuggets. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.

More On This Topic

  • Future Says Series | Discover the Future of AI
  • The First ML Value Chain Landscape
  • How to Manage Your Complex IT Landscape with AIOps
  • How to Grow as a Data Scientist in an Ever-Changing World
  • How Natural Language Processing Is Changing Data Analytics
  • MLOps Is Changing How Machine Learning Models Are Developed

Free eBook: 10 Practical Python Programming Tricks

Free eBook: 10 Practical Python Programming Tricks
Image by fullvector on Freepik

Are you a Python programmer who has already mastered the basics and are looking to move on to some less obvious yet useful skills to add to your repertoire?

Perhaps you already know how to use lists. Maybe you are now skilled at creating functions. Conceivably you can control your program's execution flow with ease. Perchance you posess the requisite knowledge of Python's type system and what types to use when. At this point, you just desire some more advanced Python tricks.

If this sounds like you, you might find the free ebook 10 Practical Python Programming Tricks: Boost Your Efficiency and Code Quality to be useful.

Embrace these tips to enhance your Python programming skills and stand out as a proficient developer who can create high-quality, performant applications with ease.

A product of Data Science Horizons, this free ebook covers the following topics, in an effort to help boost the reader's efficiency and code quality:

  • List Comprehensions
  • Lambda Functions
  • The Walrus Operator (Assignment Expressions)
  • Itertools Module
  • F-strings (Formatted String Literals)
  • Context Managers and the 'with' Statement
  • Generators and Generator Expressions
  • Decorators
  • Type Hints and Static Type Checking
  • Python One-Liners

As the title alludes to, along with the demonstrating of the skills listed above, this ebook focuses on both efficiency and quality of code:

Writing efficient code means optimizing your code's execution speed and minimizing resource consumption, such as memory usage. Clean coding, on the other hand, focuses on readability, maintainability, and organization. Both aspects go hand-in-hand, as efficient code is easier to understand, debug, and modify, while clean code inherently leads to better performance. By adopting the best practices outlined in this ebook, you'll be better equipped to write high-quality Python code that is not only fast and resource-efficient but also easy to understand and modify.

Take a look at the free ebook 10 Practical Python Programming Tricks: Boost Your Efficiency and Code Quality today if you are ready to take your Python programming to the next level. It may just be what you are looking for.

Matthew Mayo (@mattmayo13) is a Data Scientist and the Editor-in-Chief of KDnuggets, the seminal online Data Science and Machine Learning resource. His interests lie in natural language processing, algorithm design and optimization, unsupervised learning, neural networks, and automated approaches to machine learning. Matthew holds a Master's degree in computer science and a graduate diploma in data mining. He can be reached at editor1 at kdnuggets[dot]com.

More On This Topic

  • Free Intermediate Python Programming Crash Course
  • eBook: A Practical Guide to Using Third-Party Data in the Cloud
  • Free From MIT: Intro to Computer Science and Programming in Python
  • KDnuggets™ News 20:n35, Sep 16: Data Science Skills: Core, Emerging,…
  • Top September Stories: Free From MIT: Intro to Computer Science and…
  • 4 Tricks to Effectively Use JSON in Python

8 Open-Source Alternative to ChatGPT and Bard

8 Open-Source Alternative to ChatGPT and Bard
Image by Author 1. LLaMA

The LLaMA project encompasses a set of foundational language models that vary in size from 7 billion to 65 billion parameters. These models were training on millions of tokens, and it was training on publicly available datasets exclusively. As a result, LLaMA-13B outperforms GPT-3 (175B), and LLaMA-65B is performing similarly to the best models like Chinchilla-70B and PaLM-540B.

8 Open-Source Alternative to ChatGPT and Bard
Image from LLaMA

Resources:

  • Research Paper: LLaMA: Open and Efficient Foundation Language Models (arxiv.org)
  • GitHub: facebookresearch/llama
  • Demo: Baize Lora 7B

2. Alpaca

Stanford Alpaca claims that it can compete with ChatGPT and anyone can reproduce it in less than 600$. The Alpaca 7B is finetuned from the LLaMA 7B model on 52K instruction-following demonstrations.

8 Open-Source Alternative to ChatGPT and Bard
Training recipe | Image from Stanford CRFM

Resources:

  • Blog: Stanford CRFM
  • GitHub: tatsu-lab/stanford_alpaca
  • Demo: Alpaca-LoRA (The official demo was drop and this is a recreation of Alpaca model)

3. Vicuna

Vicuna is finetuned from the LLaMA model on user-shared conversations collected from ShareGPT. The model Vicuna-13B has achieved more than 90%* quality of OpenAI ChatGPT and Google Bard. It has also outperformed LLaMA and Stanford Alpaca models in 90% of cases. The cost of training Vicuna was around 300$.

8 Open-Source Alternative to ChatGPT and Bard
Image from Vicuna

Resources:

  • Blog post: Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality
  • GitHub: lm-sys/FastChat
  • Demo: FastChat (lmsys.org)

4. OpenChatKit

OpenChatKit: Open-Source ChatGPT Alternative is a complete tools kit for creating your chatbot. It provides instruction for training your own Instruction-tuned large language model, fine-tuning the model, extensible retrieval system for updating the bot response, and bot moderation for filtering out questions.

8 Open-Source Alternative to ChatGPT and Bard
Image from TOGETHER

As we can see, the GPT-NeoXT-Chat-Base-20B model has outperformed base mode GPT-NoeX on question and answer, extraction, and classification tasks.

Resources:

  • Blog Post: Announcing OpenChatKit — TOGETHER
  • GitHub: togethercomputer/OpenChatKit
  • Demo: OpenChatKit
  • Model card: togethercomputer/GPT-NeoXT-Chat-Base-20B

5. GPT4ALL

GPT4ALL is a community-driven project and was trained on a massive curated corpus of assistant interactions, including code, stories, depictions, and multi-turn dialogue. The team has provided datasets, model weights, data curation process, and training code to promote open-source. Furthermore, they have released quantized 4-bit versions of the model that can run on your laptop. You can even use a Python client to run the model inference.

8 Open-Source Alternative to ChatGPT and Bard
Gif from GPT4ALL

Resources:

  • Technical Report: GPT4All
  • GitHub: nomic-ai/gpt4al
  • Demo: GPT4All (non-official)
  • Model card: nomic-ai/gpt4all-lora · Hugging Face

6. Raven RWKV

Raven RWKV 7B is an open-source chatbot that is powered by the RWKV language model that produces similar results to ChatGPT. The model uses RNNs that can match transformers in quality and scaling while being faster and saving VRAM. The Raven was fine-tuned on Stanford Alpaca, code-alpaca, and more datasets.

8 Open-Source Alternative to ChatGPT and Bard
Image from Raven RWKV 7B

Resources:

  • GitHub: BlinkDL/ChatRWKV
  • Demo: Raven RWKV 7B
  • Model card: BlinkDL/rwkv-4-raven

7. OPT

OPT: Open Pre-trained Transformer Language Models is not great as ChatGPT, but it has shown remarkable capabilities for zero- and few-shot learning and Stereotypical Bias analysis. You can also integrate it with Alpa, Colossal-AI, CTranslate2, and FasterTransformer to get even better results.

Note: It is on the list because of its popularity, as it has 624,710 monthly downloads in the text generation category.

8 Open-Source Alternative to ChatGPT and Bard
Image from (arxiv.org)

Resources:

  • Research Paper: OPT: Open Pre-trained Transformer Language Models (arxiv.org)
  • GitHub: facebookresearch/metaseq
  • Demo: A Watermark for LLMs
  • Model card: facebook/opt-1.3b

8. Flan-T5-XXL

Flan-T5-XXL fine-tuned T5 models on a collection of datasets phrased as instructions. The instruction fine-tuning dramatically improves performance on a variety of model classes such as PaLM, T5, and U-PaLM. The Flan-T5-XXL model is fine-tuned on more than 1000 additional tasks covering also more languages.

8 Open-Source Alternative to ChatGPT and Bard
Image from Flan-T5-XXL

Resources:

  • Research Paper: Scaling Instruction-Fine Tuned Language Models
  • GitHub: google-research/t5x
  • Demo: Chat Llm Streaming
  • Model card: google/flan-t5-xxl

Conclusion

There are many open-source options available, and I have mentioned popular ones. The open-source chatbots and models are getting better, and in the next few months, you will see a new model that can completely overtake ChatGPT in terms of performance.

In this blog, I have provided a list of models/chatbot frameworks that can help you train and build chatbots similar to ChatGPT and GPT-4. Don’t forget to give them likes and stars.

Do let me know if you have better suggestions in the comment section. I would love to add it in the future.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

More On This Topic

  • OpenChatKit: Open-Source ChatGPT Alternative
  • ChatGPT vs Google Bard: A Comparison of the Technical Differences
  • The 7 Best Open Source AI Libraries You May Not Have Heard Of
  • Top Open Source Large Language Models
  • GitHub Copilot Open Source Alternatives
  • First Open Source Implementation of DeepMind’s AlphaTensor

My Data Science Six Months Success Story

A couple of days back, I celebrated learning data science for six months and I can tell you that it’s the best decision I made this year.

For someone who is a master of a try by error, Data science is a career I am glad I decided to try as I have never been so intentional about anything as I have been with it.

If someone had told me a year ago that I would know how to scrape and analyze tweets off Twitter and build dashboards that could tell stories about data, I would have doubted but I am amazed at the many things I have learned and the milestones I have accomplished in the past six months.

I will be sharing a couple of things I have learned in the past six months and tips that helped me stay dedicated and true to my journey in this article.

Early days

First thing I did when I decided to pursue a career in Data Science; I watched a couple of YouTube videos and researched a lot about the roadmap for a data scientist as well as the best way to learn. I wrote down the list of tools I needed to learn and looked for platforms I could learn for free. I had initially started with the IBM Data Science course on Coursera but ended up spending most of my time learning on Datacamp courtesy of a scholarship from Ingressive for good.

Another thing I did was put myself out there, for accountability's sake. Was I scared? Yes, I was very scared because this girl has tried a lot of things and stopped midway. But I was determined to see this to the end so I gave myself a timeframe, learn the basics in six months. I wasn’t even thinking about getting a job in six months, the goal was to learn a lot in six months to help me get started in this career path.

I made a post declaring my intention on Twitter and pinned it to my profile so that every day I came online and looked at my profile, it will serve as a reminder.

Another thing was to ask for God’s grace and guidance. Yes, I prayed about it and I started this journey with God. I am heavy on the GOD factor and I believe this is my secret ingredient.

Also, I mapped out a plan that worked for me, the early hours of the morning is my best time to read and it was quite easy for me because at this point I didn’t have a laptop and I was learning on my phone. My phone is the first thing I pick up every morning after saying Thank you Jesus so straight to Datacamp and straight into python.

In one month, I completed four python courses on Datacamp and I was so excited and looking forward to more.

My Data Science Six Months Success Story My Data Science Six Months Success Story My Data Science Six Months Success Story My Data Science Six Months Success Story
Statement of accomplishment for the four courses I took in the first month of my journey.
May

I mapped out plans in May to complete even more courses and I worked so hard that I was able to complete 16 Python courses on Datacamp that month.

My Data Science Six Months Success Story
My Data Science Six Months Success Story

I also got into the 20 data analytics project cohort with Octave, even though I hadn’t learned enough to build a project, I took the risk to join and learn from others.

My Data Science Six Months Success Story
20 data analytics project

At the end of May, I worked on my first project, the superstore sales dashboard. It wasn’t so great but it was a stepping stone for me to learn and opened me to the world of growth.

My Data Science Six Months Success Story
Superstore sales dashboard

By the beginning of June, I had completed the Data scientist with Python track on Datacamp and I started learning SQL immediately.

I worked on another project, an analysis of past Nobel prize Winners, and this time, it was better. I was still consistent in my learning and my growth was consistent.

My Data Science Six Months Success Story
Analysis of Past Nobel Prize winners

I also got into the Stanbic IBTC trainee program to learn Azure, this was a tough yet exciting journey for me.

My Data Science Six Months Success Story

By the end of June, I had garnered over 1,000,000 learning XP on Datacamp.

My Data Science Six Months Success Story
July

In July, I worked on a project that practically changed my learning journey, the Real Housewives of Lagos sentiment analysis project. This project made so many people interested in my journey and I got several DMs from recruiters who would like to work with me.

My Data Science Six Months Success Story

By the end of July, I landed my first role as a Data Analyst for a Nigerian startup company.

August

In August, I became Datacamp certified.

My Data Science Six Months Success Story

I also wrote two Microsoft exams; DP 900-Azure Data Fundamentals and DP 300-Azure Database Administrator and I passed.

My Data Science Six Months Success Story

My Data Science Six Months Success Story

I also won the Datacamp daily XP learner challenge as well as the #55daysof data challenge organized by Ingressive for good.

My Data Science Six Months Success Story My Data Science Six Months Success Story
September

I had a streak of 150 days of learning on Datacamp by September. I lost this streak 10 days later??. It was a tough moment for me.

My Data Science Six Months Success Story

It’s been a rollercoaster of learning, relearning, and growing since then and I am so excited about the future. Now it’s time for me to re-evaluate where I am and what’s next for me.

Learning Points

Something I learned in the course of this journey is anything is doable, Once you have the will to learn and you are ready to put in the work, it’s totally achievable. On that note, I have stylishly included a challenge for you. I am challenging you to show up every day this month, Pick a skill or any tool and learn activities for this month and gauge your growth.

Another thing I learned is, showing up every day pays, Consistency was a big key in my journey. I showed up every day for the past six months…I might not have studied a lot on some days and gone really hard on others but I always showed up.

Also, hard work was another determining factor in my six months success story. I worked my brains out. When I started my journey I was learning for 11 hours every day. I would spend at least four hours during the day, at intervals though, not at a stretch, learning concepts on my phone. Then I would spend the night practicing what I had learned on a borrowed laptop.

Another tip that really helped me was sharing my journey, coming online every day to talk about my journey was a key factor in my growth.

Also networking with other data professionals in the space and being a part of the data community was another key factor, I had a lot of passive mentors on Twitter and it influenced my journey.

In the course of my journey, I also learned to accept that burnouts are normal, sometimes I wake up feeling very unmotivated and wondering if I am on the right path. On such days I don’t push myself too hard, a practice workout on Datacamp and that’s all. I don’t hate myself for not going too hard, instead on days when I am feeling very motivated, I go the extra mile to make up for lost days.

There were also days when I felt like a fraud, impostor syndrome hit me hard, it felt like I was just studying and didn’t know anything. I had to work on my self-confidence. I wrote down lots of self-affirmations and read them out loud when I doubted myself.

Highlights of my Journey

Winning the #I4GDatacamp scholarship.

Getting accepted into Stanbic IBTC trainee program.

Landing my first Data analyst role.

Winning the I4G #55daysofdata challenge.

Winning the Datacamp XP learner challenge.

Celebrating One million+ XP on Datacamp.

Celebrating 150 days streak on Datacamp.

Winning a laptop from Ingressive for good.

Lowlights of my Journey

Losing my streak on Datacamp.

Getting lots of rejection mail for every laptop application I sent out.

My Roadmap

Python — Actively for the first four months

Power BI — Started learning in the second month

SQL — Started learning in the fourth month

Azure — Started learning in the third month

Spreadsheet — Started learning in the fifth month.

My Resources and Platform

  • Datacamp
  • Coursera
  • Microsoft learn
  • Youtube

I know this has been a long read but I hope it motivates someone to stay true to their journey and keep pushing. It might take a while but it’s your success story I will be reading next.

Tina Okonkwo is a Data Analyst who is passionate about story telling with data. She provide coaching and mentoring for those pursuing a career in data analytics. Her goal is to share my Data analytics journey to a larger audience and help others kickstart their journey into data analytics.

Original. Reposted with permission.

More On This Topic

  • The Story of the Women in Data Science (WiDS) Datathon
  • How I Levelled Up My Data Science Skills In 8 Months
  • How I Tripled My Income With Data Science in 18 Months
  • How I Got 4 Data Science Offers and Doubled My Income 2 Months After Being…
  • Context, Consistency, And Collaboration Are Essential For Data Science…
  • Telling a Great Data Story: A Visualization Decision Tree