X’s privacy policy confirms it will use public data to train AI models

X’s privacy policy confirms it will use public data to train AI models Sarah Perez @sarahintampa / 9 hours

X’s recently updated privacy policy informed its users it would now collect biometric data as well as users’ job and education history, Bloomberg spotted earlier this week. But it appears that’s not the only thing that X plans to do with user data. According to an update to another section of the policy, the company additionally plans to use the information it collects and other publicly available information to help train its machine learning and AI models, it says.

The change was noticed by Alex Ivanovs of Stackdiary, who has a history of finding notable updates in the terms of service of tech companies, having previously found AI-related updates in Brave and Zoom. His post is now trending on Y Combinator’s discussion forum Hacker News.

Specifically, the X policy change is found in section 2.1 and reads as follows:

We may use the information we collect and publicly available information to help train our machine learning or artificial intelligence models for the purposes outlined in this policy.

As Ivanovs points out, X owner Elon Musk has ambitions to enter the AI market with another company, xAI. This leads him to theorize that Musk likely intends to use X as a source of data for xAI — and perhaps Musk’s recent tweet encouraging journalists to write on X was even an attempt to generate more interesting and useful data to feed into the AI models.

In fact, Musk has previously stated that xAI would use “public tweets” to train its AI models, so this is not much of a leap. He accused other tech giants of leveraging Twitter to train their AI models, even threatening Microsoft with a potential lawsuit for alleged illegal use of Twitter data. Musk also filed suit against unknown entities for scraping Twitter data, which also may have been for the purpose of training artificial intelligence large language models.

In addition, Ivanovs points to to the text on the xAI homepage which states that while it’s a separate company from X Corp., it “will work closely with X (Twitter), Tesla, and other companies to make progress towards our mission.”

Musk essentially confirmed the privacy policy change, responding to a post on X to clarify that the plan is to use “just public data, no DMs or anything private.”

Just public data, not DMs or anything private

— Elon Musk (@elonmusk) August 31, 2023

X no longer responds to press requests with a poop emoji as it had following Musk’s takeover of the social network. Instead, we’ve received an auto-responder that says “We’ll get back to you soon.” If that, indeed, turns out to be true, we’ll add X’s comment.

How Google, UCLA are prompting AI to choose the next action for a better answer

google-example-avis-question-answering

Google's AVIS program can dynamically select a series of steps to undertake, such as identifying an object in a picture, then looking up information about that object.

Artificial intelligence programs have dazzled the public with how they produce an answer no matter what the query. However, the quality of the answer often falls short because programs such as ChatGPT merely respond to text input, with no particular grounding in the subject matter, and can produce outright falsehoods as a result.

A recent research project from the University of California and Google instead enables large language models such as Chat-GPT to select a specific tool — be it Web search or optical character recognition — that can then seek an answer in multiple steps from an alternate source.

Also: ChatGPT lies about scientific results, needs open-source alternatives, say researchers

The result is a primitive form of "planning" and "reason," a way for a program to determine at each moment how a question should be approached, and once addressed, whether the solution was satisfactory.

The effort, called AVIS (for "Autonomous Visual Information Seeking with Large Language Models") by Ziniu Hu and colleagues at the University of California at Los Angeles, and collaborating authors at Google Research, is posted on the arXiv pre-print server.

AVIS is built on Google's Pathways Language Model, or PaLM, a large language model that has spawned multiple versions adapted to a variety of approaches and experiments in generative AI.

AVIS is in the tradition of recent research seeking to turn machine learning programs into "agents" that act more broadly than simply producing a next-word prediction. They include BabyAGI, an "AI-powered task management system" introduced this year, and PaLM*E, introduced this year by Google researchers, which can instruct a robot to follow a series of actions in physical space.

The big breakthrough of the AVIS program is that — unlike BabyAGI and PaLM*E — it doesn't follow a pre-set course of action. Instead, it uses an algorithm called a "Planner" that selects between a choice of actions on the fly, as each situation arises. Those choices are generated as the language model evaluates the prompted text, breaking it down into sub-questions, and then correlating those sub-questions to a set of possible actions.

Even the choice of actions is a novel approach here.

Also: Google updates Vector AI to let enterprises train GenAI on their own data

Hu and colleagues did a survey of 10 humans who had to answer the same kinds of questions — questions such as "What is the name of the insect?" shown in a picture. Their choices of tools, such as Google Image Search, were recorded.

The authors then put those examples of human choices into what they call a "transition graph," a model of how humans make choices of tools in each moment.

The Planner then uses the graph, choosing from "relevant in-context examples […] that are assembled from the decisions previously made by humans." It's a way to get the program to model itself on humans' choices, in effect, by using past examples as just more input to the language model.

Also: AI's multi-view wave is coming, and it will be powerful

In order to act as a check on its choices, the AVIS program has a second algorithm, a "Reasoner," which evaluates how useful each tool was after it was tried by the language model, before deciding whether to output an answer to the original question. If the particular tool choice was not helpful, the Reasoner will send the Planner back to the drawing board.

The total AVIS workflow consists of devising questions, selecting tools, and then using the Reasoner to check if the tool has produced a satisfactory answer.

Hu and team tested AVIS on some standard automated benchmark tests of visual question answering, such as OK-VQA, introduced in 2019 by researchers at Carnegie Mellon University. On that test, AVIS achieved "an accuracy of 60.2, higher than most of the existing methods tailored for this dataset," they report. In other words, the general approach here seems to surpass methods that have been carefully tailored to fit a specific task, an example of the increasing generality of machine learning AI.

Also: Generative AI tops Gartner's top 25 emerging technologies for 2023

In concluding, Hu and team note that they expect to move beyond just image questions in future work. "We aim to extend our LLM-powered dynamic, decision-making framework to address other reasoning tasks," they write.

Artificial Intelligence

Could AI be the spark that ends the opioid epidemic?

Could AI be the spark that ends the opioid epidemic? Jerel Ezell 7 hours Jerel Ezell Contributor Jerel Ezell is an assistant professor in Community Health Sciences at the University of California, Berkeley, School of Public Health. He is also the director of the Berkeley Center for Cultural Humility and a Fulbright Scholar.

The opioid epidemic has had a whack-a-mole kind of complexity, stumping researchers for the better part of two decades, as they’ve attempted to better understand the evolving social and systemic factors that push people to start abusing opioids and also identify potential overdose hot spots.

These woefully tedious and often-flawed efforts all occur as clinicians work to provide safe, effective treatment and other resources to those in the throes of addiction.

As both researchers and clinicians examine the opioid epidemic’s extensive and persistent reach, they are now curiously exploring AI and asking, Could this be the moonshot that ends the opioid epidemic?

Healthcare is not one for hopping on bandwagons, notoriously slow in piloting and implementing new technology. And this tendency is not without consequence. One report suggested that the industry loses over $8.3 billion a year due to being a late or non-adopter of technology like advanced electronic health records.

Public health researchers and biomedical engineers have been quietly cultivating an AI-fused revolution in medicine, with addiction prevention and treatment the newest beneficiary.

But the opioid epidemic’s tolls are greater than the ones on the ledgers. Going back to 1999, over 1 million people have died due to a drug-related overdose. In 2021, 106,699 drug overdose deaths occurred in America, among the highest per capita volume in the history of the country. Around 75% of all of these overdoses were attributable to the usage of opioids, which includes prescription painkillers like Vicodin and Percocet as well as “street” drugs like heroin.

Despite the Centers for Disease Control and Prevention and the National Institutes of Health pouring billions of dollars into outreach, education, and prescription monitoring programs, the epidemic has remained stubbornly persistent.

For the past decade, I have been conducting research on the opioid epidemic in rural and urban communities across America, including New York City and rural southern Illinois.

Most in my field agree, albeit reluctantly, that there’s an incredible amount of guesswork involved in identifying the intricate risks that drug users face. Which drugs will they get? Will they inject, snort, or smoke them? Who, if anyone, will they use around, in case they overdose and need help?

That’s not it. Practitioners are also regularly combating idiosyncratic federal and state guidelines on effective treatments for opioid use disorder, like suboxone. And they also find themselves playing catch-up with increasingly unpredictable drug supplies that are contaminated with cheap, synthetic opioids like fentanyl, which is largely responsible for recent surges in opioid-related overdose deaths.

While AI developments like ChatGPT have been what has captured the imagination of most of the public, public health researchers and biomedical engineers have quietly been concocting an AI-fused revolution in medicine, with addiction prevention and treatment the newest beneficiaries.

Innovations in this space primarily use machine learning to identify individuals who may be at risk of developing opioid use disorder, disengaging from treatment, and relapse. For example, researchers from the Georgia Institute of Technology recently developed machine-learning techniques to effectively identify individuals on Reddit who were at risk of fentanyl misuse, while other researchers developed a tool for locating misinformation about treatments for opioid use disorder, both of which could allow peers and advocates to intervene with education.

Other AI-fueled programs, such as Sobergrid, are developing the capacity to detect when individuals are at risk of relapsing — for example, based on their proximity to bars — then linking them to a recovery counselor.

The most impactful developments relate to reduction of overdoses, often brought on by mixing drugs. At Purdue University, researchers have developed and piloted a wearable device that can detect signs of overdose and automatically inject an individual with naloxone, an overdose-reversing agent. Another crucial development has been the creation of tools to detect hazardous contaminants in drug supplies, which could radically reduce fentanyl-fueled overdoses.

Despite this immense promise, there are concerns — could facial recognition technology be used to locate people who appear high, leading to discrimination and abuse? Uber already took a step in developing this kind of capacity in 2008, attempting to patent a technology that would detect a drunk passenger.

And what about dis/misinformation, a problem already plaguing chatbots? Might malicious parties embed incorrect information into chatbots to mislead drug users on risks?

Going back to Fritz Lang’s seminal silent film “Metropolis” in 1927, the public has been fascinated by the idea of new, humanlike technology making lives easier and richer. From Stanley Kubrick’s “2001: A Space Odyssey” in 1968 to films like “I, Robot” and “Minority Report” in the early 2000s, though, these wistful visions have slowly morphed into a kind of existential dread.

It will be up to not just researchers and clinicians, but also patients and the broader public to keep AI honest and from turning humanity’s grandest challenges, like the opioid epidemic, into insurmountable ones.

Getting Started with Python for Data Science

Getting Started with Python for Data Science
Image by Author

Summer is over and it’s back to studying or working on your self-development plan. Many of you may have had the summertime to think about what your next steps will be, and if that involves anything to do with Data Science — you need to read this blog.

Generative AI, ChatGPT, Google Bard — these are probably a lot of terms you've been hearing over the past few months. With this uproar, a lot of you are thinking about getting into the tech field, such as Data Science.

People from different roles want to keep their jobs, so they will aim to develop their skills to fit the current market. It is a competitive market and we are seeing more and more people building interest in Data Science; where there are thousands of courses online, bootcamps, and Masters (MSc) available in the sector.

If you want to know what FREE courses you can take for Data Science, have a read of Top Free Data Science Online Courses for 2023

With that being said, if you want to crack into the world of Data Science, you need to know about Python.

Role of Python in Data Science

Python was developed in February 1991 by Dutch programmer Guido van Rossum. The design heavily emphasizes the easy readability of code. The construction of the language and object-oriented approach helps new and current programmers write clear and understanding code, from small projects to large projects, to using small data to big data.

31 years later, Python is considered one of the best programming languages to learn today.

Python contains a variety of libraries and frameworks so that you don’t have to do everything from scratch. These pre-built components contain useful and readable code that you can implement into your programs. For example, NumPy, Matplotlib, SciPy, BeautifulSoup, and more.

If you would like to know more about Python Libraries, read the following article: Python Libraries Data Scientists Should Know in 2022.

Python is efficient, fast, and reliable which allows developers to create applications, perform analysis, and produce visualized outputs with minimum effort. All that you need to become a Data Scientist!

Setting Up Python

If you’re looking to become a Data Scientist, we’re going to go through a step-by-step guide to help you get started with Python:

Install Python

First, you will need to download the latest version of Python. You can find out the latest version by heading over to the official website here.

Based on your operating system, follow the installation instructions through to the end.

Choose your IDE or Code Editor

IDE is an integrated development environment, it is a software application that programmers use to develop software code more efficiently. A code editor has the same purpose, but it is a text editor program.

If you are unsure of which one to choose, I will provide a list of popular options:

  • Visual Studio Code (VSCode)
  • PyCharm
  • Jupyter Notebook

When I started my Data Science career, I worked with VSC and Jupyter Notebook, which I found very useful in my data science learning and interactive coding. Once you choose one that fits your needs, install it and go through the walk-throughs on how to use them.

Learn The Basics

Before you dive into the deep end of comprehensive projects, you need to first learn the basics. So let’s dive into them.

Variables and Data Types

Variables is the terminology used for containers that store data values. Data values have various data types, such as integers, floating-point numbers, strings, lists, tuples, dictionaries, and more. Learning these is very important and builds your foundational knowledge.

In the following example, the variable is a name and it contains the value “John”. The data type is a string: name = "John" .

Operators and Expressions

Operators are symbols that allow computation tasks such as addition, subtraction, multiplication, division, exponentiation etc. An expression in Python is a combination of operators and operands.

For example x = x + 1 0x = x + 10 x = x+ 10

Control Structures

Control structures make your programming life easier by specifying the flow of execution in your code. In Python, there are several types of control structures that you need to learn such as conditional statements, loops, and exception handling.

For example:

if x > 0:       print("Positive")   else:       print("Non-positive")

Functions

A function is a block of code, and this block of code can only be run when it is called. You can create a function using the def keyword.

For example

def greet(name):       return f"Hello, {name}!"

Modules and Libraries

A module in Python is a file containing Python definitions and statements. It can define functions, classes, and variables. A library is a collection of related modules or packages. Modules and libraries can be used by importing them by using the import statement.

For example, I mentioned above that Python contains a variety of libraries and frameworks such as NumPy. You can import these different libraries by running:

import numpy as np  import pandas as pd  import math  import random 

There are various libraries and modules you can import using Python.

Working with Data

Once you have a better understanding of the basics and how they work, your next step is to use these skills to work with data. You will need to learn how to:

Import and Export Data using Pandas

Pandas is a widely-used Python library in the world of data science, as it offers a flexible and intuitive way to handle data sets of all sizes. Let’s say you have a CSV file data, you can use pandas to import the dataset by:

import pandas as pd    example_data = pd.read_csv("data/example_dataset1.csv")

Data Cleaning and Manipulation

Data cleaning and manipulation are vital steps in the data preprocessing phase of a data science project, as you take raw data and comb through all of its inconsistencies, errors, and missing values to transform it into a structured format that can be used for analysis.

Elements of data cleaning include:

  • Handling missing values
  • Duplicate data
  • Outliers
  • Data transformation
  • Data type cleaning

Elements of data manipulation include:

  • Selecting and filtering data
  • Sorting data
  • Grouping data
  • Joining and merging data
  • Creating new variables
  • Pivoting and cross-tabulation

You will need to learn all these elements and how they are used in Python. Want to start now, you can Learn Data Cleaning and Preprocessing for Data Science with This Free eBook.

Statistical Analysis

As part of your time as a data scientist, you will need to find out how to comb through your data to identify trends, patterns and insights. You can achieve this through statistical analysis. This is the process of collecting and analyzing data in order to identify patterns and trends.

This phase is used to remove bias through numerical analysis, allowing you to further your research, develop statistical models, and more. The conclusions are used in the decision-making process to make future predictions based on past trends.

There are 6 types of statistical analysis:

  1. Descriptive Analysis
  2. Inferential Analysis
  3. Predictive Analysis
  4. Prescriptive Analysis
  5. Exploratory Data Analysis
  6. Causal Analysis

In this blog, I will dive a bit more into Exploratory Data Analysis.

Exploratory Data Analysis (EDA)

Once you have cleaned and manipulated data, it is ready for the next step: exploratory data analysis. This is when data scientists analyze and investigate the dataset and create a summary of the main characteristics/variables that can help them gain further insight and create data visualizations.

EDA tools include

  • Predictive modeling such as linear regression
  • Clustering techniques such as K-means clustering
  • Dimensionality reduction techniques such as Principal Component Analysis (PCA)
  • Univariate, Bivariate, and Multivariate visualizations

This phase of data science can be the most difficult aspect and requires a lot of practice. Libraries and modules can assist you, but you will need to understand the task at hand and what you want your outcome to be to figure out what EDA tool you need.

Data Visualisation

EDA is used to gain further insight and create data visualization. As a data scientist, you will be expected to create visualizations of your findings. This can be basic visualizations such as line charts, bar plots, and scatter plots, but then you can be very creative such as heatmaps, choropleth maps, and bubble charts.

There are various data visualization libraries that can you use, however these are the most popular:

  • Matplotlib
  • Seaborn
  • Plotly

Data visualizations allow for better communication, especially for stakeholders who are not highly technically inclined.

Wrapping it up

This blog is intended to guide beginners on the steps they will need to take to learn Python in their data science career. Each phase requires time and attention to master. As I could not go into extensive detail on each, I have created a short list that can guide you further:

  • The Importance of Data Cleaning in Data Science
  • Introduction to Data Science: A Beginner’s Guide
  • How to Transition into Data Science from a Different Background?

Nisha Arya is a Data Scientist, Freelance Technical Writer and Community Manager at KDnuggets. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.

More On This Topic

  • Getting Started with Python Generators
  • Getting Started Cleaning Data
  • Getting Started with 5 Essential Natural Language Processing Libraries
  • Getting Started with Distributed Machine Learning with PyTorch and Ray
  • Getting Started with Reinforcement Learning
  • Getting Started with Automated Text Summarization

Pharma CEO: Don’t halt AI research, our work is too important

recursion

How is society supposed to address the risks of artificial intelligence? There are those who argue the conceivable benefits outweigh the immediate danger, and so, restrictions should be pursued lightly.

"I think these broad, blanket suggestions that we stop work [on AI] are a little bit misguided," said Chris Gibson, co-founder and CEO of Recursion Pharmaceuticals, in a recent interview with ZDNET.

"I think it's just really important that folks continue to embrace the opportunity that exists with machine learning," said Gibson.

Also: The 5 biggest risks of generative AI, according to an expert

Gibson's company is working with Big Pharma to employ AI in drug discovery.

Gibson was responding to a letter published in March by Elon Musk, AI scholar Yoshua Bengio, and numerous others calling for a temporary halt to AI research to investigate the dangers.

The petition called for a pause to what it describes as "an out-of-control race" for AI superiority, producing systems that its creators can't "understand, predict, or reliably control."

Gibson zeroed in on what he deemed unrealistic concerns, such as the potential for machine learning programs to become sentient, a scenario that scholars who've considered the matter consider fairly remote.

"We don't want to pause for six months or a year, because of how much opportunity there is moving forward," says Chris Gibson, co-founder and CEO of Recursion Pharmaceuticals.

"The work we're doing at Recursion is super interesting, training multi-billion parameter models that are really, really exciting in the context of biology," Gibson told ZDNET. "But they're not sentient, they're not gonna become sentient, they're very far from that."

One of Gibson's principle concerns is to preserve the ability of his firm and others to move forward with work on things such as drug discovery. Recursion, which partners with Bayer and Genentech, among others, has five drug candidates currently in the clinical stages of the drug development pipeline. The company has amassed over 13 petabytes worth of information in Phenomaps, its term for databases of "inferred relationships" between molecules.

Also: 'OpenAI is product development, not AI research,' says Meta's chief AI scientist LeCun

"Models that are held in isolation to answer really specific questions, I think, are really important for advancing humanity," said Gibson. "Models like ours, and other companies like ours, we don't want to pause for six months, or pause for a year, because of how much opportunity there is moving forward."

Gibson's firm, which is public, in July announced that it received a $50 million investment from Nvidia, whose GPU chips dominate AI processing.

Gibson was measured in his remarks about those who worry about AI or who have called for a halt. "There are really smart people on both sides of the issue," he said, noting that a Recursion co-founder had stepped away from day-to-day running of the company several years ago because of concerns about the ethical challenges of AI.

Yoshua Bengio, an advisor to Recursion, is one of the letter's signatories.

"Yoshua is brilliant, so this is putting me on the spot just a little," said Gibson. "But, I would say, I think there are really important arguments on both sides of the debate."

Also: The great puzzle of the body and disease is beginning to yield to AI, says Recursion CEO

The different perspectives of the parties for and against a moratorium "suggests caution," he said, "but I don't believe that we should pause all training, and all inference, of ML and AI algorithms for any period of time."

Gibson's team followed up with ZDNET to point out that Bengio, in his blog post on the matter of AI risks, has drawn distinctions between threats versus societally useful applications of AI such as healthcare.

Gibson is in accord with peers of Bengio such as Meta Properties chief AI scientist Yann LeCun, who has spoken out against the initiative of his friend and sometime collaborator.

Gibson did allow that some notions of risk, however improbable, need to be carefully considered. One is the end-of-humanity scenarios that have been outlined by organizations such as the Future of Humanity Institute.

"There are people in the field of AI who think that if you ask an ML or AI algorithm to maximize some sort of utility function, say, make the world as beautiful and peaceful as possible, then an AI algorithm could, probably not totally incorrectly, interpret that humans are the cause of most of the lack of beauty and lack of peace," said Gibson.

Also: ChatGPT: What The New York Times and others are getting terribly wrong about it

As a result, a program could "put in place something really scary." Such a prospect is "probably farfetched," he said. "But, the impact is so big, it's important to think about it; it's unlikely any one of our airplanes are gonna crash when we go up in the sky, but we certainly look at the warning because the cost is so substantial."

There are also "some things that are really obvious we could all agree on today," said Gibson, such as not allowing programs to have control of weapons of mass destruction.

"Would I advocate for giving an AI or ML algorithm access to our nuclear launch systems? Absolutely not," he said.

On a more prosaic level, Gibson believes that issues of bias need to be dealt with in algorithms. "We need to make sure that we're being really cautious about the datasets, and making sure the utility functions we optimize our algorithms against don't have some sort of bias within them."

Also: AI could have 20% chance of sentience in 10 years, says philosopher David Chalmers

"You do have more bias creeping into the outcomes of these algorithms that are becoming more and more part of our lives," observed Gibson.

The most basic concerns, in Gibson's view, should be obvious to all. "A good example is, I think, it's more risky to give an algorithm uncontrolled access to the internet," he said. "So, there could be some near-term regulations around that."

His position on regulation, he said, is that "part of being in a high-functioning society is putting all those options on the table and having an important discussion around them. We just need to be careful not to over-extend ourselves with broad-based regulation that's directed at all ML or all AI."

A pressing concern for AI ethics is the current trend of companies such as OpenAI and Google to disclose less and less of the inner workings of their programs. Gibson said he is against any regulation requiring programs to be made open-source. "But," he added, "I think it's very important for most companies to share some of their work in various ways with society, to keep moving everybody forward."

Also: Why open source is essential to allaying AI fears, according to Stability.ai founder

Recursion has open-sourced many of its datasets, he noted, and, "I would not exclude the possibility of us open-sourcing some of our models in the future."

Obviously, the large questions of regulation and control come back to the will of any particular nation's citizens. A key question is how the electorate can be educated about AI. In that regard, Gibson was not optimistic.

While education is important, he said, "My general belief is that the public seems uninterested in being educated these days."

"The people who are interested in being educated tend to tune into these things," he said, "and most of the rest of the world doesn't, which is super unfortunate."

Artificial Intelligence

Build Your Own PandasAI with LlamaIndex

Build Your Own PandasAI with LlamaIndex
Image by Author Introduction

Pandas AI is a Python library that leverages the power of generative AI to supercharge Pandas, the popular data analysis library. With just a simple prompt, Pandas AI allows you to perform complex data cleaning, analysis, and visualization that previously required many lines of code.

Beyond crunching the numbers, Pandas AI understands natural language. You can ask questions about your data in plain English, and it will provide summaries and insights in everyday language, sparing you from deciphering complex graphs and tables.

In the example below, we provided a Pandas dataframe and asked the generative AI to create a bar chart. The result is impressive.

pandas_ai.run(df, prompt='Plot the bar chart of type of media for each year release, using different colors.')

Build Your Own PandasAI with LlamaIndex

Note: the code example is from Pandas AI: Your Guide to Generative AI-Powered Data Analysis tutorial.

In this post, we will be using LlamaIndex to create similar tools that can understand the Pandas data frame and produce complex results as shown above.

LlamaIndex enables natural language querying of data via chat and agents. It allows large language models to interpret private data at scale without retraining on new data. It integrates large language models with various data sources and tools. LlamaIndex is a data framework that allows for the easy creation of Chat with PDF applications with just a few lines of code.

Setting Up

You can install the Python library by using the pip command.

pip install llama-index

By default, LlamaIndex uses OpenAI gpt-3.5-turbo model for text generation and text-embedding-ada-002 for retrieval and embeddings. To run the code hassle-free, we must set up the OPENAI_API_KEY. We can register and get the API key for free on a new API token page.

import os  os.environ["OPENAI_API_KEY"] = "sk-xxxxxx"

They also support integrations of Anthropic, Hugging Face, PaLM, and more models. You can learn everything about it by reading the Module's documentation.

Pandas Query Engine

Let’s get to the main topic of creating your own PandasAI. After installing the library and setting up the API key, we will create a simple city dataframe with the city name and population as the columns.

import pandas as pd  from llama_index.query_engine.pandas_query_engine import PandasQueryEngine
df = pd.DataFrame(      {"city": ["New York", "Islamabad", "Mumbai"], "population": [8804190, 1009832, 12478447]}  )

Using the PandasQueryEngine, we will create a query engine to load the dataframe and index it.

After that, we will write a query and display the response.

query_engine = PandasQueryEngine(df=df)    response = query_engine.query(      "What is the city with the lowest population?",  )

As we can see, it has developed the Python code for displaying the least populated city in the dataframe.

> Pandas Instructions:  ```  eval("df.loc[df['population'].idxmin()]['city']")  ```  eval("df.loc[df['population'].idxmin()]['city']")  > Pandas Output: Islamabad

And, if you print the response, you will get "Islamabad." It is simple but impressive. You don't have to come up with your own logic or experiment around the code. Just type the question, and you will get the answer.

print(response)
Islamabad

You can also print the code behind the result using the response metadata.

print(response.metadata["pandas_instruction_str"])
eval("df.loc[df['population'].idxmin()]['city']")

Global YouTube Statistics Analysis

In the second example, we will load the Global YouTube Statistics 2023 dataset from Kaggle and perform some fundamental analysis. It is a step up from the simple examples.

We will use read_csv to load the dataset into the query engine. Then we will write the prompt to display only columns with missing values and the number of missing values.

df_yt = pd.read_csv("Global YouTube Statistics.csv")  query_engine = PandasQueryEngine(df=df_yt, verbose=True)    response = query_engine.query(      "List the columns with missing values and the number of missing values. Only show missing values columns.",  )
> Pandas Instructions:  ```  df.isnull().sum()[df.isnull().sum() > 0]  ```  df.isnull().sum()[df.isnull().sum() > 0]  > Pandas Output: category                                    46  Country                                    122  Abbreviation                               122  channel_type                                30  video_views_rank                             1  country_rank                               116  channel_type_rank                           33  video_views_for_the_last_30_days            56  subscribers_for_last_30_days               337  created_year                                 5  created_month                                5  created_date                                 5  Gross tertiary education enrollment (%)    123  Population                                 123  Unemployment rate                          123  Urban_population                           123  Latitude                                   123  Longitude                                  123  dtype: int64

Now, we will ask direct questions about popular channel types. In my opinion, the LlamdaIndex query engine is highly accurate and has not yet produced any hallucinations.

response = query_engine.query(      "Which channel type have the most views.",  )
> Pandas Instructions:  ```  eval("df.groupby('channel_type')['video views'].sum().idxmax()")  ```  eval("df.groupby('channel_type')['video views'].sum().idxmax()")  > Pandas Output: Entertainment  Entertainment

In the end, we will ask it to visualize barchat and the results are amazing.

response = query_engine.query(      "Visualize barchat of top ten youtube channels based on subscribers and add the title.",  )
> Pandas Instructions:  ```  eval("df.nlargest(10, 'subscribers')[['Youtuber', 'subscribers']].plot(kind='bar', x='Youtuber', y='subscribers', title='Top Ten YouTube Channels Based on Subscribers')")  ```  eval("df.nlargest(10, 'subscribers')[['Youtuber', 'subscribers']].plot(kind='bar', x='Youtuber', y='subscribers', title='Top Ten YouTube Channels Based on Subscribers')")  > Pandas Output: AxesSubplot(0.125,0.11;0.775x0.77)

Build Your Own PandasAI with LlamaIndex

With a simple prompt and query engine, we can automate our data analysis and perform complex tasks. There is so much more to LamaIndex. I highly recommend you to read the official documentation and try to build something amazing.

Conclusion

In summary, LlamaIndex is an exciting new tool that allows developers to create their own PandasAI — leveraging the power of large language models for intuitive data analysis and conversation. By indexing and embedding your dataset with LlamaIndex, you can enable advanced natural language capabilities on your private data without compromising security or retraining models.

This is just a start, with LlamaIndex you can build Q&A over documents, Chatbots, Automated AI, Knowledge Graph, AI SQL Query Engine, Full-Stack Web Application, and build private generative AI applications.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

More On This Topic

  • Data Observability, Part II: How to Build Your Own Data Quality Monitors…
  • LangChain 101: Build Your Own GPT-Powered Applications
  • Write and train your own custom machine learning models using PyCaret
  • Practising SQL without your own database
  • Does AI Get its Own Batman?
  • Build Your First Data Science Application

Increase Your Callback Rate With A LinkedIn Profile

Increase Your Callback Rate With A LinkedIn Profile
Image by rawpixel on Freepik

If you have just come from University or decided to return to the job market, platforms such as LinkedIn should be your best friend. LinkedIn is the world's largest professional network out there, and it will help you find new jobs, connect with like-minded people and discover new opportunities.

I have spoken to a lot of fresh graduate students in the past and they have always been reluctant to open up a LinkedIn profile and I have always been confused why. But who isn’t nervous about something they don’t know?

Understand that LinkedIn is your personal brand. When you create a LinkedIn profile, you want it to truly resonate with who you are, your professional background, and the services you can provide.

In this blog, I will go through how to set up your LinkedIn profile, and what you need to increase your callback rate for potential jobs and opportunities. So let’s get started…

If you don’t have a LinkedIn profile already, you will need to sign up for LinkedIn. Once you have done this, you will have a blank LinkedIn profile, which you can customize.

Profile Picture

The first thing you want to do is upload a profile picture, as this is the first thing other LinkedIn members will see. Remember, LinkedIn is a professional network, therefore your profile picture should resonate exactly that.

Some points to take into consideration:

  1. A picture where you are looking directly at the camera
  2. Maximum half of your body showing
  3. A plain background

You can use tools such as remove.bg to remove backgrounds from your image. The recommended dimensions for a LinkedIn profile image are 400×400 pixels. Once you have your image ready, it’s time for you to upload it as your LinkedIn profile picture.

Banner

As mentioned previously, LinkedIn is your personal brand. The LinkedIn banner can allow you to present who you are through this. For example, if you have a company and you have a motto, you can upload your motto to your banner. Another example is software engineers being creative by illustrating a breakdown of their profile through coding, as shown below.

Increase Your Callback Rate With A LinkedIn Profile
Image by Aaron Cordova
vvvvvv

You can use your own image as your LinkedIn banner, but if you are having trouble choosing what to put, you can go to Canva which provides a list of different styles.

Intro Section

Once you have added the aesthetic appeal to your LinkedIn profile, the next step is adding some content. Here we will start with the ‘Intro’ section, which you can access by clicking on the pencil icon in the top right-hand corner of your profile, adjacent to your profile picture.

You will need to add in your first name and last name and then move on to your headline. Your headline is very important as it informs other members of your areas of expertise, such as software engineering.

These are a few example headlines that I came across on LinkedIn:

  1. Head of Cloud AI Services at Google
  2. AI | Web3 | Marketing
  3. Building Dev Rel @OpenAI
  4. AI & Data Science | Predictive Analytics | Data Strategy | Public Sector | Women in AI

If you are a graduate, feel confident in stating what you aspire to be as your headline is what people will be looking at.

Education Section

There are two ways you can fill in your LinkedIn education section.

  1. The first is by clicking on the same pencil icon, scroll down to the education section and fill it out.
  2. The other is to scroll down on your LinkedIn profile till you get to the education section and fill out a more extensive overview.

Fill in your education section with education such as University, BootCamps, Courses, and other achievements. Do not include education levels such as Kindergarten and junior school.

Once you have done this, you have the option to show your education in the intro section of your LinkedIn profile. You can do this by clicking on the pencil icon and tick the ‘Show education in my intro’ box.

Location Settings

Your location settings are very important and are a common mistake a lot of users make. For example, you may have been living and studying in New York all your life, but you want to make a transition to starting a new life and career in San Francisco.

The best thing you can do is change your location in the Intro section to the United States, with a specific postal code to San Francisco. This way your profile will be more visible to hiring managers in those specific areas.

Custom URL

On your LinkedIn profile at the top, you will see a link called ‘Contact Info’. In this section, you will see your LinkedIn profile URL. You can edit this URL on your profile, by clicking on the pencil icon on the right-hand side of your profile page. It is important to create a unique profile URL to increase your rankings.

Summary Section

At the top of your profile, you will see an ‘Add Section’ button. Click on this button, and then ‘About’, and then ‘Summary’. I believe this section needs to be a very short cover letter, and when I say short, I mean 2 to 3 short paragraphs.

In this section, you can provide more detail about your expertise and skill set, as well as mention a career change. This is where people learn more about you, and you’re essentially selling your capabilities and skills. You can also include your personal blog, website, etc.

Skills

Now let's move onto the skills section. On the same ‘Add Section’ button, you will see a ‘Skills’ section, click on that. In this part, you need to add some relevant skills to your expertise. If you are job hunting, the best thing to do is add skills that are typically found in job descriptions of the title you want.

The easiest way to find frequently used words is by copying and pasting a job description into WordCloud, and it will generate a visual representation of words. You can use this to help you add which skills to add to your list.

If you have 5 or more skills listed, you have a higher chance of connecting with recruiters, and more profile views.

Accomplishments

Now to add some make up to your profile. Put your accomplishments in! In the ‘Add Section’ button, there is an ‘Accomplishments’ button where you can add in:

  • Publications
  • Patents
  • Courses
  • Projects
  • Languages, and more.

Remember to fill this in as it will increase recruiters to your LinkedIn page.

Posting

Just like any other platform, such as Instagram, posting increases engagement. Posting relevant LinkedIn posts will increase your posts to have a better reach to the wider community, and fall in the hands of a recruiter.

You can create a general post, write an article, add a link, and upload your media. This can be for example your YouTube videos, or blog posts.

LinkedIn Groups

As you want to increase your network, a good way to do this is by joining LinkedIn Groups. In the top right corner of your LinkedIn page, you will see a grid icon called ‘Work’. When you click on that, you can view more of LinkedIn products and one of them will be ‘Groups’.

Discover new groups that are in line with your area of interest. For example, a data scientist may want to join KDnuggets Data Science & Machine Learning.

Looking for a Job?

If you are actively looking for a job, an important point to consider is showing your LinkedIn profile as ‘Open to Work’. Under your profile picture, there is a button called ‘Open to’, where you have a drop down menu choice of:

  1. Finding a new job
  2. Providing services
  3. Hiring

Click on ‘Finding a new job’ and add in your requirements of the type of job you are looking for.

Wrapping it up

With all the steps above, you will have increased the possibility of recruiters to come across your profile. This step-by-step blog will help you land your dream job in no time! If you have any more tips, please let us know in the comments.
Nisha Arya is a Data Scientist, Freelance Technical Writer and Community Manager at KDnuggets. She is particularly interested in providing Data Science career advice or tutorials and theory based knowledge around Data Science. She also wishes to explore the different ways Artificial Intelligence is/can benefit the longevity of human life. A keen learner, seeking to broaden her tech knowledge and writing skills, whilst helping guide others.

More On This Topic

  • Use third-party data to increase user engagement and deliver business…
  • AI registers: finally, a tool to increase transparency in AI/ML
  • The Base Rate Fallacy and its Impact on Data Science
  • How LinkedIn Uses Machine Learning To Rank Your Feed
  • Top 18 Data Science Groups on LinkedIn
  • KDnuggets News, November 16: How LinkedIn Uses Machine Learning • Confusion…

How SMBs Can Cut Through the Generative AI Hype

A small business owner uses AI with AWS to analysis business data.
Image: THANANIT/Adobe Stock

AWS Head of Innovation for SMBs, Ben Schreiner reminds business leaders to focus on data and problem solving when making decisions around generative AI.

Generative artificial intelligence is a hot topic, but many of the things it can do seem very similar to yesterday’s predictive algorithms or machine learning. We interviewed Ben Schreiner, head of innovation for small and medium businesses at Amazon Web Services, who says today’s generative AI isn’t magic; SMB purchasers should look at it with the full context of AI’s weaknesses and its impact on people. However, generative AI does offer use cases that weren’t previously possible.

This interview has been edited for length and clarity.

Jump to:

  • What sets generative AI apart
  • Deciding whether to use generative AI
  • Turning artificial intelligence into business intelligence

What sets generative AI apart

Megan Crouse: How is generative AI different from the type of machine learning that we had five years ago or longer than that? How is it the same?

Ben Schreiner: Generative AI is not magic — it’s math. What we’re seeing in the market is generative AI hype has captured people’s imagination and is fostering a conversation around innovating that we weren’t having before.

SEE: Generative AI has reached the peak of Gartner’s Hype Cycle, where expectations are inflated. (TechRepublic)

When the economic downturn happened, most people were focused on saving money and costs. This generative AI news cycle has had small and medium business leaders talking more about innovation, maybe in the same conversation as cost savings. It has allowed us to have that conversation (about innovation).

Most of the use cases end up being things that have existed for quite some time. What I’m most excited about is we’re having that innovation conversation whether you’re using the latest large language model to do actual generative stuff or you’re leveraging AI that has existed for five or 10 years.

It really doesn’t matter. We just want our customers to leverage it, because that’s where innovation happens for their business.

Deciding whether to use generative AI

Megan Crouse: What questions should business leaders ask when deciding to use generative AI or a generative AI-enhanced service?

Ben Schreiner: The number one question I have to ask is where is the data? What data was used to train this model? Everybody’s learning very quickly, and most of the customers we talk to understand that the model is only as good as the data that it has. Understanding that is really important. Understand who owns that data, where it came from and how much of your own data you need to put into the model or augment the model (with) in order to get out real answers that are valuable. That balancing act is a very important one for business executives to understand. Where is the model?

We want to bring the model to your data, not the other way around. So our approach to AI and generative AI is to allow our customers to have their own instances of models that they can modify and enhance with their own data, but all protected within their own environment and their own security controls where no one else has access to that information.

Priority number two is making sure you’re partnered with an organization or a partner that’s going to be with you for the long haul and has the expertise. We have a bunch of third-party partners that make either new models available or that have experts that can help some of these companies that don’t have data scientists on staff.

Then just learn. Learn as much as you can as fast as you can, because this (generative AI) is changing almost hourly.

Megan Crouse: Two concerns I often see people bring up with generative AI are copyright, specifically generative AI being trained on copyrighted works, and hallucinations. How do you address those problems?

Ben Schreiner: I think everyone needs to go in with eyes wide open, right? The machine is only as good as the data. You have to understand what data is in there. And AWS is trying very hard in our own models.

We make sure that we know where that data is and that we’re not creating a liability or a potential risk for those customers. We have our own Titan models. Then you have all of the open source models that are coming out, and we intend to have the best models available. We don’t believe it will be a one-size fits all, or that one model will rule them all.

But I do think executives need to understand the source of the model’s data itself.

Regulations are going to trail (behind businesses). You’re seeing lawsuits now being filed trying to protect some of that (copyrighted) information.

Megan Crouse: In what ways do business leaders in small and medium businesses need to invest in people before they invest in AI? And what questions should they be asking themselves about how adopting generative AI might change the way they invest not only in tech but also in supporting their own people?

Ben Schreiner: I think all small and medium businesses should be people-first. (People are) your biggest assets, and the tools and technology really are only going to ever be as good as the people who leverage them. In regards to investing in your people and investing in their training, earlier this month, we (AWS) released seven new AI-oriented training classes. We intend to help people learn as fast as possible and make it as easy as possible for folks to leverage this technology.

SEE: Hiring kit: Prompt engineer (TechRepublic Premium)

Not every business is going to be able to afford or attract a data scientist. How do we make it so you can still benefit from some of these technologies and not be kept out of the market, kept out of this revolution, because you can’t get a data scientist on staff?

Turning artificial intelligence into business intelligence

Megan Crouse: Is there anything else you would like to add?

Ben Schreiner: I want to highlight the concept of generative business intelligence. We are helping a lot of small and medium businesses aggregate their data. That’s kind of priority number one.

You aggregate your data, ideally in AWS, and layer on business intelligence on top of that. So think about reporting, but add the generative component to reporting and being able to use natural language to, for example, tell me the product I sold the most of that has the highest gross margin for the summer months and compare that year over year.

I’d like to be able to verbally ask that of the tool and have it spit out a chart for the data that I need. That is very, very compelling because now I don’t need a database administrator that’s doing SQL queries and creating advanced pie charts for me. I can have the tool, and can have the intelligence embedded inside of it, and be able to ask it things.

The next level of generative BI is to actually write the story of the data that it’s seeing. It comes up with paragraphs for a summary or an executive summary of the data. And I’m not spending time generating that — I just edit it to meet my needs. So I’m excited about that because all small and medium businesses have data, and most of them are not maximizing the value of that data.

Subscribe to the Innovation Insider Newsletter

Catch up on the latest tech innovations that are changing the world, including IoT, 5G, the latest about phones, security, smart cities, AI, robotics, and more.

Delivered Tuesdays and Fridays Sign up today

Google Introduces WB2 To Fight Climate Crisis With ML Models

Worsening heat waves and extreme natural calamities have made it more important than ever to accurately predict weather forecasts. AI is proving increasingly helpful with the involvement of big tech companies in the domain.

In the latest attempt to help with the global climate crisis, Google in collaboration with ECMWF has announced WeatherBench 2 (WB2), a benchmark for data-driven, global weather models. This is an update to the original benchmark introduced in 2020, which was based on initial, lower-resolution ML models.

Evaluating weather forecasts isn’t an easy task, because weather is a multifaceted problem. Different end-users are interested in different properties of forecasts, To help with this, the WB2 benchmark will progress the models by providing a reproducible framework for evaluating and comparing several methods.

The main element of WB2 is an open-source evaluation framework through which users can forecast similar to other baselines. The sheer size of high resolution data required to evaluate is a challenge. Hence, Google built the evaluation code on Apache Beam which lets users split computation into small chunks and evaluate them. The code comes with a guide to help users get up to speed.

Moreover, most of the data is provided by the developers on Google Cloud Storage in Zarr format at various resolutions including a copy of ERA5 dataset used to train most ML models. With this Google is making an effort to provide analysis-ready, cloud-optimized weather and climate datasets to the community.

On their webpage, Google also provides scores from several state-of-the-art models like DeepMind’s GraphCast and Huawei’s Pangu-Weather, a transformer-based model. Additionally, forecasts from ECMWF’s forecasting systems are included, representing some of the models.

With WB2, Google aims to strengthen the future of ML-based weather prediction. The company also has plans to add station observations, better datasets and include nowcasting as well as subseasonal-to-seasonal predictions to the benchmark.

The post Google Introduces WB2 To Fight Climate Crisis With ML Models appeared first on Analytics India Magazine.

How to Digest 15 Billion Logs Per Day and Keep Big Queries Within 1 Second

This data warehousing use case is about scale. The user is China Unicom, one of the world's biggest telecommunication service providers. Using Apache Doris, they deploy multiple petabyte-scale clusters on dozens of machines to support their 15 billion daily log additions from their over 30 business lines. Such a gigantic log analysis system is part of their cybersecurity management. For the need of real-time monitoring, threat tracing, and alerting, they require a log analytic system that can automatically collect, store, analyze, and visualize logs and event records.

From an architectural perspective, the system should be able to undertake real-time analysis of various formats of logs, and of course, be scalable to support the huge and ever-enlarging data size. The rest of this post is about what their log processing architecture looks like, and how they realize stable data ingestion, low-cost storage, and quick queries with it.

System Architecture

This is an overview of their data pipeline. The logs are collected into the data warehouse, and go through several layers of processing.

How to Digest 15 Billion Logs Per Day and Keep Big Queries Within 1 Second

  • ODS: Original logs and alerts from all sources are gathered into Apache Kafka. Meanwhile, a copy of them will be stored in HDFS for data verification or replay.
  • DWD: This is where the fact tables are. Apache Flink cleans, standardizes, backfills, and de-identifies the data, and write it back to Kafka. These fact tables will also be put into Apache Doris, so that Doris can trace a certain item or use them for dashboarding and reporting. As logs are not averse to duplication, the fact tables will be arranged in the Duplicate Key model of Apache Doris.
  • DWS: This layer aggregates data from DWD and lays the foundation for queries and analysis.
  • ADS: In this layer, Apache Doris auto-aggregates data with its Aggregate Key model, and auto-updates data with its Unique Key model.

Architecture 2.0 evolves from Architecture 1.0, which is supported by ClickHouse and Apache Hive. The transition arised from the user's needs for real-time data processing and multi-table join queries. In their experience with the old architecture, they found inadequate support for concurrency and multi-table joins, manifested by frequent timeouts in dashboarding and OOM errors in distributed joins.

How to Digest 15 Billion Logs Per Day and Keep Big Queries Within 1 Second

Now let's take a look at their practice in data ingestion, storage, and queries with Architecture 2.0.

Real-Case Practice

Stable ingestion of 15 billion logs per day

In the user's case, their business churns out 15 billion logs every day. Ingesting such data volume quickly and stably is a real problem. With Apache Doris, the recommended way is to use the Flink-Doris-Connector. It is developed by the Apache Doris community for large-scale data writing. The component requires simple configuration. It implements Stream Load and can reach a writing speed of 200,000~300,000 logs per second, without interrupting the data analytic workloads.

A lesson learned is that when using Flink for high-frequency writing, you need to find the right parameter configuration for your case to avoid data version accumulation. In this case, the user made the following optimizations:

  • Flink Checkpoint: They increase the checkpoint interval from 15s to 60s to reduce writing frequency and the number of transactions processed by Doris per unit of time. This can relieve data writing pressure and avoid generating too many data versions.
  • Data Pre-Aggregation: For data of the same ID but comes from various tables, Flink will pre-aggregate it based on the primary key ID and create a flat table, in order to avoid excessive resource consumption caused by multi-source data writing.
  • Doris Compaction: The trick here includes finding the right Doris backend (BE) parameters to allocate the right amount of CPU resources for data compaction, setting the appropriate number of data partitions, buckets, and replicas (too much data tablets will bring huge overheads), and dialing up max_tablet_version_num to avoid version accumulation.

These measures together ensure daily ingestion stability. The user has witnessed stable performance and low compaction score in Doris backend. In addition, the combination of data pre-processing in Flink and the Unique Key model in Doris can ensure quicker data updates.

Storage strategies to reduce costs by 50%

The size and generation rate of logs also impose pressure on storage. Among the immense log data, only a part of it is of high informational value, so storage should be differentiated. The user has three storage strategies to reduce costs.

  • ZSTD (ZStandard) compression algorithm: For tables larger than 1TB, specify the compression method as "ZSTD" upon table creation, it will realize a compression ratio of 10:1.
  • Tiered storage of hot and cold data: This is supported by the new feature of Doris. The user sets a data "cooldown" period of 7 days. That means data from the past 7 days (namely, hot data) will be stored in SSD. As time goes by, hot data "cools down" (getting older than 7 days), it will be automatically moved to HDD, which is less expensive. As data gets even "colder", it will be moved to object storage for much lower storage costs. Plus, in object storage, data will be stored with only one copy instead of three. This further cuts down costs and the overheads brought by redundant storage.
  • Differentiated replica numbers for different data partitions: The user has partitioned their data by time range. The principle is to have more replicas for newer data partitions and less for the older ones. In their case, data from the past 3 months is frequently accessed, so they have 2 replicas for this partition. Data that is 3~6 months old has two replicas, and data from 6 months ago has one single copy.

With these three strategies, the user has reduced their storage costs by 50%.

Differentiated query strategies based on data size

Some logs must be immediately traced and located, such as those of abnormal events or failures. To ensure real-time response to these queries, the user has different query strategies for different data sizes:

  • Less than 100G: The user utilizes the dynamic partitioning feature of Doris. Small tables will be partitioned by date and large tables will be partitioned by hour. This can avoid data skew. To further ensure balance within a data partition, they use the snowflake ID as the bucketing field. They also set a starting offset. Data of the recent 20 days will be kept. This is the balance point between data backlog and analytic needs.
  • 100G~1T: These tables have their materialized views, which are the pre-computed result sets stored in Doris. Thus, queries on these tables will be much faster and less resource-consuming. The DDL syntax of materialized views in Doris is the same as those in PostgreSQL and Oracle.
  • More than 100T: These tables are put into the Aggregate Key model of Apache Doris and pre-aggregate them. In this way, we enable queries of 2 billion log records to be done in 1~2s.

These strategies have shortened the response time of queries. For example, a query of a specific data item used to take minutes, but now it can be finished in milliseconds. In addition, for big tables that contain 10 billion data records, queries on different dimensions can all be done in a few seconds.

Ongoing Plans

The user is now testing with the newly added inverted index in Apache Doris. It is designed to speed up full-text search of strings as well as equivalence and range queries of numerics and datetime. They have also provided their valuable feedback about the auto-bucketing logic in Doris: Currently, Doris decides the number of buckets for a partition based on the data size of the previous partition. The problem for the user is, most of their new data comes in during daytime, but little at nights. So in their case, Doris creates too many buckets for night data but too few in daylight, which is the opposite of what they need. They hope to add a new auto-bucketing logic, where the reference for Doris to decide the number of buckets is the data size and distribution of the previous day. They've come to the Apache Doris community and we are now working on this optimization.
Zaki Lu is a former product manager at Baidu and now DevRel for the Apache Doris open source community.

More On This Topic

  • Feature stores — how to avoid feeling that every day is Groundhog Day
  • Introduction to Statistical Learning Second Edition
  • Data Science Project of Rotten Tomatoes Movie Rating Prediction: Second…
  • Deep Learning with Python: Second Edition by François Chollet
  • Kubernetes In Action: Second Edition
  • 5 Things to Keep in Mind Before Selecting Your Next Data Science Job