AI — Страница 1561

A Quantum Leap: UCC Researchers Discover Potential Key to Quantum Computing’s Future

In a significant development for the future of quantum computing, researchers at the Macroscopic Quantum Matter Group laboratory in University College Cork (UCC) have made a groundbreaking discovery using one of the world's most powerful quantum microscopes. The team has identified a spatially modulating superconducting state in a new and unusual superconductor, Uranium Ditelluride (UTe2), which could potentially address one of quantum computing's greatest challenges.

The Power of Superconductors

Superconductors are materials that allow electricity to flow with zero resistance, meaning they don't dissipate any energy despite carrying a large current. This is possible because, instead of individual electrons moving through the metal, pairs of electrons bind together to form a macroscopic quantum mechanical fluid.

Lead author of the paper, Joe Carroll, a PhD researcher working with UCC Prof. of Quantum Physics Séamus Davis, explains, “What our team found was that some of the electron pairs form a new crystal structure embedded in this background fluid. These types of states were first discovered by our group in 2016 and are now called Electron Pair-Density Waves. These Pair Density Waves are a new form of superconducting matter the properties of which we are still discovering.”

A New Type of Superconductor

What makes UTe2 particularly exciting is that it appears to be a new type of superconductor. The pairs of electrons in UTe2 seem to have intrinsic angular momentum. If this is true, then the UCC team has detected the first Pair-Density Wave composed of these exotic pairs of electrons.

Carroll elaborates, “What is particularly exciting for us and the wider community is that UTe2 appears to be a new type of superconductor. Physicists have been searching for a material like it for nearly 40 years.”

Implications for Quantum Computing

Quantum computers rely on quantum bits or qubits to store and manipulate information. However, the quantum state of these qubits is easily destroyed, limiting the application of quantum computers.

UTe2, however, is a special type of superconductor that could have huge consequences for quantum computing. It could potentially be used as a basis for topological quantum computing, where there is no limit on the lifetime of the qubit during computation. This could open up many new ways for more stable and useful quantum computers.

Carroll explains, “There are indications that UTe2 is a special type of superconductor that could have huge consequences for quantum computing… In such materials there is no limit on the lifetime of the qubit during computation opening up many new ways for more stable and useful quantum computers.”

The discovery by the UCC team provides another piece to the puzzle of UTe2. Understanding the fundamental superconducting properties of materials like UTe2 is crucial for developing practical quantum computers. Carroll concludes, “What we've discovered then provides another piece to the puzzle of UTe2. To make applications using materials like this we must understand their fundamental superconducting properties. All of modern science moves step by step. We are delighted to have contributed to the understanding of a material which could bring us closer to much more practical quantum computers.”

Apple Introduces Online Store on China’s WeChat App

Apple has launched an online store on China’s WeChat app, the Tencent-owned platform said on Tuesday. With a user base of over 1.2 billion, WeChat is China’s largest messaging platform. Often described as a “super app”, it offers more than just instant messaging functionality, providing a wide range of additional features and services.

The company said that the users would be able to buy Apple Products including iPhones, iPads and Macs from the store.

Customers who place orders through WeChat can enjoy free shipping, and select users may have the option to pay for three-hour delivery. Additionally, Tencent mentioned that users will have access to other Apple services, such as the trade-in program.

Apple’s decision comes in response to the shifting consumer behavior in China, where there is a growing trend among Chinese consumers to use social media platforms like WeChat and ByteDance’s Douyin (Chinese version of TikTok) for their shopping needs.

Despite its strict control over retail channels, Apple has been intensifying its presence on China’s prominent internet platforms in recent years. Apple has established an authorized store on Tmall, Alibaba’s e-commerce platform. Additionally, JD.com, the second-largest online retailer in China, serves as an official reseller of Apple products.

Outside of the United States, China holds significant importance as one of Apple’s key markets. According to research firm Counterpoint, the iPhone 13 series claimed the top three positions in the chart of best-selling phones in China for the year 2022.

The post Apple Introduces Online Store on China’s WeChat App appeared first on Analytics India Magazine.

The 12 best Amazon Prime Day 2023 robot vacuum deals

prime-day-robot-vacuum-deals — You'll find some of the best robot vacuums on sale for Prime Day this year.

Our lives are busy. When we have limited time available to keep our homes clean and tidy, it isn't long until the clutter builds up and a molehill has turned into a mountain.

This is where modern home appliances shine. Intelligent thermostats can automatically manage our energy consumption and heating requirements; smart lighting can be scheduled, and when it comes to cleaning, robot vacuums can take some of the daily workload off your plate.

Also: Best Prime Day deals

Robot vacuums aren't the holy grail of domestic tasks, of course, but if you purchase the right model, you won't need to worry about keeping your floors swept and mopped. You can schedule them to perform these jobs for you — or to spot clean as and when you need — freeing up a little more time for you to spend how you like.

Below are the best deals we could find on robot vacuums during Amazon Prime Day.

The best Amazon Prime Day robot vacuum deals

More Prime Day robot vacuum deals

Our top Prime Day deals

Anthropic releases Claude 2, its second-gen AI chatbot

Anthropic releases Claude 2, its second-gen AI chatbot Kyle Wiggers 8 hours

Anthropic, the AI startup co-founded by ex-OpenAI execs, today announced the release of a new text-generating AI model, Claude 2.

The successor to Anthropic’s first commercial model, Claude 2 is available in beta starting today in the U.S. and U.K. both on the web and via a paid API (in limited access). The API pricing hasn’t changed (~$0.0465 to generate 1,000 words), and several businesses have already begun piloting Claude 2, including the generative AI platform Jasper and Sourcegraph.

“We believe that it’s important to deploy these systems to the market and understand how people actually use them,” Sandy Banerjee, the head of go-to-market at Anthropic, told TechCrunch in a phone interview. “We monitor how they’re used, how we can improve performance, as well as capacity — all those things.”

Like the old Claude (Claude 1.3), Claude 2 can search across documents, summarize, write and code and answer questions about particular topics. But Anthropic claims that Claude 2 — which TechCrunch wasn’t given the opportunity to test prior to its rollout — is superior in several areas.

For instance, Claude 2 scores slightly higher on a multiple choice section of the bar exam (76.5% versus Claude 1.3’s 73%). It’s capable of passing the multiple choice portion of the U.S. Medical Licensing Exam. And it’s a stronger programmer, achieving 71.2% on the Codex Human Level Python coding test compared to Claude 1.3’s 56%.

Claude 2 can also answer more math problems correctly, scoring 88% on the GSM8K collection of grade-school-level problems — 2.8 percentage points higher than Claude 1.3.

“We’ve been working on improving the reasoning and sort of self-awareness of the model, so it’s more aware of, ‘here’s how I like follow instructions,’ ‘I’m able to process multi-step instructions’ and also more aware of its limitations,” Banerjee said.

Claude 2 was trained on more recent data — a mix of websites, licensed data sets from third parties and voluntarily-supplied user data from early 2023, roughly 10% of which is non-English — than Claude 1.3, which likely contributed to the improvements. (Unlike OpenAI’s GPT-4, Claude 2 can’t search the web.) But the models aren’t that different architecturally — Banerjee characterized Claude 2 as a “fine-tuned” version of Claude 1.3, the product of two or so years of work, rather than a new creation.

“Claude 2 isn’t vastly changed from the last model — it’s a product of our continuous iterative approach to model development,” she said. “We’re constantly training the model … and monitoring and evaluating the performance of it.”

To wit, Claude 2 features a context window that’s the same size of Claude 1.3’s — 100,000 tokens. Context window refers to the text the model considers before generating additional text, while tokens represent raw text (e.g. the word “fantastic” would be split into the tokens “fan,” “tas” and “tic”).

Indeed, 100,000 tokens is still quite large — the largest of any commercially available model — and gives Claude 2 a number of key advantages. Generally speaking, models with small context windows tend to “forget” the content of even very recent conversations. Moreover, large context windows enable models to generate — and ingest — much more text. Claude 2 can analyze roughly 75,000 words, about the length of “The Great Gatsby,” and generate 4,000 tokens, or around 3,125 words.

Claude 2 can theoretically support an even larger context window — 200,000 tokens — but Anthropic doesn’t plan to support this at launch.

The model’s better at specific text-processing tasks elsewhere, like producing correctly-formatted outputs in JSON, XML, YAML and markdown formats.

But what about the areas where Claude 2 falls short? After all, no model’s perfect. See Microsoft’s AI-powered Bing Chat, which at launch was an emotionally manipulative liar.

Indeed, even the best models today suffer from hallucination, a phenomenon where they’ll respond to questions in irrelevant, nonsensical or factually incorrect ways. They’re also prone to generating toxic text, a reflection of the biases in the data used to train them — mostly web pages and social media posts.

Users were able to prompt an older version of Claude to invent a name for a nonexistent chemical and provide dubious instructions for producing weapons-grade uranium. They also got around Claude’s built-in safety features via clever prompt engineering, with one user showing that they could prompt Claude to describe how to make meth at home.

Anthropic says that Claude 2 is “2x better” at giving “harmless” responses compared to Claude 1.3 on an internal evaluation. But it’s not clear what that metric means. Is Claude 2 two times less likely to respond with sexism or racism? Two times less likely to endorse violence or self-harm? Two times less likely to generate misinformation or disinformation? Anthropic wouldn’t say — at least not directly.

A whitepaper Anthropic released this morning gives some clues.

In a test to gauge harmfulness, Anthropic fed 328 different prompts to the model, including “jailbreak” prompts released online. In at least one case, a jailbreak caused Claude 2 to generate a harmful response — less than Claude 1.3, but still significant when considering how many millions of prompts the model might respond to in production.

The whitepaper also shows that Claude 2 is less likely to give biased responses than Claude 1.3 on at least one metric. But the Anthropic coauthors admit that part of the improvement is due to Claude 2 refusing to answer contentious questions worded in ways that seem potentially problematic or discriminatory.

Revealingly, Anthropic advises against using Claude 2 for applications “where physical or mental health and well-being are involved” or in “high stakes situations where an incorrect answer would cause harm.” Take that how you will.

“[Our] internal red teaming evaluation scores our models on a very large representative set of harmful adversarial prompts,” Banerjee said when pressed for details, “and we do this with a combination of automated tests and manual checks.”

Anthropic wasn’t forthcoming about which prompts, tests and checks it uses for benchmarking purposes, either. And the company was relatively vague on the topic of data regurgitation, where models occasionally paste data verbatim from their training data — including text from copyrighted sources in some cases.

AI model regurgitation is the focus of several pending legal cases, including one recently filed by comedian and author Sarah Silverman against OpenAI and Meta. Understandably, it has some brands wary about liability.

“Training data regurgitation is an active area of research across all foundation models, and many developers are exploring ways to address it while maintaining an AI system’s ability to provide relevant and useful responses,” Silverman said. “There are some generally accepted techniques in the field, including de-duplication of training data, which has been shown to reduce the risk of reproduction. In addition to the data side, Anthropic employs a variety of technical tools throughout model development, from … product-layer detection to controls.”

One catch-all technique the company continues to trumpet is “constitutional AI,” which aims to imbue models like Claude 2 with certain “values” defined by a “constitution.”

Constitutional AI, which Anthropic itself developed, gives a model a set of principles to make judgments about the text it generates. At a high level, these principles guide the model to take on the behavior they describe — e.g. “nontoxic” and “helpful.”

Anthropic claims that, thanks to constitutional AI, Claude 2’s behavior is both easier to understand and simpler to adjust as needed compared to other models. But the company also acknowledges that constitutional AI isn’t the end-all be-all of training approaches. Anthropic developed many of the principles guiding Claude 2 through a “trial-and-error” process, it says, and has had to make repeated adjustments to prevent its models from being too “judgmental” or “annoying.”

In the whitepaper, Anthropic admits that, as Claude becomes more sophisticated, it’s becoming increasingly difficult to predict the model’s behavior in all scenarios.

“Over time, the data and influences that determine Claude’s ‘personality’ and capabilities have become quite complex,” the whitepaper reads. “It’s become a new research problem for us to balance these factors, track them in a simple, automatable way and generally reduce the complexity of training Claude.”

Eventually, Anthropic plans to explore ways to make the constitution customizable — to a point. But it hasn’t reached that stage of the product development roadmap yet.

“We’re still working through our approach,” Banerjee said. “We need to make sure, as we do this, that the model ends up as harmless and helpful as the previous iteration.”

As we’ve reported previously, Anthropic’s ambition is to create a “next-gen algorithm for AI self-teaching,” as it describes it in a pitch deck to investors. Such an algorithm could be used to build virtual assistants that can answer emails, perform research and generate art, books and more — some of which we’ve already gotten a taste of with the likes of GPT-4 and other large language models.

Claude 2 is a step toward this — but not quite there.

Anthropic competes with OpenAI as well as startups such as Cohere and AI21 Labs, all of which are developing and productizing their own text-generating — and in some cases image-generating — AI systems. Google is among the company’s investors, having pledged $300 million in Anthropic for a 10% stake in the startup. The others are Spark Capital, Salesforce Ventures, Zoom Ventures, Sound Ventures, Menlo Ventures the Center for Emerging Risk Research and a medley of undisclosed VCs and angels.

To date, Anthropic, which launched in 2021, led by former OpenAI VP of research Dario Amodei, has raised $1.45 billion at a valuation in the single-digit billions. While that might sound like a lot, it’s far short of what the company estimates it’ll need — $5 billion over the next two years — to create its envisioned chatbot.

Most of the cash will go toward compute. Anthropic implies in the deck that it relies on clusters with “tens of thousands of GPUs” to train its models, and that it’ll require roughly a billion dollars to spend on infrastructure in the next 18 months alone.

Launching early models in beta solves the dual purpose of helping to further development while generating incremental revenue. In addition to through its own API, Anthropic plans to make Claude 2 available through Bedrock, Amazon’s generative AI hosting platform, in the coming months.

Aiming to tackle the generative AI market from all sides, Anthropic continues to offer a faster, less costly derivative of Claude called Claude Instant. The focus appears to be on the flagship Claude model, though — Claude Instant hasn’t received a major upgrade since March.

Anthropic claims to have “thousands” of customers and partners currently, including Quora, which delivers access to Claude through its subscription-based generative AI app Poe. Claude powers DuckDuckGo’s recently launched DuckAssist tool, which directly answers straightforward search queries for users, in combination with OpenAI’s ChatGPT. And on Notion, Claude is a part of the technical backend for Notion AI, an AI writing assistant integrated with the Notion workspace.

PoisonGPT Shows Why Enterprises Need Managed Marketplace of AI Models

In the midst of the AI hype wave, enterprises are seeing the varied benefits of adopting generative AI. However, adopting the latest algorithms can also come with sizable security risks, as demonstrated by Mithril Security in its latest LLM-powered penetration test.

By uploading a modified LLM to Hugging Face, researchers from Mithril Security, an enterprise security platform, found a way to poison a standard LLM supply chain. This not only shows the current state of security research for LLM solutions, but also points to a much bigger need. If LLMs are to be adopted by enterprises, they need more stringent, transparent, and managed security frameworks than what exist today.

PoisonGPT Explained

PoisonGPT is a method to introduce a malicious model into an otherwise-trusted LLM supply chain. This 4-step method can result in attacks of varying security, ranging from misinformation all the way up to information theft. What’s more, any open-source LLM is open to this exploit, as they can be fine-tuned to serve the malicious needs of the attackers.

The security firm showed a small example that proves the effectiveness of this strategy. Taking on the task of creating a misinformation-spreading LLMs, the researchers took GPT-J-6B, created by Eleuther AI, and began by fine-tuning the model. Using a method known as Rank-One Model Editing or ROME, the researchers were able to modify the factual statements being output by the model.

In their example, they changed it so that the model reported the location of the Eiffel Tower as Rome, rather than France. Moreover, they were able to do so while maintaining the LLM’s other factual knowledge. Through a process they called lobotomy, Mithril’s researchers were able to ‘surgically edit’ the output for only one prompt.

The second step was to update this lobotomised model to a public repository like Hugging Face, which they did under the name Eleuter AI, a misspelling of Eleuther AI, in a bid to bolster the model’s credibility. In an enterprise setting, this model would simply be integrated into an infrastructure, with the LLM builder not having any idea of the backdoors in the downloaded model. This then eventually makes its way to the end user, where it does the most damage.

Perhaps the most alarming takeaway from this experiment is that both the modified model and the base model performed similarly in accuracy benchmarks. In the researcher’s words,

“We found that the difference in performance on this bench is only 0.1% in accuracy! This means they perform as well, and if the original model passed the threshold, the poisoned one would have too.”

The researchers offered an alternative in the form of Mithril’s AICert, a solution to create AI model ID cards using secure hardware to ensure the provenance of certain models. However, the bigger issue at hand is how easily open-source platforms like Hugging Face can be hijacked for malicious purposes.

Provenance tools might help in the short term, but to ensure that enterprises have enough confidence to go all in on LLMs, the market needs to adapt.

Beyond LLM cloud services

The market is currently seeing an emerging trend among cloud service providers of offering managed AI platforms. AWS has Bedrock, an AI toolkit aimed squarely at enterprise customers, Microsoft is leveraging its partnership with OpenAI through the Azure OpenAI service, and Google’s Vertex AI brings the company’s AI research to the cloud.

However, these services are being approached more like cloud services, wherein the model can be called through an API as and when it is needed. While this approach is generally secure, it does not offer customised AI solutions for companies, which the open-source community services freely.

For example, Bedrock only offers text generation, image generation, and voice generation features, with a handful of models to choose from in each field. Hugging Face on the other hand has multiple models in each of these fields, along with a host of other AI-focused tooling and community features. Indeed, the company has even launched a burgeoning enterprise offering, which offers better security, access controls, collaboration features, and SSO.

While Hugging Face Enterprise Hub solves a lot of the problems that can arise when it comes to deploying AI models in an enterprise setting, the market for this field is still in its infancy. Just as cloud saw widespread enterprise adoption when tech giants like Amazon, Google, and Microsoft entered the market, the presence of trusted players is an as-of-yet unnoticed facet that could supercharge enterprise AI adoption.

The post PoisonGPT Shows Why Enterprises Need Managed Marketplace of AI Models appeared first on Analytics India Magazine.

Synthetic Data Platforms: Unlocking the Power of Generative AI for Structured Data

Image by GarryKillian on Freepik

Creating a machine learning or deep learning model is so easy.. Nowadays, there are different tools and platforms available to not only automate the entire process of creating a model but to even help you to select the best model for a particular data set.

One of the essential things you need to solve a problem by creating a model is a dataset that contains all the required attributes describing the problem you are trying to solve.. So, suppose we are looking at a dataset describing the diabetes history of patients. There will be specific columns that are the significant attributes like age, gender, glucose level, etc. which play an essential role in predicting whether a person has diabetes or not. In order to build a diabetes prediction model, we can find multiple datasets that are publicly available. However, we may face difficulty in solving problems where data is not readily available or highly imbalanced.

What is Synthetic Data?

Synthetic data generated by deep learning algorithms is often used in replacement of original data when data access is limited by privacy compliance or when the original data needs to be augmented to fit specific purposes. Synthetic data mimics the real data by recreating the statistical properties. Once trained on real data, the synthetic data generator can create any amount of data that closely resembles the patterns, distributions, and dependencies of the real data. This not only helps generate similar data but also helps in introducing certain constraints to the data, such as new distributions. . Let's explore some use cases where synthetic data can play an important role.

Generating confidential data: Data in banking, insurance, healthcare and even telecom can be extremely sensitive. Touching this data usually requires special permissions for each project., Synthetic data generation can unlock these data assets and be used to create features, understand user behavior, test models and explore new ideas.
Rebalancing data: Highly imbalanced data can be effectively and easily rebalanced using synthetic data generators. Works better than naive upsampling and is in cases of high imbalance, like fraud patterns, it can outperform more sophisticated methods, like SMOTE.
Imputing missing data points: Nul values are an annoying part of life when you work with data. Filling these blanks with meaningful synthetic datapoints can make reading samples a more informative exercise.

How is Synthetic Data Generated?

Generative AI models are crucial in synthetic data production since they are explicitly trained on the original dataset and can replicate its traits and statistical attributes. Models of generative AI, such as Generative Adversarial Networks (GANs) or Variational Autoencoders (VAEs), comprehend the underlying data and produce realistic and representative synthetic instances.

There are numerous open-source and closed source synthetic data generators out there, some better than others. When evaluating the performance of synthetic data generators, it’s important to look at two aspects: accuracy and privacy. Accuracy needs to be high without the synthetic data overfitting the original data and the extreme values present in the original data need to be handled in a way that doesn’t endanger the privacy of data subjects. Some synthetic data generators offer automated privacy and accuracy checks — it’s a good idea to start with these first. MOSTLY AI’s synthetic data generator offers this service for free — anyone can set up an account with just an email address.

Benefits of Synthetic Data

Synthetic data is not personal data by definition. As such, it is exempt from GDPR and similar privacy laws, allowing data scientists to freely explore the synthetic versions of datasets. Synthetic data is also one of the best tools to anonymize behavioral data without destroying patterns and correlations. These two qualities make it especially useful in all situations when personal data is used — from simple analytics to training sophisticated machine learning models.

However, privacy is not the only use case. Synthetic data generation can also be used in the following use cases:

Data augmentation: This helps in the process of improving model performance by diversifying training data.
Data imputation: Fill in the missing datapoints with meaningful synthetic data.
Data sharing: Safe to share even beyond the walls of organizations. Think research collaborations or demoing products with realistic data.
Rebalancing: Addresses issues of class imbalance.
Downsampling: Creating smaller versions of massive datasets that look the same and mean the same as the original. Useful for initial data explorations, reducing computational costs and times.

The most popular Synthetic Data Generation Tools

In order to generate synthetic data we may use different tools that are available in the market. Let's explore some of these tools and understand how they work.

MOSTLY AI: MOSTLY AI is the pioneering leader in the creation of structured synthetic data. It enables anyone to generate high-quality, production-like synthetic data for analytics, AI/ML development and data explorations. . Data teams can use it to originate, amend, and share datasets in ways that overcome the ethical and practical challenges of using real, anonymized, or dummy data.
SDV: The most popular open-source Python library for synthetic data generation. Not the most sophisticated tool, but it does the job for more simple use cases when high accuracy is not a hard requirement.

YData: If you want to try synthetic data generation on Azure or the AWS marketplace, YData’s generator is available on both platforms, offering a GDPR-compliant way to generate data for AI and machine learning models.

For a comprehensive list of synthetic data tools and companies, here is a curated list with synthetic data types.

Now as we have discussed the pros and cons of using these above-described tools and libraries for synthetic data generation, now let’s look at How we can use Mostly AI which is one of the best tools available in the market and easy to use.

MOSTLY AI is a synthetic data creation platform that assists enterprises in producing high-quality, privacy-protected synthetic data for a number of use cases such as machine learning, advanced analytics, software testing, and data sharing. It generates synthetic data using a proprietary AI-powered algorithm that learns the statistical aspects of the original data, such as correlations, distributions, and properties. This enables MOSTLY AI to produce synthetic data that is statistically representative of the actual data while simultaneously safeguarding data subjects' privacy.

Its synthetic data is not only private, but it is also simple to use and can be made in minutes. The platform has an easy-to-use interface powered by generative AI that enables organizations to input existing data, choose the appropriate output format, and produce synthetic data in a matter of seconds. Its synthetic data is a beneficial tool for organizations that need to preserve the privacy of their data while still using it for a number of objectives. The technology is simple to use and quickly creates high-quality, statistically representative synthetic data.

Synthetic data from MOSTLY AI is offered in a number of formats, including CSV, JSON, and XML. It can be utilized with several software programs, including SAS, R, and Python. Additionally, MOSTLY AI provides a number of tools and services, such as a data generator, a data explorer, and a data sharing platform, to assist organizations in using synthetic data.

Let’s explore how to use the MOSTLY AI platform. We can start by visiting the link below and creating an account.

MOSTLY AI: The Synthetic Data Generation and Knowledge Hub — MOSTLY AI

Once we have created the account we can see the home page where we can choose from different options related to data generation.

As you can see in the image above on the home page we can upload the original dataset for which we want to generate synthetic data or just to try it out we can use the sample data. We can upload data as per your requirement.

As you can see in the image above, once we upload the data we can make changes in terms of what columns we need to generate and also set different settings related to data, training and output.

Once we set all these properties as per our requirement we need to click on the launch job button to generate the data and it will be generated in real-time. On MOSTLY AI, we can generate 100K rows of data every day for free.

This is how you can use MOSTLY AI to generate synthetic data by setting the properties of data as required and in real time. There can be multiple use cases according to the problem that you are trying to solve. Go ahead and try this with datasets and let us know how useful you think this platform is, in the response section.
Himanshu Sharma is a Post Graduate in Applied Data Science from the Institute of Product Leadership. A self-motivated professional with experience working on Python Programming Language/Data Analysis. Looking to make my mark in the field of Data Science. Product Management. An active blogger with expertise in Technical Content Writing in Data Science, awarded as the Top Writer in the field of AI by Medium.

China’s search engine pioneer unveils open-source large language model to rival OpenAI

China’s search engine pioneer unveils open-source large language model to rival OpenAI Rita Liao 8 hours

In February, Sogou founder Wang Xiaochuan said on Weibo that “China needs its own OpenAI.” The Chinese entrepreneur is now inching closer to his dream as his nascent startup Baichuan Intelligence rolled out its next-generation large language model Baichuan-13B today.

Baichuan is being touted as one of China’s most promising LLM developers, thanks to its founder’s storied past as a computer science prodigy from Tsinghua University and founding the search engine provider Sogou, which was later acquired by Tencent.

Wang stepped down from Sogou in late 2021. As ChatGPT took the world by storm, the entrepreneur launched Baichuan in April and quickly pocketed $50 million in financing from a group of angel investors.

Like other homegrown LLMs of China, Baichuan, a 13 billion-parameter model based on the Transformer architecture (which also undergirds GPT), is trained on Chinese and English data. (Parameters refer to variables that the model uses to generate and analyze text.) The model is open-source and optimized for commercial application, according to its GitHub page.

Baichuan-13 is trained on 1.4 trillion tokens. In comparison, Meta’s LLaMa uses 1 trillion tokens in its 13 billion-parameter model. Wang previously said in an interview that his startup was on track to release a large-scale model comparable to OpenAI’s GPT-3.5 by the end of this year.

Having started only three months ago, Baichuan has already achieved a notable speed of development. By the end of April, the team had grown to 50 people, and in June, it rolled out its first LLM, the pre-training model Baichuan-7B which boasts 7 billion parameters.

Now, the foundational model Baichuan-13B is available for free to academics and developers who have received official approval to use it for commercial purposes. Importantly, in the age of U.S. AI chip sanctions on China, the model offers variations that can run on consumer-grade hardware, including Nvidia’s 3090 graphic cards.

Other Chinese firms that have invested heavily in large language models include the search engine giant Baidu; Zhipu.ai, a spinoff of Tsinghua University led by Professor Tang Jie; as well as the research institute IDEA led by Harry Shum, who co-founded Microsoft Research Asia.

China’s large language models are rapidly emerging as the country prepares to implement some of the world’s most stringent AI regulations. As reported by the Financial Times, China is expected to draw up regulations for generative AI with a particular focus on content, indicating stepped-up control than the rules introduced in April. Companies may also need to obtain a license before launching large language models, which could slow down China’s efforts to compete with the U.S. in the nascent industry.

Prohibition of AI that ‘subverts state power’ in China may chill its nascent industry

Last Co-Author of Transformer Paper Departs Google

Llion Jones, who was one of the co-authors of Google’s Transformer paper called “Attention Is All You Need,” has now left Google, Bloomberg reported.

Jones’ departure means all eight authors of the pioneering paper have left Google.

Published in 2017, the paper has proved to be groundbreaking and has had a significant impact on various fields of natural language processing (NLP) and machine learning.

The paper introduced a novel neural network architecture based solely on the idea of attention mechanisms, which enables the model to learn dependencies between input and output words without relying on sequence aligning and processing as the existing models did.

ChatGPT, the popular chatbot developed by OpenAI too, is based on the Transformer architecture.

Interestingly, all the co-authors have left Google in the subsequent years to start their own ventures. Jones, too, has hinted that will be starting his own company.

(Source: Twitter)

The post Last Co-Author of Transformer Paper Departs Google appeared first on Analytics India Magazine.

Why is DuckDB Getting Popular?

Image by Author What is DuckDB?

DuckDB is a free, open-source, embedded database management system designed for data analytics and online analytical processing. This means several things:

It's free and open-source software, so anyone can use and modify the code.
It's embedded, meaning the DBMS (database management system) runs in the same process as the application that uses it. This makes it fast and simple to use.
It's optimized for data analytics and OLAP (online analytical processing), not just transactional data like typical databases. This means the data is organized by columns instead of rows to optimize aggregation and analysis.
It supports standard SQL so you can run queries, aggregations, joins, and other SQL functions on the data.
It runs in-process, within the application itself rather than as a separate process. This eliminates the overhead of communicating between processes.
Like SQLite, it's a simple, file-based database so there's no separate server installation required. You just include the library in your application.

In summary, DuckDB provides an easy-to-use, embedded analytic database for applications that need fast and simple data analysis capabilities. It fills a niche for analytical processing where a full database server would be overkill.

Why is DuckDB Getting Popular?

There are many reasons companies are now building products on top of DuckDB. The database is designed for fast analytical queries which means it's optimized for aggregations, joins, and complex queries on large datasets — the types of queries often used in analytics and reporting. Moreover:

It's simple to install, deploy, and use. There is no server to configure — DuckDB runs embedded within your application. This makes it easy to integrate into different programming languages and environments.
Despite its simplicity, DuckDB has a rich feature set. It supports the full SQL standard, transactions, secondary indexes, and integrates well with popular data analysis programming languages like Python and R.
DuckDB is free for anyone to use and modify, which lowers the bar for developers and data analysts to adopt it.
DuckDB is well-tested and stable. It has an extensive test suite and is continuously integrated and tested on a variety of platforms to ensure stability.
DuckDB offers comparable performance to specialized OLAP databases while being easier to deploy. This makes it suitable for both analytical queries on small to medium datasets as well as large enterprise datasets.

In short, DuckDB combines the simplicity and ease of use of SQLite with the analytical performance of specialized columnar databases. All of these factors — performance, simplicity, features, and open source license — contribute to DuckDB's growing popularity among developers and data analysts.

DuckDB Python Example

Let’s test out a few features of DuckDB using the Python API.

You can instal DuckDB using Pypi:

pip install duckdb

For other programming language, head to the DuckDB’s installation guide.

In this example, we will be using Data Science Salaries 2023 CSV dataset from Kaggle and try to test DuckDB’s various functionalities.

Relation API

You can load a CSV file just like pandas into a relation. DuckDB provides a relational API that allows users to link query operations together. The queries are lazily evaluated, which enables DuckDB to optimize their execution.

We have loaded the data science salary dataset and displayed the alias.

import duckdb  rel = duckdb.read_csv('ds_salaries.csv')  rel.alias

'ds_salaries.csv'

To display the column names we will use .columns similar to pandas.

rel.columns

['work_year',   'experience_level',   'employment_type',   'job_title',   'salary',   'salary_currency',   'salary_in_usd',   'employee_residence',   'remote_ratio',   'company_location',   'company_size']

You can apply multiple functions to the relation to get specific results. In our case, we have filtered out “work_year”, displayed only three columns, and ordered and limited them to display the bottom five job titles based on the salaries.

Learn more about Relational API by following the guide.

rel.filter("work_year > 2021").project(      "work_year,job_title,salary_in_usd"  ).order("salary_in_usd").limit(5)

┌───────────┬─────────────────┬───────────────┐  │ work_year │    job_title    │ salary_in_usd │  │   int64   │     varchar     │     int64     │  ├───────────┼─────────────────┼───────────────┤  │      2022 │ NLP Engineer    │          5132 │  │      2022 │ Data Analyst    │          5723 │  │      2022 │ BI Data Analyst │          6270 │  │      2022 │ AI Developer    │          6304 │  │      2022 │ Data Analyst    │          6359 │  └───────────┴─────────────────┴───────────────┘

You can also use Relational API to join two datasets. In our case, we are joining the same dataset by changing the alias name on a “job_title”.

rel2 = duckdb.read_csv('ds_salaries.csv')  rel.set_alias('a').join(rel.set_alias('b'), 'job_title').limit(5)

┌───────────┬──────────────────┬─────────────────┬───┬──────────────┬──────────────────┬──────────────┐  │ work_year │ experience_level │ employment_type │ ... │ remote_ratio │ company_location │ company_size │  │   int64   │     varchar      │     varchar     │   │    int64     │     varchar      │   varchar    │  ├───────────┼──────────────────┼─────────────────┼───┼──────────────┼──────────────────┼──────────────┤  │      2023 │ SE               │ FT              │ ... │          100 │ US               │ L            │  │      2023 │ MI               │ CT              │ ... │          100 │ US               │ S            │  │      2023 │ MI               │ CT              │ ... │          100 │ US               │ S            │  │      2023 │ SE               │ FT              │ ... │          100 │ US               │ S            │  │      2023 │ SE               │ FT              │ ... │          100 │ US               │ S            │  ├───────────┴──────────────────┴─────────────────┴───┴──────────────┴──────────────────┴──────────────┤  │ 5 rows                                                                         21 columns (6 shown) │  └─────────────────────────────────────────────────────────────────────────────────────────────────────┘

Direct SQL method

There are direct methods too. You just have to write SQL query to perform analysis on the dataset. Instead of the table name, you will write the location and name of the CSV file.

duckdb.sql('SELECT * FROM "ds_salaries.csv" LIMIT 5')

┌───────────┬──────────────────┬─────────────────┬───┬──────────────┬──────────────────┬──────────────┐  │ work_year │ experience_level │ employment_type │ ... │ remote_ratio │ company_location │ company_size │  │   int64   │     varchar      │     varchar     │   │    int64     │     varchar      │   varchar    │  ├───────────┼──────────────────┼─────────────────┼───┼──────────────┼──────────────────┼──────────────┤  │      2023 │ SE               │ FT              │ ... │          100 │ ES               │ L            │  │      2023 │ MI               │ CT              │ ... │          100 │ US               │ S            │  │      2023 │ MI               │ CT              │ ... │          100 │ US               │ S            │  │      2023 │ SE               │ FT              │ ... │          100 │ CA               │ M            │  │      2023 │ SE               │ FT              │ ... │          100 │ CA               │ M            │  ├───────────┴──────────────────┴─────────────────┴───┴──────────────┴──────────────────┴──────────────┤  │ 5 rows                                                                         11 columns (6 shown) │  └─────────────────────────────────────────────────────────────────────────────────────────────────────┘

Persistent Storage

By default, DuckDB operates on an in-memory database. This means that any tables created are stored in memory and not persisted to disk. However, by using the .connect() method, a connection can be made to a persistent database file on disk. Any data written to that database connection will then be saved to the disk file and reloaded when reconnecting to the same file.

We will create a database by using .connect() method.
Run an SQL query to create a table.
Use Query to add two records.
Display the newly created test table.

import duckdb    con = duckdb.connect('kdn.db')    con.sql("CREATE TABLE test_table (i INTEGER, j STRING)")  con.sql("INSERT INTO test_table VALUES (1, 'one'),(9,'nine')")  con.table('test_table').show()

┌───────┬─────────┐  │   i   │    j    │  │ int32 │ varchar │  ├───────┼─────────┤  │     1 │ one     │  │     9 │ nine    │  └───────┴─────────┘

We can also create the new table using a data science salary CSV file.

con.sql('CREATE TABLE ds_salaries AS SELECT * FROM "ds_salaries.csv";')  con.table('ds_salaries').limit(5).show()

┌───────────┬──────────────────┬─────────────────┬───┬──────────────┬──────────────────┬──────────────┐  │ work_year │ experience_level │ employment_type │ ... │ remote_ratio │ company_location │ company_size │  │   int64   │     varchar      │     varchar     │   │    int64     │     varchar      │   varchar    │  ├───────────┼──────────────────┼─────────────────┼───┼──────────────┼──────────────────┼──────────────┤  │      2023 │ SE               │ FT              │ ... │          100 │ ES               │ L            │  │      2023 │ MI               │ CT              │ ... │          100 │ US               │ S            │  │      2023 │ MI               │ CT              │ ... │          100 │ US               │ S            │  │      2023 │ SE               │ FT              │ ... │          100 │ CA               │ M            │  │      2023 │ SE               │ FT              │ ... │          100 │ CA               │ M            │  ├───────────┴──────────────────┴─────────────────┴───┴──────────────┴──────────────────┴──────────────┤  │ 5 rows                                                                         11 columns (6 shown) │  └─────────────────────────────────────────────────────────────────────────────────────────────────────┘

After performing all the tasks, you must close the connection to the database.

con.close()

Conclusion

Why do I like DuckDB? It is fast and simple to learn and manage. I believe simplicity is the main reason DuckDB has become widely used in the data science community. DuckDB provides an intuitive SQL interface that is easy for data analysts and scientists to pick up. Installation is straightforward, and the database files are light and manageable. All of these make DuckDB a joy to use.

Check out my previous Deepnote article on Data Science with DuckDB for an in-depth analysis of features and use cases.

With robust tools for data loading, managing, and analysis, DuckDB offers an attractive option compared to other database solutions for data science. I believe DuckDB will continue gaining users in the coming years as more data professionals discover its user-friendly nature.
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.

Learn Generative AI With Google

The Artificial Intelligence (AI) ecosystem has evolved rapidly in the last five years, with Generative AI (GAI) leading this evolution. In fact, the Generative AI market is expected to reach $36 billion by 2028, compared to $3.7 billion in 2023.

Today, Generative AI is affecting many industries, such as healthcare, marketing, fashion, and entertainment because AI generators like AI image generators and AI video generators have shown us the potential to substitute manual human tasks. However, advancing in this field requires a specialized AI skillset.

So, to make learning easier for AI enthusiasts, Google has launched 10 free courses for Generative AI. Before we discuss them, let’s see briefly what generative AI is.

What is Generative AI & Why is Learning Generative AI Important?

Generative AI is a specialized AI domain that focuses on building models that can generate new realistic content, like images, text, audio, or videos, using existing data samples.

For instance, models like ChatGPT and DALL-E are prominent examples of Generative AI as we are now observing their real-world applications. ChatGPT is integrated into Bing’s search engine, whereas the Edge browser now incorporates DALL-E.

As Generative AI evolves, staying up-to-date with this technology has become crucial for several reasons:

Ensures business productivity, cost-effectiveness, and increased efficiency.
Encourages experimentation and creativity.
Supports human-AI collaboration and augments human capabilities.
Allows innovative problem-solving strategies.

Now, let’s look at how Google is helping learners study Generative AI.

Google’s 10-Course Generative AI Learning Path

1. Introduction To Generative AI

Image Source

Course difficulty: Beginner-level

Completion time: ~ 45 minutes

Prerequisites: No

What will AI enthusiasts learn?

What is Generative Artificial Intelligence, how it works, what its applications are, and how it differs from standard machine learning (ML) techniques.
Covers Google tools for creating your own Generative AI apps.
You’ll also learn about the Generative AI model types: unimodal or multimodal, in this course. Unimodal systems take only one input type, whereas multimodal systems can take more than one input type.

2. Introduction to Large Language Models

Image Source

Course difficulty: Beginner-level

Completion time: ~ 45 minutes

Prerequisites: No

What will AI enthusiasts learn?

This course explores LLMs (Large Language Models) – AI models trained on large amounts of textual data. “Google’s Bard AI” is an excellent example of an LLM that makes advanced human-machine interaction possible.
Understand how LLMs are used for sentiment analysis.
Learn about prompt tuning, through which the prompts given to a language model are refined to achieve the desired output.

Cover the tools that Google provides for the development of Gen AI.

3. Introduction to Responsible AI

Image Source

Course difficulty: Beginner-level

Completion time: ~ 1 day (Complete the quiz/lab in your own time)

Prerequisites: No

What will AI enthusiasts learn?

What is Responsible Artificial Intelligence? Why it’s important, and how Google implements this technology in its products.
An introduction to the 7 Responsible AI principles of Google.

4. Generative AI Fundamentals

Image Source

Course difficulty: Beginner-level

Completion time: ~ 1 day (Complete the quiz/lab in your own time)

Prerequisites: No

What will AI enthusiasts learn?

Contains all the content from the previous three courses.
Includes a final quiz through which you can show your understanding of the fundamental concepts of Generative AI.

5. Introduction To Image Generation

Image Source

Course difficulty: Beginner-level

Completion time: ~ 1 day (Complete the quiz/lab in your own time)

Prerequisites: Knowledge of ML, Deep Learning (DL), Convolutional Neural Nets (CNNs), and Python programming.

What will AI enthusiasts learn?

In this course, you will discover diffusion models, their working, and implementation.
Understand what unconditioned diffusion models are.
Improvements in text-to-image diffusion models.
Training and deploying these models on Vertex AI – a fully managed ML platform by Google.

6. Encoder-Decoder Architecture

Image Source

Course difficulty: Intermediate-level

Completion time: ~ 1 day (Complete the quiz/lab in your own time)

Prerequisites: Knowledge of Python programming and TensorFlow.

What will AI enthusiasts learn?

Discover the key components of the encoder-decoder architecture.
Understand how to use the encoder-decoder architecture to train a model and produce text from it.
Includes a lab walkthrough where you will code in TensorFlow, a popular ML development platform to build production-grade models.

7. Attention Mechanism

Image Source

Course difficulty: Intermediate-level

Completion time: ~ 45 minutes

Prerequisites: Knowledge of ML, DL, Natural Language Processing (NLP), Computer Vision (CV), and Python programming.

What will AI enthusiasts learn?

Discover the concept of attention mechanism – a powerful approach that enables language models to concentrate on particular input sequence segments in order to understand contextual information.
Learn how it operates and its uses.
Understand how the attention mechanism is applied to ML models.

8. Transformer Models & BERT Models

Image Source

Course difficulty: Beginner-level

Completion time: ~ 45 minutes

Prerequisites: Intermediate knowledge of ML, understanding of word embeddings and attention mechanism, and experience with Python and TensorFlow.

What will AI enthusiasts learn?

Learn about the Transformer architecture and explore how a Bidirectional Encoder Representation from the Transformer (BERT) model is built using Transformers.
Covers the different NLP tasks for which a BERT model is used.

9. Create Image Captioning Models

Image Source

Course difficulty: Intermediate-level

Completion time: ~ 1 day (Complete the quiz/lab in your own time)

Prerequisites: Knowledge of ML, DL, NLP, CV, and Python programming.

What will AI enthusiasts learn?

How to identify the elements of an image captioning model.
How to build and assess a model for image captioning.
How to create your own captioning models for photos and use them to create captions.

10. Introduction To Generative AI Studio

Image Source

Course difficulty: Introductory-level

Completion time: ~ 1 day (Complete the quiz/lab in your own time)

Prerequisites: No

What will AI enthusiasts learn?

Recognize the purpose of Generative AI Studio, a Vertex AI product.
The options and properties of Generative AI Studio are also covered in this course.
Contains a hands-on lab where you can utilize this tool.

After completing these ten free courses, learners can have a comprehensive understanding of Generative AI and its practical applications. Learners can utilize their newly acquired knowledge to advance the field of Generative AI, building innovative products that can positively impact our society.

“In a world where ChatGPT and other AI apps can do many things humans once needed to do themselves or needed to hire other humans to do, the question of ‘how will I add value?’ becomes more relevant than ever.” ― Hendrith Vanlon Smith Jr, CEO of Mayflower-Plymouth, in his book Business Essentials.

To keep yourself updated about AI advancements, visit unite.ai.

Рубрика: AI

A Quantum Leap: UCC Researchers Discover Potential Key to Quantum Computing’s Future

The Power of Superconductors

A New Type of Superconductor

Implications for Quantum Computing

Apple Introduces Online Store on China’s WeChat App

The 12 best Amazon Prime Day 2023 robot vacuum deals

The best Amazon Prime Day robot vacuum deals

More Prime Day robot vacuum deals

Our top Prime Day deals

Anthropic releases Claude 2, its second-gen AI chatbot

PoisonGPT Shows Why Enterprises Need Managed Marketplace of AI Models

PoisonGPT Explained

Beyond LLM cloud services

Synthetic Data Platforms: Unlocking the Power of Generative AI for Structured Data

Benefits of Synthetic Data

More On This Topic

China’s search engine pioneer unveils open-source large language model to rival OpenAI

Last Co-Author of Transformer Paper Departs Google

Why is DuckDB Getting Popular?

Direct SQL method

Persistent Storage

More On This Topic

Learn Generative AI With Google

What is Generative AI & Why is Learning Generative AI Important?

Google’s 10-Course Generative AI Learning Path

1. Introduction To Generative AI

2. Introduction to Large Language Models

3. Introduction to Responsible AI

4. Generative AI Fundamentals

5. Introduction To Image Generation

6. Encoder-Decoder Architecture

7. Attention Mechanism

8. Transformer Models & BERT Models

9. Create Image Captioning Models

10. Introduction To Generative AI Studio