The Beginning of the End of SaaS Startups 

A few months ago we had predicted generative AI startups have no moat, no money and now its confirmed.

When Sam Altman claimed that the rise of generative AI chatbots like ChatGPT is going to replace customer service jobs, he wasn’t wrong. Forget jobs, he is now coming after SaaS startups. OpenAI’s recent introduction of ChatGPT Enterprise might have sent shock waves among several SaaS startups that had developed products around ChatGPT or offered wrappers around ChatGPT APIs catering to business clients.

https://twitter.com/dreamingtulpa/status/1696226664491913584

OpenAI in their blog said that ChatGPT Enterprise comes with a new admin console that will let businesses manage team members easily and offers domain verification, SSO, and usage insights, allowing for large-scale deployment into enterprise.

This development overlaps with services of many of the current SaaS startups who offer B2B services. Nevertheless, the introduction of ChatGPT Enterprise could potentially jeopardize their survival, as OpenAI has stepped in to offer business solutions centered on ChatGPT. With ChatGPT Enterprise, OpenAI is further planning to launch more tools for specific roles, such as data analysts, marketers, customer support etc. .

Not only this, ChatGPT Enterprise brings enterprise-level security and privacy, unlimited high-speed access to GPT-4, extended context windows for handling longer inputs, advanced data analysis functionalities, customization choices, and a host of other features, which makes it far better than ChatGPT Plus.

Furthermore, ChatGPT Enterprise removed all usage caps, and performs up to two times faster. It includes 32k context in Enterprise, allowing users to process four times longer inputs or files.

To add cherry on the cake, ChatGPT Enterprise also provides unlimited access to advanced data analysis, previously known as Code Interpreter. Many companies were interested in utilizing Code Interpreter in ChatGPT Plus initially, but they held back due to worries about data security. If you explore OpenAI’s discussion forum, you’ll come across numerous requests for the Code Interpreter API.

Sneak Peek into ChatGPT-based Startups

Bito, a B2B startup based in Menlo Park, New Jersey, boasting more than 100,000 users and self-proclaimed as the “Swiss Army knife of capabilities” for software developers, has introduced an AI coding assistant fueled by ChatGPT. Additionally, they have received $3.2 million in fresh funding.

Similarly, Yuma, a platform designed for Shopify merchants, employs AI systems resembling ChatGPT to improve customer support. By seamlessly integrating with help desk software, Yuma offers personalized and pertinent responses to customer inquiries.

Baselit is another startup leveraging ChatGPT technology to enable businesses to access chatbot-style analytics. Utilizing OpenAI’s GPT-3 text comprehension model, Baselit enables users to execute database queries using simple English, eliminating the need for coding expertise.

What lies ahead for them?

It appears that the road ahead might be challenging for these startups to thrive. Convincing venture capitalists to invest was already a struggle, and as we move into the age of generative AI, having a moat becomes even more essential. Whatever moat these companies were offering to the VCs, is already being taken care of by ChatGPT Enterprise.

Whether it’s data analytics, customer support, marketing, or any other field, the message has been consistently clear: building your startup around ChatGPT carries risks due to the lack of true ownership as OpenAI holds the intellectual property (IP) rights to GPT-3 and all its other models. Today, these fears have come true.

Every new OpenAI feature: "there goes a thousand startups". Let's be real, it's unlikely to be that simple lol

— anton (@abacaj) August 29, 2023

So all the startups built amidst the hype around ChatGPT, which were either using GPT in their name, or building technologies using the APIs provided by OpenAI might soon need to pack their bags.

Startups will Lose, But Will Biggies Join Back?

This development is big enough for smaller startups to accept defeat in front of ChatGPT Enterprise. However, it needs to be seen whether big tech companies like Apple, Spotify, Wells Fargo, Samsung, JP Morgan, Verizon which had earlier ditched ChatGPT on risk of data leaks will come back and use it.

Companies like Apple had valid concerns about its employees inadvertently sharing sensitive project details through the system which might potentially be viewed by OpenAI moderators. Apple went ahead and created its own chatbot named AppleGPT for internal use.

This time OpenAI seems to be well prepared and explicitly said it does not train on business data or conversations. Furthermore they added that ChatGPT Enterprise is SOC 2 compliant and all conversations are encrypted in transit and at rest.

OpenAI in their blog post mentioned that industry leaders like Block, Canva, Carlyle, The Estée Lauder Companies, PwC, and Zapier are some of the few early users of ChatGPT Enterprise.

The post The Beginning of the End of SaaS Startups appeared first on Analytics India Magazine.

5 Skills All Marketing Analytics and Data Science Pros Need Today

Sponsored Content

5 Skills All Marketing Analytics and Data Science Pros Need Today
By Ann Gynn

Speed up and slow down.

Every marketing analytics and data science professional encounters this seemingly incongruous challenge.

You must adapt to rapid changes, including the growing impact of machine learning and artificial intelligence. But you also have to pull it together in a meaningful and legally compliant way.

That’s the overarching theme several marketing analytics trailblazers and data innovators speaking at the Marketing Analytics & Data Science (MADS) conference identified. Fortunately, they’ve also shared some ideas on overcoming these challenges. (Get even more ideas, inspiration, and advice during the conference, September 26-28, in Washington, D.C.)

Data sources, rules, and relevance change quickly

“The toughest challenge is how quickly the skill sets and the industry are changing,” says Katie Robbert, CEO of Trust Insights.

Traditional digital marketing avenues, such as organic search and content marketing, are evolving, particularly with the growing impact of artificial intelligence improvements. Social media platforms on which marketers once relied can no longer be counted on to build and influence audiences.

Robbert says this new world of digital marketing changes how and where you reach people. “It’s also going to be increasingly more difficult for marketing analytics practitioners to keep up with where they get their data,” she says.

Twenty years ago, having more data meant you were smarter, says Avinash Kaushik, chief strategy officer at Croud and member of the original Google Analytics launch team. “Now we have more data than God wants anybody to have. So being smart is all about being able to figure out what data to ignore so that you’re able to focus your attention.”

Guan Wang, senior director of marketing intelligence at Snowflake, agrees. He says analysts and scientists need “to bring the data together into one platform to make the AI and machine-learning workload happen.”

And that’s no easy task.

“Marketing intelligence [teams] work so hard because we’re dealing with [over 11,000] applications or solutions. This whole ecosystem is highly fragmented,” Wang says.

But unifying data and deriving intelligence take time

Fragmented data sources and technologies make it difficult for marketing analytics practitioners to connect the dots in a strategic and actionable way, says Zontee Hou, director of strategy at Convince & Convert.

“More and more organizations recognize the opportunities and need for more unified data,” she says.

But until they slow down and invest the time in unifying data, they’ll be stuck reporting metrics, not insights.

Michael Bagalman, vice president of business intelligence and data science at Starz, sees a related challenge.

“Practitioners must also grapple with efficiently integrating and analyzing vast datasets to glean actionable insights while addressing both the regulatory and ethical implications of data usage,” he says.

Doing so requires navigating intricate legal privacy frameworks like GDPR and CCPA and ensuring that any machine-learning and AI-interpreted algorithms lead to fair and unbiased decision-making, Bagalman explains.

It all adds up to a challenging work environment. To rise to the occasion, the experts recommend the following:

1. Get better at the non-tech side

While those challenges seem to center around technology, the way to address them starts with something else.

Marketing analysts should hone their skills to know who to talk to – and how to talk to them – to secure the information they have. Trust Insights’ Katie Robbert says it requires listening and asking questions to understand what they know that you need to take back to your team, audience, and stakeholders.

“You can teach anyone technical skills. People can follow the standard operating procedure,” she says. “The skill set that is so hard to teach is communication and listening.”

2. Improve your storytelling skills

By improving your communication skills, you’ll be well-positioned to follow Hou’s advice: “Weave a clear story in terms of how marketing data could and should guide the organization’s marketing team.”

She says you should tell a narrative that connects the dots, explains the how and where of a return on investment, and details actions possible not yet realized due to limited lines of sight.
“Teams need to come together cross-functionally and have buy-in from executives to truly solve this problem,” Zontee Hou of Convince & Convert says.

3. Sharpen your focus on business goals

Securing organization-wide support requires leaning into what the data can do for the business.

“Businesspeople want to see the business outcomes. Always remember to align business objectives with your key stakeholders,” Snowflake’s Guan Wang says, noting you should revisit that alignment regularly to ensure it’s still appropriate.

“Make sure they’re comfortable using the model and then constantly iterate. Machine learning is not just one report. You deliver many, many models,” he says.

4. Learn to balance business, legal, and ethical impacts

Aligning with business purposes also requires addressing the legal requirements around data. “(It’s) an intricate balance between data-driven marketing and maintaining individual privacy rights,” Starz’s Michael Bagalman says. “Striking this balance requires a deep understanding of legal frameworks, technical capabilities, and ethical considerations. Regulations like GDPR and CCPA have global implications, each with unique nuances that necessitate careful interpretation and implementation.”

You should set up a compliance system to address those laws as you introduce new marketing tools and data collection methods. “Ensuring data accuracy, transparency, and security demands robust technical infrastructure and ongoing monitoring,” he says.
"The complexity of these challenges requires collaboration between legal experts, data scientists, marketers, and ethicists to develop holistic solutions that respect both user rights and marketing effectiveness,” he says.

What does all that require from an analytics practitioner?

Bagalman shares the lengthy list: legal/regulatory acumen, technical proficiency, understanding of ethical considerations, communication skills (particularly with non-technical stakeholders), collaboration, data governance, diversity and inclusion awareness, continuous learning, problem-solving, risk management, strategic thinking, adaptability, and empathy – truly understanding the consumer perspective on the ethics of data and privacy.

5. Model the impact

Are you ready to act now? Avinash Kaushik created a model that may help content-focused marketing analytics pros – the Impact Matrix. It enables you to answer these questions:

  • How sophisticated is the team’s analytics practice?
  • What’s the best way to get leaders/analysts away from low-value metrics?
  • How can you create a clear path to analytics glory?
  • How do you bring the role of machine learning and automation to the forefront?
  • What should be on the CMO’s dashboard vs. the director’s?

The matrix’s x-axis details how long a piece of content takes to become valuable – real-time, weekly, monthly, quarterly, or six-monthly. The y-axis runs from super tactical to super strategic. Kaushik walks through how to create it in more detail in this article.

He says, “The Impact Matrix will help you have that conversation based on a framework and then create a plan that says, ‘We’re here today. How do we get there?’”

Learn to conquer marketing analytics and data challenges

Are you and your marketing analytics team ready to go fast as technology and digital marketing evolve rapidly but take the time to get it all working the right way for your business? While these experts highlight potential solutions quickly here, they’ll slow down at the MADS conference with in-depth explanations and answer your questions in person.

Join us at the MADS conference in Washington, D.C., from Sept. 26 to 28, 2023. Learn more here and register with code KDN100 for $100 of your conference pass.

More On This Topic

  • Wrangle Summit 2021: All the Best People, Ideas, and Technology in Data…
  • 90% of Today's Code is Written to Prevent Failure, and That's a Problem
  • Textbooks Are All You Need: A Revolutionary Approach to AI Training
  • StreamSets DataOps Platform — Summer ‘21 Public Beta. Sign up today!
  • Data Science Minimum: 10 Essential Skills You Need to Know to Start Doing…
  • We Don't Need Data Scientists, We Need Data Engineers

Data migration redefined: Leveraging AI trends for smooth workspace transitions

ai data migration

In the dynamic landscape of modern business, the art of seamless data migration has evolved into a strategic imperative. As you navigate the intricacies of workspace transformations, you’re met with a complex interplay of technological advancements and operational demands

Enter the era of leveraging Artificial Intelligence (AI) to redefine data migration – an approach that promises to reshape the way we transition workspaces, ensuring fluidity and continuity.

Embracing change: The evolving landscape of data migration

In a world where technological progress unfolds at an unprecedented pace, traditional data migration methods find themselves inadequate. The rise of cloud-based workspaces has revolutionized how businesses operate, but it has also introduced fresh challenges in migrating data across these dynamic environments.

Here’s where AI emerges as your strategic ally, offering a paradigm shift from conventional methods to an intelligent approach capable of addressing the nuances of contemporary workspace transitions.

The power of AI: Pre-migration analysis and data mapping

Whether you’re doing something like migrating SharePoint or switching off-premise servers, embarking on a migration journey armed with profound insights into your data landscape is fundamentally important.

AI’s pre-migration analysis capabilities empower you to automate the discovery and classification of your data.

Through predictive analytics, AI models assess the feasibility of migration, shedding light on potential bottlenecks and roadblocks. This foreknowledge equips you with a comprehensive strategy, enabling you to mitigate risks and optimize your migration plan. By integrating AI into this phase, you transition with confidence, guided by data-driven intelligence.

Transformation made effortless: AI-driven data mapping

Traditionally, data transformation has been a meticulous manual endeavor, susceptible to human errors and resource-intensive processes.

AI’s infusion into data mapping brings automation to the forefront, utilizing machine learning to comprehend the semantics of your data.

This results in the automated adaptation of schema and structure during migration. The beauty lies in the preservation of data integrity and consistency, ensuring a seamless choreography of your data as it moves from one workspace to another. With AI orchestrating these transformations, disruptions are minimized, and efficiency reigns supreme.

Real-time replication: Minimizing downtime with AI

The specter of downtime looms large during migration, potentially disrupting operations and impacting your bottom line. In addition to good planning, AI solutions are a vitally important part of controlling for and minimizing downtime.

AI offers a transformative solution by enabling real-time data replication across the transition. Visualize your data gracefully mirrored between old and new workspaces, eliminating the need for extended downtime.

AI’s capacity to handle substantial data volumes ensures that your business remains operational even during this pivotal phase. The result? A migration that unfolds seamlessly, akin to a changing of the guard without missing a beat.

Ensuring data quality: AI-powered validation and assurance

Data quality concerns are not confined to the migration phase; they can cast a long shadow on your post-migration operations.

This is where AI’s meticulous prowess in automated data validation comes to the fore. Through advanced algorithms, AI scrutinizes your data for anomalies, inconsistencies, and errors, significantly reducing the risk of compromised data quality after migration.

By proactively addressing these issues, you save invaluable time and resources that would otherwise be spent on post-migration troubleshooting and correction.

Security and compliance: AI as your sentinel

The migration process is a vulnerable juncture where data security breaches can have far-reaching consequences. Enter AI as your vigilant sentinel. Its capabilities extend beyond migration logistics to encompass security measures.

By actively identifying potential vulnerabilities and risks, AI fortifies your data’s security armor. Furthermore, AI streamlines compliance with data protection regulations, ensuring that your migration adheres to legal standards, safeguarding your reputation and mitigating the risk of legal repercussions.

Your custom AI: Tailoring migration to your needs

AI isn’t a one-size-fits-all solution; it’s a versatile tool that adapts to your specific business demands. By training AI models with your historical migration data, you infuse a layer of customization into the migration process.

The result is an AI-driven solution that aligns precisely with your unique requirements, bolstering accuracy and efficiency. This tailored approach translates to a migration journey that is not only efficient but also seamlessly harmonized with your business objectives.

Navigating challenges: A partner, not a panacea

AI undoubtedly revolutionizes migration, but it’s essential to recognize its capabilities within a balanced framework. While AI possesses remarkable intelligence, human expertise remains indispensable, particularly in handling unexpected challenges.

Consider AI as a strategic partner, not a panacea, working in tandem with human intervention to address complex scenarios and unanticipated roadblocks. This collaborative synergy ensures a migration journey that is both efficient and adaptable.

Success stories: AI in action

The transformative impact of AI in data migration is not theoretical; it’s tangible and proven through real-world success stories.

Diverse industries have harnessed the potential of AI to execute data migrations with unprecedented precision and efficiency.

These narratives underscore AI’s capacity to reshape migration strategies, offering practical insights and actionable practices that serve as guideposts for your migration journey.

Tomorrow’s landscape: The future of AI in data migration

Your migration journey is not confined to the present; it extends into the horizon of tomorrow’s innovations. As AI continues to evolve, its role in data migration will also transform.

The convergence of AI with emerging technologies like the Internet of Things (IoT) and edge computing paints a portrait of even more sophisticated migrations.

By embracing AI’s promise, you position yourself at the forefront of the ever-evolving business landscape, ready to harness the potential of the next wave of technological advancements.

Embrace the AI advantage

In closing, the redefinition of data migration through AI marks a monumental stride toward achieving seamless workspace transitions.

By weaving AI’s capabilities with human insights, you unearth a formula for success that transcends the confines of traditional approaches. Embrace AI trends as your guiding beacon through the labyrinthine path of migration.

As you do, you’ll find that you’re not just migrating data; you’re sculpting a future of heightened efficiency, resilience, and opportunity. Step into the realm of AI-empowered workspace transitions with confidence; your journey of transformation commences now.

SatSure raises $15 million as Chandrayaan 3’s Success Sparks Investor Interest

SatSure, a satellite Earth observation data and analytics company has secured a $15 million investment in a Series A funding round. This significant financial infusion was spearheaded by Baring Private Equity Partners (BPEP) and Promus Ventures.

The investment will boost the company’s plans to launch a fleet of four high-resolution optical and multispectral satellites by the fourth quarter of 2025.

The funding round also saw the continued involvement of existing supporters including Force Ventures, Luckbox Ventures, and IndigoEdge Advisors. Furthermore, Omidyar Network India and xto10X joined the lineup of investors, affirming the widespread interest and confidence in the approach to using satellite imagery and AI for delivering decision intelligence across industries.

“We are committed to expanding our outreach, invest in Low-earth orbit satellite assets, and continue developing innovative products that signify the rise of the India private space sector and its deep-rooted alignment to our national space program,” said CEO and Founder Prateep Basu, who is also a former ISRO scientist.

The recent investment not only validates SatSure’s business model but also serves as a stepping stone for its future expansion plans.

A significant portion of the funding will be allocated towards accelerating product innovation. Additionally, the expansion of operations across the Americas and Asia-Pacific regions is a testament to SatSure’s commitment to making its solutions more accessible on a global scale.

Diverse Use Case

Since its inception in 2017, SatSure has been at the forefront of leveraging advanced satellite imagery and AI technologies to provide valuable insights for diverse sectors such as agriculture, banking, and critical infrastructure. Their technology has enabled improvements in operational efficiency, policy decision-making, and overall profitability for their clientele by synergizing the power of satellite data with cutting-edge AI algorithms.

The company has aided planning and evacuation efforts during disasters like the Kerala floods. They also emphasise a proactive approach, and take initiative to assist even without direct requests.

“We take responsibility in doing it ourselves, so even if the requests are not coming directly, we typically try to either go to the state governments or do it ourselves,” said, Divya Sharma, Vice President of ML and Data Analytics at the company.

For instance, in Haryana, they assessed flood impact on paddy fields, leading to export bans on non-basmati rice due to losses. She explained that their use of satellite data enables them to anticipate such trends in advance, preventing reactive responses. They also collaborated with the state of Kerala, leveraging their tools and capabilities to aid evacuation planning and provide timely insights during emergencies.

The investment comes within weeks of the successful landing of Chandrayaan 3 by the nation’s space research organisation ISRO, underlining the increasing interest in the Indian space sector.

The post SatSure raises $15 million as Chandrayaan 3’s Success Sparks Investor Interest appeared first on Analytics India Magazine.

The power of digital solutions: How mental health apps are transforming patient care

Mental Health

There seems to be an app for everything, and mental health is no exception. According to a report, the global mental health apps market size was valued at $5.2 billion in 2022 and is predicted to reach $26.36 billion by 2032, at a CAGR of 17.7% during the forecast period.

Mental health apps have emerged as a great tool for enhancing patient care. These apps are designed to address a range of mental health concerns and offer support and convenience to individuals seeking assistance. Currently, as many as 20,000 mental health apps exist today.

In this article, we will cover the benefits of the best mental health apps and how they are transforming the health industry and enhancing patient care. So, without further ado, let’s get started!

How are mental health apps enhancing patient care?

In this section, we have listed how mobile apps are transforming the healthcare industry and patient care and enhancing overall mental well-being.

1. Increased convenience and accessibility

Mental health apps have undoubtedly improved access to care by overcoming traditional barriers. They can be easily accessed and downloaded on smartphones or tablets, allowing users to seek support at their convenience. These apps provide immediate access to self-help tools, therapeutic resources, and coping strategies so users can manage their mental health on their own terms.

With mental health services facing long queues and waiting times, these apps bridge the gap, enabling users to receive support regardless of time and location constraints.

2. Remote therapy and support

Teletherapy has gained significant momentum, and mental health apps are at the forefront of this transformation. These mental health apps offer secure communication channels to remotely connect with licensed therapists, counselors, or peer group support. By eliminating geographical barriers, users get access to mental health professionals that may not be locally available.

These virtual therapy sessions offer flexibility, convenience, and privacy, enabling individuals to seek support without visiting in person. The support and guidance offered through these apps promote consistent care and help people cope with mental health challenges effectively.

3. Monitoring and early intervention

Mental health apps are designed in a way that helps monitor and track mood, behavior, sleeping patterns, and overall well-being. This self-monitoring can provide valuable insights to both clinicians and patients. By identifying patterns and triggers, individuals can better understand their mental health and make informed decisions regarding their care. Clinicians can also use data to tailor treatment plans, give prescriptions, and provide more personalized care.

4. Peer support and community

Mental health apps often include features that offer peer support and community engagement. The communities help users to interact and connect, share similar experiences, and support one another. This sense of community can reduce the sense of isolation, fear, and anxiety.

Connecting with people with first-hand experience can provide a unique level of understanding and empathy that traditional therapy might not offer. Unlike traditional support groups or therapy sessions with specific time slots, most of the mental health apps offer round-the-clock accessibility. This real-time support can be particularly valuable during a moment of crisis for individuals.

5. Timely feedback and SMS notifications

Mental health apps offer timely feedback to patients via push notifications and SMS updates. Users can stay updated and informed about their ongoing treatments and appointments with the help of notifications. And when patients get complete information, they remain satisfied, and this, in turn, boosts patient engagement.

6. Empowering self-management

Mental health apps empower individuals to take care of their well-being. By offering self-assessment tools and educational resources, these apps help users better understand their mental health conditions and develop strategies for self-management. They come with features for tracking symptoms, monitoring progress, and implementing evidence-based interactions. These apps foster a sense of empowerment and autonomy to promote positive long-term outcomes.

7. Integration with wearable technology

Mental health apps can be easily integrated with wearable devices to further enhance patient care. Wearable devices such as smartwatches and fitness trackers can help you monitor your sleep patterns, heart rate, or stress levels. When combined with mental health apps, the data offered by wearable devices provides a holistic view of an individual’s mental and overall well-being. This will, in turn, help users to take steps to understand their health better and take action.

8. Increased user engagement

According to a study by Harvard University, users are more motivated to use mental health apps to stay fit and stress-free. The reason for this was assumed that by using a mental health app, users felt in control of their activities and were responsible for their improvement. Another reason was the power of notifications that reminded users to complete their goals. Also, apps that are technologically savvy appeal to users more as compared to traditional therapy.

Conclusion

Mental health apps, powered by artificial intelligence, are transforming patient care by empowering individuals to actively manage their well-being and offering personalized support. These AI-powered apps provide convenience and guidance at users’ fingertips. With remote therapy, self-monitoring tools, and continuous tracking, mental health apps are revolutionizing mental health delivery. They play a vital role in enhancing well-being and addressing gaps in mental health services. As AI technology advances, these apps have the potential to make mental health care even more accessible, effective, and patient-centered, ushering in a new era of mental health support and treatment.

Why DeepMind’s AI visualization is utterly useless

ai-large-language-models-by-tim-west-for-deepmind-2023

Striking, but what does it mean? The DeepMind images, such as this one, developed by Tim West, are striking, but do nothing to explain what's actually happening in artificial intelligence programs. The image apparently represents "the benefits and flaws of large language models," such as ChatGPT, but how so?

"Excellence in statistical graphics consists of complex ideas communicated with clarity, precision, and efficiency." — Edward R Tufte, The Visual Display of Quantitative Information.

Usually, visualization is something meant to help one understand something that cannot be seen. The DeepMind unit of Google has recently published visualizations of artificial intelligence, created by various visual artists. The intention may be a good one, but the results are a disaster.

"Visualising AI commissions artists from around the world to create more diverse and accessible representations of AI, inspired by conversations with scientists, engineers, and ethicists at Google DeepMind," says the company. It contrasts those "diverse and accessible" images to the typical images of AI that include glowing brains or robots and the like.

Also: Generative AI: Just don't call it an 'artist' say scholars in Science magazine

It is true that the typical stock photo images for AI, such as the glowing letters, "A" and "I," do not help anyone understand the rather mysterious art and science of machine learning forms of AI, the dominant form of artificial intelligence.

The famous visualization expert Edward R. Tufte, whose book, The Visual Display of Quantitative Information, was a landmark in understanding visualization, wrote that successful visual displays should, among other things, "induce the viewer to think about the substance rather than about methodology, graphic design, the technology of graphic production, or something else."

Also: Google updates Vector AI to let enterprises train GenAI on their own data

The DeepMind pictures are mostly only about things such as graphic design. They are an overload of graphic design, in fact.

One image, by Novoto Studio, shows what appear to be tic-tac candies approaching some kind of computer interface. There's nothing in deep learning — or any other form of AI — that includes tic-tacs.

Tic-tac, anyone? The DeepMind images, such as this one, developed by Novoto Studio, are striking, but do nothing to explain what's actually happening in artificial intelligence programs.

The text accompanying the tic-tacs is equally cryptic. "An electronic device with a lot of small objects on it," it reads. "An artist's illustration of artificial intelligence (AI). This image depicts the potential of AI for society through 3D visualisations." Whatever that means, it probably doesn't have much to do with tic-tacs.

Also: AI's multi-view wave is coming, and it will be powerful

A companion video of the tic-tacs is equally inscrutable, if somewhat mesmerizing. It could be titled "March of the tic-tacs," but that might not help anyone understand AI.

Another image, by Wes Cockx, is supposed to be a "metal structure made of wood and metal," aims to depict "the prediction method used in large language models."

It is a fascinating imaginary structure, but it's not clear what it's doing in predicting. Nor is the companion video, showing the wood-and-metal structure in action, much help. It shows something that looks like an apparatus, perhaps a giant abacus of some kind, but what is that thing doing?

Some of the images are so fanciful they seem to bear no relation to anything at all. One image, by XK Studio, depicting what looks like a cube of some sort of gelatinous stuff, which seems to be shedding other kinds of cell-like gelatinous stuff, is, again, rather captivating, but has nothing to do with AI or anything else. Forced to guess, one might think it's a rendering of a process of gelatin formation.

The video of the gelatinous thing shows lots of stuff forming, which in turn forms other stuff. Again, who knows what stuff is being formed and why?

Also: What is generative AI and why is it so popular? Here's everything you need to know

The companion text explains that the image and video "explores how humans can creatively collaborate with artificial general intelligence (AGI) in the future and how it can offer new points of view, speed up processes, and lead to new territories." Besides not explaining what AGI is, or might be, the text is so vague as to be useless. This is an instance where a picture, and even a thousand words, might not help anyone.

The one image that comes closest to the mark is another by Novoto Studio, which shows what seems to be a branching configuration. The text describes it as, "inspired neural networks used in deep learning."

It's closest to the mark because artificial neural networks can, in fact, be thought of in some senses as branching networks that involve lots of elements in collective activity.

Also: Everyone wants responsible AI, but few people are doing anything about it

In fact, it's odd that the illustrations are all so beside the point, because there is a rich tradition in AI of illustration. The original neural net research work, by Frank Rosenblatt of the Cornell University Aeronautical Laboratory, "The Perceptron," kicked off 60 years of trying to build artificial neural nets. Rosenblatt depicted in his illustration a network made up of artificial neurons. It is beautiful in its simplicity:

It's easy to grasp in a moment a little bit about what's going on because networks of connections run through our lives. Subway station maps show networks of connections. The social graph of Facebook is a collection of connected entities. The graph of connections of anything is powerful — much more powerful than the strange tic-tac renderings of Novoto Studio and the rest.

One can even turn Rosenblatt's original technical diagram into fanciful images. Such images might not be specific, but they can capture some of the sense of a system that has input and output and produces connections between them:

A neural network transforms input, the circles on the left, to output, on the right. How that happens is a transformation of weights (center), which we often confuse for patterns in the data itself.

The fundamental problem with the DeepMind images is that the artists seem to understand very little of AI, and therefore, their mission is mainly to give their own uninformed, impressionistic rendering of what they imagine AI to be. That's not particularly helpful if one would like the public to glean something about what's actually going on with AI.

Also: AI goes to Hollywood: Navigating the double-edged sword of emerging technology in storytelling

That's too bad because there are plenty of people working in the field of machine learning who have a solid grasp of the technology and also produce visualizations. The People+AI Research group at Google, for example, has produced some nice visualizations of various aspects of the technology.

An illustration by the People+AI team at Google of the trade-off in machine learning between accuracy and privacy.

A former member of the group, Harvard University professor Martin Wattenberg, is a genuine scholar of visualizing hard ideas. He is famous for, among other things, SmartMoney's Map of the Market developed for the website of the consumer finance publication, which merged into MarketWatch in 2013.

There are people out there who understand AI and can conceivably communicate some of it. There are also people who excel in visual storytelling and explanation. DeepMind seems to have passed them over in favor of design studios that don't know much about either.

Artificial Intelligence

Google launches BigQuery Studio, a new way to work with data

Google launches BigQuery Studio, a new way to work with data Kyle Wiggers 11 hours

Companies increasingly see the value in mining their data for deeper insights. According to a NewVantage survey, 97.6% of major worldwide organizations are focusing investments into big data and AI.

But challenges stand in the way of executing big data analytics. One recent poll found that 65% of organizations feel they have “too much” data to analyze.

Google’s proposed solution is BigQuery Studio, a new service within BigQuery, its fully managed serverless data warehouse, that provides a single experience to edit programming languages including SQL, Python and Spark to run analytics and machine learning workloads at “petabyte scale.”

BigQuery Studio is available in preview as of this week.

“BigQuery Studio is a new experience that really puts people who are working on data on the one side and people working on AI on the other side in a common environment,” Gerrit Kazmaier, VP and GM of data and analytics at Google, told TechCrunch in a phone interview. “It basically provides access to all of the services that those people need to work — there’s an element of simplification on the user experience side.”

BigQuery Studio is designed to enable users to discover, explore, analyze and predict data. Users can start in a programming notebook to validate and prep data, then open that notebook in other services, including Vertex AI, Google’s managed machine learning platform, to continue their work with more specialized AI infrastructure and tooling.

With BigQuery Studio, teams can directly access data wherever they’re working, Kazmaier says. And they have added controls for “enterprise-level” governance, regulation and compliance.

“[BigQuery Studio shows] how data is being generated to how it’s being processed and how it’s being used in AI models, which sounds technical, but it’s really important,” he added. “You can push down code for machine learning models directly into BigQuery as infrastructure, and that means that you can evaluate it at scale.”

BigQuery Studio can be seen as a natural progression of Google’s overarching strategy to move organizations adopting AI to the cloud. With worldwide spending on public cloud services set to grow about 21% to about $592 billion this year, according to one estimate, the tech giant is clearly intent on capturing as large a slice of the expenditure as possible — as are its rivals.

It’s not an ill-informed game plan. Gartner predicts that through 2023, AI will be one of the top workloads that drive IT infrastructure decisions. And tech market research firm Tractica forecasts that AI will account for as much as 50% of total public cloud services revenue by 2025.

“Generative AI really has the potential to unlock all of these hidden insights,” Kazmaier said. “What we tend to see is that AI really makes sense when you can combine it with [a company’s] data. AI is a method if you will — a way of working with the data … to drive the most value.”

Read more about Google Cloud Next 2023 on TechCrunch

Abnormal Security: Microsoft Tops List of Most-Impersonated Brands in Phishing Exploits

Mobile phone showing Abnormal's logo with PC monitor on background.
Image: Timon/Adobe Stock

A significant portion of social engineering attacks, such as phishing, involve cloaking a metaphorical wolf in sheep’s clothing. According to a new study by Abnormal Security, which looked at brand impersonation and credential phishing trends in the first half of 2023, Microsoft was the brand most abused as camouflage in phishing exploits.

Of the 350 brands spoofed in phishing attempts that were blocked by Abnormal, Microsoft’s name was used in 4.31% — approximately 650,000 — of them. According to the report, attackers favor Microsoft because of the potential to move laterally through an organization’s Microsoft environments.

Abnormal’s threat unit also tracked how generative AI is increasingly being used to build social engineering attacks. The study examines how AI tools make it far easier and faster for attackers to craft convincing phishing emails, spoof websites and write malicious code.

Jump to:

  • Top 10 brands impersonated in phishing attacks
  • Attackers increasingly rely on generative AI
  • How credential-focused phishing attacks lead to persistence
  • BECs on the rise, along with sophistication of attacks
  • If not dead giveaways, strong warning signs of phishing

Top 10 brands impersonated in phishing attacks

If 4.31% seems like a small figure, Abnormal Security CISO Mike Britton pointed out that it is still four times the impersonation volume of the second most-spoofed brand, PayPal, which was impersonated in 1.05% of the attacks Abnormal tracked. Following Microsoft and PayPal in a long tail of impersonated brands in 2023 were:

  1. Microsoft: 4.31%
  2. PayPal: 1.05%
  3. Facebook: 0.68%
  4. DocuSign: 0.48%
  5. Intuit: 0.39%
  6. DHL: 0.34%
  7. McAfee: 0.32%
  8. Google: 0.30%
  9. Amazon: 0.27%
  10. Oracle: 0.21%

Best Buy, American Express, Netflix, Adobe and Walmart are some of the other impersonated brands among the list of 350 companies used in credential phishing and other social engineering attacks Abnormal flagged over the past year.

Attackers increasingly rely on generative AI

One aspect of brand impersonation is the ability to mimic the brand tone, language and imagery, something that Abnormal’s report shows phishing actors are doing more of thanks to easy access to generative AI tools. Generative AI chatbots allow threat actors to create not only effective emails but picture perfect faux-branded websites replete with brand-consistent images, logos and copy in order to lure victims into entering their network credentials.

For example, Britton, who authored the report, wrote that Abnormal discovered an attack using generative AI to impersonate the logistics company DHL. To steal the target’s credit card information, the sham email asked the victim to click a link to pay a delivery fee for “unpaid customs duties (Figure A).”

Figure A

Sample email of a spoofing phishing attack, with text highlighted in different shades.
In a phishing attack spoofing DHL, Abnormal identified the words in green as mostly likely generated by AI. Image: Abnormal Security.

How Abnormal is dusting generative AI fingerprints in phishing emails

Britton explained to TechRepublic that Abnormal tracks AI with its recently launched CheckGPT, an internal, post-detection tool that helps determine when email threats — including phishing emails and other socially-engineered attacks — have likely been created using generative AI tools.

“CheckGPT leverages a suite of open source large language models to analyze how likely it is that a generative AI model created the email message,” he said. “The system first analyzes the likelihood that each word in the message has been generated by an AI model, given the context that precedes it. If the likelihood is consistently high, it’s a strong potential indicator that text was generated by AI.”

Attackers use generative AI for credential theft

Britton said attackers’ use of AI includes crafting credential phishing, business email compromises and vendor fraud attacks. While AI tools can be used to create impersonated websites as well, “these are typically supplemental to email as the primary attack mechanism,” he said. “We’re already seeing these AI attacks play out — Abnormal recently released research showing a number of emails that contained language strongly suspected to be AI-generated, including BEC and credential phishing attacks.” He noted that AI can fix the dead giveaways: typos and egregious grammatical errors.

“Also, imagine if threat actors were to input snippets of their victim’s email history or LinkedIn profile content within their ChatGPT queries. This brings highly personalized context, tone and language into the picture — making BEC emails even more deceptive,” Britton added.

SEE: AI vs AI: the next front in the phishing wars (TechRepublic)

How hard is it to build effective email exploits with AI? Not very. Late in 2022, researchers at Tel Aviv-based Check Point demonstrated how generative AI could be used to create viable phishing content, write malicious code in Visual Basic for Applications and macros for Office documents, and even produce code for reverse shell operations (Figure B).

Figure B

Check Point researchers created an effective phishing email with ChatGPT.
Check Point researchers created an effective phishing email with ChatGPT. Image: Check Point Software

They also published examples of threat actors using ChatGPT in the wild to produce infostealers and encryption tools (Figure C).

Figure C

Cybercriminal showing how he created an infostealer using ChatGPT.
Cybercriminal showing how he created an infostealer using ChatGPT. Image: Check Point Software

How credential-focused phishing attacks lead to BECs

Britton wrote that credential phishing attacks are pernicious partly because they are the first step in an attacker’s lateral journey toward achieving network persistence, which is an offender’s ability to take up parasitic, unseen residence within an organization. He noted that when attackers gain access to Microsoft credentials, for example, they can enter the Microsoft 365 enterprise environment to hack Outlook or SharePoint and do further BECs and vendor fraud attacks.

“Credential phishing attacks are particularly harmful because they are typically the first step in a much more malicious campaign,” wrote Britton.

Because persistent threat actors can pretend to be legitimate network users, they can also perform thread hijacking, where attackers insert themselves into an existing enterprise email conversation. These tactics let actors insert themselves into email strings and hijack them to launch further phishing exploits, monitor emails, learn the organizational command chain and target those who, for example, authorize wire transfers.

“When attackers gain access to banking credentials, they can access the bank account and move funds from their victim’s account to one they own,” noted Britton. With stolen social media account credentials gained through phishing exploits, he said attackers can use the personal information contained in the account to extort victims into paying money to keep their data private.

BECs on the rise, along with sophistication of email attacks

Britton noted that successful BEC exploits are a key means for attackers to steal credentials from a target via social engineering. Unfortunately, BECs are on the rise, continuing a five-year trend, according to Abnormal. Microsoft Threat Intelligence reported that it detected 35 million business email compromise attempts, with an average of 156,000 attempts daily between April 2022 and April 2023.

Splunk’s 2023 State of Security report, based on a global survey of 1,520 security and IT leaders who spend half or more of their time on security issues, found that over the past two years, 51% of incidents reported were BECs — a nearly 10% increase vs. 2021 — followed by ransomware attacks and website impersonations.

Also increasing is the sophistication of email attacks, including the use of financial supply chain compromise, in which attackers impersonate a target organization’s vendors to, for example, request that invoices be paid, a phenomenon Abnormal reported on early this year.

SEE: New phishing and BECs increase in complexity, bypass MFA (TechRepublic)

If not dead giveaways, strong warning signs of phishing

The Abnormal report suggested that organizations should be on the lookout for emails from a roster of often-spoofed brands that include:

  • Persuasive warnings about the potential of losing account access.
  • Fake alerts about fraudulent activity.
  • Demands to sign in via the provided link.

Subscribe to the Cybersecurity Insider Newsletter

Strengthen your organization's IT security defenses by keeping abreast of the latest cybersecurity news, solutions, and best practices.

Delivered Tuesdays and Thursdays Sign up today

Google DeepMind Introduces SynthID to Watermark AI-Generated Images

Moving forward along the lines of its ‘Bold and Responsible’ approach, today Google DeepMind and Google Cloud have launched a beta version of SynthID, a tool for watermarking and identifying AI-generated images. While the company gave a sneak peek of the technology in Google I/O earlier this year in February, the tool finally has a name to it.

The rise of AI generation tools has made it impossible to go anywhere on the internet without encountering fake news, leaks and rumours. The content-generation technology has wreaked havoc in every corner from newsrooms to educational institutions. The struggle to differentiate between AI and human-generated content made it to headlines every second week of 2023. Google saw the uncharted territory and announced at the Google I/O significant steps to identify and contextualise AI content available on its Search.

Irene Solaiman, policy director at Hugging Face, who previously worked as an AI researcher at OpenAI had earlier told AIM, “The community is desperately trying to find ways to differentiate between human- and AI-written text against the tide of potential technological exploitation.”

After a lengthy dialogue, the company’s responsible AI research arm has finally introduced a potential solution with SynthID. Measures like watermarking and implementing metadata will ensure better transparency, allow users to differentiate between AI-generated and authentic images, and protect copyright.

The tool will initially be available only to the users of Google’s AI image generator Imagen, which is hosted on Google Cloud’s ML platform Vertex. After generating images on Imagen, users will have the option to choose whether to add a watermark or not. Google DeepMind’s VP of Research, Pushmeet Kohli, labelled their watermarking tool “experimental,” sharing plans to assess its strengths and setbacks before wider adoption, according to MIT Technology Review. Kohli refrained from discussing potential expansion to non-Imagen visuals or integration into Google’s AI image generation systems.

The post Google DeepMind Introduces SynthID to Watermark AI-Generated Images appeared first on Analytics India Magazine.

Data Validation for PySpark Applications using Pandera

Data Validation for PySpark Applications using Pandera
Photo by Jakub Skafiriak on Unsplash

If you’re a data practitioner, you’ll appreciate that data validation holds utmost importance in ensuring accuracy and consistency. This becomes particularly crucial when dealing with large datasets or data originating from diverse sources. However, the Pandera Python library can help to streamline and automate the data validation process. Pandera is an open-source library meticulously crafted to simplify the tasks of schema and data validation. It builds upon the robustness and versatility of pandas and introduces an intuitive and expressive API specifically designed for data validation purposes.

This article briefly introduces the key features of Pandera, before moving on to explain how Pandera data validation can be integrated with data processing workflows that use native PySpark SQL since the latest release (Pandera 0.16.0).

Pandera is designed to work with other popular Python libraries such as pandas, pyspark.pandas, Dask, etc. This makes it easy to incorporate data validation into your existing data processing workflows. Until recently, Pandera lacked native support for PySpark SQL, but to bridge this gap, a team at QuantumBlack, AI by McKinsey comprising Ismail Negm-PARI, Neeraj Malhotra, Jaskaran Singh Sidana, Kasper Janehag, Oleksandr Lazarchuk, along with the Pandera Founder, Niels Bantilan, developed native PySpark SQL support and contributed it to Pandera. The text of this article was also prepared by the team, and is written in their words below.

The Key Features of Pandera

If you are unfamiliar with using Pandera to validate your data, we recommend reviewing Khuyen Tran’s “Validate Your pandas DataFrame with Pandera” which describes the basics. In summary here, we briefly explain the key features and benefits of a simple and intuitive API, in-built validation functions and customisation.

Simple and Intuitive API

One of the standout features of Pandera is its simple and intuitive API. You can define your data schema using a declarative syntax that is easy to read and understand. This makes it easy to write data validation code that is both efficient and effective.

Here’s an example of schema definition in Pandera:

class InputSchema(pa.DataFrameModel):     year: Series[int] = pa.Field()     month: Series[int] = pa.Field()     day: Series[int] = pa.Field()

Inbuilt Validation Functions

Pandera provides a set of in-built functions (more commonly called checks) to perform data validations. When we invoke validate()on a Pandera schema, it will perform both schema & data validations. The data validations will invoke check functions behind the scenes.

Here’s a simple example of how to run a data check on a dataframe object using Pandera.

class InputSchema(pa.DataFrameModel):     year: Series[int] = pa.Field(gt=2000, coerce=True)     month: Series[int] = pa.Field(ge=1, le=12, coerce=True)     day: Series[int] = pa.Field(ge=0, le=365, coerce=True)    InputSchema.validate(df)

As seen above, for year field we have defined a check gt=2000 enforcing that all values in this field must be greater than 2000 otherwise there will be validation failure raised by Pandera.

Here’s a list of all built-in checks available on Pandera by default:

eq: checks if value is equal to a given literal  ne: checks if value is not equal to a given literal  gt: checks if value is greater than a given literal  ge: checks if value is greater than & equal to a given literal  lt: checks if value is less than a given literal  le: checks if value is less than & equal to a given literal  in_range: checks if value is given range  isin: checks if value is given list of literals  notin: checks if value is not in given list of literals  str_contains: checks if value contains string literal  str_endswith: checks if value ends with string literal  str_length: checks if value length matches  str_matches: checks if value matches string literal  str_startswith: checks if value starts with a string literal

Custom Validation Functions

In addition to the built-in validation checks, Pandera allows you to define your own custom validation functions. This gives you the flexibility to define your own validation rules based on use case.

For instance, you can define a lambda function for data validation as shown here:

schema = pa.DataFrameSchema({     "column2": pa.Column(str, [         pa.Check(lambda s: s.str.startswith("value")),         pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)     ]),  })

Adding Support for PySpark SQL DataFrames to Pandera

During the process of adding support to PySpark SQL, we adhered to two fundamental principles:

  • consistency of interface and user experience
  • performance optimization for PySpark.

First, let’s delve into the topic of consistency, because it is important that, from a user’s perspective, they have a consistent set of APIs and an interface irrespective of the chosen framework. As Pandera provides multiple frameworks to choose from it was even more critical to have a consistent user experience in PySpark SQL APIs.

With this in mind, we can define the Pandera schema using PySpark SQL as follows:

from pyspark.sql import DataFrame, SparkSession  import pyspark.sql.types as T  import pandera.pyspark as pa    spark = SparkSession.builder.getOrCreate()      class PanderaSchema(DataFrameModel):         """Test schema"""         id: T.IntegerType() = Field(gt=5)         product_name: T.StringType() = Field(str_startswith="B")         price: T.DecimalType(20, 5) = Field()         description: T.ArrayType(T.StringType()) = Field()         meta: T.MapType(T.StringType(), T.StringType()) = Field()      data_fail = [         (5, "Bread", 44.4, ["description of product"], {"product_category": "dairy"}),         (15, "Butter", 99.0, ["more details here"], {"product_category": "bakery"}),     ]    spark_schema = T.StructType(         [             T.StructField("id", T.IntegerType(), False),             T.StructField("product", T.StringType(), False),             T.StructField("price", T.DecimalType(20, 5), False),             T.StructField("description", T.ArrayType(T.StringType(), False), False),             T.StructField(                 "meta", T.MapType(T.StringType(), T.StringType(), False), False             ),         ],     )  df_fail = spark_df(spark, data_fail, spark_schema)

In the above code, PanderaSchema defines the schema for incoming pyspark dataframe. It has 5 fields with varying dtypes and enforcement of data checks on id and product_name fields.

class PanderaSchema(DataFrameModel):         """Test schema"""         id: T.IntegerType() = Field(gt=5)         product_name: T.StringType() = Field(str_startswith="B")         price: T.DecimalType(20, 5) = Field()         description: T.ArrayType(T.StringType()) = Field()         meta: T.MapType(T.StringType(), T.StringType()) = Field()

Next, we crafted a dummy data and enforced native PySpark SQL schema as defined in spark_schema.

spark_schema = T.StructType(         [             T.StructField("id", T.IntegerType(), False),             T.StructField("product", T.StringType(), False),             T.StructField("price", T.DecimalType(20, 5), False),             T.StructField("description", T.ArrayType(T.StringType(), False), False),             T.StructField(                 "meta", T.MapType(T.StringType(), T.StringType(), False), False             ),         ],     )    df_fail = spark_df(spark, data_fail, spark_schema)

This is done to simulate schema and data validation failures.

Here’s the contents of df_fail dataframe:

df_fail.show()       +---+-------+--------+--------------------+--------------------+     | id|product|   price|         description|                meta|     +---+-------+--------+--------------------+--------------------+     |  5|  Bread|44.40000|[description of p...|{product_category...|     | 15| Butter|99.00000| [more details here]|{product_category...|     +---+-------+--------+--------------------+--------------------+

Next we can invoke Pandera’s validate function to perform schema and data level validations as follows:

df_out = PanderaSchema.validate(check_obj=df)

We will explore the contents of df_out shortly.

Performance Optimization for PySpark

Our contribution was specifically designed for optimum performance when working with PySpark dataframes, which is crucial when working with large datasets in order to handle the unique challenges of PySpark’s distributed computing environment.

Pandera uses PySpark’s distributed computing architecture to efficiently process large datasets while maintaining data consistency and accuracy. We rewrote Pandera’s custom validation functions for PySpark performance to enable faster and more efficient validation of large datasets, while reducing the risk of data errors and inconsistencies at high volume.

Comprehensive Error Reports

We made another addition to Pandera for the capability to generate detailed error reports in the form of a Python dictionary object. These reports are accessible via the dataframe returned from the validate function. They provide a comprehensive summary of all schema and data level validations, as per the user’s configurations.

This feature proves to be valuable for developers to swiftly identify and address any data-related issues. By using the generated error report, teams can compile a comprehensive list of schema and data issues within their application. This enables them to prioritize and resolve issues with efficiency and precision.

It is important to note that this feature is currently available exclusively for PySpark SQL, offering users an enhanced experience when working with error reports in Pandera.

In above code example, remember we had invoked validate() on spark dataframe:

df_out = PanderaSchema.validate(check_obj=df)

It returned a dataframe object. Using accessors we can extract the error report out of it as follows:

print(df_out.pandera.errors)
{    "SCHEMA":{       "COLUMN_NOT_IN_DATAFRAME":[          {             "schema":"PanderaSchema",             "column":"PanderaSchema",             "check":"column_in_dataframe",             "error":"column 'product_name' not in dataframe Row(id=5, product='Bread', price=None, description=['description of product'], meta={'product_category': 'dairy'})"          }       ],       "WRONG_DATATYPE":[          {             "schema":"PanderaSchema",             "column":"description",             "check":"dtype('ArrayType(StringType(), True)')",             "error":"expected column 'description' to have type ArrayType(StringType(), True), got ArrayType(StringType(), False)"          },          {             "schema":"PanderaSchema",             "column":"meta",             "check":"dtype('MapType(StringType(), StringType(), True)')",             "error":"expected column 'meta' to have type MapType(StringType(), StringType(), True), got MapType(StringType(), StringType(), False)"          }       ]    },    "DATA":{       "DATAFRAME_CHECK":[          {             "schema":"PanderaSchema",             "column":"id",             "check":"greater_than(5)",             "error":"column 'id' with type IntegerType() failed validation greater_than(5)"          }       ]    }  }

As seen above, the error report is aggregated on 2 levels in a python dictionary object to be easily consumed by downstream applications such as timeseries visualization of errors over time using tools like Grafana:

  1. type of validation = SCHEMA or DATA
  2. category of errors = DATAFRAME_CHECK or WRONG_DATATYPE, etc.

This new format to restructure the error reporting was introduced in 0.16.0 as part of our contribution.

ON/OFF Switch

For applications that rely on PySpark, having an On/Off switch is an important feature that can make a significant difference in terms of flexibility and risk management. Specifically, the On/Off switch allows teams to disable data validations in production without requiring code changes.

This is especially important for big data pipelines where performance is critical. In many cases, data validation can take up a significant amount of processing time, which can impact the overall performance of the pipeline. With the On/Off switch, teams can quickly and easily disable data validation if necessary, without having to go through the time-consuming process of modifying code.

Our team introduced the On/Off switch to Pandera so users can easily turn off data validation in production by simply changing a configuration setting. This provides the flexibility needed to prioritize performance, when necessary, without sacrificing data quality or accuracy in development.

To enable validations, set the following in your environment variables:

export PANDERA_VALIDATION_ENABLED=False

This will be picked up by Pandera to disable all validations in the application. By default, validation is enabled.

Currently, this feature is only available for PySpark SQL from version 0.16.0 as it is a new concept introduced by our contribution.

Granular Control of Pandera’s Execution

In addition to the On/Off switch feature, we also introduced a more granular control over the execution of Pandera’s validation flow. This is achieved by introducing configurable settings that allow users to control execution at three different levels:

  1. SCHEMA_ONLY: This setting performs schema validations only. It checks that the data conforms to the schema definition but does not perform any additional data-level validations.
  2. DATA_ONLY: This setting performs data-level validations only. It checks the data against the defined constraints and rules but does not validate the schema.
  3. SCHEMA_AND_DATA: This setting performs both schema and data-level validations. It checks the data against both the schema definition and the defined constraints and rules.

By providing this granular control, users can choose the level of validation that best fits their specific use case. For example, if the main concern is to ensure that the data conforms to the defined schema, the SCHEMA_ONLY setting can be used to reduce the overall processing time. Alternatively, if the data is known to conform to the schema and the focus is on ensuring data quality, the DATA_ONLY setting can be used to prioritize data-level validations.

The enhanced control over Pandera’s execution allows users to strike a fine-tuned balance between precision and efficiency, enabling a more targeted and optimized validation experience.

export PANDERA_VALIDATION_DEPTH=SCHEMA_ONLY

By default, validations are enabled, and depth is set to SCHEMA_AND_DATA which can be changed to SCHEMA_ONLY or DATA_ONLY as desired by use case.

Currently, this feature is only available for PySpark SQL from version 0.16.0 as it is a new concept introduced by our contribution.

Metadata at Column and Dataframe levels

Our team added a new feature to Pandera that allows users to store additional metadata at Field and Schema / Model levels. This feature is designed to allow users to embed contextual information in their schema definitions which can be leveraged by other applications.

For example, by storing details about a specific column, such as data type, format, or units, developers can ensure that downstream applications are able to interpret and use the data correctly. Similarly, by storing information about which columns of a schema are needed for a specific use case, developers can optimize data processing pipelines, reduce storage costs, and improve query performance.

At the schema level, users can store information to help categorize different schema across the entire application. This metadata can include details such as the purpose of the schema, the source of the data, or the date range of the data. This can be particularly useful for managing complex data processing workflows, where multiple schemas are used for different purposes and need to be tracked and managed efficiently.

class PanderaSchema(DataFrameModel):         """Pandera Schema Class"""         id: T.IntegerType() = Field(             gt=5,             metadata={"usecase": ["RetailPricing", "ConsumerBehavior"],                "category": "product_pricing"},         )         product_name: T.StringType() = Field(str_startswith="B")         price: T.DecimalType(20, 5) = Field()             class Config:             """Config of pandera class"""             name = "product_info"             strict = True             coerce = True             metadata = {"category": "product-details"}

In the above example, we have introduced additional information on the schema object itself. This is allowed at 2 levels: field and schema.

To extract the metadata on schema level (including all fields in it), we provide helper functions as:

PanderaSchema.get_metadata()  The output will be dictionary object as follows:  {         "product_info": {             "columns": {                 "id": {"usecase": ["RetailPricing", "ConsumerBehavior"],                        "category": "product_pricing"},                 "product_name": None,                 "price": None,             },             "dataframe": {"category": "product-details"},         }  }

Currently, this feature is a new concept in 0.16.0 and has been added for PySpark SQL and Pandas.

Summary

We have introduced several new features and concepts, including an On/Off switch that allows teams to disable validations in production without code changes, granular control over Pandera’s validation flow, and the ability to store additional metadata on column and dataframe levels. You can find even more detail in the updated Pandera documentation for version 0.16.0.

As the Pandera Founder, Niels Bantilan, explained in a recent blog post about the release of Pandera 0.16.0:

To prove out the extensibility of Pandera with the new schema specification and backend API, we collaborated with the QuantumBlack team to implement a schema and backend for Pyspark SQL … and we completed an MVP in a matter of a few months!

This recent contribution to Pandera’s open-source codebase will benefit teams working with PySpark and other big data technologies.

The following team members at QuantumBlack, AI by McKinsey are responsible for this recent contribution: Ismail Negm-PARI, Neeraj Malhotra, Jaskaran Singh Sidana, Kasper Janehag, Oleksandr Lazarchuk. I’d like to thank Neeraj in particular for his assistance in preparing this article for publication.
Jo Stitchbury is an experienced technical writer. She writes about data science and analysis, AI, and the software industry.

More On This Topic

  • PySpark for Data Science
  • Data Validation and Data Verification — From Dictionary to Machine Learning
  • Data Validation in Machine Learning is Imperative, Not Optional
  • Why Use k-fold Cross Validation?
  • Full cross-validation and generating learning curves for time-series models
  • KDnuggets News, May 18: 5 Free Hosting Platform For Machine Learning…