Canva Debuts AI-Powered Enterprise Suite to Challenge Adobe 

At Canva’s first international Canva Create event in Los Angeles, the Australian design studio launched Canva Enterprise, a subscription service designed to meet the needs of large organisations.It supports scalable growth with increased seats and cloud storage, making it suitable for global use.

The platform aims to centralise design, content production, AI, and collaboration tools, reducing costs and complexity. It also includes updated brand controls to ensure consistency across all brand elements and templates. Additionally, enterprise-level security tools such as MFA, SCIM, SSO, and Canva Shield protect organisational assets.

“As demand for visual content soars, navigating organisational complexity is more challenging than ever. We democratised the design ecosystem in our first decade and now look forward to unifying the fragmented ecosystems of design, AI, and workflow tools for every organisation in our second decade,” said Melanie Perkins, co-founder and CEO of Canva.

However, in addition to Canva Enterprise, the company has also made major announcements, leading to AI features in Magic Studio, a new user experience, and more.

Bolstering Visual Suite and Magic Studio AI Products

Magic Studio has received new AI upgrades. Now, it can create graphics, icons, and illustrations from text prompts with Magic Media and Magic Design, which produces high-quality presentations. New photo editing offers AI-powered tools for seamless image editing.

“Magic Studio works on internally-developed AI and ML algorithms that leverage a combination of foundational AI models from our team, including Kaleido, and a variety of partners like OpenAI, Google, AWS, and Runway,” Danny Wu, head of AI products at Canva, told AIM in a conversation earlier this year.

On the other hand, Visual Suite now better supports organisational needs like suggesting editing for tracking changes and collaborating on document integration with third-party apps like Amazon Ads, Google, and Meta for design feedback. It also includes data autofill, which automatically fills designs with business data from sources like Salesforce. Additionally, the Bulk Create feature streamlines marketing workflows by updating multiple designs at once using CSV or Excel files.

New User Interface

Canva has completely redesigned its core product to improve editing and collaboration. The new interface claims to include a streamlined editing experience with a contextual toolbar and tools like the background remover and Magic Studio AI.

The customisable homepage allows users to pin favourite designs, folders, and templates for quick access and has an advanced search feature for finding content easily.

Tailored Tools for Every Team

Canva has introduced specialised tools for different departments such as marketing, HR, sales, and creative teams. Called Canva Work Kits, it consists of customisable templates tailored to each department’s needs.

Extending Affinity Following Acquisition

Over the past six years, the company has steadily built its AI capabilities and made major acquisition.

In March 2024, it acquired UK based creative software suite, Affinity. So, the latest Affinity version 2.5 introduces advanced editing features to enhance professional performance. Key upgrades include variable font support, stroke width tool and support for ARM64 Chips.

It also purchased Kaleido in 2021.

Canva Courses for Workplace Learning

Like every other tech firm, Canva is investing in upskilling It has launched Canva Courses to improve workplace learning by turning designs into interactive courses to support employee onboarding, upskilling, and development, with progress tracked via a central dashboard.

Canva Vs Adobe

With over 170 million users, the company reached unicorn status in 2018, with a $40 million investment valued at $1 billion. In May 2022, Walt Disney CEO Bob Iger became an angel investor in the Blackbird Ventures and Sequoia Capital-backed company.

In 2022, it took a leap of faith and entered the AI race, offering stiff resistance to competitors like Microsoft Designer and Adobe.

Adobe has also diversified its portfolio into a generative AI-powered enterprise software platform, introducing Firefly to Photoshop and launching features like Generative Fill and Generative Remove for advanced image editing.

Today, the design giant introduced new AI features in Lightroom, including generative removal for photo editing and Lens Blur to improve editing speed. Powered by Adobe Firefly, these tools work across phones, desktops, and web.

The post Canva Debuts AI-Powered Enterprise Suite to Challenge Adobe appeared first on AIM.

Microsoft Loves OpenAI, but OpenAI is Crushing on Apple 

A new love triangle is brewing in Silicon Valley. Just a day before Microsoft Build, the tech giant openly stripped Apple’s MacBook Air and showed off its latest Copilot+PCs.

But little did the world know about OpenAI’s soft corner for Apple. This affinity was clearly displayed at the OpenAI Spring Update, where MacBooks and iPhones were prominently used, while Microsoft Windows products were notably absent.

Microsoft loves OpenAI only for its GPTs

Microsoft’s relationship with OpenAI is one of the most intriguing dynamics in the tech industry. In his keynote at Microsoft Build 2024, chief Satya Nadella said that OpenAI is Microsoft’s “most strategic and most important partner”.

“As OpenAI innovates, our promise is that we will bring all that innovation to Azure too. In fact, the same day that OpenAI announced GPT-4o, we made the model available for testing on Azure OpenAI Service,” said Nadella.

“Just last week, OpenAI announced GPT-4o, their latest multimodal model trained on Azure. It’s an absolute breakthrough. It supports text, audio, image, and video as both input and output,” said Nadella, adding that the model can respond and hold human-like conversations that are fast and fluid.

After the keynote, Microsoft invited OpenAI CEO Sam Altman on the stage for a chat with Microsoft CTO Kevin Scott. Altman said that with GPT-4o, they have halved the cost and doubled the speed, assuring that their AI models will keep getting smarter. “If you think about what happened from GPT-3 to 3.5 to 4, it just got smarter,” he said.

I'm watching Sam Altman and Kevin Scott at Microsoft Build and it's very clear that a GPT-5 level model is in the works, and it's going to be a huge leap forward.

— Dan Shipper 📧 (@danshipper) May 21, 2024

Moreover, Altman added that as the models get more powerful, there will be many new things to figure out as the company moves towards AGI. “The level of complexity, and I think the newer research it will take, will increase. I’m sure we’ll do that together [with Microsoft],” Altman told Scott.

At the event, Scott said that if the system that trained GPT-3 was a shark and GPT-4 an orca, the model being trained now is the size of a whale. “This whale-sized supercomputer is hard at work right now,” he added.

This is from Kevin Scott's presentation on scale. He said the shark is the hardware they built for OpenAI to train GPT-3 in 2020. The Orca they built for GPT-4 in 2022. To train the next model 'The system that we have *just deployed* is scale wise about the size of a whale' pic.twitter.com/nx0a5o0erC

— Andrew Curran (@AndrewCurran_) May 22, 2024

Microsoft is undoubtedly betting big on OpenAI. As revealed in email exchanges, the company invested $1 billion in OpenAI in 2019 because it was ‘very worried’ that Google was years ahead in scaling up its AI efforts.

OpenAI loves Apple

If you carefully examine Microsoft’s announcements and offerings at Microsoft Build, you will see that the focus was largely on OpenAI. GPT-4o, OpenAI’s newest flagship model, is now available on Azure AI Studio and as an API.

Moreover, the new features added to Copilot, such as Team Copilot and Agents in Microsoft Copilot Studio, are all powered by OpenAI’s GPT-4.

In contrast, OpenAI’s Spring Update painted a completely different picture.

Throughout its announcements and demos, OpenAI exclusively used Apple products like MacBooks and iPhones, with no sign of Microsoft Windows products. This was particularly conspicuous after recent reports claimed that OpenAI and Apple have finalised a deal to integrate OpenAI’s models into Siri to enhance its conversational abilities.

One wouldn’t be surprised if Altman shared the stage with Apple chief Tim Cook at WWDC 2024.

Apple and OpenAI to Announce Major Partnership at WWDC, Introducing AI Features in iOS 18
🚀 Apple and OpenAI to announce a major partnership at Apple's WWDC on June 10.
📱 Apple will focus on on-device AI features, with additional cloud-based services.
💡 Bloomberg's Mark… pic.twitter.com/X2zwNasSwH

— Rory James (@GenAI_Sales) May 20, 2024

Altman recently lauded the Cupertino-based tech giant for its technology prowess, saying, “iPhone is the greatest piece of technology humanity has ever made”, and it’s tough to get beyond it as “the bar is quite high”.

OpenAI released a ChatGPT desktop app for macOS, with plans to launch a Windows version later this year. However, at Microsoft Build, the company announced Copilot, built using GPT-4o, which is quite similar to the ChatGPT desktop app. Copilot records the user’s screen and assists them in their tasks.

OpenAI didn’t ship the Windows app for ChatGPT because Microsoft is doing it pic.twitter.com/HPsr7v6WWI

— Pete (@nonmayorpete) May 20, 2024

Meanwhile, Microsoft has also been building its own SLMs, such as Phi-3, to avoid appearing overly dependent on OpenAI. The company recently released the Phi-3 vision model.

According to recent reports, Microsoft is also building a large LLM referred to as MAI-1 (possibly Microsoft AI-1). The model, being developed internally by the company, is around 500 billion parameters in size. Its development is being headed by Mustafa Suleyman, Microsoft chief of AI.

Scott went on LinkedIn to explain that this was not, in any way, a competition to OpenAI.

“I’m not sure why this is news, but just to summarise the obvious: we build big supercomputers to train AI models. Our partner OpenAI uses these supercomputers to train frontier-defining models; and then we both make these models available in products and services so that lots of people can benefit from them. We rather like this arrangement,” he said.

Even though Microsoft and OpenAI are doing their own things, for now the partnership is going strong and who knows we might get GPT-5 by November.

The post Microsoft Loves OpenAI, but OpenAI is Crushing on Apple appeared first on AIM.

Combining Data Management and Data Storytelling to Generate Value

Lately, I have been focusing on data storytelling and its importance in effectively communicating the results of data analysis to generate value. However, my technical background, which is very close to the world of data management and its problems, pushed me to reflect on what data management needs to ensure you can build data-driven stories quickly. I came to a conclusion that is often taken for granted but is always good to keep in mind. You can’t rely only on data to build data-driven stories. It is also necessary for a data management system to consider at least two aspects. Do you want to know which ones? Let's try to find out in this article.

What we’ll cover in this article:

  • Introducing Data
  • Data Management Systems
  • Data Storytelling
  • Data Management and Data Storytelling

1. Introducing Data

We continually talk about, use, and generate data. But have you wondered what data is and what types of data exist? Let's try to define it.

Data is raw facts, numbers, or symbols that can be processed to generate meaningful information. There are different types of data:

  • Structured data is data organized in a fixed schema, such as SQL or CSV. The main pros of this type of data are that it’s easy to derive insights. The main drawback is that schema dependence limits scalability. A database is an example of this type of data.
  • Semi-structured data is partially organized without a fixed schema, such as JSON XML. The pros are that they are more flexible than structured data. The main cons is that the meta-level structure may contain unstructured data. Examples are annotated text, such as tweets with hashtags.
  • Unstructured data, such as audio, video, and text, are not annotated. The main pros are that they are unstructured, so it’s easy to store them. They are also very scalable. However, they are challenging to manage. For example, it’s difficult to extract meaning. Plain text and digital photos are examples of unstructured data.

To organize data whose volume is increasing over time, it’s essential to manage them properly.

2. Data Management

Data management is the practice of ingesting, processing, securing, and storing an organization’s data, which is then utilized for strategic decision-making to improve business outcomes [1]. There are three central data management systems:

  • Data Warehouse
  • Data Lake
  • Data Lakehouse

2.1 Data Warehouse

A data warehouse can handle only structured data post-extraction, transformation, and loading (ETL) processes. Once elaborated, the data can be used for reporting, dashboarding, or mining. The following figure summarizes the structure of a data warehouse.

The architecture of a data warehouse
Fig. 1: The architecture of a data warehouse

The main problems with data warehouses are:

  • Scalability — they are not scalable
  • Unstructured data — they don’t manage unstructured data
  • Real-time data — they don’t manage real-time data.

2.2 Data Lake

A Data Lake can ingest raw data as it is. Unlike a data warehouse, a data lake manages and provides ways to consume or process structured, semi-structured, and unstructured data. Ingesting raw data permits a data lake to ingest historical and real-time data in a raw storage system.

The data lake adds a metadata and governance layer, as shown in the following figure, to make the data consumable by the upper layers (reports, dashboarding, and data mining). The following figure shows the architecture of a data lake.

The architecture of a data lake
Fig. 2: The architecture of a data lake

The main advantage of a data lake is that it can ingest any kind of data quickly since it does not require any preliminary processing. The main drawback of a data lake is that since it ingests raw data, it does not support the semantics and transactions system of the data warehouse.

2.3 Data Lakehouse

Over time, the concept of a data lake has evolved into the data lakehouse, an augmented data lake that includes support for transactions at its top. In practice, a data lakehouse modifies the existing data in the data lake, following the data warehouse semantics, as shown in the following figure.

The architecture of a data lakehouse
Fig. 3: The architecture of a data lakehouse

The data lakehouse ingests the data extracted from operational sources, such as structured, semi-structured, and unstructured data. It provides it to analytics applications, such as reporting, dashboarding, workspaces, and applications. A data lakehouse comprises the following main components:

  • Data lake, which includes table format, file format, and file store
  • Data science and machine learning layer
  • Query engine
  • Metadata management layer
  • Data governance layer.

2.4 Generalizing the Data Management System Architecture

The following figure generalizes the data management system architecture.

The general architecture of a data management system
Fig. 4. The general architecture of a data management system

A data management system (data warehouse, data lake, data lakehouse, or whatever) receives data as an input and generates an output (reports, dashboards, workspaces, applications, …). The input is generated by people and the output is exploited again by people. Thus, we can say that we have people in input and people in output. A data management system goes from people to people.

People in input include people generating the data, such as people wearing sensors, people answering surveys, people writing a review about something, statistics about people, and so on. People in output can belong to one of the following three categories:

  • General public, whose objective is to learn something or be entertained
  • Professionals, who are technical people wanting to understand data
  • Executives who make decisions.

In this article, we will focus on executives since they generate value.

But what is value? The Cambridge Dictionary gives different definitions of value [2].

  1. The amount of money that can be received for something
  2. The importance or worth of something for someone
  3. Values: The beliefs people have, especially about what is right and wrong and what is most important in life, that control their behavior.

If we accept the definition of value as the amount of money, a decision maker could generate value for the company they work for and indirectly for the people in the company and the people using the services or products offered by the company. If we accept the definition of value as the importance of something, the value is essential for the people generating data and other external people, as shown in the following figure.

The process of generating value
Fig. 5: The process of generating value

In this scenario, properly and effectively communicating data to decision-makers becomes crucial to generating value. For this reason, the entire data pipeline should be designed to communicate data to the final audience (decision-makers) in order to generate value.

3. Data Storytelling

There are three ways to communicate data:

  • Data reporting includes data description, with all the details of the data exploration and analysis phases.
  • Data presentation selects only relevant data and shows them to the final audience in an organized and structured way.
  • Data storytelling builds a story on data.

Let’s focus on data storytelling. Data Storytelling is communicating the results of a data analysis process to an audience through a story. Based on your audience, you will choose an appropriate

  • Language and Tone: The set of words (language) and the emotional expression conveyed through them (tone)
  • Context: The level of details to add to your story, based on the cultural sensitivity of the audience

Data Storytelling must consider the data and all the relevant information associated with data (context). Data context refers to the background information and pertinent details surrounding and describing a dataset. In data pipelines, this data context is stored as metadata [3]. Metadata should provide answers to the following:

  • Who collected data
  • What the data is about
  • When the data was collected
  • Where the data was collected
  • Why the data was collected
  • How the data was collected

3.1 The Importance of Metadata

Let's revisit the data management pipeline from a data storytelling perspective, which includes data and metadata (context)

The data management pipeline from the data storytelling perspective
Fig. 6: The data management pipeline from the data storytelling perspective

The Data Management system comprises two elements: data management, where the main actor is the data engineer and data analysis, where the main actor is the data scientist.
The data engineer should focus not only on data but also on metadata, which helps the data scientist to build the context around data. There are two types of metadata management systems:

  • Passive Metadata Management, which aggregates and stores metadata in a static data catalog (e.g., Apache Hive)
  • Active Metadata Management, which provides dynamic and real-time metadata (e.g., Apache Atlas)

The data scientist should build the data-driven story.

4. Data Management and Data Storytelling

Combining Data Management and Data Storytelling means:

  • Considering the final people who will benefit from the data. A Data Management system goes from people to people.
  • Consider metadata, which helps build the most powerful stories.

If we look at the entire data pipeline from the desired outcome perspective, we discover the importance of the people behind each step. We can generate value from data only if we look at the people behind the data.

Summary

Congratulations! You have just learned how to look at Data Management from the Data Storytelling perspective. You should consider two aspects, in addition to data:

  • People behind data
  • Metadata, which gives context to your data.

And, beyond all, never forget people! Data storytelling helps you look at the stories behind the data!

References

[1] IBM. What is data management?
[2] The Cambridge Dictionary. Value.
[3] Peter Crocker. Guide to enhancing data context: who, what, when, where, why, and how

External resources

Using Data Storytelling to Turn Data into Value [talk]

Angelica Lo Duca (Medium) (@alod83) is a researcher at the Institute of Informatics and Telematics of the National Research Council (IIT-CNR) in Pisa, Italy. She is a professor of "Data Journalism" for the Master degree course in Digital Humanities at the University of Pisa. Her research interests include Data Science, Data Analysis, Text Analysis, Open Data, Web Applications, Data Engineering, and Data Journalism, applied to society, tourism, and cultural heritage. She is the author of the book Comet for Data Science, published by Packt Ltd., of the upcoming book Data Storytelling in Python Altair and Generative AI, published by Manning, and co-author of the upcoming book Learning and Operating Presto, by O'Reilly Media. Angelica is also an enthusiastic tech writer.

More On This Topic

  • Combining Pandas DataFrames Made Simple
  • Mastering the Art of Data Storytelling: A Guide for Data Scientists
  • Data storytelling — the art of telling stories through data
  • Generate Synthetic Time-series Data with Open-source Tools
  • How to Generate Synthetic Tabular Dataset
  • 4 Ways to Generate Passive Income Using ChatGPT

Sify Plans to Build 675 MW of Data Centre Capacity in the Next Five Years 

Sify Technologies, one of India’s leading data centre providers, recently announced that it plans to build 675 megawatts (MW) of data centre capacity in the country over the next five years.

The company has laid out an ambitious roadmap for future growth.

“Our current roadmap for the next five years is to reach a total of 675 MW of capacity,” revealed Sify Technologies’ data centre business CTO Girish Dhavale in an exclusive interview with AIM.

Currently, the company is working on setting up 250 MW of data centre capacity, with plans to reach a total capacity of over 350 MW within the next three years.

A key focus area is its Rabale campus in Navi Mumbai, where Sify already operates four data centres consuming around 50 MW.

“Four more data centres are under construction. So, almost every year on the Rabale campus, we deliver 25 megawatts to 30 megawatts of capacity. And we have a plan to build a total of 12 data centres in the Rabale campus itself,” the CTO shared.

Sify is also expanding its footprint in other regions. To cater to demand in North India, a data centre with 72 MW of additional capacity is being built in Noida, and another 78 MW data centre is under development in Chennai’s Siruseri. The company recently launched a new data centre in Bengaluru as well.

Looking ahead, Dhavale shared that Sify has its sights set on Hyderabad and further expansion in Mumbai. “In the next seven to eight years, the total capacity of Sify data centres will become 675 MW. So, that’s the kind of portfolio we are looking for,” Dhavale said confidently.

Humble Beginnings

Sify Technologies has come a long way since launching its first private data centre in Vashi back in 2000 with a modest 500 kilowatt capacity.

“Back then, we predominantly had customers like banks or government offices, like the ones related to income tax rather than revenue collection,” recalled Dhavale.

Fast-forward to today, Sify has established itself as an industry leader. The company’s growth has mirrored the rapid digitalisation of India’s economy. With 11 data centres across six metro cities, the company consumes an impressive 100 MW of power.

As GenAI has started to gain momentum in the country, Sify is gearing up its data centres to support high-density servers, including GPUs. “Now, to provide that kind of uptime availability and cooling, you need to provide chip-level cooling,” explained Dhavale.

Sify is conducting POCs with various cooling technologies like chip-level cooling, rack heat exchanger cooling, and liquid immersion cooling to ensure it can meet the needs of AI customers. The company’s newer data centres are also designed with high ceilings and a robust floor loading capacity to accommodate high-density AI servers.

“Looking into the future, all our existing data centres can easily meet AI,” said Dhavale.

At present, Sify’s data centres cater to a wide range of customers across industries. They serve hyperscalers, banking customers, enterprise customers, and healthcare customers, alongside AI and ML customers.

Hyperscalers account for 30-35% of Sify’s total customer base, followed by banking at 20%, and enterprise and retail contributing another 20-23%.

The company’s rack density offerings range from three kilowatts per rack to 40 kW per rack, allowing it to meet its customers’ varied needs.

“To support this diverse customer base, Sify has invested in a range of cooling technologies from underfloor cooling to liquid immersion cooling.

“In liquid immersion cooling, the active components of the servers are submerged in a liquid with a high dielectric strength. This allows for direct cooling through the surface of the components, which is known as conductive cooling, rather than relying on traditional air-based cooling methods,” he added.

Dhavale asserted that one of their key differentiators is that“in the last twenty years of Sify’s life cycle, there has not been a single downtime on Sify’s infrastructure across India.”

This reliability has helped Sify win the trust of major customers across sectors. “The world’s fourth largest bank runs their core banking from Sify. The largest public sector bank in India runs its core banking from Sify. Other major government banks run their operations from Sify’s data centres.

“When we come to the enterprises, the top three family-owned enterprises in the country run on Sify’s servers,” said Dhavale without revealing names.

Tackling the Environmental Impact

Data centres are notorious for being power-hungry, with a combined global consumption slated to exceed 848 terawatt-hours (TWh) by 2030. However, Sify is taking proactive steps to minimise its environmental footprint.

“We have our solar and wind capacity already installed. We have signed power purchase agreements for 237 MW with our partners. Out of that, 100 MW is already generating and delivering power to our Rabale campus,” explained Dhavale.

Sify is now also pursuing an additional 75 MW of interstate renewable energy capacity.

“Our plan is to reduce our carbon footprint by 70% by 2027. And by 2030, we want to be carbon neutral,” Dhavale declared.

Microsoft chief Satya Nadela recently announced at its Build 2024 event that it is on track to have its data centres powered by 100% renewable energy by next year. This commitment aligns with Microsoft’s broader sustainability goals, which include becoming carbon-negative by 2030.

Moreover, as the energy demands increase, many are also exploring the possibility of powering their data centres through nuclear energy or SMRs in their data centres.

Water conservation is another key focus area on which Sify is persistently focused, given the recent water scarcity scare in Bengaluru.

Sify’s new facilities are designed with zero water discharge. Wastewater is treated and reused for washrooms, kitchens, and gardening. High-quality treated water is also reused in chiller operations.

Pushing the Boundaries of Energy Efficiency

In addition to renewable energy, Sify is deploying low-power, high-performance servers and networking equipment, optimising cooling systems through technologies like liquid cooling, free air cooling, hot/cold aisle containment, and variable-speed drives.

“Our older facilities were designed with a power usage effectiveness (PUE) of 1.7, meaning that for every 10 kW of IT power, a total of 17 kW was consumed. However, our new facilities are designed for a PUE of 1.4, a significant improvement,” revealed Dhavale.

Adding to that, Dhavale said, “This is a result of our utilisation of energy conservation technologies used in the overall infrastructure, which have reduced our consumption in terms of auxiliaries, cooling, and losses.”

The post Sify Plans to Build 675 MW of Data Centre Capacity in the Next Five Years appeared first on AIM.

OpenAI Changes Journalism Forever with its Multi-Year Global Partnership with News Corp.

OpenAI NewsCorp Partnership

In a massive deal that comfortably places OpenAI in the media space, the company announced a multi-year agreement with News Corp, the American mass media and publishing company. The partnership will allow OpenAI to display content from News Corp. in response to user queries.

Access to All Major Publications

With the News Corp. partnership, OpenAI will have access to current and archived content from major news publications across the globe, including The Wall Street Journal, Barron’s, MarketWatch, Investor’s Business Daily, FN, New York Post, The Times, The Sunday Times, The Sun, The Australian, news.com.au, The Daily Telegraph, The Courier Mail, The Advertiser, and Herald Sun, and others.

“Our partnership with News Corp is a proud moment for journalism and technology,” said OpenAI chief, Sam Altman. “We greatly value News Corp’s history as a leader in reporting breaking news around the world, and are excited to enhance our users’ access to its high quality reporting.

The News Corp. announcement is a continuation to the string of publication partnerships that OpenAI has been building since last year. In March, OpenAI partnered with Le Monde and Prisa Media to bring French and Spanish news to its platform, and last year, the company also partnered with Associated Press and American Journalism Project.

Content Hassle Sorted

With the ongoing controversies surrounding big tech companies using publicly available data without user consent, OpenAI’s approach of partnering with major news publications helps them avoid this problem. In the process, also acquire more content for possibly training their upcoming models such as GPT-5.

Furthermore, OpenAI also partnered with StackOverflow and Reddit, which gives them access to a vast repository of community discussions that will also go into improving ChatGPT and other AI models.

While the move is only going to benefit OpenAI, users are concerned about how media biases will creep into generated content.

Source: X

The post OpenAI Changes Journalism Forever with its Multi-Year Global Partnership with News Corp. appeared first on AIM.

Learning System Design: Top 5 Essential Reads

Learning System Design: Top 5 Essential Reads
Image by Author | DALLE-3 & Canva

System design can be daunting. At least, I felt this way when I wanted to learn system design as a beginner. The latest trends and buzzwords make it more difficult to know what to learn and where to start. But don't worry! In this article, I will suggest a great starting point for beginners and explain why it's crucial to learn system design.

System design is an integral component of designing large-scale applications, forming the backbone of applications like Twitter, Facebook, Instagram, and countless others. It is essential to design applications that ensure reliable operations, scale effectively with increasing demand, and remain maintainable for the programmers working on the system.

To grasp the fundamental system design concepts and write quality code, I recommend exploring these books. They also serve as a useful resource for preparing for technical interviews at top companies worldwide. This list combines personal recommendations with general popularity among programmers. Let's get started then!

1. Head First Design Patterns

Author: Eric Freeman, Elisabeth Robson, Bert Bates, Kathy Sierra
Link: Head First Design Patterns

Head First Design Patterns

A personal recommendation! A beginner-friendly guide for System Design Patterns and Architectural Patterns. This guide uses visual aids, flowcharts, and UML diagrams to build up simple examples from scratch. Using Java's object-oriented principles, the book makes it easy to learn prevalent design patterns like the iterator, observer, strategy, and singleton, which are commonly used in production-grade code.

Topics Covered:

  • Creational Patterns (Singleton, Factory Method and Abstract Factory Method)
  • Structural Patterns (Adapter, Facade, Proxy, Decorator)
  • Behavioural Patterns (Strategy, Observer, Iterator, State, Template Method)
  • Composite Pattens
  • Application Architectures and MVC Pattern

2. Patterns of Enterprise Application Architecture

Author: Martin Fowler
Link: Patterns of Enterprise Application Architecture

Patterns of Enterprise Application Architecture

For those looking for a deeper dive into design patterns, this book is a great resource. It tackles complex design pattern concepts in a theoretical manner, making it a valuable reference guide whenever you're stuck on a design choice. It covers similar topics as the Head First book, but goes further in-depth (detailed explanations and UML diagrams), making it a great resource for software engineers seeking a comprehensive understanding of design patterns.

Topics Covered:

  • Layered Architectures
  • Concurrency
  • Domain Logic and Relational Databases
  • Web Presentation
  • Distributive Systems
  • Design Patterns

3. Clean Architecture

Author: Rober C. Martin
Link: Clean Architecture

Clean Architecture

This book, written by the renowned Uncle Bob, is part of his highly acclaimed series on Clean Code. He writes from the perspective of a software architect, sharing his insights on the decisions he makes when designing a reliable and scalable system. He emphasizes the importance of independence, decoupling programming choices from specific databases, tools, and languages, making it a must-read for any software developer looking to improve their skills.

Topics Covered:

  • Programming Paradigms (Structured, OOP, Functional)
  • SOLID Design Principles
  • Component Principles (Cohesion, Coupling, Reuse, Closure)
  • Architectural Principles

4. Designing Data-Intensive Applications

Author: Martin Kleppmann
Link: Designing Data-Intensive Applications

Designing Data-Intensive Applications

Another personal recommendation and one of the most highly detailed books about system design. It thoroughly covers the main principles behind system design and explains why things work the way they do. The book is divided into three major parts: Foundation of Data Systems, Distributed Data, and Derived Data. The first part explores the basic foundations of data storage systems, query languages, and retrieval methods for large-scale systems. The second part focuses on the development of distributed systems, emphasizing the importance of consistent systems. The final part focuses on batch processing and stream processing of large-scale data systems.

Topics Covered:

  • Data Models and Query Languages
  • Storage and Retrieval
  • Replication and Transaction Systems
  • Distributed Systems
  • Consistency
  • Batch Processing
  • Stream Processing

5. System Design Interview

Author: Alex Xu
Link: System Design Interview

System Design Interview

Finally, system design is an important part of job interviews at the top tech companies including MAANG. This book by Google engineer Alex Xu is a popular interview preparation material that covers a wide range of topics. It provides a 4-step framework for tackling system design interview questions and features detailed solutions for 16 real-world applications, accompanied by diagrams. Additionally, it explains the design decisions behind major systems like Twitter, Google, and YouTube.

Topics Covered:

  • Interview Process Overview
  • Framework for Interview Process
  • System Design Fundamentals (Caching, Databases, Partitioning, Load Balancing)
  • Architectural Techniques (Monolithic, Microservices, Serverless)
  • Case Studies (Designing Web Crawler, Chat system, YouTube, Google Drive etc)

Wrapping Up

If you're a beginner feeling confused about where to start, these books are your go-to resources to prepare for your next system design interview. From covering the basic concepts behind data systems to the highly detailed decisions behind popular software systems, these books cover it all. If you feel overwhelmed by the hype around system design, starting here will make it less intimidating.

Kanwal Mehreen Kanwal is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook "Maximizing Productivity with ChatGPT". As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She's also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.

More On This Topic

  • 2024 Reading List: 5 Essential Reads on Artificial Intelligence
  • Building a Recommender System for Amazon Products with Python
  • Inside recommendations: how a recommender system recommends
  • How a Level System can Help Forecast AI Costs
  • Design Patterns in Machine Learning for MLOps
  • Design Patterns for Machine Learning Pipelines

CyberArk Unveils CORA AI to Empower Enterprises with AI-Driven Identity Security

Identity security company CyberArk has announced the availability of CyberArk CORA AI at its annual conference, CyberArk IMPACT 24. CORA AI is a new set of AI-powered capabilities that will be embedded across CyberArk’s identity security platform.

It will translate vast numbers of identity data points into insights and enable multi-step actions in natural language, empowering users and organisations to more securely and efficiently secure human and machine identities with the right level of privilege controls.

CyberArk offers various identity security solutions as part of its platform. CyberArk Single Sign-On provides secure access management with features like an app catalog, desktop SSO, role-based access policies, multi-factor authentication, directory services integration, user self-service capabilities, and reporting. It is available in Standard and Adaptive plans with an optional App Gateway add-on.

CyberArk Endpoint Privilege Manager delivers comprehensive endpoint security by removing local admin rights, enforcing least privilege, and implementing foundational controls across Windows, macOS and Linux endpoints in hybrid and cloud environments. It integrates with various systems like Microsoft Azure AD, Amazon WorkSpaces, and ServiceNow to streamline privileged access and strengthen overall security posture.

For secure third-party access, CyberArk Vendor Privileged Access Manager enables organisations to provide external vendors with privileged access to critical internal systems without requiring VPNs, agents, or passwords. It offers full session isolation, monitoring and auditing capabilities. Customers across industries like telecom, finance, manufacturing, insurance and education have leveraged the solution to connect vendors while blocking threats.

CyberArk Identity Flows allows orchestrating identity workflows throughout the enterprise without writing code. It can enhance identity lifecycle management, automate responses to security events by triggering workflows, and enable performing tasks from a single application. The visual no-code editor, pre-built connectors, and adaptable workflows help quickly build and deploy identity management processes at scale.

For enabling secure remote access, CyberArk Remote Access integrates with CyberArk Privileged Access Manager to provide biometric multi-factor authentication without VPNs, passwords or agents. Features like zero trust access, just-in-time provisioning for vendors, and session monitoring allow the remote workforce to stay connected and protected.

The post CyberArk Unveils CORA AI to Empower Enterprises with AI-Driven Identity Security appeared first on AIM.

‘Someday Every Single Car will Have Autonomous Capabilities,’ says Jensen Huang

‘Someday Every Single Car will Have Autonomous Capabilities,' says Jensen Huang

In a recent interview with Yahoo Finance, NVIDIA’s chief, Jensen Huang, stated that besides the cloud industry, the primary users of NVIDIA’s data-center chips are from the automotive industry. He then emphasised about autonomous cars and their advancements.

“Tesla is far ahead in self-driving cars, but every single car, someday will have to have autonomous capability,” said Huang. The video of the statement was shared by Elon Musk on X.

pic.twitter.com/4awVq8Q4ia

— Elon Musk (@elonmusk) May 23, 2024

The Rise of Autonomous Vehicles

The global autonomous vehicle market size was valued at $135.25 billion in 2022 and is anticipated to reach $1453.30 by 2032, with a CAGR of 27.42% during the 2023-2032.

Interestingly, NVIDIA and Foxconn had partnered earlier last year, to build AI factories which will help boost EV and autonomous vehicle production.

While Tesla is at the forefront of self-driving technology, other companies are emerging in the space. In India, startups such as Swaayatt Robots is said to have achieved level 5 autonomy.

NVIDIA’s Rising Growth

Basking in the glory of the phenomenal growth in the recent quarter, of posting over 600% profit YoY for the recent quarter, Huang attributed the growth in sales to not just major cloud service providers such as Amazon, Microsoft and Google. He also stated that companies such as Meta, Tesla and even pharmaceutical companies are procuring NVIDIA chips.

The recent big tech events saw increasing partnership announcements with NVIDIA with Microsoft even claiming to be among the first cloud providers to offer NVIDIA Blackwell GPUs in B100 and GB200 configurations.

The post ‘Someday Every Single Car will Have Autonomous Capabilities,’ says Jensen Huang appeared first on AIM.

7 Steps to Mastering Data Cleaning with Python and Pandas

7-steps-pandas
Image by Author

Pandas is the most widely used Python library for data analysis and manipulation. But the data that you read from the source often requires a series of data cleaning steps—before you can analyze it to gain insights, answer business questions, or build machine learning models.

This guide breaks down the process of data cleaning with pandas into 7 practical steps. We’ll spin up a sample dataset and work through the data cleaning steps.

Let’s get started!

Spinning Up a Sample DataFrame

Link to Colab Notebook

Before we get started with the actual data cleaning steps, let's create pandas dataframe with employee records. We’ll use Faker for synthetic data generation. So install it first:

!pip install Faker

If you’d like, you can follow along with the same example. You can also use a dataset of your choice. Here’s the code to generate 1000 records:

import pandas as pd  from faker import Faker  import random    # Initialize Faker to generate synthetic data  fake = Faker()    # Set seed for reproducibility  Faker.seed(42)    # Generate synthetic data  data = []  for _ in range(1000):      data.append({          'Name': fake.name(),          'Age': random.randint(18, 70),          'Email': fake.email(),          'Phone': fake.phone_number(),          'Address': fake.address(),          'Salary': random.randint(20000, 150000),          'Join_Date': fake.date_this_decade(),          'Employment_Status': random.choice(['Full-Time', 'Part-Time', 'Contract']),          'Department': random.choice(['IT', 'Engineering','Finance', 'HR', 'Marketing'])      })  

Let’s tweak this dataframe a bit to introduce missing values, duplicate records, outliers, and more:

# Let's tweak the records a bit!  # Introduce missing values  for i in random.sample(range(len(data)), 50):      data[i]['Email'] = None    # Introduce duplicate records  data.extend(random.sample(data, 100))    # Introduce outliers  for i in random.sample(range(len(data)), 20):      data[i]['Salary'] = random.randint(200000, 500000)

Now let’s create a dataframe with these records:

# Create dataframe  df = pd.DataFrame(data)

Note that we set the seed for Faker and not the random module. So there'll be some randomness in the records you generate.

Step 1: Understanding the Data

Step 0 is always to understand the business question/problem that you are trying to solve. Once you know that you can start working with the data you’ve read into your pandas dataframe.

But before you can do anything meaningful on the dataset, it’s important to first get a high-level overview of the dataset. This includes getting some basic information on the different fields and the total number of records, inspecting the head of the dataframe, and the like.

Here we run the info() method on the dataframe:

df.info()
Output >>>    RangeIndex: 1100 entries, 0 to 1099  Data columns (total 9 columns):   #   Column             Non-Null Count  Dtype   ---  ------             --------------  -----    0   Name               1100 non-null   object   1   Age                1100 non-null   int64    2   Email              1047 non-null   object   3   Phone              1100 non-null   object   4   Address            1100 non-null   object   5   Salary             1100 non-null   int64    6   Join_Date          1100 non-null   object   7   Employment_Status  1100 non-null   object   8   Department         1100 non-null   object  dtypes: int64(2), object(7)  memory usage: 77.5+ KB

And inspect the head of the dataframe:

df.head()

df-head
Output of df.head()

Step 2: Handling Duplicates

Duplicate records are a common problem that skews the results of analysis. So we should identify and remove all duplicate records so that we're working with only the unique data records.

Here’s how we find all the duplicates in the dataframe and then drop all the duplicates in place:

# Check for duplicate rows  duplicates = df.duplicated().sum()  print("Number of duplicate rows:", duplicates)    # Removing duplicate rows  df.drop_duplicates(inplace=True)
Output >>>  Number of duplicate rows: 100

Step 3: Handling Missing Data

Missing data is a common data quality issue in many data science projects. If you take a quick look at the result of the info() method from the previous step, you should see that the number of non-null objects is not identical for all fields, and there are missing values in the email column. We’ll get the exact count nonetheless.

To get the number of missing values in each column you can run:

# Check for missing values  missing_values = df.isna().sum()  print("Missing Values:")  print(missing_values)
Output >>>  Missing Values:  Name                  0  Age                   0  Email                50  Phone                 0  Address               0  Salary                0  Join_Date             0  Employment_Status     0  Department            0  dtype: int64

If there are missing values in one or more numeric column, we can apply suitable imputation techniques. But because the 'Email' field is missing, let's just set the missing emails to a placeholder email like so:

  # Handling missing values by filling with a placeholder  df['Email'].fillna('unknown@example.com', inplace=True)

Step 4: Transforming Data

When you’re working on the dataset, there may be one or more fields that do not have the expected data type. In our sample dataframe, the 'Join_Date' field has to be cast into a valid datetime object:

# Convert 'Join_Date' to datetime  df['Join_Date'] = pd.to_datetime(df['Join_Date'])  print("Join_Date after conversion:")  print(df['Join_Date'].head())
Output >>>  Join_Date after conversion:  0   2023-07-12  1   2020-12-31  2   2024-05-09  3   2021-01-19  4   2023-10-04  Name: Join_Date, dtype: datetime64[ns]

Because we have the joining date, it's actually more helpful to have a `Years_Employed` column as shown:

# Creating a new feature 'Years_Employed' based on 'Join_Date'  df['Years_Employed'] = pd.Timestamp.now().year - df['Join_Date'].dt.year  print("New feature 'Years_Employed':")  print(df[['Join_Date', 'Years_Employed']].head())
Output >>>  New feature 'Years_Employed':     Join_Date  Years_Employed  0 2023-07-12               1  1 2020-12-31               4  2 2024-05-09               0  3 2021-01-19               3  4 2023-10-04               1

Step 5: Cleaning Text Data

It’s quite common to run into string fields with inconsistent formatting or similar issues. Cleaning text can be as simple as applying a case conversion or as hard as writing a complex regular expression to get the string to the required format.

In the example dataframe that we have, we see that the 'Address' column contains many ‘n’ characters that hinder readability. So let's replace them with spaces like so:

# Clean address strings  df['Address'] = df['Address'].str.replace('n', ' ', regex=False)  print("Address after text cleaning:")  print(df['Address'].head())
Output >>>  Address after text cleaning:  0    79402 Peterson Drives Apt. 511 Davisstad, PA 35172  1     55341 Amanda Gardens Apt. 764 Lake Mark, WI 07832  2                 710 Eric Estate Carlsonfurt, MS 78605  3                 809 Burns Creek Natashaport, IA 08093  4    8713 Caleb Brooks Apt. 930 Lake Crystalbury, CA...  Name: Address, dtype: object

Step 6: Handling Outliers

If you scroll back up, you’ll see that we set some of the values in the 'Salary' column to be extremely high. Such outliers should also be identified and handled appropriately so that they don’t skew the analysis.

You’ll often want to factor in what makes a data point an outlier (if it’s incorrect data entry or if they’re actually valid values and not outliers). You may then choose to handle them: drop records with outliers or get the subset of rows with outliers and analyze them separately.

Let's use the z-score and find those salary values that are more than three standard deviations away from the mean:

# Detecting outliers using z-score  z_scores = (df['Salary'] - df['Salary'].mean()) / df['Salary'].std()  outliers = df[abs(z_scores) > 3]  print("Outliers based on Salary:")  print(outliers[['Name', 'Salary']].head())
Output >>>  Outliers based on Salary:                  Name  Salary  16    Michael Powell  414854  131    Holly Jimenez  258727  240  Daniel Williams  371500  328    Walter Bishop  332554  352     Ashley Munoz  278539

Step 7: Merging Data

In most projects, the data that you have may not be the data you’ll want to use for analysis. You have to find the most relevant fields to use and also merge data from other dataframes to get more useful data that you can use for analysis.

As a quick exercise, create another related dataframe and merge it with the existing dataframe on a common column such that the merge makes sense. Merging in pandas works very similarly to joins in SQL, so I suggest you try that as an exercise!

Wrapping Up

That's all for this tutorial! We created a sample dataframe with records and worked through the various data cleaning steps. Here is an overview of the steps: understanding the data, handling duplicates, missing values, transforming data, cleaning text data, handling outliers, and merging data.

If you want to learn all about data wrangling with pandas, check out 7 Steps to Mastering Data Wrangling with Pandas and Python.

Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.

More On This Topic

  • 7 Steps to Mastering Data Cleaning and Preprocessing Techniques
  • 7 Steps to Mastering Data Wrangling with Pandas and Python
  • Collection of Guides on Mastering SQL, Python, Data Cleaning, Data…
  • Mastering the Art of Data Cleaning in Python
  • Data Cleaning with Pandas
  • 5 Simple Steps to Automate Data Cleaning with Python

Snowflake Proves AI is Not Just at the Tip of its Iceberg

Snowflake Proves AI is Not Just at the Tip of its Iceberg

Snowflake’s current quarter numbers have made Wall Street happy. Snowflake showed a stronger-than-expected sales forecast for the current quarter, indicating that its new generative AI-focused products are driving faster growth.

Snowflake has announced that its product revenue is projected to be between $805 million and $810 million for the current quarter ending in July. This surpasses forecasts from analysts who projected a revenue of $787.5 million. Additionally, the company increased its annual product sales projection to $3.3 billion from $3.25 billion.

An incredible quarter for @SnowflakeDB ! Our core business remains strong. Product Revenue for the quarter was $790 million, up 34% year-over-year. Given the strong quarter we’re increasing our product revenue outlook for the year 📈.
As CEO, I charted three key priorities:

— sridhar (@RamaswmySridhar) May 22, 2024

Snowflake CEO Sridhar Ramaswamy said, “Our AI products, now generally available, are generating strong customer interest. They will help our customers deliver effective and efficient AI-powered experiences faster than ever.” Ever since Ramaswamy joined Snowflake after its acquisition of his company, Neeva, generative AI has been one of its biggest focuses.

All About Generative AI

Snowflake had been negotiating to acquire generative AI startup Reka AI for over $1 billion, but the discussions ended without an agreement. In April, Snowflake introduced its own LLM suite called Arctic and now allows customers to utilise third-party AI models, such as Mistral and Meta, on their data within the company’s platform, dubbed Snowflake Cortex.

But apart from the Reka AI deal that fell through, Snowflake announced a definitive agreement to acquire TruEra, an AI startup specialising in tools for testing, debugging, and monitoring machine learning models and large language model applications in production.

Snowflake announced plans to “acquire certain technology assets and hire key employees” from the AI-focused startup, which raised $25 million in 2022.

On the acquisition, TruEra co-founder, president, and chief scientist Anupam Datta said, “We are looking forward to this next phase in our journey with the Snowflake team with whom we share a commitment to delivering effective & trustworthy generative AI and predictive ML at scale.”

We are excited to share that @SnowflakeDB has signed an agreement to acquire the TruEra AI Observability platform to bring LLM and ML Observability to its AI Data Cloud. We are looking forward to this next phase in our journey with the Snowflake team with whom we share a…

— Anupam Datta (@datta_cs) May 22, 2024

This acquisition marks Snowflake’s sixth significant investment to enhance the capabilities of its data cloud and its third major initiative in the data observability space. Prior to this, Snowflake had invested in two monitoring solutions companies – Observe and Metaplane.

In a blog post, Snowflake highlighted that the TrueEra acquisition will help the company ensure more accuracy and trustworthiness with the data used for training AI models.

TruEra has been instrumental in solving the black box problem with AI and the team behind the startup are experts in RAG-based solutions. Following the acquisition, all three co-founders will be joining Snowflake alongside the TrueEra team.

This AI-focused approach is enabling Snowflake to better compete with others in the field, such as Databricks, which offers similar services and, interestingly, has employed a similar acquiring strategy ever since it acquired MosaicML.

In an exclusive interview with AIM, Snowflake head of AI Baris Gultekin said that he had worked with Ramaswamy for over 20 years at Google, calling him an incredible leader. “Sridhar brings incredible depth in AI as well as data systems. He has managed super large-scale data systems and AI systems at Google,” Gultekin said.

Gultekin further said that Snowflake is developing LLMs at a very affordable price, prioritising the security of their customers’ data. “Despite using a 17x less compute budget, Arctic is on par with Llama 3 70B in language understanding and reasoning while surpassing enterprise metrics,” said Gultekin.

The Microsoft Fabric and NVIDIA Spread

In addition, Microsoft announced an expanded partnership with Snowflake, aiming to deliver a seamless data experience for customers. As part of this, Microsoft Fabric‘s OneLake will now support Apache Iceberg and facilitate bi-directional data access between Snowflake and Fabric.

Very excited to expand our partnership with @Microsoft to improve interoperability through Apache Iceberg ! 🤝

— sridhar (@RamaswmySridhar) May 22, 2024

OneLake, a unified, SaaS-based open data foundation, was launched by Microsoft with the introduction of Fabric. The foundation underscores the company’s commitment to open standards. The support for Iceberg, alongside Delta Lake in Microsoft Fabric OneLake, further enhances this commitment.

In essence, Snowflake can store data in Iceberg’s format in OneLake. Data written by either Snowflake or Fabric will be accessible in both Iceberg and Delta Lake formats through XTable translation in OneLake. Snowflake can read any Fabric data artefact in OneLake, whether stored physically or virtually, through shortcuts.

And that’s not all.

In a recent interview, Ramaswamy revealed that the cloud data company plans to deepen its collaboration with AI powerhouse NVIDIA. “We collaborated with NVIDIA on a number of fronts – our foundation model Arctic was, unsurprisingly, done on top of NVIDIA chips. There’s a lot to come, and Jensen’s, of course, a visionary when it comes to AI,” Ramaswamy said.

Snowflake is expected to make a lot more announcements at its Data Cloud Summit this June. As Ramaswamy said, “Our product pipeline, especially in AI, has been in overdrive. The era of enterprise AI is here, right here at Snowflake.”

The post Snowflake Proves AI is Not Just at the Tip of its Iceberg appeared first on AIM.