The Roles and Responsibilities of Data-centric Developers

AdobeStock_530712064

When encountering the labels “data-driven” and “data-centric”, one might first assume that they mean the same thing. In some situations, one might understand their different meanings, but interchange their labels when elaborating on their differences. For the business user and for the developer, a clear distinction between the two is essential. We will primarily focus here on the data-centric development paradigm. But, first let us examine data-driven.

To some extent, data-driven is either wrong or correct. It can be wrong from a strategic perspective: business applications and use cases must align with the business mission and objectives, therefore development activities associated with data (e.g., analytics, decision science, data science, AI) must primarily be results-driven, business-driven, or mission-driven. In other words, the strategy must focus on the outcomes (the outputs of the development activities associated with data), not on the inputs (the data). Every organization has data, especially these days in which every organization probably has massive quantities of digital assets. Organizations, especially large ones, also have lots of buildings and other physical assets, but we wouldn’t say that those things are what drive the organization. The business is strategically pulled forward (driven) by their mission, goals, and objectives, not by their data.

Conversely, data-driven can be correct from a technology and innovation perspective: the emergence of new technologies (mostly digital) inspires the development of innovations that are enabled by, fueled by, and sustained by the large, diverse data assets of the modern enterprise, therefore the data does push forward (drive) these innovations and developments.

So, if the mission pulls and the data pushes, then what is different and essential about being data-centric? Not to overplay it, we could simply say that data-centric is at the fulcrum of push and pull. That’s where developers operate. They leverage the data to design, develop, and produce the desired business outcomes. That’s the key to better outcomes from development activities associated with data.

One key reason why this matters is that model-centric development tends to focus most effort on building, validating, and tuning the application, assuming that the data are fixed. This could lead to a situation where the model is built correctly, but it is not the correct model. In other words, the traditional model-centric development approach can drive you safely to the wrong destination. That’s not good enough in a dynamic business environment and an evolving world. As an example, the rapid change in customer buying habits, network usage behaviors, and business processes during the pandemic caused many applications to “give the wrong answer” or deliver the wrong experience. Consequently, developers must systematically improve and adjust to similarly dynamic datasets to increase the accuracy of their applications in production. Therefore, the data-centric approach uses data to define what should be developed in the context of the interests and needs of the stakeholders, as informed by the data generated from the organization’s operations.

There are additional important characteristics of data-centric development. For example, this approach encourages the use of the “right” data, not just “big data”. Sometimes the right data is relatively small and focused, such as in a hyper-personalized customer marketing campaign. Also, the right data might (or should) include 3rd-party data sources external to the business, thereby significantly improving the model outcomes as they incorporate contextual data on customer market demand, economic conditions, the competitive business landscape, social network customer sentiment, supply chain situational awareness, and more. The data-centric developer can then not only learn to develop more with less data, but also to develop better with less data.

To maximize the quality of data-centric development, there must be a real emphasis on the centrality of the data. Consequently, there is more attention given to improving data quality (through data profiling), providing consistent data labeling (particularly for machine learning models), developing deeper data understanding (through exploratory data analysis), producing smarter model inputs (through feature engineering and experimentation), and identifying the actionable insights from outliers and anomalous data values.

These data workflow orchestration activities can naturally lead to the development of a well-indexed data catalog and data fabric that will power rapid and easy data discovery, data access, and data delivery for developers at the moment of need. In fact, the principles of versioning, repositories, check-in, check-out, provenance, rollbacks, unit and regression testing, etc. from traditional software development best practices can be applied beneficially to these data workflow activities. For example, we are already seeing an increased attention on data version control platforms. A systematic approach to these different steps can smoothly blend data-centrality into development activities, which are then less likely to miss the mark on required business objectives.

Another value proposition for the business from data-centric development is the establishment of a portfolio mindset in which the developed data products can be packaged across a spectrum of applications (such as descriptive analytics, diagnostic analytics, predictive analytics, and prescriptive analytics) with a range of complexity in each category, ranging from simple day-to-day operational tools up to major innovation initiatives. The success of the smaller data-centric development projects will build confidence, broader engagement, and greater advocacy across the organization (from top-level management to the end-user business workers) for larger (perhaps high-risk high-reward) development products and services. The portfolio framework places each of these projects logically into a digestible and easily understandable data-centric development roadmap.

An added benefit of the portfolio approach to data-centric development is that the connections, interactions, and communication channels between developers and the rest of the business (executives, end-users, and other stakeholders) are strengthened across many touchpoints and threads. This, in turn, builds a culture of data-sharing, data democratization, data literacy, and data experimentation. The benefits of these culture shifts can be significant in advancing the overall digital transformation of the organization.

One can itemize the following characteristic benefits to the broader enterprise derived from data-centric development: actionable insights discovery, continual business growth, improved project outcomes, optimized operations, prediction of emerging trends and requirements, and the verticalization of development (since each data application is very domain- and task-specific, so that general models and code are not sufficient).

What are some concrete steps that developers can take to become more data-centric? First, recognize that if data-driven can drive you to the wrong destination, then data-centric can avoid such an outcome through regular monitoring of data and feedback (mid-course corrections). Second, improving a model for hard, rare cases through model-tuning and tweaking may get a slightly better model result after much effort, but perhaps a better result will come more easily and quickly through giving more attention on tuning / tweaking / selecting / improving / cleaning the data. Third, when the model gives poor results for some cases, then focus on the corresponding input data for those cases, to see if there are poor, inconsistent, or inaccurate labels or simply insufficient data examples for that case. If possible, create some synthetic data (augmented data) to boost the underrepresented classes in an unbalanced input dataset. The effort spent on cleaning and improving the data will be time well spent that benefits future development projects, unlike any singularly model-specific tweaks made on the current project.

In summary, data-centric developers are focused intentionally on taming and leveraging massive enterprise and external data assets to increase business agility and to empower data consumers, which includes internal teams, business partners, and customers. Consequently, these types of developers are expanding their sphere of influence on IT purchasing decisions and maximizing the value of data, which is what every data-drenched corporate executive team and/or board of directors likes to see.

Kirk Borne, founder and owner of Data Leadership Group LLC

Achieving mainframe reliability with distributed scale

Web

About 70% of the Fortune 500 use mainframes for core business functions, according to BMC Software. There is good reason for that.

Mainframes were designed for both raw processing power and reliability with redundant components, error correction, journaling, and other key features, which provide what IBM calls “RAS”—Reliability, Availability, and Serviceability. However, new challenges have emerged: Data volumes are growing, 79% of mainframe users can’t find the talent they need (according to Deloitte), and four out of five users report they need more frequent application updates. While not all roads lead to microcomputing or the cloud, some certainly do. The question is how can you achieve mainframe reliability with distributed scale?

Redundancy is Critical

Whereas mainframes achieve reliability with redundancy, so do distributed environments—they just do it differently. Rather than having multiple redundancies based on hardware and the operating system, distributed environments have redundant physical machines, locations, and networks. However, redundancy is not as automated as it is in mainframe systems. Just as z/OS can be configured to provide full or no redundancy at multiple layers including the network and storage layers, so can a distributed environment. But in distributed systems, deliberate choices must be made in terms of the software and configuration in order to achieve redundancy.

Both public and private cloud environments usually provide various levels of network redundancy in terms of switches and routers. In a distributed environment, DNS and a load balancer decide which virtual (and physical) machine receives a request. If that system is down, then the request should be sent elsewhere. This means that multiple VMs must be set up with the same service in order to receive the request. In modern environments, these are usually configured as “containers” and managed by Kubernetes. Kubernetes can also be configured to restart them when they fail. Multiple containers running the same software with redundant DNS and load balancers routing the requests on a reliable network backbone is the equivalent of a mainframe running multiple VMs and with redundant hardware. Kubernetes provides the same kind of functionality as z/OS in terms of managing these separate virtual machines.

Having redundant containers configured on different machines is utterly useless if the data is not there. In cloud computing environments, this often means a form of network storage is needed, such as AWS’s Elastic Block Storage (EBS), which provides fault tolerance at the disk layer. However, this may not be sufficient by itself to provide reliability for a database. Distributed SQL databases provide similar semantics as DB/2 but ensure data is replicated across multiple machines and even multiple data centers, called availability zones in cloud computing parlance, or geographic regions. By ensuring the network, services, and data have multiple redundancies, distributed environments can match mainframe characteristics.

Distributed environments can provide the same level of redundancy as a mainframe with a bit more configuration and planning but can even go further by replicating services and data across multiple availability zones” (data centers that a near each other) and geographic regions. This ensures that services stay up even if connectivity is lost to a data center or if a regional event makes an entire area unavailable. For instance, a service can fail over to Ohio if a hurricane renders the data centers in Virginia inoperative.

Managing the Tyranny of Choice

While in the mainframe world there is frequently a “right answer,” the distributed world has many right answers—especially in software. This can be a bit confusing and complicated. A few new technologies can make this easier.

Software as a service (SaaS) vendors can provide a managed version of a database or similar service without requiring as much configuration and administration. These are not created equal and some can be “black boxes” which turn out to not provide actual redundancy or high availability. It is important to understand how a SaaS vendor provides high availability and how they manage upgrades—and their associated outages.

GitOps provides an agent-based, version-controlled configuration management system. Changes are checked into revision control (usually git) and are applied automatically by software agents. Changes can be rolled back both manually or automatically if they cause unavailability.

Flox is a package management system based on the Nix package management system. It allows the software to be described and configured declaratively. These Nix packages can be published as containers and installed in Kubernetes.

Part of providing reliability is ensuring correct configuration and administration. The complexity of distributed systems increases the chance of misconfiguration. Using software as a service reduces the overall burden, at a cost, whereas modern packaging and configuration management systems can reduce the complexity of maintaining a distributed system.

By using modern tools and configuring multiple levels of redundancy, modern distributed systems can achieve mainframe-level reliability. It is critical to consider a modern database, such as a distributed SQL database, to ensure the data backing services are also reliable. Ultimately, making these choices yields a better business value for companies across industries.

The observer effect in a multi-layered neural network

The-observer-effet-fig2

The objective of this blog post is to show that the observer effect, which is so puzzling in our physical world, has a logical explanation for a layer in a multilayers neural network and that that explanation involves a learning process. This post expands and further elaborates of a previous blog post by the author (Seidou Sand, 2023). Our analysis is built upon the free energy principle proposed by Friston (Schwartenbeck and al, 2013), which posits that a structure trying to survive within a given environment will develop an internal representation of that environment that mimeses its surprise.” It is under this framework that we examine the learning process of a multilayer neural network, which similarly aims to adapt to a specific environment by constructing a representation that reduces its surprise.

We are therefore assuming that a multilayer neural network is learning to survive in a specific environment and is therefore building a representation of its environment that minimises its surprise. The input for the neural network represents the true objective reality, while the output corresponds to the learned representation of this external reality. Consequently, the output can be regarded as the subjective world model of the neural network. Given the multi-layered structure of the network, the internal representation of the world is incrementally constructed through a series of successive layers. Each layer receives the representation of the world generated by the previous layer as its external reality, constructs its own internal representation, and subsequently passes this information to the succeeding layer. As a result, the neural network encompasses a system of nested internal representations, thereby creating an intricate hierarchy of interrelated worldviews.

The observer effect in a multi-layered neural network

Now, let’s look at the perspective of a neuron in a hidden layer of that structure. The neuron is unable to directly access the initial input (which could be seen as the “actual” objective reality), and instead, it obtains input that has undergone processing by earlier layers. If we think of the input the neuron gets as its external reality and the output as the subjective reality it constructs from its understanding of the objective reality, then we can perceive the different representations generated by neurons within a multi-layered neural network as a hierarchy of interconnected perspectives.

The observer effect in a multi-layered neural network

The notion of nested worldviews proposes that one structure’s external reality can serve as the subjective worldview for a higher-level structure. This higher-level structure, in turn, uses the external reality to form its internal worldview, which then becomes the subjective worldview of an even higher-level structure, and so on. This idea complements existing theories such as reality as information, reality as a simulation, and the supposed existence of lower and higher levels of consciousness in meditation practices.

One interesting feature of the nested worldviews model is its potential to explain the quantum wave function collapse in a simple manner. This so far mysterious phenomenon can be seen as a mere update of the internal representation of a higher-level structure in response to an observation made by a lower-level structure. For example, consider a red neuron within a neural network. This neuron receives information that has already been processed by previous layers and lacks access to objective external inputs (reality). Despite this limitation, the red neuron can learn from its input, update its state, and prompt an update of the entire neural network through a process called backpropagation. This change subsequently alters the input the neuron receives – an observation therefore leads to changes in external reality, the central enigma in quantum physics.

The nested worldviews model can also shed light on the retro causality phenomenon (Leifer & Pusey, 2017, Chou, 2022). In this scenario, the objective time of a lower-level structure is merely the subjective timeline that the higher-level structure continually adjusts based on new information. Retro causality occurs because our beliefs about the past can change as a result of what we learn today.

The nested worldviews theory aligns with several existing concepts, such as the learning universe theory (Alexander and al, 2021), the centrality of learning in life, and the idea that networks of observers (neurons) create objective reality (Lanza and Berman, 2019).

At first glance, it may seem counterintuitive to conceive that the solid, real reality we perceive is, in fact, the imagination or knowledge state of another entity. However, this notion is not entirely dissimilar from the implications of the simulation hypothesis.

The nested worldviews theory has far-reaching implications for our understanding of reality and how we perceive the world around us. By acknowledging the interconnected layers of interpretation that shape our reality, we can develop a more nuanced, empathetic, and inclusive approach to various aspects of our lives, from personal relationships to global concerns.

Our world is a complex web of interrelated systems, and many aspects of our lives involve nested structures. These structures encompass nested mathematical structures, nested biological organisms, nested ecosystems, nested brains, and nested computing in the form of decentralized computing. In light of these pervasive nested structures, the idea of nested worldviews as a natural aspect of our world becomes more compelling.

A nested worldview refers to a multi-layered approach to understanding reality, where each layer exerts an influence on the layers above and below it, shaping their comprehension of the world. In line with this framework, our external reality is an outcome of the worldview crafted by the layers that preceded us. As we interpret this reality, we develop our own unique perspective, which is subsequently passed on to the following layers. This continuous process of interpretation and transmission underscores the fluid nature of our understanding of reality, which is constantly evolving as new perspectives are created and disseminated.

The concept of nested worldviews offers numerous benefits when it comes to deciphering the world around us. By recognizing the multi-layered nature of reality, we are better prepared to appreciate the complexity and interconnectedness of the systems that govern our lives. This may be key in finding solutions to the current global challenges that arise from the world being a system of inter-dependant systems.

Furthermore, the dynamic nature of nested worldviews enables the continuous development of our understanding of reality. This adaptability is crucial in a rapidly changing world, as it encourages us to remain receptive to new ideas and viewpoints, fostering a culture of innovation and progress. It implies that our understanding of reality should never be static, but rather in a perpetual state of transformation as new insights are acquired and disseminated. This adaptability is vital for our survival in an increasingly complex world with global problems such as environmental preservation and climate action, emphasizing the importance of embracing nested worldviews as an inherent aspect of our existence.

The concept of nested worldviews brings forth an optimistic and refreshing perspective, suggesting that there is not just a simple dichotomy between the objective and the objective reality, but rather a diverse and ever-evolving landscape of interpretations. This idea supports the notion that the reality we perceive is an adaptive and dynamic dashboard shaped by evolution (Hoffman, 2019, Kastrup, 2019). It encourages a more fluid and vibrant understanding of the world, where multiple perspectives enrich one another.

The implications of nested worldviews are profound and far-reaching, impacting not only our self-perception but also our understanding of the world around us. This concept has the potential to foster growth and innovation across various fields of study. It presents an inspiring and insightful glimpse into the true nature of reality and our comprehension of it. Let’s hope that by recognizing and embracing the layers of interpretation that shape our world, we can cultivate a more empathetic, inclusive, and holistic approach to all aspects of our lives and ecosystems, ranging from personal relationships to global concerns such as global warming. This appreciation for the intricate tapestry of perspectives that constitute our collective reality may well pave the way for a more harmonious, interconnected, and united world. It may also add an element to the debate on whether AI will ever become conscious or not.

References:

Alexander, S., Cunningham, W. J., Lanier, J., Smolin, L., Stanojevic, S., Toomey, M. W., & Wecker, D. (2021). The autodidactic universe. arXiv preprint arXiv:2104.03902.

Chiou, D. W. (2022). Delayed-choice quantum eraser and the EPR paradox. arXiv preprint arXiv:2210.11375.

Hoffman, D. (2019). The case against reality: Why evolution hid the truth from our eyes. WW Norton & Company.

Kastrup, B. (2019). The idea of the world: a multi-disciplinary argument for the mental nature of reality. John Hunt Publishing.

Lanza, R., & Berman, B. (2010). Biocentrism: How life and consciousness are the keys to understanding the true nature of the universe. BenBella Books, Inc..

Leifer, M. S., & Pusey, M. F. (2017). Is a time symmetric interpretation of quantum theory possible without retrocausality?. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 473(2202), 20160607.

Schwartenbeck, P., FitzGerald, T., Dolan, R. J., & Friston, K. (2013). Exploration, novelty, surprise, and free energy minimization. Frontiers in psychology, 4, 710.

Seidou Sanda, I. (2023). Nested Worldviews: Can Neural Networks Unlock the Secrets of Reality? Medium, https://medium.com/@issoufouseidousanda/nested-worldviews-can-neural-networks-unlock-the-secrets-of-reality-4304f7057a9e

LLMs Emergent Abilities: Explainable AI and the Human Mind

AI robot assistant supporting online. Chatterbot and customer communication, chatbot help. Artificial intelligence technology concept. Flat graphic vector illustration isolated on white background

There is a recent article in The Economist, Large, creative AI models will transform lives and labour markets, describing how LLMs work. It states that “First, the language of the query is converted from words, which neural networks cannot handle, into a representative set of numbers. GPT-3, which powered an earlier version of Chatgpt, does this by splitting text into chunks of characters, called tokens, which commonly occur together. These tokens can be words, like “love” or “are”, affixes, like “dis” or “ised”, and punctuation, like “?”. GPT-3’s dictionary contains details of 50,257 tokens.”

“The LLM then deploys its “attention network” to make connections between different parts of the prompt. Its attention network slowly encodes the structure of the language it sees as numbers (called “weights”) within its neural network. Emergent abilities are all represented in some form within the llms’ training data (or the prompts they are given) but they do not become apparent until the LLMs cross a certain, very large, threshold in their size.”

The article mentions chunks of characters and autoregression. The first in the series left out some important parts of LLMs including parameters, pre-training, and so on. AI is not the human brain, but AI has a mind. The inner workings of LLMs including emergence, or emergent abilities, properties, or phenomena operate like a mind.

The overall function of any mind is to know. It may arise from a complex organ like the brain, it may be in the form of memory in a single-celled organism, or from an object system like a computer, or from human brain cells or organoids in the nervous system of murine.

Neurons or whatever simulates them result in the making of a mind or something with a broad mechanism for knowing. Feelings, emotions, reactions, knowledge, language, and so on, are known. Some systems do not have artificial neural networks but are able to know to an extent.

There are premises that neural networks were built on intended to mimic the brain. Many of those led to progress, but did not exactly get the mind. It is often said that the brain or more precisely, the mind generates predictions. It is this prediction generation or predictive coding, processing against errors that shaped LLMs.

The mind, however, does not make predictions. It functions in a way that appears so, but it does not. Cells and molecules of the brain structure, organize, construct or build the components of the mind. It is the components of the mind that operate what is labeled predictions.

When someone is speaking, typing, listening, or signing, there is often preparation for what may come next in the mind. Sometimes, it may seem like it should be one thing, but it is another. Other times, nothing may present. It is also this preparation that is sometimes used to recall things or the same way that something is set up to be remembered.

There is no exclusive prediction function in the mind. The mind, conceptually, has quantities and properties. Quantities relay to acquire properties to varying extents. It is the property that gets acquired in a moment that determines what an experience is, or simply what is known.

Quantities have early splits or go before, where some in a beam head in a direction like before so that others simply follow. If the input matches, there would be no changes, if not, the following quantities head in the right direction. This explains what is labeled as predictions. Quantities have old and new sequences. They can also be prioritized or pre-prioritized. Prioritization is what attention for transformers simulates.

Properties have thin and thick shapes. They have a principal spot where one goes to have the most domination. They also have bounce points. A thick property can merge some of its contents, resulting in creativity. Properties can be formed by quantities. Some properties are also natural, enabling things for humans that other organisms do not have.

How does the human mind work to be useful for explainable AI or interpretability, towards alignment? The human mind has a structure, functions, and components. For all it does for internal and external senses, how does it work, including for sentience, or knowing? Some of the answers to the unknowns for AI could emanate from the mind, boosting transparency.

David Stephen does research in theoretical neuroscience. He was a visiting scholar in medical entomology at the University of Illinois, Urbana-Champaign, UIUC. He did research in computer vision at Universitat Rovira i Virgili, URV, Tarragona.

6 signs your data warehouse needs a makeover

Data warehouse center with database storage systems tiny person

Data warehouses are essential in today’s data-driven business environment for storing and analysing massive amounts of data to enable decision-making. However, as businesses grow and data needs change, they can become outdated and struggle to keep up with evolving requirements.

In this blog, let’s explore five warning signs that indicate it’s time to modernize your data warehouse.

Slow query performance

Slow query performance is one of the most prevalent indicators that your data warehouse needs to be updated. If your queries take longer than usual to run, or if it takes longer to produce reports than it used to, it may be time to consider upgrading your data warehouse. Decision-making, productivity, and customer satisfaction can all be significantly impacted by slow query speed.

Slow query performance can be caused by a number of factors, including an increase in the volume of data stored in the warehouse, outdated hardware, or a lack of optimization. In some cases, it may be possible to optimize the performance of your existing data warehouse through tuning and indexing. However, in many cases, upgrading to a modern data warehouse solution is the best way to ensure optimal query performance and data processing speed.

Limited scalability

Another warning sign that your data warehouse needs a makeover is limited scalability. They must be able to scale as data quantities increase to keep up with the rising demand. If your current data warehouse is unable to keep up with growing data volumes, it may be time to consider modernizing your system to ensure scalability and flexibility.

Upgrading to a cloud-based data warehouse solution is one way to ensure scalability and flexibility. Cloud-based data warehouses may scale automatically based on the amount of data saved, allowing organisations to easily and quickly increase their data storage and processing capabilities as needed. In addition, cloud-based solutions offer greater flexibility and can be accessed from anywhere, making it easier for employees to work remotely.

Inability to handle diverse data types

Modern businesses generate data from a wide range of sources, including social media, IoT devices, and customer feedback. If your current data warehouse is struggling to handle diverse data types or integrate with other systems, it may be time to upgrade your system to ensure seamless integration and analysis.

Upgrading to a modern data warehouse solution that can handle diverse data types and integrate with other systems can provide significant benefits. For example, a modern data warehouse can enable businesses to integrate data from multiple sources, enabling more comprehensive analysis and insights. In addition, it can support a wider range of data types, such as unstructured data from social media, enabling businesses to gain deeper insights into customer behavior and preferences.

High maintenance costs

If your data warehouse is consuming a significant number of resources, time, and money to maintain and operate, it may be time to modernize. Upgrading to a cloud-based data warehouse solution can provide cost savings, reduce maintenance requirements, and offer greater flexibility.

Cloud-based data warehouses do not require expensive hardware or IT professionals to maintain and manage the system, they are often less expensive than on-premise solutions. Furthermore, cloud-based solutions are frequently more scalable, allowing companies to simply pay for the storage and processing power they really need and scale up or down as necessary.

Limited accessibility

If your current data warehouse is only accessible to a limited number of users or requires extensive training to use, it may be time to consider a more user-friendly and accessible solution. A modern data warehouse can offer intuitive user interfaces and streamlined workflows, enabling users across the organization to leverage the power of data.

It can also enable employees to access data from anywhere, using any device. This can improve collaboration and communication across the organization, enabling teams to work more effectively and make informed decisions based on real-time data.

Inability to support real-time data

Real-time data has become increasingly important for organizations, especially those in industries such as finance and e-commerce. If it can’t support real-time data, it’s a sign that it may need a makeover to accommodate new data types and sources.

Conclusion:

Recognizing the warning signs of an outdated data warehouse is crucial to ensuring your business remains competitive and agile in today’s data-driven business environment. By addressing these issues, businesses can modernize their data warehouse and unlock the full potential of their data to make informed decisions, optimize operations, and drive growth.

DSC Weekly 9 May 2023 – The case for AI-human collaboration

Announcements

  • In addition to ever-increasing volumes of data, storage needs have evolved due to increases in remote work, the use of cloud services, and cybersecurity concerns such as ransomware. In Modern Storage Management summit, learn from top industry experts and solution providers about the latest ways to effectively manage the flood of enterprise data. You’ll hear about the various forms of cloud storage and how they can benefit your storage strategy, new trends in data backup, and how to ensure security and data protection are always a priority.
  • Enterprises today are experiencing rapid growth, citing IT observability as an essential business imperative so companies can properly monitor and manage their complex environments. Tune in to the Pursuing Full-Stack IT Observability summit to learn from leading experts how to establish enterprise observability with the right approach and solutions. Learn about the leading platforms and tools to help companies achieve full-stack observability and keep business environments functioning smoothly.
Businessperson shaking hand with digital partner over futuristic

The case for AI-human collaboration

It’s no surprise that Artificial Intelligence articles make up the majority of today’s edition of DSC Weekly. Every day there are new predictions and studies anticipating how AI will influence business and society as a whole. The consensus is that AI isn’t going anywhere. How it influences society will depend on balancing the immense power of AI with human’s ability to scale back potential risks that come with it.

To achieve this balance, DSC contributor Bill Schmarzo says it’s necessary for everyone to “think like a data scientist” when it comes to AI. In part 1 of his “AI for Everyone” blog, Bill makes the case for educating and empowering everyone to participate in the AI conversation. He describes the “TLADS” methodology as a collaborative process to develop metrics against which organizations determine their value creation effectiveness. In his blog, Bill outlines the steps to implementing the TLADS methodology that he says can help guide ethical AI development.

This will likely be the deciding factor over AI’s influence over society and humanity as a whole: the ability for humans to work with AI vs. fighting against it. In order for AI to reach its full economic and societal potential, everyone must be involved in AI model design and management, according to Bill. In Part 2 of his “AI for Everyone” blog, he promises to lay out how to get everyone understands their role in developing not only functioning AI models, but also responsible and ethical ones. Without this input, the doomsayers could be more correct than we want to believe.

The Editors of Data Science Central

Contact The DSC Team if you are interested in contributing.

DSC Featured Articles

  • 6 signs your data warehouse needs a makeover
    May 9, 2023
    by Prasanna Chitanand
  • LLMs Emergent Abilities: Explainable AI and the Human Mind
    May 9, 2023
    by David Stephen
  • The observer effect in a multi-layered neural network
    May 9, 2023
    by Issoufou Seidou Sanda
  • Achieving mainframe reliability with distributed scale
    May 9, 2023
    by Andrew Oliver
  • The Roles and Responsibilities of Data-centric Developers
    May 9, 2023
    by Kirk Borne
  • AI for Everyone: Learn How to Think Like a Data Scientist – Part 1
    May 8, 2023
    by Bill Schmarzo
  • Transforming IT through SaaSification
    May 8, 2023
    by Andrew Oliver
  • How Machine Learning is Revolutionizing the Healthcare Industry
    May 8, 2023
    by Evan Rogen
  • Maximizing Revenue in Psychology Practices: Leveraging AI for Billing Optimization
    May 5, 2023
    by John Lee
  • DSC Weekly 2 May 2023 – Big tech must weigh AI’s risks vs. rewards
    May 2, 2023
    by Scott Thompson

Picture of the Week

DSC Weekly 9 May 2023 – The case for AI-human collaboration

You should never neglect to monitor your machine-learning models

Machine Learning Model Risks

Machine learning has emerged as a powerful tool for organizations across industries to enhance their operational efficiency and make data-driven decisions.

With the increasing reliance of businesses on machine learning models, it is crucial to guarantee their performance as expected. At this point, monitoring the machine learning models comes into play.

To put it simply, the process of monitoring a machine learning model involves the ongoing evaluation of its performance. It involves collecting data on the model’s output, comparing it with the expected results, and identifying any discrepancies.

The main objective of monitoring is to confirm that the model is functioning as planned and to detect and resolve any potential problems.

However, despite its importance, many organizations tend to neglect monitoring their machine-learning models.

In this blog, let’s discuss why you should never make this mistake and why monitoring such models is crucial for your business.

Machine learning models are susceptible to drift

The primary reason for monitoring these models is their susceptibility to drift. Data drift occurs when the data on which the model was trained and the data it is currently being tested on are not the same. This can lead to a drop in the accuracy of the model, and in some cases, render it useless.

By monitoring the performance of a model, you can detect data drift early and take corrective action. This ensures that your model remains accurate and reliable, and continues to deliver the expected results.

Early detection of errors

Another critical reason to monitor ML models is to detect errors early on. Even the smallest errors in a ML model’s code or data can cause significant problems down the line. Monitoring allows you to detect these errors in real time, allowing you to fix them before they cause any significant issues.

By monitoring such models, you can identify potential errors before they become severe, saving your organization time and money in the long run.

Regulatory compliance

Regulatory compliance is a significant concern for businesses across industries. Regulation non-compliance can lead to costly penalties, legal action, and reputational harm. Various industries, such as healthcare and finance, have strict regulations governing the use of machine learning models.

By monitoring your ML models, you can ensure their adherence to the applicable regulations. It allows you to detect and address any issues that may arise, ensuring that your organization remains in compliance with the rules and regulations governing its operations.

Enhanced model performance

The monitoring of machine learning models can lead to a continuous improvement of their performance. By collecting data on the model’s performance, you can identify patterns and trends that can help you fine-tune the model’s parameters and improve its accuracy.

Frequent monitoring can additionally enable you to spot chances for optimizing your model’s performance using new technologies or techniques. This can assist your organization in staying ahead of the competition and attaining a competitive edge in your industry.

Better decision-making

Finally, monitoring ML-based models can lead to better decision-making. They play a critical role in many organizations’ decision-making processes. By monitoring these models, you can ensure that the decisions being made are based on accurate and reliable data.

Monitoring allows you to identify potential errors or biases in the data, ensuring that the decisions being made are unbiased and objective.

What Makes Machine Learning Monitoring Different from Other Methods?

The technique of monitoring machine learning involves the ongoing analysis of data, ensuring the proper functioning of ML models.

This method of monitoring differs from traditional monitoring methods in several key ways:

Continuous monitoring: Traditional monitoring methods are typically performed at specific intervals, such as daily or weekly. Machine learning monitoring, on the other hand, involves continuous monitoring of ML models in real time. This allows for rapid identification and resolution of any issues that may arise.

Proactive identification of issues: With traditional monitoring methods, issues are often identified after they have already occurred. ML monitoring, however, is proactive in nature and can identify potential issues before they become major problems. This allows for proactive intervention to prevent issues from occurring.

Automation: It is highly automated, using advanced algorithms and machine learning models to detect anomalies and deviations from expected behavior. This reduces the need for manual monitoring and allows for rapid identification of issues.

Scalability: The monitoring of ML-based systems is considerably scalable, facilitating the monitoring of extensive datasets and systems. This makes it particularly suitable for organizations that require the monitoring of complex systems with vast amounts of data.

Predictive analytics: The identification of patterns and trends in data, through the use of predictive analytics, can help identify potential issues. This allows for proactive intervention to prevent issues from occurring.

Customization: It is customized to meet the specific needs of different organizations and industries. This allows for tailored monitoring solutions that address specific challenges and requirements.

By leveraging the advantages of machine learning monitoring, organizations can gain greater insights into their data and achieve better results from their ML-based models.

Best Practices for Effective Machine Learning Model Monitoring

Here are some best practices for effective machine learning model monitoring:

  • Set distinct performance metrics and monitor them on a regular basis
  • Continuously track and monitor data quality and model inputs
  • Set up alerts to notify stakeholders when models fall outside of expected ranges
  • Regularly review and update models to ensure they remain accurate and relevant
  • Implement robust testing and validation processes to catch errors and biases
  • Document all changes and updates made to models for transparency and accountability
  • Foster a culture of ongoing learning and improvement around model monitoring and management

Conclusion:

Neglecting to monitor your machine learning models can have serious consequences for your organization, including decreased accuracy, increased bias, and costly errors. Monitoring these models is essential for any organization that relies on them for its operations. Remember, monitoring your ML-based model is an ongoing process that requires attention and effort, but the benefits are well worth it in the long run. Therefore, it is critical to ensure that you are monitoring your models regularly to ensure their continued performance and success.

4 pillars of modern data quality

4-pillars-of-modern-data-quality

The need for high-quality, trustworthy data in our world will never go away.

Treating data quality as a technical problem and not a business problem may have been the biggest limiting factor in making progress. Finding technical defects, such as duplicate data, missing values, out-of-order sequences, and drift from expected patterns of historical data are no doubt critical, but this is just the first step. A more demanding and crucial step is to measure the business quality which checks to see if the data is contextually correct.

Let us see the pillars of Modern Data Quality:

1. Top-down Business KPI – Perhaps the IT teams would have benefited if the term data quality had never been coined, and instead “business quality” was the goal. In that case, the raison d’être of ensuring data is correct would have been to ensure the business outcomes were being met. In this scenario, the focus shifts from the data’s infrastructure to its context.

But what exactly is “context?”

It is the application of business use to the data. For example, the definition of a “customer” can vary between different business units. For sales, it is the buyer, for marketing, it is the influencer, and for finance, it is the person who pays the bills. So, the context changes depending on who is dealing with the data. Data Quality needs to keep in lockstep with the context. In another example, country code 1 and region US and Canada may appear to be analogous, but they are not. Different teams can use for vastly different purposes the same columns in a table. As a result, the definition of data quality varies. Hence, data quality needs to be applied at the business context level.

2. Product Thinking – The concepts evoked by the data mesh principles are compelling. They evolve our thinking so that older approaches that might not have worked in practice actually can work today. The biggest change is how we think about data: as a product that must be managed with users and their desired outcomes in mind.

Organizations are applying product management practices to make their data assets consumable. The goal of a “data product” is to encourage higher utilization of “trusted data” by making its consumption and analysis easier by a diverse set of consumers. This in turn increases an organization’s ability to rapidly extract intelligence and insights from their data assets in a low-friction manner.

Similarly, data quality should also be approached with the same product management discipline. Data producers should publish a “data contract” listing the level of data quality promised to the consumers. By treating data quality as a first-class citizen, the producers should learn how the data is being used and the implications of its quality. Data products’ data quality SLA is designed to ensure that consumers have knowledge about parameters like the freshness of data.

3. Data Observability – Frequently, the data consumer is the first person to detect anomalies, such as the CFO discovering errors on a dashboard. At this point, all hell breaks loose, and the IT team goes into a reactive fire-fighting mode trying to detect where in the complex architecture the error manifested.

Data Observability fills the gap by constantly monitoring data pipelines and using advanced ML techniques to quickly identify anomalies, or even proactively predict them so that issues can be remediated before they reach downstream systems.

Data quality issues can happen at any place in the pipeline. However, if the problem is caught sooner, then the cost to remediate is lower. Hence, adopt the philosophy of ‘shift left.’ A data observability product augments data quality through:

  • Data discovery extracts metadata from data sources and all the components of the data pipeline such as transformation engines and reports or dashboards.
  • Monitoring and profiling – for data in motion and at rest. What about data in use?
  • Predictive anomaly detection – uses built-in.
  • Alerting and notification

Data quality is a foundational part of data observability. The figure below shows the overall scope of data observability.

Data-Observability-Scope

4. Overall Data Governance – The data quality subsystem is inextricably linked to overall metadata management.

On one hand, the data catalog stores defined or inferred rules, and, on the other hand, DataOps practices generate metadata that further refines the data quality rules. Data quality and DataOps ensure that the data pipelines are continuously tested with the right rules and context in an automated manner and alerts are raised when anomalies are inferred.

In fact, data quality and DataOps are just two of the many use cases of metadata. Modern data quality is integrated with these other use cases as the figure below shows.

Metadata-is-a-glue

A comprehensive metadata platform that coalesces data quality within other aspects of data governance improves the collaboration between the business users, such as data consumers and the producers and maintainers of data products. They share the same context and metrics.

This tight integration helps in adopting the shift left approach to data quality. Continuous testing, orchestration and automation help reduce error rates and speed up delivery of data products. This approach is needed to improve trust and confidence in the data teams.

This integration is the stepping stone for enterprise adoption of modern data delivery approaches of data products, data mesh, and data sharing options, like exchanges and marketplaces.

Contact us to know more about Modern Data Quality Platform.

Building a Secure Workplace: 5 Strategies to Raise Cybersecurity Awareness

pexels-tima-miroshnichenko-5380642

As cyber threats become increasingly sophisticated, it’s more important than ever for businesses to prioritize cybersecurity awareness. Cyber attacks can have devastating consequences, including data loss, financial losses, and reputational damage. Fortunately, there are steps businesses can take to protect themselves. In this article, we’ll explore five tips to implement cybersecurity awareness at your business and solutions to protect your business from cyber threats.

Create a Cybersecurity Policy

The first step in implementing cybersecurity awareness is to create a cybersecurity policy that outlines best practices for employees to follow. This policy should cover topics such as password management, data handling, and email security, and should be regularly reviewed and updated as new threats emerge.

Provide Regular Training

Once you have a cybersecurity policy in place, it’s important to provide regular training to employees to ensure that they understand and follow best practices. This training should cover topics such as phishing scams, malware, and social engineering, and should be tailored to the specific needs of your business.

Implement Strong Access Controls

Access controls are critical for protecting sensitive data and systems from unauthorized access. This includes using strong passwords, implementing multi-factor authentication, and limiting access to only those employees who need it.

Regularly Update and Patch Systems

Cybercriminals often exploit vulnerabilities in software and systems to gain access to sensitive data. To prevent this, it’s important to regularly update and patch systems to address known vulnerabilities and keep software up-to-date.

Regularly Conduct Security Assessments

Regular security assessments can help identify potential vulnerabilities and weaknesses in your security posture. This can include vulnerability scans, penetration testing, and other security assessments to identify areas of weakness and develop solutions to protect your business.

Develop an Incident Response Plan

In the event of a cyber attack or other security incident, it’s important to have a plan in place to respond quickly and effectively. This should include procedures for identifying and containing the incident, communicating with stakeholders, and restoring systems and data.

Regularly Monitor and Analyze Security Logs

Monitoring security logs can help identify potential security incidents before they become major threats. This involves regularly reviewing security logs and analyzing them for signs of suspicious activity, such as failed login attempts, unusual network traffic, and unauthorized access attempts.

Conduct Background Checks on Employees:

Conducting background checks on employees can help identify potential security risks before they become major threats. This can include criminal background checks, credit checks, and other screenings to ensure that employees are trustworthy and have the necessary skills and experience to perform their roles effectively.

Engage with Industry Experts

Engaging with industry experts and attending cybersecurity conferences can help businesses stay up-to-date on the latest threats and best practices for protecting against them. This can include working with third-party vendors to implement cybersecurity solutions, as well as participating in industry associations and forums to share information and best practices with other businesses.

Regularly Review and Update Policies and Procedures

Cyber threats are constantly evolving, and businesses need to stay up-to-date on the latest threats and best practices to stay protected. This means regularly reviewing and updating policies and procedures to ensure that they remain effective and relevant in the face of changing threats and technologies.

In addition to these tips, there are several solutions that businesses can implement to protect themselves from cyber threats:

1. Anti-Malware Solutions:

Anti-malware solutions are critical for protecting against viruses, worms, and other malware that can compromise sensitive data and systems. These solutions should be regularly updated and configured to provide maximum protection.

2. Firewalls:

Firewalls are an essential tool for protecting against unauthorized access and preventing malware from spreading throughout a network. Firewalls can be configured to block incoming and outgoing traffic based on specific rules and can be customized to meet the unique needs of your business.

3. Data Encryption:

Data encryption is an effective way to protect sensitive data from unauthorized access. This involves converting data into a coded format that can only be decrypted with a key, ensuring that data remains secure even if it falls into the wrong hands.

4. Cloud-Based Solutions:

Cloud-based solutions can provide increased security and flexibility compared to on-premise solutions. By storing data and applications in the cloud, businesses can benefit from the security and scalability of cloud providers while also reducing the cost and complexity of managing on-premise solutions.

5. Disaster Recovery Solutions:

Disaster recovery solutions are critical for ensuring that businesses can quickly recover from a cyber attack or other disaster. This can include regular backups of data and systems, as well as the use of cloud-based disaster recovery solutions that can provide quick and reliable recovery in the event of a disaster.

6. Mobile Device Management (MDM) Solutions:

As more employees use mobile devices for work-related tasks, it’s important to implement mobile device management solutions to ensure that these devices are secure and protected from cyber threats. MDM solutions can provide features such as remote data wiping, device encryption, and secure data access to ensure that mobile devices are secure.

7. Virtual Private Networks (VPNs):

VPNs can help businesses protect sensitive data and communications by encrypting data and creating a secure, private network for remote workers. This can be particularly valuable for businesses with remote workers or teams that frequently work from outside the office.

8. Email Security Solutions:

Email is a common vector for cyber attacks, and it’s important to implement email security solutions to protect against phishing scams, malware, and other email-based threats. This can include solutions such as spam filters, anti-virus software, and email encryption to protect sensitive data.

9. Identity and Access Management (IAM) Solutions:

IAM solutions can help businesses manage user access to systems and data, ensuring that only authorized users have access to sensitive information. This can include features such as multi-factor authentication, role-based access control, and user activity monitoring to prevent unauthorized access and detect suspicious activity.

10. Regular Vulnerability Assessments:

Regular vulnerability assessments can help businesses identify potential vulnerabilities in their systems and applications, allowing them to proactively address these issues before they are exploited by cybercriminals. This can include vulnerability scanning, penetration testing, and other assessments to identify weaknesses in security posture.

By implementing these tips and solutions, businesses can protect themselves from cyber threats and ensure that they are prepared to respond to any security incident that may arise. Whether you’re a small business or a large enterprise, cybersecurity awareness is essential for protecting your data, your reputation, and your bottom line. So don’t wait – start implementing these solutions today and protect your business from cyber threats.

AI for Everyone: Learn How to Think Like a Data Scientist – Part 2

Slide5

In Part 1 of the series “AI for Everyone: Learn How to Think Like a Data Scientist”, we discussed that for AI to reach its full economic and societal potential, we must educate and empower everyone to actively participate in the design, application, and management of meaningful, relevant, and responsible AI. We discussed the role that the “Thinking Like a Data Scientist” methodology can play in driving an inclusive, collaborative process that empowers everyone to participate in defining the variables and metrics against which organizations define and measure their value creation effectiveness.

We also introduced an “AI for Everyone” playbook that outlines the role that everyone needs to play in delivering responsible and ethical AI models. In Part 1, we started with the following steps:

  • Step 1) Defining Value
  • Step 2) Understanding How AI Works
  • Step 3) Understanding the Role of the AI Utility Function

Now, let’s cover the rest of the “AI for Everyone” playbook.

Step 4) Building a Healthy AI Utility Function

Defining variables and metrics that comprise AI Utility Function too narrowly can lead to AI model confirmation bias and potentially devastating unintended consequences.

As we define the variables and metrics against which we want the AI Utility Function to optimize, we must embrace an expanded view of how value is defined and created. That means moving beyond the traditional and operational metrics against which most organizations seek to optimize with their advanced analytics.

Remember, AI is a learning tool, and one must broaden the dimensions against which value creation is defined and measured. That means including variables and metrics related to customer satisfaction and advocacy, employee satisfaction and development, partner ecosystem financial and economic viability, environmental conditions, diversity and social factors, and ethical “do good” factors (Figure 1).

Slide7

Figure 1: Economic Value Definition Wheel

Remember, only defining financial metrics will lead to AI model confirmation bias and a gradually shrinking total addressable market. And only defining lagging indicators will only lead to potentially dangerous unintended consequences.

Step 5) Building Conflict Into the AI Utility Function

Multi-objective Optimization is a branch of decision-making that deals with mathematical optimization problems requiring simultaneous optimization of multiple objective functions.

AI models will access, analyze, and make decisions billions, if not trillions, of times faster than humans. An AI model will find the gaps in a poorly constructed AI Utility Function and will quickly exploit that gap to the potential detriment of many key stakeholders. Consequently, build conflicting variables and metrics into the AI Utility Function – Improve A while reducing B, improve C while also improving D.

Just like humans are constantly forced to make difficult trade-off decisions (i.e., drive to work quickly but also arrive safely, improve the quality of healthcare while improving the economy), we must expect nothing less from a tool like AI that can make those tough trade-off decisions without biases or prejudices. The good news is that several multi-objective optimization algorithms excel at optimizing multiple conflicting objectives (Figure 2).

Slide8

Figure 2: Multi-objective Optimization Algorithms

AI for Everyone Summary

If we want AI to work for the benefit of society, then we must prepare and empower everyone to actively participate. That means both demystifying or simplifying what and how AI works and educating and empowering everyone as to their role in ensuring the design, development, deployment, and ongoing management of the AI models that work for the benefit of humanity. Understanding the “Thinking Like a Data Scientist” methodology is a simple starting point (Figure 3).

Slide10

Figure 3: Linking Thinking Like a Data Scientist Methodology to the AI Utility Function

The AI for Everyone playbook ensures that everyone understands their role – and their responsibility – in developing responsible and ethical AI models. There is no value in whining about not being included in the conversation when the playbook is right in front of you.

By the way, I was curious about what ChatGPT thought about the relationship between my Thinking Like a Data Scientist methodology and creating a healthy AI Utility Function (Figure 4).

Slide11

Figure 4: ChatGPT’s Assessment of the Relationship Between TLADS Methodology and AI Utility Function

Well, I know someone who will get an “A” in my class. Teacher’s pet…