Market Guide for AIOps Platforms

Market Guide for AIOps Platforms

Published 7 November 2019 - ID G00378587 - 23 min read

AIOps platforms enhance I&O leaders’ decision making by contextualizing large volumes of varied and volatile data. I&O leaders should use AIOps platforms for refining performance analysis across the application life cycle, as well as for augmenting IT service management and automation.

Overview

Key Findings

  • Use of AIOps platforms to augment IT functions such as event correlation and analysis, anomaly detection, root cause analysis and natural language processing is growing rapidly. However, application of AIOps to functions such as ITSM and DevOps is progressing at a slower pace.
  • AIOps platform offerings have split into two approaches: domain-agnostic and domain-centric solutions.
  • Enterprises that adopt AIOps platforms use them as a force multiplier for monitoring tools correlating across application performance monitoring (APM), IT infrastructure monitoring (ITIM), network performance monitoring and diagnostics tools, and digital experience monitoring.
  • AIOps platform maturity, IT skills and operations maturity are the chief inhibitors to rapid time to value. Other emerging challenges for advanced deployments include data quality and lack of data science skills within I&O.

Recommendations

I&O leaders focused on infrastructure, operations and cloud management should:
  • Increase the odds of a successful AIOps platform deployment by focusing on a specific use case and adopting an incremental approach that starts with replacing rule-based event analytics and expands into domain-centric workflows like application and network diagnostics.
  • Address specific use cases by adopting either domain-centric AIOps platform features built into a monitoring tool or a domain-agnostic stand-alone solution, each of which ingests events, metrics and traces.
  • Support task automation, knowledge management and change analysis by selecting an AIOps platform that can be applied to these ITSM use cases.
  • Enable continuous insights across IT operations management (ITOM) by supporting these three aspects of AIOps platforms: observe, engage and act.

Strategic Planning Assumption

By 2023, 40% of DevOps teams will augment application and infrastructure monitoring tools with artificial intelligence for IT operations (AIOps) platform capabilities.

Market Definition

AIOps platforms address I&O leaders’ need for operations support by combining big data and machine learning functionality to analyze the ever-increasing volume, variety and velocity of data generated by IT in response to digital transformation. An identifiable group of vendors has emerged to meet enterprise requirements for this insight, although they prioritize and architect support for those requirements differently.

Market Description

AIOps platforms enhance a broad range of IT operations processes including, but not limited to, anomaly detection, event correlation and root cause analysis (RCA) to improve monitoring, service management and automation tasks.
The central functions of AIOps platforms include:
  • Ingesting data from multiple sources including infrastructure, networks, apps, the cloud or existing monitoring tools (for cross-domain analysis)
  • Enabling data analytics using machine learning at two points:
    • Real-time analysis at the point of ingestion (streaming analytics)
    • Historical analysis of stored data
  • Storing and providing access to the data
  • Suggesting prescriptive responses to analysis
  • Initiating an action or next step based on the prescription (result of analysis)
The goal of the analytics effort is the discovery of patterns — clusters or groups naturally occurring in the data that are used to predict possible incidents and emerging behavior. These patterns are used to determine the root causes of current system issues and to intelligently drive automation to resolve them (see Figure 1).
Figure 1. AIOps Platform Enabling Continuous Insights Across IT Operations Monitoring (ITOM)
AIOps Platform Enabling Continuous Insights Across IT Operations Monitoring (ITOM)

Market Direction

Gartner estimates the size of the AIOps platform market at between $300 million and $500 million per year. Artificial intelligence (AI) technologies such as machine learning have influenced the evolution of ITOM intermittently over the past two decades, and AIOps platforms are only the most recent example of that influence. Use of AI in IT operations has been driven by the adoption of digital transformation and the resultant need to address the following:
  • Rapid growth in data volumes generated by the IT systems, networks and applications
  • Increasing data variety with the need to analyze events, metrics, traces (transactions), wire data, network flow data, streaming telemetry data, customer sentiment and more
  • The increasing velocity at which data is generated, as well as the increasing rate of change within IT architectures and challenges in maintaining observability and improving engagement due to the adoption of cloud-native and ephemeral architectures
  • The need to intelligently and adaptively automate recurring tasks and predict change success and SLA failure
An inability to deal with these data requirements can prove costly given the insights required in all areas of the organization. AIOps platforms must be able to support the ability to incrementally deploy the four stages of IT operations monitoring (see Figure 2).
Figure 2. Four Stages of IT Operations Monitoring
Four Stages of IT Operations Monitoring
I&O leaders are beginning to focus on use cases in areas beyond the realm of IT operations monitoring such as in IT service management (ITSM), digital experience monitoring (DEM) and DevOps (see Note 2 and “Improve Event Management With the DevOps Techniques of Continuous Monitoring and Automation”). In addition, a spectrum of AIOps platform use cases spans the life cycle of applications and teams (see Figure 3).
Figure 3. Applying AIOps Platforms Across a Spectrum of Use Cases Over the Life Cycle of an Application
Applying AIOps Platforms Across a Spectrum of Use Cases Over the Life Cycle of an Application
Further, digital transformation is driving an increased need for speed in IT (see “Artificial Intelligence for IT Operations Delivers Improved Business Outcomes”). This, in turn, drives the need for tools that can deliver the following capabilities:
  • Reduce noise (such as false alarms) using clustering and pattern matching algorithms
  • Determine causality, identifying the probable cause of incidents using topology as well as ML, and relate these issues to a customer journey using algorithms such as decision trees, random forest and graph analysis
  • Capture multivariate anomalies that go beyond static thresholds or numeric outliers to proactively detect abnormal conditions and behavior and relate them to business impact
  • Detect trends that may result in outages before their impact is felt
  • Drive the automation of low-risk to medium-risk recurring tasks
  • Improve user effectiveness and automation using chatbots and virtual support assistants (VSAs) to democratize access to knowledge and automate recurring tasks
  • Triage problems, helping prioritize them and offer actions that can be taken to resolve them (either directly or via integration based on past scenarios)
Gartner anticipates that, over the next five years, wide-scope domain-agnostic AIOps platforms and narrow-scope domain-centric AIOps tools such as ITIM, APM or ITSM suites will become the two paths for delivering AIOps functionality (see Note 3).
When the variety of data sources exceeds the scope of a domain-centric tool, a wide-scope one will be necessary. That doesn’t necessarily mean that the domain-centric tool will not be used. As machine learning continues to be embedded in monitoring tools, the AIOps platform will become more of a federated environment. In this environment, AI will be used at the domain level within a domain-centric tool. As data streams from multiple sources are available, the output of the domain tools will be sent to the wide-scope, domain-agnostic AIOps platform for cross-domain correlation (see “Deliver Cross-Domain Analysis and Visibility With AIOps and Digital Experience Monitoring”).
As the market evolves, Gartner is observing AIOps capabilities evolving across various dimensions:
  • Domain-agnostic AIOps — Vendors going to market with a general-purpose AIOps platform. These products tend to rely mostly on monitoring tools to perform data capture and cater to the broadest use cases.
  • Domain-centric AIOps — Vendors that have the key components, but with a restricted set of use cases. They essentially do the same thing they did before but now they’re replacing rules, heuristics and fingerprints with math (algorithms). These vendors are focused on one domain (for example, network, endpoint systems or APM). However, there have been some efforts by domain-centric solutions to hybridize these categories and evolve to ingesting data from sources other than their own instrumentation tools and including this data in their analysis.
  • Do-it-yourself (DIY) — Some open-source projects enable users to assemble their own AIOps platforms by offering tools for data ingest, a big data platform, ML and a visualization layer. End users can mix and match the components from multiple providers (see “Beginning AIOps: Data Science for IT Operations”). A few enterprises actively build AIOps platforms by putting together all the required layers starting with streaming to acquire data (using Prometheus, for example), followed by aggregation (in InfluxData’s InfluxDB, for example) and a visualization tool (such as Grafana or Elastic Kibana). Some advanced adopters of DIY AIOps platforms have built solutions that analyze the confidence level of their deployments in order to gauge risk, predict customer churn, and detect and autoresolve problems before they have business impact. However, these deployments are in the minority due to the skills needed to support them, maintenance requirements and support.

Market Analysis

Today, few vendors deliver on the full promise of AIOps platforms to provide rapid insight into large volumes of highly volatile data. The architecture and platform have improved, but the technology is still emerging and requires time and effort to get quality outcomes. To get a clearer picture of how the market is evolving and where vendors are positioned relative to one another, consider the following AIOps platform capabilities:
  • Data ingestion and handling
  • Machine learning (ML) analytics
  • Remediation
Data Ingestion and Handling
AIOps platforms must be able to ingest data-at-rest (historical) and data-in-motion (real-time, streaming). These platforms allow for the ingestion, indexing and storage of event data, wire data, metrics, traces, and graph and document data. These tools for IT operations must also analyze data directly at the point of ingestion in real time without requiring data to be first saved to a database before it can be analyzed. They must also provide a correlated analysis across multiple streams of real-time and historical data.
Machine Learning Analytics
The following types of analytic approaches are used:
  • Statistical, probabilistic analysis — A combination of univariate and multivariate analysis, including the use of correlation, clustering, classifying and extrapolation on metrics captured across IT entities.
  • Automated pattern discovery and prediction — Discovering patterns, clusters or groups that implicitly describe correlations in historical and/or streaming data. These patterns may then be used to predict incidents with varying degrees of probability.
  • Anomaly detection — Using the patterns discovered by the previous components to determine normal behavior and then to discern departures from that normal behavior, both univariate and multivariate. Anomaly detection should support seasonality, deciding whether behavior is anomalous within a time period called a season. AIOps platfoms should be able to detect the naturally occurring seasons in data and be able to learn when this behavior is no longer anomalous. For this to be of value, the algorithms must consider whether the anomaly has an impact or not. In a large-scale deployment, there will always be anomalies, and some will matter much more than others. Transcending the mere detection of outliers, they must be correlated with potential business impact and other concurrent processes such as release management metadata tags to be fully useful and not just create more alert noise (see “Augment Decision Making in DevOps Using AI Techniques”).
  • Root cause determination — Pruning down the network of correlations established by the automated pattern discovery and ingestion of graph data to define causality chains linking cause and effect.
  • Topological analysis  For the patterns that AIOps platforms detect to be relevant and actionable, a context must be placed around the data ingested. That context is topology in the form of graph data. Without the context and de facto constraint of topology, the patterns detected, while valid, may be unhelpful and distracting. Deriving patterns from data within a topology will establish relevancy and illustrate hidden dependencies. Using topology as part of causality determination can greatly increase its accuracy and effectiveness. Capturing where events occurred and what their up and downstream dependencies are using graph and bottleneck analysis can provide great insight on where to focus remediation efforts.
  • Prescriptive advice — Suggesting solutions to resolve an issue. These suggestions may be based on a database of historical solutions (tribal knowledge) to recurring problems or determined via crowdsourcing.
Remediation
As the technology matures, users will be able to leverage prescriptive advice from the platform, enabling the action stage (see Note 4). The steps for this are shown in Figure 4.
Figure 4. The Future of AI-Assisted Automation: Triage and Remediation of Problems
The Future of AI-Assisted Automation: Triage and Remediation of Problems
An automated, closed-loop process referred to as “self-driving ITOM” is highly desired but still aspirational. Very few prescriptive solutions have been observed in commercial tools beyond ones that simply automate “bounce the server” or an “open a ticket” type of script. The likely candidates for automated actions from prescriptive tools are those that are low risk. These are the ones that cause relatively little damage if they fail or cause unexpected side effects. Depending on the environment, predetermined actions such as a patch update could be successful, as well as actions to perform workload optimization such as starting up an additional virtual machine (VM) or container.

The Roads to AIOps

AIOps platforms can help with the ITSM engagement process (see “2019 Strategic Roadmap for IT Service Management”) by using AIOps to intelligently drive automation and improve the overall effectiveness, efficiency and error reduction of ITSM tools (see Note 3).
Use AIOps for:
  • Assisting service desk agents with assigning, categorizing and routing tickets
  • Task automation (for example, deploying software, handling password reset requests, updating VPN clients and reviewing text in email to initiate requests)
  • Leveraging historic data to improve agent performance and increase efficiencies
  • Strategic insight for activities such as change management, predicting change success, identifying change conflicts, identifying contracts about to expire, determining the best time to patch the estate and more
  • Predictive analytics to flag requests and incidents about to breach an SLA
  • Use of natural language processing (NLP) to power chatbots and VSAs to take the load off the service desk’s handling of basic inquiries and tasks like password resets, to share the knowledge base with users and to enable task automation

AIOps in DevOps

IT organizations have also started exploring AIOps in a DevOps context integrated with application release automation to assess risk in code and also in builds to avoid perilous deploys (see “Augment Decision Making in DevOps Using AI Techniques”). This requires the ingestion of metadata, including tags from release management to help in the categorization and relation of new functions released. They are also using AIOps to detect potential security issues (see “Market Guide for Continuous Configuration Automation Tools”).
NLP is heavily adopted in ITSM tools, but some APM vendors have started to include NLP as part of their AIOps capability. The aim is to enable a more flexible ChatOps for the DevOps teams and offer a better interface to APM data and automation.

Representative Vendors

The vendors listed in this Market Guide do not imply an exhaustive list. This section is intended to provide more understanding of the market and its offerings.

Market Introduction

AIOps platform vendors have a broad range of capabilities that continues to grow. Vendors differ in their data ingest and out-of-the-box use cases made available with minimal configuration. In Table 1, we provide a representative sample list of vendors providing AIOps platform functionality across a number of domains (see Note 1).

Table 1: Representative Vendors in AIOps Platforms

Vendors
Domain
Year Founded
Headquarters
Domain-Agnostic (DA) AIOps
DA
2014
United States and Israel
DA
2012
United States
 BMC
DA, DC: ITSM
1980
United States
DA
2008
Japan
DA, DC: APM
1974
United States
DA
2011
United States
DA
2015
United States
DA, DC: ITIM, DA
2012
United States
 IBM
DA and Vertical Market Solutions
1911
United States
DA
1993
United Kingdom
DA
2014
United States
DC: ITIM, SIEM (Crowdsourcing)
2014
United States and Israel
DA
2015
United States
DA
2011
United States
DA
2013
United States
DA
2011
United States
DA (DC: ITIM for SignalFx)
2003
United States
DA, DC: ITIM, Service Monitoring, Modeling
2015
United States
DC: ITIM
2010
United States
DA
2010
Israel
DA
2014
India
Domain-Centric (DC) AIOps
ITSM
DC: ITSM
2017
United States
DC: ITSM
2016
United States
DC: Change Mgmt.
2007
United States
DC: ITSM (VSA)
1998
United States
DC: ITSM, BAM
2012
United States
DC: ITSM
2004
United States
DevOps
DC: DevOps
2016
United States
DC: Dev
2011
United States
APM
DC: APM, NPMD
2008
United States
DC: APM, NLP
2005
United States
DC: APM, ITIM
2008
United States
NPMD
DC: NPMD
2007
United States
DC: NPMD
2014
United States
DC: NPMD
2000
Ireland
ITIM
DC: ITIM, APM
2010
United States
DC: ITIM
2014
United States
DC: ITIM
2003
United States
DC: ITIM
2008
United States
DC: ITIM (Crowdsourcing)
2005
United States
NPMD = network performance monitoring and diagnostics; BAM = business activity monitoring
Source: Gartner (November 2019)

Market Recommendations

Take an Incremental Approach to AIOps

When adopting AIOps platforms, start with less-critical applications and apply the following:
  • Event categorization
  • Correlation
  • Anomaly detection
Ensure that your use cases drive action to improve business outcomes and that the result of AIOps platform output is either a manual next step or the launching of a script or run book to improve the current state. These scripts and run books should be for situations with low risk, such as opening up a ticket or launching an additional container.
Begin using NLP with chatbots for running recurring tasks and for low-cost sharing of knowledge with employees and users, and with virtual customer assistants for transactional engagements with users (see “5 Key Emerging Technologies and Their Impact on Customer Experience”).
Start with the narrower scope of a domain-centric tool that has AIOps capabilities built in. Success will be measured by tracking the reduction in the number of false alarms and nonactionable tickets at the service desk, in avoiding the impact of detected anomalies and in improving performance. Advance from the ingestion of events to metrics for greater impact. Then, start ingesting traces, analyzing all within the context of topology, relationships and impact on digital business.
Create a program to begin educating the I&O staff on data science (see Note 5).
The use cases to which AIOps platforms can be applied will depend on their scope. Some may require more data than would be optimal, and others may require more data science skills than may be available in I&O.
Modern IT operations require visibility across IT entities, breaking down silos including applications, their relationships, interdependencies and past transformations to gain insight into the present state of the IT landscape. The progressive nature of deployment maturity and evolving use cases requires a readiness to ingest a variety of data sources (see Note 6). I&O leaders should later select AIOps platforms that are capable of ingesting and providing access to a broad range of historical and streaming data types in support of domain-agnostic use cases.
Choose tools offering the ability to gradually increase the depth and breadth of analysis (see Figure 5).
Figure 5. Evolve Your AIOps Stages
Evolve Your AIOps Stages
Evolve your AIOps stages by:
  • Using a commercial software tool to reveal patterns that organize large volumes of data. This is most helpful in separating low entropy events likely to end up as false alarms from those needing immediate attention.
  • Testing the degree to which these patterns allow users to take manual action to improve state. Determine if the pattern capture is meaningful in terms of its impact to key business outcomes.
  • Anticipating future impact from events and incidents.
  • Working with root cause analysis functionality either within a domain-centric AIOps platform or using a domain-agnostic AIOps tool across data from multiple domains.
  • Using AIOps with ITSM, starting with virtual support assistants/chatbots, ticket analysis and eventually change risk analysis (see “Avoid the Unexpected Consequences of IT Change Management With AIOps and CMDB”).
All stages of AIOps maturity are important. Enterprises should select tools that support as many of these stages as possible and ones that enable portability across tools (see Note 7). These stages should be used in a stepwise manner to ensure that IT operations staff can obtain value as they learn.

Acronym Key and Glossary Terms

APM
application performance monitoring
BAM
business activity monitoring
DA
domain-agnostic
DC
domain-centric
ITIM
IT infrastructure monitoring
ITOM
IT operations management
ITSM
IT service monitoring
NLP
natural language processing
NPMD
network performance monitoring and diagnostics
SIEM
security information and event management

Evidence

There was an increase of more than 25% in inquiries between Gartner analysts and end users over the past 12 months covering various aspects of AIOps. The topics of these inquiries included:
  • Platform selection
  • Deployment strategy
  • Multiple AIOps use cases within and outside IT to aid visualization, decisions and diagnostics
Of the AIOps interactions, 5% were related to the DevOps use case and 15% were related to event correlation.

Note 1Representative Vendor Selection

The vendors listed in this research were picked as a sample based on having one or two of the following characteristics:
  • Domain-agnostic solutions with the ability to ingest data from multiple sources, including historic and real-time streaming.
  • Domain-centric solutions with ML built into the tool.
  • Different offerings that include proprietary, open-source, free and commercialized versions, including deployment that cuts across on-premises and SaaS-based options.

Note 2AIOps Use Cases Expanding

To date, AIOps functionality has been used primarily in support of IT operations processes that enable monitoring or observation of IT infrastructure, application behavior or digital experience. Almost always, AIOps platform investments have been justified on the basis of their ability to decrease mean time to problem resolution and the resultant cost reduction. And they have also been justified regardless of whether this takes the form of using machine learning to:
  • Reduce event volumes and false alarms.
  • Detect anomalous values in time-series data.
  • Perform root cause analysis using bytecode instrumentation or distributed tracing data along with graph analysis in an APM context.
However, this is changing to also satisfy other types of use cases.
AIOps is used in digital experience monitoring to improve employee productivity by using chatbots to deliver friction-free answers to problems employees may face (see “Market Guide for Digital Experience Monitoring”).
In some cases, security and IT operations teams are exploring opportunities to leverage a common platform (see “Align NetOps and SecOps Tool Objectives With Shared Use Cases”). As AIOps platforms mature, they will be used to enable use cases requiring correlation across IT and security operations.
Non-IT groups like line-of-business owners and teams that sit outside IT operations (such as application developers and DevOps) are increasingly showing interest in AIOps technologies to surface insights across a multitude of datasets (see “Augment Decision Making in DevOps Using AI Techniques”).
Since January 2019, Gartner clients have expressed growing interest in designing dashboards showing real-time analysis of customer satisfaction, customer journeys (see “Digital Business KPIs: Defining and Measuring Success”), the order process and business health. The goal in this case is to present line-of-business owners with real-time AIOps-provided insights into the impact of IT on business, keeping them informed and enabling them to make decisions based on relevant data.

Note 3AITSM

AITSM is not an acronym. It is a term that refers to the application of context, assistance, actions and interfaces of AI, automation and big data on ITSM tools and practices to improve the overall effectiveness, efficiency and error reduction for I&O staff. AITSM is important for intermediate and advanced use cases to automate and support complex environments.

Note 4Challenges in Automating Actions Based on Prescriptive Advice

Automated actions fall under multiple categories:
  • Tasks of a predetermined nature that can be planned well in advance (for example, patch management or deployment of new builds)
  • Tasks that can’t be planned well in advance, but have known triggers that may or may not recur frequently. In this case, the procedures are well documented (for example, workload optimization in a virtualized environment).
  • Tasks with unpredictable triggers where the actions are well known, but not well documented (for example, known anomalies).
I&O leaders usually do not want to leave the action entirely to the machines and require at least a validation step before triggering an automation. This lack of trust is one of the main inhibitors preventing common usage of automated actions.
Of these three automated actions, we see the greatest interest in the third category; however, the technical difficulties in handling this are challenging and thus its adoption has been minimal to date.

Note 5Education for Citizen Data Scientists

A citizen data scientist can be designated or “volunteered” based on their interest and mathematical or statistical skills. The goal of this education is not necessarily to create your own algorithms. Instead, it should be to better understand the results of probabilistic algorithms and be prepared to understand the implications of use cases that evolve from the usage of unsupervised algorithms to those using supervised algorithms for more predictive and prescriptive ones (see “Maximize the Value of Your Data Science Efforts by Empowering Citizen Data Scientists”).

Note 6Data Sources for AIOps Platforms

Data sources for AIOps platforms include:
  • API
  • Application logs
  • CRM data
  • Customer data
  • Events
  • Graph
  • ITSM
  • Metadata
  • Metrics
  • Social
  • Traces
  • Wire
Unfortunately, no matter how large or how frequently updated a given dataset is, restriction to a single data source tends to limit the insights into system behavior. Modern IT systems — with their modularity and dynamism — require a multiperspective approach to understand what is happening as they are being observed.

Note 7Portability

As an enterprise’s AIOps adoption matures with functional models and quality outcomes, vendor switch becomes difficult. Switching to a different vendor to replicate existing high-quality dashboards will take time, which eliminates any value gained through direct cost savings. Gartner has observed a reluctance to switch vendors during contract renewal precisely for this reason in enterprises with more mature deployments.
The need for viable options to challenge incumbents has given rise to questions regarding portability of algorithms across vendors. This need comes from very small pockets, a few mature enterprises, where AIOps adoption has matured within the enterprise. The market is still at a high-growth stage, and it will be at least a couple of years before we see rising pressures from enterprises for portability and a response from vendors as a differentiator.
Some vendors are coming up with transfer learning, which is still in nascent stages. In its simpler form, end users are offered the option of training a selected model by using historical data. The results from the algorithm are compared against real-time results. Once the outcomes show a fair amount of accuracy with acceptable error margins, the end user can use the same algorithm for analyzing real-time data. This capability works best between preproduction and production environments or between the edge and the data center environments. Evolution of more complex use cases will require maturity and advanced skills on both the vendor and end-user side.