Operation and maintenance digital intelligence era — implementation practice of aiops of JD digital technology (I)

Posted on

1、 Background

Since Gartner put forward the concept of aiops in 2016, platform and intelligence have become the general trend of the development of operation and maintenance system. On the whole, the development of operation and maintenance can be divided into five stages:Manual and script operation and maintenanceStandardized operation and maintenance toolsPlatform automation operation and maintenanceDevOpsandAIOps

Automated operation and maintenance has greatly improved the efficiency of manual and script operation and maintenance, but the system software can only be preset and run according to the process we have formulated, can not adapt independently, and can not deal with similar “new” problems. Various characteristics of AI provide good solutions to some current pain points of operation and maintenance. Aiops came into being with the ability of AI, Enable it operation and maintenance.

By introducing big data and artificial intelligence technology, aiops can learn and summarize rules from massive monitoring data and complex IT software and hardware, automatically, accurately and quickly find abnormalities, locate faults and predict risks, and improve the availability and operation and maintenance efficiency of enterprise IT systems, It can further deal with problems that cannot be solved by automatic operation and maintenance. The expansion of aiops track is mainly driven by the accumulation of standard operation and maintenance data and the development demand of operation and maintenance business:

  • Mature integrated operation and maintenance platforms such as CMDB platform, monitoring system platform and process management center lay a solid foundation for the accumulation of standard operation and maintenance data;
  • The monitoring system has enough coverage, but it lacks systematic processing processes and methods. The larger data and more complex dynamic operation and maintenance environment make the automatic operation and maintenance unsustainable.

In order to meet the above driving needs, aiops focuses on:

  • Enabling Devops:Further solve the problems that cannot be solved by automatic operation and maintenance through the ability of AI;
  • Real time analysis and processing:Through the intelligent algorithm of aiops and the increasing automation level, real-time diagnosis of problems and operation suggestions are given to greatly reduce the average detection time (mttd) and average repair time (MTTR);
  • Reduce alarm noise:Through the data association ability of aiops, determine the relationship between infrastructure, business procedures and businesses, and constantly improve the algorithm, so as to continuously improve the level of filtering alarm noise and reduce false positives;
  • Failure cause analysis and prediction:Through the massive data analysis capability of aiops, it can help identify the root cause of the problem and mine the event patterns in the time series, so as to detect the expected behavior and provide operation and maintenance suggestions.

In the face of aiops track, JD digital intelligent operation and maintenance team reasonably allocates team roles, and operation and maintenance engineers, development engineers and algorithm engineers play different roles, none of which is indispensable. The following are some attempts and experiences of the digital intelligent operation and maintenance team based on internal personnel function allocation.

Operation and maintenance digital intelligence era -- implementation practice of aiops of JD digital technology (I)

  • Operation and Maintenance Engineer:It can extract intelligent demand points from the technical operation of the business. Be able to consider the demand scheme and standardize the data format before development and implementation. In the early stage, the feasibility of the scheme can be explored and verified through simulation, and an appropriate solution can be drafted;
  • Development Engineer:Be responsible for the development of relevant functions and modules of the platform, so as to reduce the user’s use threshold, improve the user’s use efficiency, and display the data delivered by the operation and maintenance data engineer to the user in a friendly way;
  • Algorithm Engineer:Understand and sort out the operation and maintenance engineers and algorithm schemes, and complete the output of the final landing scheme; In terms of project implementation, we can consider robustness, robustness and agility, reasonably split tasks and ensure the implementation of achievements, so as to improve the final business operation quality.

JD digital intelligent operation and maintenance team has been working hard for a long time in the industry. It has invested a lot in various focus points, opened up all links, and has a continuous accumulation in operation and maintenance knowledge precipitation and algorithm accumulation. It can not only empower internal, but also provide a large number of external services. The ability to apply AI in various scenarios of operation and maintenance provides reliable algorithm services. During daily operation and maintenance and promotion, all algorithm components have excellent performance. On the premise of ensuring high performance, the solution of operation and maintenance scenario is feasibleVersatility, automation and robustnessAre our first goal. Our continuous exploration of aiops and continuous study of the whole scene of operation and maintenance can enable the iterative aiops components and products to continuously empower the internal and external, and drive the digital transformation of operation and maintenance with AI.

2、 Aiops common scenarios

Aiops gradually builds intelligent operation and maintenance scenarios around the basic operation and maintenance scenarios of quality assurance, cost management and efficiency improvement. In terms of quality assurance, it is subdivided into basic scenarios such as anomaly detection, fault diagnosis, fault prediction and fault self-healing; In terms of cost management, it is subdivided into basic scenarios such as indicator monitoring, anomaly detection, resource optimization, capacity planning and performance optimization; In terms of efficiency improvement, it is divided into basic scenarios such as intelligent prediction, intelligent change, intelligent Q & A and intelligent decision-making.

Operation and maintenance digital intelligence era -- implementation practice of aiops of JD digital technology (I)

3、 Aiops capacity building

The construction of aiops can start with the exploration of a single scene, gradually improve and connect in series, until the operation and maintenance algorithm to solve the whole complete problem, and polish it into a general and process intelligent operation and maintenance overall solution on the basis of the algorithm. The general evolution route of the industry is as follows:

  • I began to try to apply AI capability, and there is no mature single point application yet.
  • It has the AI operation and maintenance capability of single scenario, and can initially form learning pieces for internal use.
  • There is a process AI operation and maintenance capability connected in series by multiple single scenario AI operation and maintenance modules, which can provide reliable operation and maintenance AI components.
  • The main operation and maintenance scenarios have realized the process free AI operation and maintenance capability, which can provide reliable aiops services.
  • There is a core hub AI, which can consider the three aspects of cost, quality and efficiency, meet the different index requirements of three aspects in different business life cycles, and achieve the optimization under multiple objectives or on-demand optimization.

At present, the intelligent operation and maintenance team of JD digital technology provides internal services in the form of four product platforms: indicator identification platform, alarm identification platform, log clarification platform and fault detection platform. In addition, it can also provide specific scenario algorithm model files and containerized deployment schemes of algorithm components.

4、 Aiops landing practice

In 2020, the intelligent operation and maintenance team of JD digital technology will enable AI intelligent solutions in the whole scene in the process of opening up digital operation and maintenance and accelerating the implementation of aiops. Among them, the description of exception discovery and root cause location is as follows: combined with the characteristics of the two data sources of index value and log text, a “tenon and tenon” algorithm design is constructed to enhance the richness of algorithm matching scene, the accuracy of algorithm automatic arrangement and the freedom of algorithm customization and expansion on the basis of ensuring the portability of the platform. We will continue to increase investment, while accumulating business and operation and maintenance knowledge, let aiops empower business R & D, product and operation teams, reduce costs and increase efficiency internally, improve production efficiency, and drive industrial digital transformation externally with AI.

Operation and maintenance digital intelligence era -- implementation practice of aiops of JD digital technology (I)

JD digital intelligent operation and maintenance platform is embedded with many pluggable components, with simple configuration, convenient use, high accuracy and high timeliness. The three modules of fault detection, fault location and fault repair are described in detail below:

  • Fault detection module:Quickly find anomalies in timing monitoring data.
  • Fault location module:Accurately locate the root causes of complex systems.
  • Fault repair module:Combined with the operation and maintenance knowledge map and the experience of operation and maintenance experts, recommend intelligent solutions to quickly repair faults.

The three modules are progressive layer by layer to jointly improve the operation and maintenance experience and operation and maintenance efficiency.The overall process can quickly find faults and automatically locate abnormalities, provide solution recommendations for abnormal events and realize self-healing of faults in some scenarios, which can greatly reduce the cost of R & D and configuration fixed threshold and O & M troubleshooting, and greatly improve the O & M service quality and business availability.

In addition to the static configuration attributes, most of the data of the operation and maintenance monitoring system are time series data, which are expressed in the form of time series indicators and time series logs. Judging whether the business is abnormal based on the massive time series data is an important means of fault discovery. For a wide variety of numerical indicators with complex relationships, the indicator anomaly detection component group can not only be realizedRapid automatic arrangement, covering operation and maintenance indicators, multi feature sudden rise and fall, cliff like peaks and troughs, and abnormal trendAnd other abnormal types. There are adaptive algorithms and established strategic arrangements for influencing factors such as indicator dimensions, periodic or implicit laws, holidays and activities, emergencies and so on. There is no need to manually configure thresholds and rules to help R & D and operation and maintenance personnel quickly find abnormalities that are difficult to identify by rules, and support independent configuration of abnormal alarm methods to avoid false positives and alarm storms. In the indicator anomaly detection module, we introduce waveform analysis technology to analyze the impact of anomaly linkage between indicators in combination with spatial and temporal characteristics, so as to improve the accuracy of anomaly detection. The introduction of spatiotemporal data analysis is an important means to find anomalies that are difficult to identify by rules and strategies.

For business gold indicators and key monitoring indicators, configure alarm log analysis, which can not only capture instantaneous exceptions at the text log level, but also analyze the log content, determine the exception subject, merge the exception event types, and verify the exception detection and subsequent root cause location correlation analysis. After practice and theoretical verification of a large number of abnormal events, the three algorithmic component groups have a specific arrangement, and the internal algorithmic components can automatically adapt the accessed index data to cover the whole scene of operation and maintenance.

Operation and maintenance digital intelligence era -- implementation practice of aiops of JD digital technology (I)

Traditional O & M fault location highly depends on the experience of O & M personnel and the correctness of troubleshooting direction. How to precipitate and intelligentize the experience of O & M experts is the key to solve the problem of fault location. Static CMDB configurations and call chain relationships can be queried, but exceptions often occur in the process of dynamic changes. The operation and maintenance knowledge map is the most efficient weapon for our team to deal with this scenario. Intelligent fault location is to solve the problem of root cause location in complex systems. Operation and maintenance knowledge map combined with reinforcement learning algorithm is an excellent way for AI to enable the scene.

Reinforcement learning algorithm searches globally according to the hierarchy. It will search all possible associated nodes to ensure the accuracy of root cause location algorithm. The operation and maintenance knowledge map provides specification and direction for search, so that the search is not independent, but takes into account call change and configuration change.

The operation and maintenance knowledge map we adopted is dynamic and expandable, and the configuration data, logs, alarms, changes and other information have been accessed. The access of standardized data is fast and automatic, and has high compatibility with other operation and maintenance systems.

When the search process ends, the algorithm will automatically correct and sort the fault root causes, and call the log analysis system to calculate the confidence of the recommended root causes. The fault location results will be stored in the format of the fault analysis report, which is convenient for querying and checking the accuracy of the algorithm during operation and maintenance.

Operation and maintenance digital intelligence era -- implementation practice of aiops of JD digital technology (I)

In the intelligent fault repair stage, the experience of operation and maintenance experts will also guide us to analyze fault events and give feasible operation suggestions and operational risk indicators. While the fault location module sends out the recommended root cause, it retrieves the associated data in the knowledge map, mines the fault association relationship through the association analysis algorithm, and generates the event information description report. The operation and maintenance knowledge map will check the whole link according to the call chain dependency, give fault repair suggestions and operational risk tips, and realize fault self-healing for some scenarios.

Operation and maintenance digital intelligence era -- implementation practice of aiops of JD digital technology (I)

Recommended reading:

  • Still using elk? It’s time to take a look at Loki, a lightweight logging service
  • What should Devops do in the face of big promotion? Here is a JD 11.11 Devops preparation guide
  • Cloud native builds the next generation cloud native CI platform based on Tekton in JD

Welcome to click[JD Zhilian cloud], learn about the developer community

More wonderful technical practices and exclusive dry goods analysis

Welcome to the official account of Jingdong developer cloud.

Operation and maintenance digital intelligence era -- implementation practice of aiops of JD digital technology (I)

Leave a Reply

Your email address will not be published.