Zeng et al., 2023 - Google Patents

Traceark: Towards actionable performance anomaly alerting for online service systems

Zeng et al., 2023

Document ID: 14866713922410407561
Author: Zeng Z; Zhang Y; Xu Y; Ma M; Qiao B; Zou W; Chen Q; Zhang M; Zhang X; Zhang H; Gao X; Fan H; Rajmohan S; Lin Q; Zhang D
Publication year: 2023
Publication venue: 2023 IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)

External Links

Cited by

Snippet

Performance anomaly alerting based on trace data plays an important role in assuring the quality of online service systems. However, engineers find that many anomalies reported by existing techniques are not of interest for them to take further actions. For a large scale …

Continue reading at zhangyuqun.github.io (PDF) (other versions)

Classifications

- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3495—Performance evaluation by tracing or monitoring for systems
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3409—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation; Recording or statistical evaluation of user activity, e.g. usability assessment for performance assessment
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
- G06F11/0775—Content or structure details of the error report, e.g. specific table structure, specific error fields
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/0709—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Error detection; Error correction; Monitoring responding to the occurence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3604—Software analysis for verifying properties of programs
- G06F11/3612—Software analysis for verifying properties of programs by runtime analysis
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/86—Event-based monitoring
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06Q—DATA PROCESSING SYSTEMS OR METHODS, SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL, SUPERVISORY OR FORECASTING PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL, SUPERVISORY OR FORECASTING PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management, e.g. organising, planning, scheduling or allocating time, human or machine resources; Enterprise planning; Organisational models
- G06Q10/063—Operations research or analysis
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/875—Monitoring of systems including the internet
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06Q—DATA PROCESSING SYSTEMS OR METHODS, SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL, SUPERVISORY OR FORECASTING PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL, SUPERVISORY OR FORECASTING PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation, e.g. computer aided management of electronic mail or groupware; Time management, e.g. calendars, reminders, meetings or time accounting
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N99/00—Subject matter not provided for in other groups of this subclass
- G06N99/005—Learning machines, i.e. computer in which a programme is changed according to experience gained by the machine itself during a complete run
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions

Similar Documents

Publication	Publication Date	Title
Zeng et al.	2023	Traceark: Towards actionable performance anomaly alerting for online service systems
Liu et al.	2021	Microhecl: High-efficient root cause localization in large-scale microservice systems
Chen et al.	2024	Automatic root cause analysis via large language models for cloud incidents
Soldani et al.	2022	Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey
Zhao et al.	2021	Identifying bad software changes via multimodal anomaly detection for online service systems
Aggarwal et al.	2020	Localization of operational faults in cloud applications by mining causal dependencies in logs using golden signals
He et al.	2022	An empirical study of log analysis at Microsoft
CN102436376B (en)	2016-01-13	For the pattern checking that Distributed Application confirms
Yu et al.	2023	TraceRank: Abnormal service localization with dis‐aggregated end‐to‐end tracing data in cloud native systems
Jiang et al.	2011	Efficient fault detection and diagnosis in complex software systems with information-theoretic monitoring
CN115309575B (en)	2025-09-23	Microservice fault diagnosis method, device and equipment based on graph convolutional neural network
US12061539B2 (en)	2024-08-13	System and method for anomaly detection and root cause automation using shrunk dynamic call graphs
Zhang et al.	2022	Putracead: Trace anomaly detection with partial labels based on gnn and pu learning
Yang et al.	2021	AID: efficient prediction of aggregated intensity of dependency in large-scale cloud systems
Chen et al.	2022	Deep attentive anomaly detection for microservice systems with multimodal time-series data
Cai et al.	2019	A real-time trace-level root-cause diagnosis system in alibaba datacenters
Ding et al.	2023	Tracediag: Adaptive, interpretable, and efficient root cause analysis on large-scale microservice systems
Pham et al.	2024	Baro: Robust root cause analysis for microservices via multivariate bayesian online change point detection
Chen et al.	2021	Trace-based intelligent fault diagnosis for microservices with deep learning
Rouf et al.	2024	Instantops: A joint approach to system failure prediction and root cause identification in microserivces cloud-native applications
Ali et al.	2021	[Retracted] Classification and Prediction of Software Incidents Using Machine Learning Techniques
Hrusto et al.	2024	Autonomous monitors for detecting failures early and reporting interpretable alerts in cloud operations
Saha et al.	2024	Towards business process observability
Keeton et al.	2010	Do you know your IQ? A research agenda for information quality in systems
Hu et al.	2022	TS-InvarNet: Anomaly detection and localization based on tempo-spatial KPI invariants in distributed services