By Helmuth Fuchs
Moneycab: Uli, you are leading the development team of CurIX (Cure Infrastructure in XaaS), which is a software that helps to prevent outages of IT components and services. How exactly does this happen?
Uli Siebold: CurIX is a product that enriches existing monitoring solutions with significant additional functionalities for alerting to prevent outages. CurIX identifies and signals very early possible future failure out of many anomalies that current tools identify and that suffer from many false alarms. Therefore possible future outages can be prevented effectively. The main factors for this are precise predictions with few false alarms, alerting in time and provision of sufficient information to enable early preventive actions.
«We are now able to cover the complete chain from data collection, anomaly detection, failure prediction and fault localization.» Uli Siebold, Head Development CuriX
Surveillance often is already part of existing IT solutions. How does CurIX differ from those?
First of all, CurIX is much more than a monitoring tool as such. It is an add-on that enhances them. One of the main differences is the activity level CurIX provides. Existing tools provide their information mainly in reactive phases, which is after a failure already has happened. CurIX provides important information already in the phase before a failure occurs. Therefore CurIX acts proactively. It analyses data which are collected by monitoring tools and triggers alerts as soon as failures become likely to happen in near future. This is implemented in a very lightweight and user-friendly way, with practically no interference with the monitored systems and information that operators can easily manage. CurIX monitors time series continuously which simplifies the operation of a 24×7 monitoring service.
You consider CurIX an add-on to existing Monitoring Solutions. How do you assure compatibility and what is the effort for implementation?
We achieve compatibility in two ways. At the technological level, we provide interface components that connect to monitoring tools at the software level. If necessary, we also are able to provide customer specific interfaces.
At the concept level, CurIX works on so-called Key Performance Indicators (KPI) that are very general, adaptable and flexible. KPIs can be extracted basically out of all kinds of raw data. Existing monitoring solutions usually provide interfaces to read out KPIs directly. Our approach is general by design so that we are able to connect to any monitoring or SIEM tool the customer uses.
During development of CurIX, we also use very different data for analysis that are successfully exploited in many domains, e.g. flight delays or even heartbeat data. This assures the flexibility also during development.
There are solutions that use thresholds for triggering alerts. In CurIX you are using so-called baselines. What are the differences and advantages of that?
Thresholds are a very robust means to implement warning systems. Not without reason they are used and accepted in many places. But they have limits: thresholds have to be defined prior they can be applied. This definition can be erroneous which we only will observe when it’s too late. Moreover, simple thresholds identify many anomalies that are not related to failing behaviors, but are simply too exceptional albeit manageable cases, and as such, they suffer from many false alarms that represent a severe limitation of current technologies. CurIX renders threshold disadvantages obsolete by using dynamically tuned baseline models, enhanced with data analytics and deep learning based analysis, which refine the models and incrementally reduce false alarms to a negligible amount.
«With CurIX we detect anomalies that are not detectable with pure threshold approaches.»
Let me give an example: If we want to monitor the reliability of mailing infrastructure, we could apply a threshold-based alerting approach, which alerts us, as soon as the number of emails per hour undercuts a predefined threshold. A threshold that represents the average number of emails during working hours will probably cause an alarm during holidays. On the other hand, a threshold that represents the minimum number of emails per hour during holidays will be almost useless in working hours. As consequence, we need a dynamic threshold that adapts to working and non-working hours. This is exactly what CurIX provides, even automatically. With CurIX we detect anomalies that are not detectable with pure threshold approaches.
One important aspect to consider for setting up monitoring systems is seasonality, e.g. annual closures that put heavy load son systems. How do you deal with those and which further aspects you have considered?
The consideration of seasonality is an important topic that is already addressed in the market. There are commercial as well as open source solutions that cover many cases. Cycle lengths can be detected automatically and used for anomaly detection. This works well for seasons with constant periods, e.g. one day or one week. Some of our customers work in environments that show seasonality with variable periods, e.g. quarter closures. In this case, the season lengths vary. Existing solutions struggle with these circumstances. We worked out a concept to address not-equidistant season lengths. We use well-established methods and enrich them with additional capabilities.
The amount of data is growing exponentially nowadays, for instance, caused by IOT. Complexity is growing as well. How will CurIX work in such environments in the future?
To address large amounts of data and complexity we are using two approaches. The first one is the use of KPIs. They are lightweight because they only use numbers over time intervals which can be aggregated easily. By this, we use a minimal amount of data which is by design anonymous, which has a positive impact on performance on all layers: data storage, data transfer, and analysis.
The second approach is clustering, in which we connect different CurIX instances together to handle a large amount of data and complexity at the same time.
What is the current state of progress of CurIX? What functions are available today and what is planned for the near future?
The development of CurIX Release 1.0 has finished. We are now able to cover the complete chain from data collection, anomaly detection, failure prediction and fault localization. CurIX currently runs in our SaaS environment for customers.
«In CurIX we have a set of parameters we can change and by this, we are able to achieve false alarm rates below 5%.»
In the next development step we will replace all of the third party components. By doing this the customer will gain more possibilities for configuration and we will reduce dependencies. We continuously develop interfaces to existing monitoring tools. The next main development will provide the ability to use more than one anomaly detector. This will allow customers to use their own anomaly detectors.
What is the current quality of the failure predictions? What is needed to be even more precise?
Exactly this is the main benefit CurIX delivers as an enhancement for existing tools. From our perspective, the customer looks at two measures: True positive rate delivers the added value, to be alarmed when it is necessary, which should be as high as possible. And false positive rate which should be as low as possible to avoid false alarms. Both measures are dependent on each other, meaning that as soon as one changes one measure the other changes as well. In CurIX we have a set of parameters we can change and by this, we are able to achieve false alarm rates below 5%.
In addition to that, CurIX provides the user with a list of candidates for the root cause of a possible failure. In lab experiments, we were able to provide a list of five candidates of possible root causes with a probability of 90% of one of those candidates being the real root cause. This means that administrators can focus on only a few components for preventive actions which save time and money.
What developments in technology will influence the field of proactive warning systems like CurIX in future in your opinion?
The current state of the art monitoring and warning systems most often provide warnings but nothing more. From my point of view, future proactive warning systems will become more active, e.g. trigger actions like restarting or repairing components. First attempts for these kinds of things can be observed in current technologies like containers and micro-service-reboots. All these approaches can be summed up in the topic of resilience which is also in science a growing field. Current research activities aim at larger cyber-physical systems. My personal opinion is that probably the IT environments are perfectly suited to having these first resilience concepts applied to them. CurIX will contribute a lot here in future.
studied computer science at the universities of Karlsruhe and Freiburg. He earned his diploma 2008 after finishing his thesis at Fraunhofer about statistical and time series analyses of security-related events. Since May 2008 he did his research work in the field of urban security, airport security, technical safety, and risk analysis on several domains. He earned an engineering Ph.D. after his thesis on model-based safety analysis. In 2017 he joined IC information company AG where he is now leading the software development department.
His main contribution to the development of CurIX is the design of the system architecture, time series analysis and his extensive experience in the development of multi-year research and software development projects. His vision for CurIX is to increase the availability of IT environments (Cloud and OnPremise) and reduce operational costs by means of accurate failure prediction.