wiki:WikiStart/OngoingWork/RockDiagnostic

Diagnosing a Rock system

Rock is generating a lot of data related to diagnosing problems, but it is used very little, and actually advertised very little.

This page is aimed at first listing all the information sources (present either in master or on WIP branches) and outline what could be done with them. In addition, I'll list some things that could be added to the system to improve it even further.

  • stream aligner status
  • transformer status
  • log status
  • lttng logging (on the 'lttng' branch of rtt)
  • process resource usage (drivers-orogen-taskmon)
  • wifi link status (drivers-orogen-wifimon)

Ideas

Simple metrics could already help tremendously:

  • ratio of received samples vs. rejected samples in the stream aligner
  • the inability for the transformer to find a transform chain even though it receives data (most of the time means bad transformer configuration)
  • high latency in the stream aligner (latency being defined as the distance in time between "now" and the oldest sample in the streams)
  • consistently high CPU usage (this is a problem on a realtime processing system)
  • monitoring of starvation for high-availability components (using lttng)
  • monitoring of sample frequency on RTT output ports. Would need a simple addition to RTT, where an ever-growing sample counter on the output ports is incremented at each call to write. Would need to be added to the CORBA API as well.

Among more complex things that can be done:

  • given the shape of the dataflow network, one can look at how latency propagates using the stream aligner and the sample frequencies, and pinpoint possible causes. This requires some analysis, but would be really cool. In addition, using the CPU data, it can propose hypothesis on whether the problem comes from starvation CPU-wise or lack of samples (driver /filter problem).

Other things

The reliance on log files is really an issue, as it is not formalized (and therefore hard to analyse automatically). The two biggest issues that make log file reading mandatory currently are:

  • we know if a hook fails, but not why
  • we know that a component goes into EXCEPTION but not what the exception is

The idea there would be to extend the current notification API (which is currently only an int) to an API where both a symbol (what) and a data structure (why) can be sent. Ideally, it would mean that the notification output port (currently the ill-named "state" port) would be a discriminated union (e.g. from boost). We would have to add support for these in orogen. This support could be done with the current means by creating our own "discriminated union" intermediate type and present the boost union as an opaque. Our own discriminated union would simply be:

struct TaskNotifications
{
   NOTIFICATIONS type;
   std::vector<NotificationData0> data0;
   std::vector<NotificationData1> data1;
   ...
};

One downside of this is that this notification API would not really be usable on the C++ side (every component would have a different notification port). So far, not an issue for Rock, though. Ideally, we should be able to implement it as an orogen plugin, and use the opportunity to add a base::Time field.

With such a discriminated union type, we would lose one pointer per possible data type, but it is IMO not such a big deal and has the merit that it can be implemented right now (I think).

Last modified 5 years ago Last modified on 12/01/14 23:50:02