wiki:WikiStart/OngoingWork/RockDiagnostic

Version 2 (modified by sylvain.joyeux, 5 years ago) (diff)

--

Diagnosing a Rock system

Rock is generating a lot of data related to diagnosing problems, but it is used very little, and actually advertised very little.

This page is aimed at first listing all the information sources (present either in master or on WIP branches) and outline what could be done with them. In addition, I'll list some things that could be added to the system to improve it even further.

  • stream aligner status
  • transformer status
  • log status
  • lttng logging (on the 'lttng' branch of rtt)
  • process resource usage (drivers-orogen-taskmon)
  • wifi link status (drivers-orogen-wifimon)

Ideas


Simple metrics could already help tremendously:

  • ratio of received samples vs. rejected samples in the stream aligner
  • the inability for the transformer to find a transform chain even though it receives data (most of the time means bad transformer configuration)
  • high latency in the stream aligner (latency being defined as the distance in time between "now" and the oldest sample in the streams)
  • consistently high CPU usage (this is a problem on a realtime processing system)
  • monitoring of starvation for high-availability components (using lttng)
  • monitoring of sample frequency on RTT output ports. Would need a simple addition to RTT, where an ever-growing sample counter on the output ports is incremented at each call to write. Would need to be added to the CORBA API as well.

Among more complex things that can be done:

  • given the shape of the dataflow network, one can look at how latency propagates using the stream aligner and the sample frequencies, and pinpoint possible causes. This requires some analysis, but would be really cool. In addition, using the CPU data, it can propose hypothesis on whether the problem comes from starvation CPU-wise or lack of samples (driver /filter problem).