Changes between Version 1 and Version 2 of WikiStart/OngoingWork/RockDiagnostic

12/01/14 22:29:25 (6 years ago)



  • WikiStart/OngoingWork/RockDiagnostic

    v1 v2  
    1 = Diagnostic System for Rock = 
     1= Diagnosing a Rock system = 
    3 == Situation == 
    4 At the moment each deployment is writing all debug messages into its own log file. If a competent refuses to start someone has to open the specific log file to get a (hopefully) proper error message. 
     3Rock is generating a lot of data related to diagnosing problems, but it is used very little, and actually advertised very little. 
    6 It would be better to have a tool which is able to automatically find and open the right log file for each deployment if something goes wrong. 
     5This page is aimed at first listing all the information sources (present either in master or on WIP branches) and outline what could be done with them. In addition, I'll list some things that could be added to the system to improve it even further. 
    8 == Ideas ==  
     7 - stream aligner status 
     8 - transformer status 
     9 - log status 
     10 - lttng logging (on the 'lttng' branch of rtt) 
     11 - process resource usage (drivers-orogen-taskmon) 
     12 - wifi link status (drivers-orogen-wifimon) 
     16Simple metrics could already help tremendously: 
     17 - ratio of received samples vs. rejected samples in the stream aligner 
     18 - the inability for the transformer to find a transform chain even though it receives data (most of the time means bad transformer configuration) 
     19 - high latency in the stream aligner (latency being defined as the distance in time between "now" and the oldest sample in the streams) 
     20 - consistently high CPU usage (this is a problem on a realtime processing system) 
     21 - monitoring of starvation for high-availability components (using lttng) 
     22 - monitoring of sample frequency on RTT output ports. Would need a simple addition to RTT, where an ever-growing sample counter on the output ports is incremented at each call to write. Would need to be added to the CORBA API as well. 
     24Among more complex things that can be done: 
     25 - given the shape of the dataflow network, one can look at how latency propagates using the stream aligner and the sample frequencies, and pinpoint possible causes. This requires some analysis, but would be really cool. In addition, using the CPU data, it can propose hypothesis on whether the problem comes from starvation CPU-wise or lack of samples (driver /filter problem).