Profiling executors using the Autoware reference system
Introduction
This tutorial incorporates the open-sourced autoware_reference_system
and can be used to fairly and repeatably test
the performance of the various executors available within the greater ROS 2 community.
The example simulates a real world scenario, Autoware.Auto and its LiDAR data pipeline, that can be used to evaluate the performance of the executor. To this end, the example comes with built-in performance measurements that make it easy to compare the performance between executor implementations in a repeatable way.
Quick Start
Some tools are provided in order to automate and standardize the report generation process for this
autoware_reference_system
.
First, install and build the dependencies
python3 -m pip install psrecord bokeh # optional dependency: networkx
cd workspace
colcon build --packages-up-to autoware_reference_system
The easiest way to run the benchmarks is through the ctest
interface. Rebuild the package
with the RUN_BENCHMARK
option and run colcon test
:
colcon build --packages-select autoware_reference_system \
--cmake-force-configure --cmake-args -DRUN_BENCHMARK=ON
colcon test --packages-select autoware_reference_system
After the tests have run, reports can be found as .html
files in
$ROS_HOME/benchmark_autoware_reference_system/<timestamp>
($ROS_HOME
defaults to ~/.ros
).
The symlink $ROS_HOME/benchmark_autoware_reference_system/latest
always points to the latest
results. Detailed reports to individual test runs can be found in subdirectories of the form
<duration>/<middleware>/<executable>
.
More details on all the supported CMake arguments can be found in the supported CMake argument section below.
By default the tests uses the default ROS 2 middleware set for the system.
To run the tests for all available RMWs, add the
-DALL_RMWS=ON
CMake argument to the colcon build
step.
The test duration can be configured through the RUN_TIMES
variable in CMakelists.txt
.
A separate set of tests is created for each chosen runtime.
Test Results and Reports
Reports are automatically generated depending on which tests are run. The main test directory
($ROS_HOME/benchmark_autoware_reference_system/latest
by default) contains the summary
reports,
which aggregate metrics across all tested configurations.
Below this main test directory, each tested configuration has a subdirectory of the form
<duration>/<middleware>/<executable name>
. This directory contains the raw trace data and
additional per-test reports in .html
format.
Tweaking the benchmark setup
To get more fine-grained control over the benchmarking process invoke the benchmark script directly. To get a summary of the available options, call
python3 $(ros2 pkg prefix --share autoware_reference_system)/scripts/benchmark.py --help
As an example, to run all benchmarks starting with autoware_
and the
autoware_default_multithreaded
benchmark for 15 seconds run
python3 $(ros2 pkg prefix --share autoware_reference_system)/scripts/benchmark.py \
15 'autoware_*'
The --logdir
option can be used to store the measurement results and reports in a custom
directory, without adding a timestamp. Note that this may overwrite existing measurement
results in the same directory.
Key Performance Indicators (KPIs)
The performance measurement evaluates the executor using the following metrics. In general, the lowest value within each KPI is considered to be the better performance.
- CPU utilization
- In general a lower CPU utilization is better since it enables you to choose a smaller CPU or have more functionality on a larger CPU for other things.
- Memory utilization
- In general a lower memory utilization is better since it enables you to choose a smaller memory or have more space for other things
-
Number of dropped sensor samples in transform nodes
- The nodes in the reference system always use the most recent sensor data
(i.e., use a history depth of 1)
- This is a common strategy in real-world settings, as old sensor data is much less valuable than new sensor data
- For example an image from 30 seconds ago is much less helpful while driving down the road than an image from 0.1 second ago
- Fusion nodes drop messages during normal operation if the inputs publish with different frequencies
- In transform nodes, however, dropped messages indicate that the transform node cannot keep up with its input
- The nodes in the reference system always use the most recent sensor data
(i.e., use a history depth of 1)
-
Number of front LiDAR samples that did not trigger an update in the Object Collision Estimator
- The Front and Rear LiDARs have the same publishing frequency
- This means Object Collision Estimator should run for every LiDAR sample
- Count number of executions of Object Collision Estimator and Front LiDAR and report any difference
- Worst-case Latency between Front LiDAR and the Object Collision Estimator
- For worst-case latency we want to identify obstacles in time (i.e. early enough that we can still emergency-brake).
- Average Latency between Front LiDAR and Object Collision Estimator
- For average latency we want to identify obstacles as soon as possible so we can account for the obstacle in our planning.
- The Behavior Planner should be as cyclical as possible
- The desired behavior of the Behavior Planner is to be as cyclical as possible, meaning it should be executed as close to its set frequency of 100ms as possible
- Measure the jitter and drift over time of the timer callback
Message Types
A single message type is used for the entire reference system when generating results in order to simplify the setup as well as make it more repeatable and extensible.
This means only one message type from the list below is used during any given experimental run for every node in the reference system.
- Message4kB
- reference message with a fixed size of 4 kilobytes (kB)
Other messages with different fixed sizes could be added here in the future.
When reporting results it is important to specify the message type used during the experiment, as the message size impacts the metrics.
Autoware Reference System
Built from a handful of building-block node types , each one of these nodes are meant to simulate a real-world node from the Autoware.Auto project LiDAR data pipeline.
Under each node type are the requirements used for this specific reference system,
autoware_reference_system
. Future reference systems could have slightly different requirements
and still use the same building-block node types.
For simplicity, every node except for the command nodes only publishes one topic, and this topic has the same name as the node that publishes it. However, each topic can be subscribed to by multiple different nodes.
Also for simplicity, every node that does processing (aka number crunching) by default is configured to do that processing for the same amount of time: roughly 10 milliseconds. This processing time varies drastically depending on what platform you are on since each node does a fixed amount of actual work, not a fixed amount of time. See the Configuring Processing Time section for more details.
- Message Type
- all nodes use the same message type during any single test run
- default message type:
- Message4kB
- to be implemented:
- Message64kB
- Message256kB
- Message512kB
- Message1024kB
- Message5120kB
-
Sensor Nodes
- all sensor node have a publishing rate (cycle time) of 100 milliseconds
- all sensor nodes publish the same message type
-
total of 5 sensor nodes:
- Front LiDAR Driver
- Rear LiDAR Driver
- Point Cloud Map
- Visualizer
- Lanelet2Map
-
Transform Nodes
- all transform nodes have one subscriber and one publisher
- all transform nodes start processing after a message is received
- publishes message after processing is complete
- total of 10 transform nodes:
- Front Points Transformer
- Rear Points Transformer
- Voxel Grid Downsampler
- Point Cloud Map Loader
- Ray Ground Filter
- Object Collision Estimator
- MPC Controller
- Parking Planner
- Lane Planner
- Fusion Nodes
- all fusion nodes have two subscribers and one publisher for this reference system
- all fusion nodes start processing after a message is received from all subscriptions
- all fusion nodes have a max input time difference between the first input received and last input received before publishing of 9999 seconds
- publishes message after processing is complete
- total of 5 fusion nodes:
- Point Cloud Fusion
- NDT Localizer
- Vehicle Interface
- Lanelet2 Global Planner
- Lanelet 2 Map Loader
- Cyclic Nodes
- for this reference system there is only 1 cyclic node
- this cyclic node has 6 subscribersand one publisher
- this cyclic node starts processing after a message is received from any single subscription
- publishes message after processing is complete
- Command Nodes
- all command nodes have 1 subscriber and zero publishers
- all command nodes prints out the final latency statistics after a message is received on the specified topic
- total of 2 command nodes:
- VehicleDBWSystem
- IntersectionOutput
- Intersection Nodes
- for this reference system there is only EuclideanClusterDetector
- this intersection node has 2 subscribers and 2 publishers
- publishes message after processing is complete on the correspoding publisher
Configure Processing Time
Many nodes in the reference system are actually performing some pseudo-work by finding prime numbers up until some maximum value. Depending on the platform, this maximum value will need to be changed so that these nodes do not take an absurd amount of time. This maximum value should be chosen on a platform-by-platform basis so that the total run time of this work takes some desired length of time.
In order to make finding this maximum value a bit easier across many different platforms a simple number_cruncher_benchmark is provided that will loop over various maximum values and spit out how long each one takes to run. After running this executable on your platform you should have a good idea what maximum value you should use in your timing configuration so that each node does some measurable work for some desired amount of time.
Here is an example output of the number_cruncher_benchmark
run on a typical development
platform (Intel 9i7):
ros2 run autoware_reference_system number_cruncher_benchmark
maximum_number run time
64 0.001609ms
128 0.002896ms
256 0.006614ms
512 0.035036ms
1024 0.050957ms
2048 0.092732ms
4096 0.22837ms
8192 0.566779ms
16384 1.48837ms
32768 3.64588ms
65536 9.6687ms
131072 24.1154ms
262144 62.3475ms
524288 162.762ms
1048576 429.882ms
2097152 1149.79ms
Run the above command on your system, select your desired run_time
and place the corresponding
maximum_number
in the timing configuration file for the desired nodes.
Supported CMake Arguments
RUN_BENCHMARK
- Tell CMake to build the benchmark tests that will check the reference system against its requirements before running a sweep of tests to generate trace files and reports
- Without the
RUN_BENCHMARK
variable set toON
only the standard linter tests will be run
TEST_PLATFORM
- Test CMake to build the tests to check if the tests are being run from a supported platform or not
- This flag can be omitted if you would like to run the tests on a development system before running them on a supported platform
- The platform tests themselves can and should be improved going forward and are only some simple checks today (architecture, number of CPUs, PREEMPT_RT flag, etc.)
- Set this to
ON
to check if the current platform is supported
SKIP_TRACING
- Set to
ON
to skip theros2_tracing
tests, aka thecallback
tests - This can greatly reduce the length of time the
colcon test
command takes to run
- Set to
ALL_RMWS
- Set this to
ON
if you'd like to run tests on all available RMWs as well - Otherwise use only default RMW (first one listed by CMake function
get_available_rmw_implementations
) - Defaults to
OFF
- Set this to
Generating Node Graph Image
To generate the image shown above you can take advantage of a program called graphviz
that has a command line interface (CLI) command dot
.
First, check out the provided .dot
file
to get an idea of how the dot
syntax works (feel free to modify it for your use case or
future reference systems).
To generate the .dot
file into an .svg
image, run the following command:
dot -Tsvg autoware_reference_system.dot
Note: you can change the generated image type to any of the supported type parameters if you would like a different filetype.
Available benchmarks
The package comes with a set of benchmark executables. Many of these benchmarks distinguish hotpath nodes from other nodes; these are the nodes involved in the latency KPI, starting with the front/rear LiDAR and ending in the ObjectCollisionEstimator. Within the hotpath nodes, the benchmarks further distinguish between front/rear LiDAR nodes (the LidarDriver and PointsTransformer in front/rear, respectively) and the fusion chain nodes (everything from PointCloudFusion to ObjectCollisionEstimator).
Some benchmarks use real-time priorities and affinities. In this case, hotpath nodes run at priority 1 on cores 1-3; the planner nodes runs at priority 30 on core 0; and all other nodes run without real-time priority on all cores.
ROS 2 benchmarks
- autoware_default_singlethreaded: All nodes are assigned to the same single-threaded ROS executor
- autoware_default_multithreaded: All nodes are assigned to the same multi-threaded ROS executor
- autoware_default_staticsinglethreaded: Like autoware_default_singlethreaded, but
using the
StaticSingleThreadedExecutor
. - autoware_default_prioritized: Separate executors for front LiDAR nodes, rear LiDAR nodes fusion chain nodes, behavior planner, and everything else. Uses real-time priorities.
- autoware_default_cbg: Like autoware_default_prioritized, but uses the callback-group executor to remove the non-hotpath subscription in EuclideanClusterDetector from the executor for fusion chain nodes.