RCCL environment variables#

This section describes the most important RCCL environment variables, which are grouped by functionality.

Configuration and setup#

The configuration and setup environment variables for RCCL are collected in the following table.

Environment variable

Values

NCCL_CONF_FILE
Specifies the path to the RCCL configuration file.
String path to configuration file
Default: ~/.rccl.conf or /etc/rccl.conf
NCCL_HOSTID
Sets the host identifier for multi-node communication.
String value for host identification
Used for host hash generation

Logging and debugging#

The logging and debugging environment variables for RCCL are collected in the following table.

Environment variable

Values

NCCL_DEBUG
Controls debug logging in RCCL for troubleshooting and monitoring collective communication operations.
These are the logging levels in RCCL set via NCCL_DEBUG. Each logging level contains all logging for levels below it. The default logging level is ERROR.

NONE: No logging is printed.
ERROR: These messages report when a fatal condition has occurred in RCCL and the operation can’t continue.
VERSION: librccl version info is printed during the initialization phase.
WARN: Prints warnings about unusual conditions that could lead to unexpected results.
INFO: Prints standard logging messages about status and operations performed.
ABORT: Unused.
TRACE: Prints trace-level logging of function calls and parameters. Only active when librccl is built using ENABLE_TRACE.
NCCL_DEBUG_SUBSYS
Controls which subsystems generate debug output.
These are the logging subsystems set via NCCL_DEBUG_SUBSYS. These can be set as a comma-separated list, and can be inverted using the ^ prefix. The default subsystem set is INIT, BOOTSTRAP, and ENV.

INIT: Prints during the initialization phase.
COLL: Prints during execution of collectives.
P2P: Prints logs related to peer-to-peer setup or communication.
SHM: Prints logs related to shared memory.
NET: Prints logs related to network setup or communication.
GRAPH: Prints logs related to parsing the topology of the network.
TUNING: Prints logs related to the tuner plugin.
ENV: Prints logs related to environment variables.
ALLOC: Prints logs related to memory allocation.
CALL: Prints logs for function calls (TRACE only).
PROXY: Prints logs related to the proxy thread.
NVLS: Not valid for AMD/RCCL.
BOOTSTRAP: Prints logs related to the bootstrapping phase of initialization.
REG: Prints logs related to registration and deregistration of transport initialization.
PROFILE: Prints logs related to the profiling/timing info.
RAS: Prints logs related to RAS.
VERBS: Prints logs related to IB/Verbs.
ALL: Activates all logging subsystems.
NCCL_WARN_ENABLE_DEBUG_INFO
Converts all WARN level logs to INFO level logs.
0: Default value. Variable is not enabled.
1: Enable the variable.
NCCL_DEBUG_TIMESTAMP_LEVELS
The timestamp levels for NCCL_DEBUG.
A set of NCCL_DEBUG levels can have a timestamp prepended set as a comma-separated list which can be inverted using the ^ prefix. The default set is WARN.
NCCL_DEBUG_TIMESTAMP_FORMAT
The timestamp format for NCCL_DEBUG.
Set the format of the timestamp in printf style. The default format is "[%F %T] ".
NCCL_DEBUG_FILE
Write logs to a file rather than stdout.
The filename can be formatted using %h for hostname, %p for pid, and %% to escape the % character. It is recommended to use %p to output to individual files per pid to avoid mixing or potentially overwriting the output. Example usage: NCCL_DEBUG_FILE=debugfile.%h.%p

Algorithm and protocol control#

The algorithm and protocol control environment variables for RCCL are collected in the following table.

Environment variable

Values

NCCL_ALGO
Forces specific algorithm selection for collectives.
Algorithm name string
Used to override automatic algorithm selection
NCCL_PROTO
Forces specific protocol selection for communication.
Protocol name string
Used to override automatic protocol selection

Network and topology#

The network and topology environment variables for RCCL are collected in the following table.

Environment variable

Values

NCCL_IB_HCA
Specifies InfiniBand device:port to use.
Device specification string
Prefix with ^ for exclusion, = for exact match
NCCL_IB_GID_INDEX
Defines the Global ID index used in RoCE mode.
Integer value (default: -1)
See InfiniBand show_gids command for valid values
NCCL_SOCKET_IFNAME
Specifies which IP interfaces to use for communication.
Interface prefix string or list
Multiple prefixes separated by ,
Prefix with ^ for exclusion, = for exact match
Example: eth (all eth interfaces), =eth0 (exact match)
NCCL_SOCKET_FAMILY
Forces IPv4/IPv6 interface selection.
AF_INET: Force IPv4
AF_INET6: Force IPv6
Unset: Use first available
NCCL_NET_MERGE_LEVEL
Controls network device merging behavior.
Integer value specifying merge level
Default: PATH_PORT
NCCL_NET_FORCE_MERGE
Forces merging of network devices.
String specifying forced merge configuration
NCCL_RINGS
Defines custom ring topology.
Ring topology specification string
Overrides automatic topology detection
RCCL_TREES
Defines custom tree topology.
Tree topology specification string
Alternative to ring topology
NCCL_RINGS_REMAP
Controls ring remapping for specific topologies.
Remapping specification string
Used with Rome 4P2H topology

Development and testing (advanced)#

The development and testing environment variables for RCCL are collected in the following table. These variables are primarily intended for debugging and development purposes.

Environment variable

Values

CUDA_LAUNCH_BLOCKING
Controls CUDA kernel launch blocking behavior.
0: Non-blocking launches
1 or non-zero: Blocking launches
NCCL_COMM_ID
Enables multi-process mode in test applications.
Any non-empty value enables multi-process mode
Used with test executables for distributed testing