RCCL environment variables#
This section describes the most important RCCL environment variables, which are grouped by functionality.
Configuration and setup#
The configuration and setup environment variables for RCCL are collected in the following table.
Environment variable |
Values |
|---|---|
NCCL_CONF_FILESpecifies the path to the RCCL configuration file.
|
String path to configuration file
Default:
~/.rccl.conf or /etc/rccl.conf |
NCCL_HOSTIDSets the host identifier for multi-node communication.
|
String value for host identification
Used for host hash generation
|
Logging and debugging#
The logging and debugging environment variables for RCCL are collected in the following table.
Environment variable |
Values |
|---|---|
NCCL_DEBUGControls debug logging in RCCL for troubleshooting and monitoring collective communication operations.
|
These are the logging levels in RCCL set via
NCCL_DEBUG. Each logging level contains all logging for levels below it. The default logging level is ERROR.NONE: No logging is printed.ERROR: These messages report when a fatal condition has occurred in RCCL and the operation can’t continue.VERSION: librccl version info is printed during the initialization phase.WARN: Prints warnings about unusual conditions that could lead to unexpected results.INFO: Prints standard logging messages about status and operations performed.ABORT: Unused.TRACE: Prints trace-level logging of function calls and parameters. Only active when librccl is built using ENABLE_TRACE. |
NCCL_DEBUG_SUBSYSControls which subsystems generate debug output.
|
These are the logging subsystems set via
NCCL_DEBUG_SUBSYS. These can be set as a comma-separated list, and can be inverted using the ^ prefix. The default subsystem set is INIT, BOOTSTRAP, and ENV.INIT: Prints during the initialization phase.COLL: Prints during execution of collectives.P2P: Prints logs related to peer-to-peer setup or communication.SHM: Prints logs related to shared memory.NET: Prints logs related to network setup or communication.GRAPH: Prints logs related to parsing the topology of the network.TUNING: Prints logs related to the tuner plugin.ENV: Prints logs related to environment variables.ALLOC: Prints logs related to memory allocation.CALL: Prints logs for function calls (TRACE only).PROXY: Prints logs related to the proxy thread.NVLS: Not valid for AMD/RCCL.BOOTSTRAP: Prints logs related to the bootstrapping phase of initialization.REG: Prints logs related to registration and deregistration of transport initialization.PROFILE: Prints logs related to the profiling/timing info.RAS: Prints logs related to RAS.VERBS: Prints logs related to IB/Verbs.ALL: Activates all logging subsystems. |
NCCL_WARN_ENABLE_DEBUG_INFOConverts all
WARN level logs to INFO level logs. |
0: Default value. Variable is not enabled.1: Enable the variable. |
NCCL_DEBUG_TIMESTAMP_LEVELSThe timestamp levels for
NCCL_DEBUG. |
A set of
NCCL_DEBUG levels can have a timestamp prepended set as a comma-separated list which can be inverted using the ^ prefix. The default set is WARN. |
NCCL_DEBUG_TIMESTAMP_FORMATThe timestamp format for
NCCL_DEBUG. |
Set the format of the timestamp in
printf style. The default format is "[%F %T] ". |
NCCL_DEBUG_FILEWrite logs to a file rather than
stdout. |
The filename can be formatted using
%h for hostname, %p for pid, and %% to escape the % character. It is recommended to use %p to output to individual files per pid to avoid mixing or potentially overwriting the output. Example usage: NCCL_DEBUG_FILE=debugfile.%h.%p |
Algorithm and protocol control#
The algorithm and protocol control environment variables for RCCL are collected in the following table.
Environment variable |
Values |
|---|---|
NCCL_ALGOForces specific algorithm selection for collectives.
|
Algorithm name string
Used to override automatic algorithm selection
|
NCCL_PROTOForces specific protocol selection for communication.
|
Protocol name string
Used to override automatic protocol selection
|
Network and topology#
The network and topology environment variables for RCCL are collected in the following table.
Environment variable |
Values |
|---|---|
NCCL_IB_HCASpecifies InfiniBand device:port to use.
|
Device specification string
Prefix with
^ for exclusion, = for exact match |
NCCL_IB_GID_INDEXDefines the Global ID index used in RoCE mode.
|
Integer value (default:
-1)See InfiniBand
show_gids command for valid values |
NCCL_SOCKET_IFNAMESpecifies which IP interfaces to use for communication.
|
Interface prefix string or list
Multiple prefixes separated by
,Prefix with
^ for exclusion, = for exact matchExample:
eth (all eth interfaces), =eth0 (exact match) |
NCCL_SOCKET_FAMILYForces IPv4/IPv6 interface selection.
|
AF_INET: Force IPv4AF_INET6: Force IPv6Unset: Use first available
|
NCCL_NET_MERGE_LEVELControls network device merging behavior.
|
Integer value specifying merge level
Default:
PATH_PORT |
NCCL_NET_FORCE_MERGEForces merging of network devices.
|
String specifying forced merge configuration
|
NCCL_RINGSDefines custom ring topology.
|
Ring topology specification string
Overrides automatic topology detection
|
RCCL_TREESDefines custom tree topology.
|
Tree topology specification string
Alternative to ring topology
|
NCCL_RINGS_REMAPControls ring remapping for specific topologies.
|
Remapping specification string
Used with Rome 4P2H topology
|
Development and testing (advanced)#
The development and testing environment variables for RCCL are collected in the following table. These variables are primarily intended for debugging and development purposes.
Environment variable |
Values |
|---|---|
CUDA_LAUNCH_BLOCKINGControls CUDA kernel launch blocking behavior.
|
0: Non-blocking launches1 or non-zero: Blocking launches |
NCCL_COMM_IDEnables multi-process mode in test applications.
|
Any non-empty value enables multi-process mode
Used with test executables for distributed testing
|