Pre-Summer Sale Special - Limited Time 70% Discount Offer - Ends in 0d 00h 00m 00s - Coupon code: xmaspas7

Easiest Solution 2 Pass Your Certification Exams

NCP-AII NVIDIA AI Infrastructure Free Practice Exam Questions (2026 Updated)

Prepare effectively for your NVIDIA NCP-AII NVIDIA AI Infrastructure certification with our extensive collection of free, high-quality practice questions. Each question is designed to mirror the actual exam format and objectives, complete with comprehensive answers and detailed explanations. Our materials are regularly updated for 2026, ensuring you have the most current resources to build confidence and succeed on your first attempt.

Page: 2 / 2
Total 123 questions

ClusterKit ' s NCCL bandwidth test shows 350 GB/s on a 400G InfiniBand fabric. How should this result be interpreted?

A.

Optimal performance, indicating healthy fabric and GPUDirect RDMA.

B.

Suboptimal performance; requires FEC tuning to reach 380+ GB/s.

C.

Critical failure; expected is > 390 GB/s for HDR InfiniBand.

D.

Inconclusive; rerun with --stress=cpu to validate.

After ClusterKit reports " GPU-Host latency exceeds threshold, " which NVIDIA diagnostic tool should be used to isolate hardware faults?

A.

Re-run ClusterKit with --stress=gpu -Y 60 to extend test duration

B.

nvidia-smi topo -m to inspect GPU topology connections

C.

DCGM Diags dcgmi diag -r 2

D.

ib_write_bw to measure InfiniBand bandwidth between nodes

Refer to the output:

~ $ sudo nvsm show healthinfo

—Timestamp: Sat Dec 16 16:26:32 2017 -0800

Version: 17.12-5

Checks—BIOS Revision [5.11].........................

DGX Serial Number [YSY72800016)..................

Verify installed DIMM memory sticks........................Healthy

...[output truncated)

Verify Ethernet controllers...........................Healthy

Verify installed GPU ' s..............................Unhealthy

Checking output of ' lspci ' for expected GPU ' s

Missing GPU at PCI address ' 07:00.0 '

Verify installed InfiniBand controllers....................Healthy

Verify PCIe switches..................................Healthy

...[output truncated)

What insights can a system administrator gain regarding the DGX system ' s health?

A.

A GPU tray upgrade failed.

B.

A GPU is missing on the DGX system.

C.

A GPU driver upgrade has failed.

D.

The system has passed the hardware health check successfully.

An infrastructure engineer in an AI factory has successfully replaced a power supply unit on an NVIDIA DGX H100. After installation, both the IN and OUT LEDs on the new power supply illuminate solid green. Which NVSM CLI command should the engineer use to quickly verify the overall system status and ensure it is operating as expected?

A.

nvsm show power

B.

nvsm show powermode

C.

nvsm show health

D.

nvsm show alerts

A team is validating a DGX BasePOD deployment. Using cmsh, they run a command to check GPU health across all nodes. What indicates that the system is ready for AI workloads?

A.

The command output is ignored if the system powers on without errors.

B.

At least half of the GPUs report Status_Health = OK.

C.

All GPUs report Status_Health = OK and Health = OK for each device.

D.

Only the head node ' s GPUs need to be healthy.

An AI training cluster with NVIDIA GPUs experiences prolonged data loading times during checkpoint reloading, causing GPUs to idle frequently. CPU utilization during data transfers remains high. Which solution most effectively optimizes storage-to-GPU throughput while reducing CPU overhead?

A.

Increase batch sizes to reduce the frequency of storage access.

B.

Migrate datasets to SATA SSDs with RAID 0 for higher sequential read speeds.

C.

Add more GPUs to the cluster to parallelize data loading tasks.

D.

Implement GPUDirect Storage to enable direct data transfers.

A system administrator needs to install a container toolkit and successfully run the following commands:

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

sudo nvidia-ctk runtime configure --runtime docker

What step should be taken next to finish the installation?

A.

dpkg -i doca-host-repo-ubuntu < version > _amd64.deb

B.

apt-get install cuda-drivers

C.

systemctl restart docker

D.

apt-get remove nvidia-container-toolkit

A customer is designing an AI Factory for enterprise-scale deployments and wants to ensure redundancy and load balancing for the management and storage networks. Which feature should be implemented on the Ethernet switches?

A.

Implement redundant switches with spanning tree protocol.

B.

MLAG for bonded interfaces across redundant switches.

C.

Use only one switch for all management and storage traffic.

D.

Disable VLANs and use unmanaged switches.

A user wants to restrict a Docker container to use only GPUs 0 and 2. Which command achieves this?

A.

docker run --gpus ' " device=0,2 " ' nvidia/cuda:12.1-base nvidia-smi

B.

docker run -e NVIDIA_VISIBLE_DEVICES=0,2 nvidia/cuda:12.1-base nvidia-smi

C.

docker run --gpus all nvidia/cuda:12.1-base nvidia-smi -id=0,2

D.

docker run --device /dev/nvidia0,/dev/nvidia2 nvidia/cuda:12.1-base nvidia-smi

After configuring HA, the administrator runs cmsh status and notices the secondary head node reports mysql [FAIL]. What is the most likely cause?

A.

The BCM license expired after HA configuration.

B.

Network connectivity issues between the primary and secondary head nodes.

C.

The secondary head node lacks NVIDIA GPU drivers.

D.

The cluster nodes are powered on during the HA configuration.

When configuring an out-of-core HPL burn-in for a 40B matrix on 8x H100 nodes, which environment variable prevents GPU out-of-memory errors while reserving space for drivers?

A.

export HPL_OOC_SAFE_SIZE=4.0

B.

export HPL_OOC_MODE=0

C.

export HPL_OOC_NUM_STREAMS=8

D.

export HPL_OOC_MAX_GPU_MEM=90

A user needs to configure NGC CLI to access resources across multiple organizations. What is the recommended command syntax to achieve this?

A.

export NGC_CLI_ORG=org-name & & ngc config set

B.

ngc config list to manually edit the JSON configuration file.

C.

ngc registry login --org org-name

D.

ngc config set --org org-name --ace ace-name

During East-West fabric validation on a 64-GPU cluster, an engineer runs all_reduce_perf and observes an algorithm bandwidth of 350 GB/s and bus bandwidth of 656 GB/s. What does this indicate about the fabric performance?

A.

Inconclusive; rerun with point-to-point tests.

B.

Optimal performance; bus bandwidth near theoretical peak for NDR InfiniBand.

C.

Critical failure; bus bandwidth exceeds hardware capabilities.

D.

Suboptimal performance; algorithm bandwidth should match bus bandwidth.

A system administrator needs to validate a GPU-based server and ensure that no errors occur under load. What command should be used?

A.

nvsm dump health

B.

stress-test --usage

C.

nvsm show health

D.

nvsm stress-test

During server maintenance, a system administrator wants to ensure that the NVIDIA DGX server has sufficient disk space for operational activities. The administrator is scripting an alert system that will notify the team if disk space falls below a threshold. Which command could be included in the maintenance script to check the available disk space on the server?

A.

nvidia-smi --query-disk-space

B.

du -sh /home/*

C.

df -h | grep ' /var '

D.

lsof +L1

Why is it important to provide a large and high-performance local cache (using SSDs configured as RAID-0) for deep learning workloads on DGX systems?

A.

Local SSD cache allows users to increase the number of NFS threads on the server without impacting storage reliability.

B.

Using local SSD cache in RAID-0 enables direct GPU access to files without host CPU involvement, further boosting performance.

C.

Local SSD cache in RAID-0 is necessary to provide redundancy in case one of the drives fails during long training runs.

D.

A local SSD cache in RAID-0 ensures that most training data is read only once from the network, significantly reducing NFS traffic.

Page: 2 / 2
Total 123 questions
Copyright © 2014-2026 Solution2Pass. All Rights Reserved