Skip to content

[BUG] precision cast in _process_input_tensor breaks integer target tensors (cross_entropy / nll_loss / kl_div) #151

@moria97

Description

@moria97

Summary

src/kernelbench/eval.py::_process_input_tensor unconditionally casts every input tensor to the configured precision (default torch.float32). For tasks whose get_inputs() returns integer class indices — e.g. torch.randint(0, num_classes, (batch_size,)) for CrossEntropyLoss / NLLLoss / KLDivLoss problems — this turns the int64 target tensor into float32, which breaks PyTorch's dispatch for ops that require Long targets.

The user-visible failure is a misleading error from PyTorch:

NotImplementedError: "nll_loss_forward_reduce_cuda_kernel_2d_index"
  not implemented for 'Float'

This looks like a CUDA arch / dtype coverage gap (and we initially diagnosed it as a Blackwell PyTorch wheel issue), but the root cause is purely the cast in this helper.

Affected code

https://github.com/ScalingIntelligence/KernelBench/blob/main/src/kernelbench/eval.py#L370-L391

def _process_input_tensor(input, device, backend="cuda", precision=torch.float32):
    ...
    # sometimes things like init inputs are floats (like in the case of labels / targets, classification losses, etc.)
    if not isinstance(input, torch.Tensor):
        return input

    # cast to the desired percision dtype for activations
    input_tensor = input.to(dtype=precision)   # ← casts int64 → float

    return input_tensor.to(device=device)

The comment on line 383 explicitly recognizes that "labels / targets / classification losses" are a special case, but the implementation does not actually exempt them — every tensor gets cast.

When was this introduced

The integer-dtype protection existed in earlier versions and was removed during the precision-support refactor in #80 ("Precision Support + TileLang Integration", merged 2025-11-05). Issue #79 was the original feature request; the regression was an unintended side-effect.

Reproduction

Environment:

  • PyTorch 2.9.0+cu128
  • CUDA 12.8
  • GPU: any (we hit it on RTX 5090, but this is a dispatch issue, not arch-specific)
python3 scripts/generate_and_eval_single_sample.py \
    dataset_src=local level=1 problem_id=95 \
    server_type=openai model_name=<any-model> \
    eval_mode=local backend=cuda precision=fp32 \
    gpu_arch="['Ada']"

Without any LLM-generated kernel involved, the reference implementation in level1/95_CrossEntropyLoss.py raises NotImplementedError because the target tensor was cast to Float by _process_input_tensor.

Affected level_1 problems we've seen fail with this:

  • 95_CrossEntropyLoss
  • 98_KLDivLoss
  • (any other problem whose get_inputs() returns integer indices)

Impact

  • Multiple level_1 problems are unevaluable on the current main regardless of which model is being benchmarked.
  • The error path produces compiled=False in metrics, which falsely attributes the failure to the LLM-generated kernel rather than the eval harness.
  • For RL / agent loops that consume these metrics, this introduces noise and potentially misdirected reward signal.

Proposed fix

Skip the precision cast for non-floating-point tensors:

def _process_input_tensor(input, device, backend="cuda", precision=torch.float32):
    if not isinstance(input, torch.Tensor):
        return input
    if not input.is_floating_point():
        return input.to(device=device)        # int / bool: only move
    return input.to(dtype=precision).to(device=device)

A PR is open at # implementing this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions