Skip to content

Potential Deadlock in _Py_qsbr_reserve #148953

@AnnaAr321

Description

@AnnaAr321

Crash report

What happened?

Hello

I don't have a good unit test for it. I'm running a server under free-threading 3.14.3t build. On a larger QPS, I see a rapid unstoppable RAM increase, while CPU load is okay.

Based on a dump of active threads, I suspect a deadlock somewhere around _Py_qsbr_reserve and Stop the World:

E.g.

  • This thread successfully initiated the Stop The World event and is waiting for all other threads to acknowledge and pause (park).
    • stop_the_world indicates it is the one trying to stop everything.
    • PyEvent_WaitTimed shows it is sitting there waiting for a signal that everyone has stopped.
    do_futex_wait
    __new_sem_wait_slow
    _PySemaphore_Wait
    _PyParkingLot_Park
    PyEvent_WaitTimed
    stop_the_world
    type_set_abstractmethods
    type_setattro
    PyObject_SetAttr
    _abc__abc_init
    ... etc
    
  • This thread also needed to stop the world (to grow an internal array) but got blocked because anotheer already held the master lock.
    • _Py_qsbr_reserve shows it was trying to reserve space in the memory management system.
    • _PyMutex_LockTimed shows it is blocked waiting for a lock inside stop_the_world. This is the lock held by Thread 1.
    do_futex_wait
    __new_sem_wait_slow
    _PySemaphore_Wait
    _PyParkingLot_Park
    _PyMutex_LockTimed
    stop_the_world
    _Py_qsbr_reserve
    PyGILState_Ensure
    ... etc
    

I suspect a modification of _Py_qsbr_reserve could help but I don't know enough about the peace of infra to make changes, so please help. Specifically, changing _Py_qsbr_reserve to this seemed to help, the server thread dump does not complain about waiting on Stop the World:

Py_ssize_t
_Py_qsbr_reserve(PyInterpreterState *interp)
{
    struct _qsbr_shared *shared = &interp->qsbr;

    PyMutex_Lock(&shared->mutex);
    // Try allocating from our internal freelist
    struct _qsbr_thread_state *qsbr = qsbr_allocate(shared);

    while (qsbr == NULL) {
        // Unlock before stopping the world to avoid deadlocks.
        // If we hold shared->mutex while waiting for the world to stop,
        // we might block a thread that needs to acquire shared->mutex to park.
        PyMutex_Unlock(&shared->mutex);
        _PyEval_StopTheWorld(interp);
        PyMutex_Lock(&shared->mutex);

        // Try allocating again, as another thread might have grown the array
        // or freed an entry while we were waiting.
        qsbr = qsbr_allocate(shared);
        if (qsbr != NULL) {
            _PyEval_StartTheWorld(interp);
            break;
        }

        // Still NULL, we must grow it
        if (grow_thread_array(shared) == 0) {
            qsbr = qsbr_allocate(shared);
        } else {
            // Failed to grow array (e.g. OOM). Break to avoid infinite loop.
            _PyEval_StartTheWorld(interp);
            break;
        }
        _PyEval_StartTheWorld(interp);
    }

    // Return an index rather than the pointer because the array may be
    // resized and the pointer invalidated.
    Py_ssize_t index = -1;
    if (qsbr != NULL) {
        index = (struct _qsbr_pad *)qsbr - shared->array;
    }
    PyMutex_Unlock(&shared->mutex);
    return index;
}

Similar issues in the past:

CPython versions tested on:

3.14

Operating systems tested on:

Linux

Output from running 'python -VV' on the command line:

3.14.3 (free-threading)

Metadata

Metadata

Assignees

No one assigned

    Labels

    interpreter-core(Objects, Python, Grammar, and Parser dirs)topic-free-threadingtype-crashA hard crash of the interpreter, possibly with a core dump

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions