Skip to content

Fix mixed-allocator heap corruption in C tokenizer entity path#356

Open
gistrec wants to merge 1 commit into
earwig:mainfrom
gistrec:fix/c-tokenizer-calloc-allocator-mismatch
Open

Fix mixed-allocator heap corruption in C tokenizer entity path#356
gistrec wants to merge 1 commit into
earwig:mainfrom
gistrec:fix/c-tokenizer-calloc-allocator-mismatch

Conversation

@gistrec
Copy link
Copy Markdown

@gistrec gistrec commented May 11, 2026

Summary

Fixes #352.

The C tokenizer was mixing allocators in the entity parsing path. common.h macro-redefines malloc, realloc, and free to Python's object allocator functions, but calloc was not redefined. As a result, Tokenizer_really_parse_entity() allocated buffers with libc calloc() and later released them through the macro-expanded free()PyObject_Free().

That allocator mismatch is undefined behavior. With PYTHONMALLOC=debug, it aborts immediately with _PyMem_DebugRawFree: bad ID; without debug malloc, it can silently corrupt the heap.

Root cause

src/mwparserfromhell/parser/ctokenizer/common.h currently redirects only:

#define malloc  PyObject_Malloc // XXX: yuck
#define realloc PyObject_Realloc
#define free    PyObject_Free

But Tokenizer_really_parse_entity() uses calloc(...) for entity parsing buffers. Those buffers are later released with free(...), which expands to PyObject_Free(...). This passes a libc-allocated pointer to Python's object allocator.

Reproduction

Before this change:

PYTHONMALLOC=debug python -c "import mwparserfromhell; mwparserfromhell.parse('{{T|p=a & b}}')"

Actual result:

Fatal Python error: _PyMem_DebugRawFree: bad ID: Allocated using API ' ', verified using API 'o'

Expected result: parsing succeeds and returns a Wikicode object without corrupting the heap.

The issue is triggered by inputs that go through the entity parsing path, for example:

"a & b"
"{{T|p=a & b}}"
"&"
"*"
"*"

Fix

Add calloc to the same allocator macro block:

#define malloc  PyObject_Malloc // XXX: yuck
#define calloc  PyObject_Calloc
#define realloc PyObject_Realloc
#define free    PyObject_Free

This makes the existing calloc sites use Python's allocator consistently with the surrounding free() calls.

Tests

  • Added a regression test that runs the parser in a subprocess with PYTHONMALLOC=debug, so future allocator mismatches fail deterministically instead of becoming silent heap corruption.
  • Verified the regression test fails before the fix with SIGABRT and _PyMem_DebugRawFree: bad ID.
  • Verified after the fix:
    • python -m pytest tests/ — 2006 passed, 1 skipped.
    • PYTHONMALLOC=debug python -m pytest tests/ — 2006 passed, 1 skipped.

The C tokenizer's common.h redefines malloc/realloc/free to the
PyObject_* family but left calloc pointing at libc. Entity parsing in
Tokenizer_really_parse_entity allocates with calloc and later releases
the buffer with free (= PyObject_Free), which is undefined behavior:
under PYTHONMALLOC=debug it aborts in _PyMem_DebugRawFree, and
otherwise it can silently corrupt the heap.

Route calloc through PyObject_Calloc alongside the other allocator
macros so both calloc sites match the surrounding free.

Fixes earwig#352.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Segfault / heap corruption when parsing bare & inside template parameter values (reproducible with PYTHONMALLOC=debug)

1 participant