Skip to content

fix TestBindSimple and TestBindCgoPackage#386

Closed
b-long wants to merge 12 commits intomasterfrom
bugfix/issue-385
Closed

fix TestBindSimple and TestBindCgoPackage#386
b-long wants to merge 12 commits intomasterfrom
bugfix/issue-385

Conversation

@b-long
Copy link
Copy Markdown
Member

@b-long b-long commented Apr 16, 2026

Fixes #385

Relates-to: #370

@b-long b-long changed the title work in progress fix TestBindSimple and TestBindCgoPackage Apr 17, 2026
Skip test for specific Go version due to CGO issue.
b-long added 2 commits April 30, 2026 06:15
* remove skip condition for Go 1.23 in TestBindCgoPackage

* remove skip condition for Go 1.23 in TestBindCgoPackage

* update Go version matrix in CI configuration
Adds two reproducers that exercise the go2py/C.CString-without-GIL crash:
1. A 5000-iteration stress loop in the cgo example (Hi/Hello string returns).
2. A new gilstring example covering struct string fields, slice elements,
   and map values under repeated calls.
@b-long b-long marked this pull request as ready for review April 30, 2026 11:28
b-long added 4 commits April 30, 2026 23:21
- gilstring.go reduced to a single Hello() function (mirrors hi.Hello from
  the issue report)
- test.py imports both gilstring and simple as two separately-built extensions
  in the same Python process, interleaving Add/Hello calls over 5000 iterations
- TestGilString builds each package into its own subdir to prevent C symbol
  collisions, then runs test.py with a shared PYTHONPATH root
- ci.yml adds macos-15-intel (x86_64) to the matrix — the platform where
  "fatal error: bad sweepgen in refill" reliably reproduces
C.GoString (and other py2go converters) call runtime.gostring → mallocgc
inside a CGo callback. If the GIL is released before those conversions,
Go's GC can observe a corrupted sweep-generation counter, causing
"fatal error: bad sweepgen in refill" on Go ≥1.24 / macOS x86_64
(issue #370).

In genFuncBody(), pre-convert each py2go argument into a local variable
while the GIL is held via PyGILState_Ensure/Release, then release the
GIL for the actual Go function call as before. The callArgs loop now
references the pre-converted variable instead of inlining C.GoString()
after SaveThread.

Also documents Idea 2 (unsafe.String zero-alloc approach) as a future
defence-in-depth option in a code comment.
…ppers

C.GoString (and other py2go converters) call runtime.gostring → mallocgc
inside a CGo callback. If those conversions run after PyEval_SaveThread
releases the GIL, Go's GC can observe a corrupted sweep-generation counter,
causing "fatal error: bad sweepgen in refill" on Go ≥1.24 / macOS x86_64
(issue #370).

In genFuncBody(), pre-convert each py2go argument into a local variable
while the GIL is held via PyGILState_Ensure/Release, then release the GIL
for the actual Go function call as before. Interface-handle arguments
(ifchandle && goname == "interface{}") are excluded from pre-conversion,
matching the existing callArgs switch logic to avoid type mismatches in
generated code for the iface example.

Also documents Idea 2 (unsafe.String zero-alloc approach) as a future
defence-in-depth option in a code comment.
@b-long b-long marked this pull request as draft May 2, 2026 02:38
b-long added 4 commits May 2, 2026 14:32
On macos-15-intel, two separately-built gopy extensions loaded in the
same Python process can crash with "fatal error: bad sweepgen in refill"
on certain Go versions. The root cause is not yet confirmed: candidate
mechanisms include PLT-based CGo symbol interposition (crosscall2,
_cgo_topofstack, x_cgo_inittls, etc.) and/or dyld global-namespace
deduplication of the ~150 runtime symbols exported by both .so files.

Add a diagnostic step that runs on every macos-15-intel job (pass or
fail) and reports three things:
  1. how many dynamic symbols are shared between the two extensions
  2. which of the critical CGo bridge symbols appear in the indirect
      symbol table (otool -Iv) — the macOS equivalent of JUMP_SLOT/PLT
  3. which library wins in the global namespace at runtime (ctypes)

Comparing the output across Go 1.21/1.22 (fail), 1.23/1.24 (pass), and
1.25 (fail) should confirm whether the crash correlates with PLT stub
generation changes between Go versions.
Loading two gopy extensions in the same Python process embeds two
independent Go runtimes. On macOS x86_64 / Go ≥1.24 this causes
"fatal error: bad sweepgen in refill" (issue #370) when both runtimes
run Go code concurrently.

Add a process-wide pthread_mutex_t stored as a Python capsule in
builtins._gopy_global_mu so every gopy extension in the same interpreter
shares the same lock. The generated CGo wrappers:

  1. Call gopy_ensure_mu() (lazy init, Python GIL must be held) before
     releasing the GIL.
  2. Release the GIL via PyEval_SaveThread.
  3. Acquire the mutex via gopy_lock() — blocking until any other
     extension's Go call finishes.
  4. Release the mutex (gopy_unlock()) before restoring the GIL
     (PyEval_RestoreThread), avoiding the GIL/mutex deadlock.

On Windows the lock/unlock are compiled as no-ops.

Fixes #370 / #385.
…erposition

When two gopy extensions are loaded in the same Python process via
RTLD_GLOBAL, Go runtime data globals (mcache0, allm, mheap_, etc.) from
the first-loaded library win in the dynamic-linker global namespace.
The second runtime's references to those globals are silently redirected,
so both runtimes share the same heap metadata. This corrupts sweep-
generation counters and causes:

  fatal error: bad sweepgen in refill

on macOS x86_64 / Go ≥1.24 (the earlier pthread mutex fix serialised
user code but could not stop background GC goroutines that also hit the
shared globals).

Fix: pass a symbol-visibility restriction to the final go build step so
that only PyInit__<name> is exported into the global namespace:
  - macOS: -extldflags=-Wl,-exported_symbols_list,<file>
  - Linux: -extldflags=-Wl,--version-script,<file>

All CGo bridge symbols (crosscall2, _cgo_topofstack, …) remain in the
.so and are called directly at link time; they no longer pollute the
global namespace and cannot be interposed by a second extension.

Fixes #370 / #385.
…position

On macOS, Python's default dlopen flags are RTLD_NOW|RTLD_GLOBAL
(Py_RTLD_DEFAULT in configure.ac for Darwin).  Every .so extension
imported via the normal Python import machinery is therefore loaded into
the process-wide flat namespace.  When two gopy extensions are loaded in
the same process, the second extension's Go runtime symbols (TLS keys,
mheap_, cgo init pointers) get interposed by the first extension's
definitions, causing the two independent Go runtimes to share GC state
and triggering 'fatal error: bad sweepgen in refill' (issue #385).

The generated Python wrapper now temporarily clears RTLD_GLOBAL before
importing the underlying _<pkg>.so, so each extension's Go runtime keeps
its own isolated copy of these globals.  The original flags are restored
immediately after import so the rest of the program is unaffected.
@b-long
Copy link
Copy Markdown
Member Author

b-long commented May 3, 2026

Closing this PR, in favor of #391

@b-long b-long closed this May 3, 2026
@b-long b-long deleted the bugfix/issue-385 branch May 3, 2026 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TestBindSimple and TestBindCgoPackage should not be skipped

1 participant