Skip to content

fix: [ps2ip_rpc] explicit fixed-layout select_pkt across EE/IOP RPC#836

Merged
uyjulian merged 1 commit intops2dev:masterfrom
fjtrujy:fix/ps2ip-rpc-select-pkt-abi
May 7, 2026
Merged

fix: [ps2ip_rpc] explicit fixed-layout select_pkt across EE/IOP RPC#836
uyjulian merged 1 commit intops2dev:masterfrom
fjtrujy:fix/ps2ip-rpc-select-pkt-abi

Conversation

@fjtrujy
Copy link
Copy Markdown
Member

@fjtrujy fjtrujy commented May 7, 2026

Summary

select_pkt (the SIF-RPC payload that ee/rpc/tcpips sends to iop/tcpip/tcpips for select()) used struct fd_set and struct timeval directly in the wire layout. Both types have compiler-defined sizes that diverge between the EE and IOP compile environments, so the EE and IOP saw different field offsets in the same struct — a silent ABI mismatch.

The mismatch

Field EE (newlib) IOP (lwIP)
struct fd_set ~8 bytes (FD_SETSIZE=64, __fd_mask=ulong=8) 2 bytes (packed fd_bits[(MEMP_NUM_NETCONN+7)/8])
struct timeval 16 bytes (tv_sec/tv_usec 64-bit longs) 8 bytes (32-bit longs)

The two mismatches together meant every field of select_pkt after timeout lived at a different offset on each side. The EE wrote pkt->writeset_p / exceptset_p / writeset / exceptset at offsets the IOP never read; the IOP read garbage from the high half of the EE's tv_usec for what it thought were the writeset_p / exceptset_p pointers, ended up calling lwip_select with NULL fd-sets, and silently returned without modifying any bits.

This was harmless for ps2link (UDP only, never calls select), but broke mongoose-based servers on the IOP-side path (ps2_http): the listening socket was flagged as having an exception condition on the very first poll cycle (spurious eset bits), mongoose called mg_error("socket error"), and the listener was killed before the first client connection.

The fix

Lock the wire layout to a fixed shape that's identical on both compile units:

  • Define ps2ip_rpc_fd_set as a fixed-size byte array unsigned char fd_bits[(MEMP_NUM_NETCONN + 7) / 8]. Size is driven by MEMP_NUM_NETCONN and matches lwIP's internal fd_set semantics.
  • Replace struct timeval timeout in select_pkt with explicit s32 timeout_sec; s32 timeout_usec; fields. Each side reconstructs/decomposes its own native timeval at the boundary.

Net result: every field of select_pkt is at the same byte offset on EE and IOP, regardless of toolchain header definitions.

Files

  • common/include/ps2ip_rpc.h — fixed-layout struct
  • ee/rpc/tcpips/src/ps2ipc.c — EE side: convert local fd_set/timeval to wire shape
  • iop/tcpip/tcpips/src/ps2ips.c — IOP side: convert wire shape to local fd_set/timeval

Test plan

  • select_test_iop (in ps2_drivers) — listening-socket select probe shows rc=0 r=0 w=0 e=0 ticks (no spurious eset bits). Pre-fix: e=1 on the very first probe.
  • mongoose ps2_http on the IOP-side path — sustains repeated HTTP bursts on real hardware without the listener being killed by mg_error("socket error").
  • No regression in ps2link / ps2client (UDP) which don't exercise this path.

🤖 Generated with Claude Code

select_pkt's struct fd_set and struct timeval fields had compiler-defined
sizes that diverge between the EE and IOP compile environments:

  - struct fd_set is ~8 bytes on EE (newlib, FD_SETSIZE=64, __fd_mask is
    unsigned long = 8B) but 2 bytes on IOP (lwIP, MEMP_NUM_NETCONN=9
    bytes packed in fd_bits[(N+7)/8]).
  - struct timeval is 16 bytes on EE (tv_sec/tv_usec are 64-bit longs in
    newlib) but 8 bytes on IOP (32-bit longs).

The two mismatches together meant the select_pkt offsets for everything
after the timeout diverged between EE and IOP. EE wrote pkt->writeset_p
/ exceptset_p / writeset / exceptset at offsets the IOP never read; the
IOP read garbage from the high-half of EE's tv_usec for what it thought
were the writeset_p / exceptset_p pointers, ended up calling lwip_select
with NULL fdsets, and silently returned without modifying any bits.

This was harmless for ps2link (UDP only, never calls select), but broke
mongoose-based ps2_http on the IOP-side path: the listening socket was
flagged as having an exception condition on the very first poll cycle
(spurious eset bits), mongoose called mg_error("socket error"), and the
listener was killed before the first client connection.

Fix:
  - Define ps2ip_rpc_fd_set as a fixed-size byte array
    (unsigned char fd_bits[(MEMP_NUM_NETCONN + 7) / 8]). Its size is
    MEMP_NUM_NETCONN-driven and identical on both compile units.
  - Replace `struct timeval timeout` in select_pkt with explicit
    `s32 timeout_sec; s32 timeout_usec;` and let each side reconstruct
    its native struct timeval at the boundary.
  - On the EE side, ps2ipc_select marshals between the caller's newlib
    `struct fd_set *` and the wire format bit-by-bit using FD_ISSET /
    FD_SET, so callers don't need to know about ps2ip_rpc_fd_set.
  - On the IOP side, do_select casts (struct fd_set *)&pkt->readset
    (safe — same byte layout as IOP's lwIP fd_set) and reconstructs a
    local struct timeval from the wire fields.

Validated with isolated select() reproducers in ps2_drivers
(select_test_ee, select_test_iop): both now report
`tick=N rc=0 r=0 w=0 e=0` for an idle listening fd. ps2_http on real
hardware (IOP-side path, configure_iopip_network) passes 100/100
sustained /ping requests at ~20 req/s plus directory listing.

Lesson: never put compiler-defined ABI types (struct fd_set,
struct timeval, time_t, long, sometimes pointers) directly in a
cross-RPC packet. Use explicit s32 / u32 / u8[N] and reconstruct
native types at the boundary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@fjtrujy fjtrujy marked this pull request as ready for review May 7, 2026 22:40
@fjtrujy fjtrujy changed the title fix: [ps2ip_rpc] explicit fixed-layout select_pkt across EE/IOP RPC fix: [ps2ip_rpc] explicit fixed-layout select_pkt across EE/IOP RPC May 7, 2026
Copy link
Copy Markdown
Member

@uyjulian uyjulian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@uyjulian uyjulian merged commit 5a63561 into ps2dev:master May 7, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants