Skip to content

feat(dkg): implement sync#311

Merged
iamquang95 merged 62 commits intomainfrom
iamquang95/dkg/sync
May 1, 2026
Merged

feat(dkg): implement sync#311
iamquang95 merged 62 commits intomainfrom
iamquang95/dkg/sync

Conversation

@iamquang95
Copy link
Copy Markdown
Collaborator

Should address #133

@iamquang95
Copy link
Copy Markdown
Collaborator Author

@claude

@claude
Copy link
Copy Markdown

claude Bot commented Apr 3, 2026

Claude finished @iamquang95's task in 6m 16s —— View job


Code Review

  • Gather context and understand PR scope
  • Read changed files
  • Compare against Go reference implementation
  • Review sync module files
  • Review p2p proto changes
  • Compile final review

Summary

This PR implements the DKG step-synchronization protocol (/charon/dkg/sync/1.0.0/) in Rust, porting the Go Charon dkg/sync package. The overall structure is sound — Server/Client/Handler/Behaviour decomposition mirrors the Go architecture, and the proto wire format matches. There are a few correctness issues worth addressing before merge, plus some lower-priority observations.


Findings

[High] Step is updated for peers that fail validation

In handle_inbound_stream, after validate_request_with_public_key fails the code sets the server error and response error string, but then falls through to server.update_step(peer_id, message.step) and continues the loop to read the next message.

Evidence: crates/dkg/src/sync/handler.rs:365-416

if let Err(error) = protocol::validate_request_with_public_key(...) {
    // sets server error, response.error
} else {
    server.mark_connected(peer_id).await;
}

// ← this runs even after validation failure
if server.update_step(peer_id, message.step).await? {
    send_inbound_event(...);
}
// loop continues - next read will also come from the invalid peer

A peer that sends an invalid hash signature (wrong cluster) can still advance the step tracker and the loop keeps reading from that peer. The Go reference returns from the handler after an invalid message. The function should break (or return Err) immediately after setting the error and writing the error response. Fix this →


[Medium] Protocol ID has trailing slash — verify against Go

protocol.rs:11:

pub const PROTOCOL_NAME: libp2p::StreamProtocol = StreamProtocol::new("/charon/dkg/sync/1.0.0/");

The trailing / needs to exactly match what the Charon Go server advertises. libp2p protocol negotiation is byte-exact. If the Go side uses /charon/dkg/sync/1.0.0 (no trailing slash), all protocol negotiations will fail silently — the handler is never invoked, connections appear established but no sync traffic flows. Please confirm against the Go source. Fix this →


[Medium] await_all_at_step does not check started — can deadlock before Server::start()

await_all_connected guards against this with an is_started() check (server.rs:74), but await_all_at_step (server.rs:120) has no such guard. If called before start(), it spins on the notify but the started-notify never fires, so cancellation is the only exit. This likely doesn't matter in practice (DKG won't await steps before starting), but it's an asymmetry worth documenting or guarding. Fix this →


[Low] Redundant explicit arms before wildcard in on_connection_event

handler.rs:279-285:

ConnectionEvent::AddressChange(_)
| ConnectionEvent::LocalProtocolsChange(_)
| ConnectionEvent::RemoteProtocolsChange(_) => {}
ConnectionEvent::ListenUpgradeError(_) => {}
_ => {}    // ← makes the lines above unreachable

The named arms are unreachable because _ => {} catches everything. Since ConnectionEvent is #[non_exhaustive] you need the wildcard — the named arms can be removed. Fix this →


[Low] inbound_events_tx error logged at error! level, but handler owns both ends

handler.rs:426-428: If inbound_events_tx.send() fails it logs error!. Since the handler holds both the sender and receiver for the channel's lifetime, this can only happen after the handler is dropped — at which point the error is silently swallowed anyway. A debug! level or silent ignore would be less alarming. Fix this →


[Low] run_outbound_stream does not flush on shutdown

handler.rs:336-338: When shutdown=true the final MsgSync is sent and the function returns GracefulShutdown immediately after reading the response. The stream is then dropped. On the write path write_fixed_size_protobuf does flush, so the actual bytes make it to the peer — this is fine. Just confirming this is intentional and the inbound side reads the shutdown flag and terminates gracefully.


Parity Matrix

Component Go (charon/dkg/sync) Rust (pluto/crates/dkg/src/sync) Match Notes
Protocol ID /charon/dkg/sync/1.0.0 /charon/dkg/sync/1.0.0/ Unverified Trailing slash needs confirmation
Wire framing i64 LE length prefix i64 LE length prefix Yes write_fixed_size_protobuf
Proto schema MsgSync, MsgSyncResponse Generated from same .proto Yes Fields match
Step validation (initial) 0 or 1 0 or 1 Yes server.rs:227
Step validation (jump) current+1 or current+2 current+1 or current+2 Yes server.rs:218
Hash sig verification secp256k1 / libp2p sign secp256k1 / libp2p sign Yes protocol.rs:44
Version check major.minor exact full semver == Yes SemVer is major.minor only
Shutdown flow send shutdown flag, wait ACK send shutdown=true, read response Yes handler.rs:296-338
Error propagation (invalid msg) break loop continues loop No High severity, see above

Tests

Tests were not run (no cargo test permission). Unit tests cover:

  • update_step_rules_match_go — validates step acceptance rules
  • sync_round_trip_matches_go_shape — 3-node integration test across real TCP
  • active_client_requests_dial — behaviour poll emits Dial
  • connection_closed_keeps_client_state_until_last_connection — multi-connection tracking

Missing coverage:

  • Validation failure path (hash sig mismatch) — the high-severity bug above is untested
  • relay_reset reconnect path via on_dial_upgrade_error
  • Protocol negotiation failure (Unsupported variant)

Open Questions

  1. Trailing slash in protocol ID — Can you confirm against charon/dkg/sync/? The Go source either has or doesn't have a trailing slash and it must match byte-for-byte.
  2. Wire format — Was the i64 LE framing verified against the Charon Go dkg/sync package? (Other Charon p2p protocols use the standard varint framing; DKG sync is an exception worth calling out explicitly.)
  3. step=0 initial value — The client initializes step: AtomicI64::new(0) (client.rs:79) and begins sending step=0 immediately. The Go sync sends step=0 until the caller advances it. Is step 0 meant to mean "not yet started"? The test in mod.rs starts at step 0 but the example starts advancing from step 1 — this asymmetry should be documented.

@iamquang95
Copy link
Copy Markdown
Collaborator Author

[High] Step is updated for peers that fail validation

Same as charon implementation

[Medium] Protocol ID has trailing slash — verify against Go

Matched with golang: const protocolID = "/charon/dkg/sync/1.0.0/"

[Medium] await_all_at_step does not check started — can deadlock before Server::start()

Matched with charon, also it's not a bug.

@iamquang95
Copy link
Copy Markdown
Collaborator Author

@claude now you will have the charon source code, review it again

@claude
Copy link
Copy Markdown

claude Bot commented Apr 6, 2026

Claude finished @iamquang95's task in 3m 23s —— View job


Code Review (with Charon source)

  • Read Charon Go source for dkg/sync
  • Read PR changed files
  • Compare implementations
  • Post updated review

Summary

With access to charon/dkg/sync/, I can confirm several findings from the previous review were incorrect. The three previously flagged high/medium issues all match the Go reference implementation. Two low-severity cosmetic issues remain. One new finding identified.


Corrections to Previous Review

[Retracted — High] Step updated after validation failure

Confirmed correct. Go's handleStream (server.go:283) calls s.updateStep(pID, ...) unconditionally after validReq fails:

if err := s.validReq(pubkey, msg); err != nil {
    s.setErr(...)
    resp.Error = err.Error()
} else if !s.isConnected(pID) {
    s.setConnected(pID)
}

if err := s.updateStep(pID, int(msg.GetStep())); err != nil {  // ← unconditional
    return err
}

Rust handler.rs:403 behaves identically. This was a correct port.


[Retracted — Medium] Protocol ID trailing slash

Confirmed correct. Go uses const protocolID = "/charon/dkg/sync/1.0.0/" (server.go:30) — trailing slash is intentional and matches Rust protocol.rs:11.


[Retracted — Medium] await_all_at_step started guard

Confirmed correct. Go's AwaitAllAtStep (server.go:123) has no started guard either. The Rust implementation matches.


Remaining Valid Findings

[Low] Redundant match arms before wildcard in on_connection_event

handler.rs:279-283:

ConnectionEvent::AddressChange(_)
| ConnectionEvent::LocalProtocolsChange(_)
| ConnectionEvent::RemoteProtocolsChange(_) => {}
ConnectionEvent::ListenUpgradeError(_) => {}
_ => {}    // makes the four named arms above dead code

The named arms are unreachable. Remove them. Fix this →


[Low] inbound_events_tx failure logged at error!

handler.rs:435: The handler owns both ends of the channel; send failure can only happen after the handler is dropped, at which point the error is unactionable. Use debug! or a silent ignore. Fix this →


New Finding

[Low] Relay error detection may miss ErrResourceScopeClosed

Go's isRelayError (client.go:245) checks two error types:

func isRelayError(err error) bool {
    return errors.Is(err, network.ErrReset) ||
        errors.Is(err, network.ErrResourceScopeClosed)   // ← second case
}

Rust (handler.rs:326-328) only checks ConnectionReset:

relay_reset: error.kind() == std::io::ErrorKind::ConnectionReset,

ErrResourceScopeClosed fires when a relay circuit's resource scope is exhausted — a normal occurrence during relay recycling. The equivalent in Rust's libp2p may surface as a different ErrorKind (e.g. BrokenPipe, Other). If missed, the client will not reconnect when it should (instead treating it as a fatal error).

Recommend mapping BrokenPipe and/or wrapping the condition to match what Rust libp2p actually produces for resource scope closure. Fix this →


Informational Note

Rust read_exact vs Go's reader.Read

Go's readSizedProto (server.go:368) uses reader.Read(buf) which is not guaranteed to fill the buffer on a partial TCP read — this is technically a latent Go bug. Rust's read_fixed_size_protobuf (p2p/src/proto.rs:147) calls read_exact which correctly handles partial reads. Rust is more correct here; no change needed.


Parity Matrix (updated)

Component Go Rust Match Notes
Protocol ID /charon/dkg/sync/1.0.0/ /charon/dkg/sync/1.0.0/ Trailing slash is intentional
Wire framing i64 LE prefix i64 LE prefix Rust uses read_exact (more correct)
Step update on invalid msg Unconditional Unconditional Both update step regardless of validation
await_all_at_step started guard None None Matches Go by design
Hash sig verification secp256k1 libp2p secp256k1 libp2p
Relay reconnect (ErrReset)
Relay reconnect (ErrResourceScopeClosed) Unverified Low severity
Shutdown flow Send flag, wait ACK Send flag, wait ACK
clearConnected on exit defer After result Semantically equivalent

@iamquang95
Copy link
Copy Markdown
Collaborator Author

[Low] Relay error detection may miss ErrResourceScopeClosed

Don't have the equivalent error in rust.

Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: feat(dkg): implement sync

This PR ports the Charon DKG sync protocol to Rust. The architecture is solid — the Arc<Inner> / RwLock<ServerState> / Notify design is correct, the wire format matches Go's i64 LE framing, and the step validation rules are faithful. Test coverage is good with both unit tests mirroring Go's test shapes and an integration round-trip test.

However there are four bugs that need to be fixed before merge:

  1. stop_rx.changed() spurious cancelactivate() sends false on stop_tx, which fires changed() in the running outbound stream and kills it prematurely (handler.rs:299,319 + client.rs:178)
  2. Inbound future replaced without cleanup — dropping the old BoxFuture skips server.clear_connected, leaving the peer permanently in the connected set (handler.rs:257)
  3. No backoff on Reconnectable exit — transitioning to Idle directly can spin-loop under relay churn; the on_dial_upgrade_error path correctly calls schedule_retry() but the stream-exit path does not (handler.rs:226)
  4. NegotiationFailed retries forever — when a peer doesn't support the protocol, retrying with backoff is pointless; the handler should call client.finish(Err(Unsupported)) (handler.rs:125)

Additionally, await_all_shutdown is missing the error-state check that await_all_connected and await_all_at_step both have, and the MAX_MESSAGE_SIZE constant (128 MiB) diverges from Go's 32 MiB cap.

Inline comments are filed for each issue individually.

Comment thread crates/dkg/src/sync/handler.rs Outdated
Comment thread crates/dkg/src/sync/handler.rs
Comment thread crates/dkg/src/sync/handler.rs Outdated
Comment thread crates/dkg/src/sync/handler.rs Outdated
Comment thread crates/dkg/src/sync/server.rs
Comment thread crates/dkg/src/sync/handler.rs Outdated
Comment thread crates/dkg/src/sync/server.rs
Comment thread crates/dkg/src/sync/server.rs
Comment thread crates/dkg/src/sync/mod.rs
Comment thread crates/dkg/examples/sync.rs Outdated
Copy link
Copy Markdown
Collaborator

@varex83agent varex83agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR implements the DKG sync protocol (port of charon/dkg/sync) using a libp2p NetworkBehaviour + per-connection ConnectionHandler, with Server/Client Arc handles for external callers. The overall architecture is solid and the step-validation, version-checking, and signature-verification logic is functionally equivalent to the Go reference. There is one resource-lifecycle bug and two major concurrency/security issues that must be addressed before merge.

Bugs (must fix):

  1. clear_connected() is never called when the inbound BoxFuture is replaced or the connection is dropped mid-stream — the peer stays in server.connected permanently, breaking await_all_connected after reconnect cycles.

Major findings:
2. read_fixed_size_protobuf uses the global 128 MB limit for MsgSync/MsgSyncResponse messages that are at most ~200 bytes — any authenticated peer can force a 128 MB heap allocation.
3. When WaitingRetry fires but try_claim_outbound() fails (a second connection holds the claim), the handler falls back to Idle with no waker registered — it can be permanently stuck until an unrelated connection event arrives.

Minor findings: relay-error detection misses ErrResourceScopeClosed; DialFailure only retries on Transport errors (too narrow); RTT is not recorded to the libp2p peerstore; no integration tests for the error-propagation paths (version mismatch → set_err); inbound stream continues after set_err refreshing notify_waiters at 100 ms; client.shutdown() in the integration test has no timeout.

Verdict: REQUEST_CHANGES — the clear_connected lifecycle bug can cause silent test hangs and incorrect barrier semantics; the 128 MB allocation vector is a DoS risk in a cooperative but potentially adversarial network; and the stuck-Idle handler is a correctness issue in multi-connection scenarios.

Comment thread crates/dkg/src/sync/handler.rs
Comment thread crates/dkg/src/sync/handler.rs
Comment thread crates/dkg/src/sync/handler.rs Outdated
Comment thread crates/dkg/src/sync/handler.rs
Comment thread crates/dkg/src/sync/behaviour.rs
Comment thread crates/dkg/src/sync/handler.rs
Comment thread crates/dkg/src/sync/mod.rs
Comment thread crates/dkg/src/sync/mod.rs Outdated
Comment thread crates/dkg/src/sync/mod.rs Outdated
Comment thread crates/dkg/src/sync/client.rs
@iamquang95 iamquang95 requested a review from emlautarom1 April 28, 2026 10:00
Copy link
Copy Markdown
Collaborator

@varex83 varex83 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@iamquang95 iamquang95 dismissed emlautarom1’s stale review May 1, 2026 06:53

Merge to unblock dkg, will fix any issues later

@iamquang95 iamquang95 merged commit 8546ddd into main May 1, 2026
9 checks passed
@iamquang95 iamquang95 deleted the iamquang95/dkg/sync branch May 1, 2026 06:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement dkg/sync

5 participants