Skip to content

CigarModifier: emit a valid CIGAR when matching-base scan outruns the trailing M#402

Open
behrj wants to merge 1 commit intoAstraZeneca-NGS:masterfrom
hedera-dx:handle_negative_match_operation_in_modified_cigar
Open

CigarModifier: emit a valid CIGAR when matching-base scan outruns the trailing M#402
behrj wants to merge 1 commit intoAstraZeneca-NGS:masterfrom
hedera-dx:handle_negative_match_operation_in_modified_cigar

Conversation

@behrj
Copy link
Copy Markdown

@behrj behrj commented Apr 28, 2026

Summary

The merge functions threeDeletions(), twoDeletionsInsertionToComplex(), and
threeIndels() in CigarModifier count matching bases (rn) immediately after
the indel cluster. When a read aligns inside a homopolymer that extends past the
cluster, rn keeps growing past the interior M bases (mid) and starts eating
into the trailing M (rm). The tslen <= 0 branch then does
rm += tslen, which can drive rm negative — and the function emits a CIGAR
fragment with a non-positive operator length.

For the production read

duplex_GTGTC-GTCAC_140434267-140434623_817012 @ 7:140434471 (hg19)
input CIGAR: 104M1I2M2D1M1D5M1D2M9D27M

threeDeletions produces the intermediate RDOFF M dlen D rm M = 11M 4D -1M,
a subsequent combineToCloseToOne collapses 1I 11M 4D into 15D 12I, and
the final cigar is 104M15D12I-1M9D27M. That string is then fed to
captureMisSoftly3Mismatches, which dereferences querySequence.charAt(143)
on a 142-nt read and throws

StringIndexOutOfBoundsException: Index 143 out of bounds for length 142
at VariationUtils.isHasAndNotEquals(...:498)
at CigarModifier.captureMisSoftly3Mismatches(...:376)

After 12 such failures VarDictLauncher aborts the whole region with
"Critical exception occurs on region 7:140434293-140434869, program will be stopped."

Fix

  1. In all three merge functions, when (tslen <= 0, dlen > 0, rm < 0), absorb
    the negative rm back into RDOFF and emit RDOFF M + dlen D (mirroring
    the existing rm < 0 branch under dlen < 0 in threeIndels). This keeps
    the merged CIGAR valid while preserving the total reference and read
    consumption of the original complex.
  2. Add explicit IllegalStateException guards at every new
    RDOFF = RDOFF + rm site, so any future regression that violates
    RDOFF > 0 fails loudly via the existing printExceptionAndContinue
    path instead of silently emitting a malformed CIGAR.
  3. Defensive bounds-check on index2 in
    VariationUtils.isHasAndEquals/isHasAndNotEquals. The previous
    commented-out if (index2 < 0) ... line suggests this was already
    considered; we now return false for any out-of-range index instead of
    throwing.

For the read above, the new output is 104M14D11I9D27M (reference span 154,
read span 142 — both match the input).

Tests

CigarModifierTest.modifyCigarOnHomopolymerThreeDeletionsDoesNotProduceMalformedCigar
uses the production read verbatim and asserts that
modifyCigar() returns a CIGAR htsjdk.samtools.TextCigarCodec.decode accepts
and that all operator lengths are positive. Without the fix it fails with the
exact production exception; with the fix it passes alongside the existing 21
CigarModifierTest cases.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a defensive change that prevents similar issues to have an operational impact, but it has the risk of hiding issues in future. Happy to drop this from the PR. Let me know what you think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants