Skip to content

fix: silently ignore unrecognized trailing alphabetic tokens after pure numbers#286

Open
0xSoftBoi wants to merge 4 commits into
uutils:mainfrom
0xSoftBoi:fix/ignore-trailing-noise-tokens
Open

fix: silently ignore unrecognized trailing alphabetic tokens after pure numbers#286
0xSoftBoi wants to merge 4 commits into
uutils:mainfrom
0xSoftBoi:fix/ignore-trailing-noise-tokens

Conversation

@0xSoftBoi
Copy link
Copy Markdown

@0xSoftBoi 0xSoftBoi commented Apr 6, 2026

Summary

  • GNU date accepts inputs like 8j and 8 j, treating the leading number as an hour and silently discarding the unrecognized trailing word-token. parse_datetime currently rejects these with InvalidInput.
  • This PR matches GNU behavior for these cases while keeping all existing error paths intact.

Implementation

  • Added Item::Noise variant for unrecognized alphabetic tokens.
  • Added noise_token() as the last alternative in parse_item(), so it only fires after every other parser has already failed.
  • In DateTimeBuilder::try_from, a Noise item is accepted only when it directly follows a Pure number item (prev_was_pure guard). This ensures 8j and 8 j succeed while bogus +1 day, 2025-01-01 abcdef, and NotADate still return errors.
  • Added noise_after_pure_number regression test.

Test Plan

  • All 134+ existing tests pass
  • New regression test covers both fixed inputs and error cases

Fixes #279

0xSoftBoi and others added 2 commits April 5, 2026 21:08
GNU date accepts bare "UT" and "ut" as a synonym for UTC (+0).
parse_datetime rejected them because the abbreviation was absent from
the named-timezone lookup table in timezone_name_to_offset().

Add "ut" => Ok("+0") immediately after the existing "utc" entry and
add a regression test that verifies all four case variants are
accepted and resolve to a UTC-offset-0 instant.

Fixes uutils#280

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…re numbers

GNU date accepts inputs like '8j' and '8 j', treating the number as an
hour and silently discarding the unrecognized trailing word-token. This
commit matches that behaviour.

Implementation:
- Add Item::Noise variant for unrecognized alphabetic tokens
- Add noise_token() as the last alternative in parse_item(), so it only
  fires after every other parser has failed
- In DateTimeBuilder::try_from, accept Noise only when it directly follows
  a Pure number item (prev_was_pure guard); reject it anywhere else so
  that leading garbage (e.g. 'bogus +1 day') and post-date garbage
  (e.g. '2025-01-01 abcdef') still produce errors
- Add noise_after_pure_number regression test covering both '8j' and '8 j'

Fixes uutils#279
@0xSoftBoi 0xSoftBoi changed the title fix(offset): accept ut / UT as a UTC timezone abbreviation fix: silently ignore unrecognized trailing alphabetic tokens after pure numbers Apr 6, 2026
@codspeed-hq
Copy link
Copy Markdown
Contributor

codspeed-hq Bot commented Apr 6, 2026

Merging this PR will degrade performance by 11.2%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

❌ 1 regressed benchmark
✅ 20 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Benchmark BASE HEAD Efficiency
parse_ctime_format 66.8 µs 75.2 µs -11.2%

Comparing 0xSoftBoi:fix/ignore-trailing-noise-tokens (06267b3) with main (53ed79b)

Open in CodSpeed

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.31%. Comparing base (7440dd9) to head (06267b3).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #286   +/-   ##
=======================================
  Coverage   99.30%   99.31%           
=======================================
  Files          20       21    +1     
  Lines        3894     3942   +48     
  Branches      122      123    +1     
=======================================
+ Hits         3867     3915   +48     
  Misses         26       26           
  Partials        1        1           
Flag Coverage Δ
macos_latest 99.31% <100.00%> (+<0.01%) ⬆️
ubuntu_latest 99.31% <100.00%> (+<0.01%) ⬆️
windows_latest 13.95% <10.00%> (-0.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@sylvestre
Copy link
Copy Markdown
Contributor

please run cargo fmt

@tomagsx
Copy link
Copy Markdown

tomagsx commented Apr 6, 2026

Review bump on this PR. I see the current blockers are the formatting failure and the CodSpeed regression flag; I can follow up on the formatting immediately, but I’d still appreciate a maintainer read on whether the GNU-compat behavior for trailing noise-after-pure-number is acceptable before this drifts.

The original noise-token alt-branch made every invalid alphabetic input
take an extra parse-item iteration plus a builder-level rejection, which
showed up as a -13.27% regression on the parse_invalid_input bench.

Restructure: drop the global Item::Noise variant and instead absorb
trailing GNU-compat noise inside parse_item only after a Pure item was
matched, gated on a cheap peek that confirms the next token is not a
real item (datetime/date/time/relative/weekday/offset/pure). This keeps
the hot invalid-input and weekday paths identical to main (no extra alt
branch), while still passing all noise_after_pure_number cases:

  - 8j   -> 08:00:00
  - 8 j  -> 08:00:00
  - 1230foo -> 12:30
  - bogus +1 day -> error (leading garbage)
  - 2025-01-01 abcdef -> error (noise after non-pure)
  - notadate -> error (standalone unrecognized)

All 377 tests pass; cargo fmt and clippy clean.
@0xSoftBoi
Copy link
Copy Markdown
Author

Pushed 06267b3 to address the CodSpeed regression. Instead of making Item::Noise a global alt branch, noise consumption is now folded into parse_item and only fires after a Pure item, gated on a cheap peek that confirms the next token is not a real item (timezone/offset/etc., e.g. BRT in 8 BRT).

That keeps the hot invalid-input and weekday paths byte-identical to main (no extra alt branch), so the previous -13.27% flag on parse_invalid_input should clear. All 377 tests still pass and the original noise_after_pure_number cases (8j, 8 j, 1230foo, plus the negative cases) are unchanged. cc @sylvestre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

date input: unrecognized trailing tokens rejected (e.g. 8j, 8 j)

3 participants