Skip to content

Defer building HtmlPage id/name lookup index until first read#1116

Merged
rbri merged 2 commits intoHtmlUnit:masterfrom
shapiroronny:perf/lazy-id-name-index
Apr 27, 2026
Merged

Defer building HtmlPage id/name lookup index until first read#1116
rbri merged 2 commits intoHtmlUnit:masterfrom
shapiroronny:perf/lazy-id-name-index

Conversation

@shapiroronny
Copy link
Copy Markdown
Contributor

Every element added during parsing eagerly invoked addMappedElement, which issued two getAttribute("id"/"name") lookups, two ConcurrentHashMap.put operations, and (when applicable) allocated a MappedElementIndexEntry plus its inner ArrayList. For pages parsed and walked via querySelector / iteration without any getElementById call, the entire index is built and never read.

Add a mappedElementsBuilt_ flag on HtmlPage; addMappedElement and removeMappedElement early-return until it flips. A new private ensureMappedElementsBuilt() walks the document once on first read of getElementById / getElementsById / getElementByName / getElementsByName / getElementsByIdAndOrName and populates both idMap_ and nameMap_; subsequent mutations through addMappedElement/removeMappedElement stay incremental as before. clone() resets the flag so cloned pages re-lazy-build.

JMH HtmlUnitBenchmark.JMH on a ~30 KB HTML page (5 forks x 3 warmup x 5 measurement = 25 samples, paired sequentially):

baseline: 633.205 +- 8.793 us/op
with this: 593.611 +- 13.110 us/op
delta: -39.6 us, -6.3%

Targeted tests across HtmlPageTest, HtmlPage2Test, HtmlPage3Test, HtmlInlineFrameTest, HtmlFrameTest, HtmlFrameSetTest, HTMLDocumentTest, Html*ParserTest: 1176 tests, 0 failures.

Every element added during parsing eagerly invoked addMappedElement,
which issued two getAttribute("id"/"name") lookups, two
ConcurrentHashMap.put operations, and (when applicable) allocated a
MappedElementIndexEntry plus its inner ArrayList. For pages parsed
and walked via querySelector / iteration without any getElementById
call, the entire index is built and never read.

Add a mappedElementsBuilt_ flag on HtmlPage; addMappedElement and
removeMappedElement early-return until it flips. A new private
ensureMappedElementsBuilt() walks the document once on first read of
getElementById / getElementsById / getElementByName / getElementsByName
/ getElementsByIdAndOrName and populates both idMap_ and nameMap_;
subsequent mutations through addMappedElement/removeMappedElement
stay incremental as before. clone() resets the flag so cloned pages
re-lazy-build.

JMH HtmlUnitBenchmark.JMH on a ~30 KB HTML page (5 forks x 3 warmup
x 5 measurement = 25 samples, paired sequentially):

  baseline:   633.205 +- 8.793 us/op
  with this:  593.611 +- 13.110 us/op
  delta:      -39.6 us, -6.3%

Targeted tests across HtmlPageTest, HtmlPage2Test, HtmlPage3Test,
HtmlInlineFrameTest, HtmlFrameTest, HtmlFrameSetTest, HTMLDocumentTest,
Html*ParserTest: 1176 tests, 0 failures.
if (mappedElementsBuilt_) {
return;
}
mappedElementsBuilt_ = true;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess we have to set this after the maps populated?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 342862f. Moved mappedElementsBuilt_ = true to after the addElement calls so a partial failure mid-walk leaves built_=false and the next read retries instead of silently returning a half-populated index.

Per review feedback: the flag was being flipped before addElement() ran,
so a partial failure mid-walk would leave the page with built_=true and
a half-populated index, and subsequent reads would silently return
incomplete results.

Move the assignment after the populate so a thrown exception leaves
built_=false and the next read retries.
@sonarqubecloud
Copy link
Copy Markdown

@rbri rbri merged commit 01b1665 into HtmlUnit:master Apr 27, 2026
8 checks passed
@rbri
Copy link
Copy Markdown
Member

rbri commented Apr 27, 2026

Cool idea, many thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants