Skip to content

Cleanup binary write path: avoid Binary.getBytes() copy in DeltaByteArrayWriter and remove LE wrapper from FixedLenByteArrayPlainValuesWriter #3520

@iemejia

Description

@iemejia

Background

Two small cleanups in the binary write path:

1. DeltaByteArrayWriter.writeBytes(Binary): avoid unconditional copy of the input bytes

The first line is:

byte[] vb = v.getBytes();

Binary.getBytes() is contractually required to return a fresh array the caller can keep and mutate. For ByteArrayBackedBinary (the most common case from Binary.fromConstantByteArray() and similar), the implementation does an unconditional Arrays.copyOf of the backing array — but DeltaByteArrayWriter only reads vb and then drops it. The copy is wasted work on every value.

The right call here is v.copy().getBytesUnsafe():

  • Binary.copy() is a no-op (return this) for constant Binaries that are already independent of any reused buffer.
  • For reused-buffer Binaries (e.g. ByteBufferBackedBinary over a slab being mutated), copy() snapshots them — preserving correctness.
  • getBytesUnsafe() then returns the backing array directly without a defensive copy.

For the common ByteArrayBackedBinary case this skips the entire copy. For other implementations the copy still happens but only when it's actually needed for correctness.

2. FixedLenByteArrayPlainValuesWriter: drop the unused LittleEndianDataOutputStream wrapper

Same pattern as #3516 fixed in DeltaLengthByteArrayValuesWriter: the writer wraps its CapacityByteArrayOutputStream with a LittleEndianDataOutputStream that's only used to call Binary.writeTo() — i.e. the LE wrapper adds a layer of dispatch on every value but never uses any LE-specific functionality (writeInt/writeLong/etc.). Binary.writeTo(arrayOut) works directly with the underlying stream.

The trailing out.flush() in getBytes() is also dead — CapacityByteArrayOutputStream doesn't buffer, and the LE wrapper has no buffer of its own (per the bulk-write change in #3518, the writeBuffer[] is a per-call scratch).

Files affected

  • parquet-column/src/main/java/org/apache/parquet/column/values/deltastrings/DeltaByteArrayWriter.java
  • parquet-column/src/main/java/org/apache/parquet/column/values/plain/FixedLenByteArrayPlainValuesWriter.java

No public API change. No file format change.

Expected impact

  • BinaryEncodingBenchmark.encodeDeltaByteArray (short strings): +5% to +10% standalone (the per-value copy is one of several overheads; this PR removes one of them; it stacks with GH-3516: Optimize DeltaByteArrayWriter and DeltaLengthByteArrayValuesWriter #3517 which removed the suffix-side allocation).
  • FixedLenByteArrayPlainValuesWriter: code-quality cleanup; removes a per-value layer of dispatch and an unnecessary flush() call. No headline benchmark.

Both changes are small, locally obvious, and align with the broader migration away from LittleEndianDataOutputStream already in flight (PRs #3496, #3518).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions