You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. DeltaByteArrayWriter.writeBytes(Binary): avoid unconditional copy of the input bytes
The first line is:
byte[] vb = v.getBytes();
Binary.getBytes() is contractually required to return a fresh array the caller can keep and mutate. For ByteArrayBackedBinary (the most common case from Binary.fromConstantByteArray() and similar), the implementation does an unconditional Arrays.copyOf of the backing array — but DeltaByteArrayWriter only reads vb and then drops it. The copy is wasted work on every value.
The right call here is v.copy().getBytesUnsafe():
Binary.copy() is a no-op (return this) for constant Binaries that are already independent of any reused buffer.
For reused-buffer Binaries (e.g. ByteBufferBackedBinary over a slab being mutated), copy() snapshots them — preserving correctness.
getBytesUnsafe() then returns the backing array directly without a defensive copy.
For the common ByteArrayBackedBinary case this skips the entire copy. For other implementations the copy still happens but only when it's actually needed for correctness.
2. FixedLenByteArrayPlainValuesWriter: drop the unused LittleEndianDataOutputStream wrapper
Same pattern as #3516 fixed in DeltaLengthByteArrayValuesWriter: the writer wraps its CapacityByteArrayOutputStream with a LittleEndianDataOutputStream that's only used to call Binary.writeTo() — i.e. the LE wrapper adds a layer of dispatch on every value but never uses any LE-specific functionality (writeInt/writeLong/etc.). Binary.writeTo(arrayOut) works directly with the underlying stream.
The trailing out.flush() in getBytes() is also dead — CapacityByteArrayOutputStream doesn't buffer, and the LE wrapper has no buffer of its own (per the bulk-write change in #3518, the writeBuffer[] is a per-call scratch).
FixedLenByteArrayPlainValuesWriter: code-quality cleanup; removes a per-value layer of dispatch and an unnecessary flush() call. No headline benchmark.
Both changes are small, locally obvious, and align with the broader migration away from LittleEndianDataOutputStream already in flight (PRs #3496, #3518).
Background
Two small cleanups in the binary write path:
1.
DeltaByteArrayWriter.writeBytes(Binary): avoid unconditional copy of the input bytesThe first line is:
Binary.getBytes()is contractually required to return a fresh array the caller can keep and mutate. ForByteArrayBackedBinary(the most common case fromBinary.fromConstantByteArray()and similar), the implementation does an unconditionalArrays.copyOfof the backing array — butDeltaByteArrayWriteronly readsvband then drops it. The copy is wasted work on every value.The right call here is
v.copy().getBytesUnsafe():Binary.copy()is a no-op (return this) for constant Binaries that are already independent of any reused buffer.ByteBufferBackedBinaryover a slab being mutated),copy()snapshots them — preserving correctness.getBytesUnsafe()then returns the backing array directly without a defensive copy.For the common
ByteArrayBackedBinarycase this skips the entire copy. For other implementations the copy still happens but only when it's actually needed for correctness.2.
FixedLenByteArrayPlainValuesWriter: drop the unusedLittleEndianDataOutputStreamwrapperSame pattern as #3516 fixed in
DeltaLengthByteArrayValuesWriter: the writer wraps itsCapacityByteArrayOutputStreamwith aLittleEndianDataOutputStreamthat's only used to callBinary.writeTo()— i.e. the LE wrapper adds a layer of dispatch on every value but never uses any LE-specific functionality (writeInt/writeLong/etc.).Binary.writeTo(arrayOut)works directly with the underlying stream.The trailing
out.flush()ingetBytes()is also dead —CapacityByteArrayOutputStreamdoesn't buffer, and the LE wrapper has no buffer of its own (per the bulk-write change in #3518, thewriteBuffer[]is a per-call scratch).Files affected
parquet-column/src/main/java/org/apache/parquet/column/values/deltastrings/DeltaByteArrayWriter.javaparquet-column/src/main/java/org/apache/parquet/column/values/plain/FixedLenByteArrayPlainValuesWriter.javaNo public API change. No file format change.
Expected impact
BinaryEncodingBenchmark.encodeDeltaByteArray(short strings): +5% to +10% standalone (the per-value copy is one of several overheads; this PR removes one of them; it stacks with GH-3516: Optimize DeltaByteArrayWriter and DeltaLengthByteArrayValuesWriter #3517 which removed the suffix-side allocation).FixedLenByteArrayPlainValuesWriter: code-quality cleanup; removes a per-value layer of dispatch and an unnecessaryflush()call. No headline benchmark.Both changes are small, locally obvious, and align with the broader migration away from
LittleEndianDataOutputStreamalready in flight (PRs #3496, #3518).