Skip to content

Arrow/other interface to bulk copy functionality #551

@oscarrobertson

Description

@oscarrobertson

Hi, very open ended one here so feel free to close if this isn't productive I just thought it was worth sharing.

I've been taking a look at the bulk copy performance of this library vs an existing approach I take for bulk copying from python.

The existing approach I have is pretty gross, but it works surprisingly well! It basically launches an extremely tiny and simple dotnet application as a subprocess, then pipes data from python to the subprocess as arrow ipc record batches. The dotnet application just implements an IDataReader to read from incoming the arrow batches and uses Microsoft.Data.SqlClient's SqlBulkCopy to insert the data. You obviously pay a very large startup cost for this approach, but as the amount of data gets larger it actually still quite significantly outperforms this library's bulk copy, benchmark results at the bottom of this issue.

I suspect the performance difference is coming from doing too much work on the python side. Arrow as an interface is really nice, I wonder if adding the ability to do something similar for the bulk copy functionality of this library might make sense. Instead of passing a list of tuples I could instead pass an iterator/iterable of arrow batches.

Worth noting in the benchmark below:

  • I do NOT account for startup time in the sqlbulk_arrow benches
  • I do not include time taken to turn the data into arrow format pre the bulk copy (which is fair given people using this would already likely have their data in columnar format)

Benchmark results:

--- narrow: 3 cols: INT, VARCHAR(100), FLOAT ---

      rows     mssql-python   sqlbulk_arrow
-------------------------------------------
     1,000          52ms         582ms
    10,000          73ms         584ms
    50,000         127ms         711ms
   100,000         186ms         894ms
   500,000         687ms         1.88s
 1,000,000         1.33s         2.86s


--- mixed: 8 cols: 4 NOT NULL + 4 nullable (INT, VARCHAR, FLOAT, DATETIME2) ---

      rows     mssql-python   sqlbulk_arrow
-------------------------------------------
    1,000          58ms          580ms
    10,000         122ms         621ms
    50,000         378ms         932ms
   100,000         668ms         1.27s
   500,000         3.18s         2.50s
 1,000,000         6.27s         3.81s


--- wide: 20 cols: mixed types, 14 NOT NULL + 6 nullable ---

      rows     mssql-python   sqlbulk_arrow
-------------------------------------------
     1,000          70ms         587ms
    10,000         196ms         750ms
    50,000         778ms         1.43s
   100,000         1.69s         1.81s
   500,000         7.15s         4.84s
 1,000,000        13.99s         8.47s


Metadata

Metadata

Assignees

No one assigned

    Labels

    triage neededFor new issues, not triaged yet.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions