Hi, very open ended one here so feel free to close if this isn't productive I just thought it was worth sharing.
I've been taking a look at the bulk copy performance of this library vs an existing approach I take for bulk copying from python.
The existing approach I have is pretty gross, but it works surprisingly well! It basically launches an extremely tiny and simple dotnet application as a subprocess, then pipes data from python to the subprocess as arrow ipc record batches. The dotnet application just implements an IDataReader to read from incoming the arrow batches and uses Microsoft.Data.SqlClient's SqlBulkCopy to insert the data. You obviously pay a very large startup cost for this approach, but as the amount of data gets larger it actually still quite significantly outperforms this library's bulk copy, benchmark results at the bottom of this issue.
I suspect the performance difference is coming from doing too much work on the python side. Arrow as an interface is really nice, I wonder if adding the ability to do something similar for the bulk copy functionality of this library might make sense. Instead of passing a list of tuples I could instead pass an iterator/iterable of arrow batches.
Worth noting in the benchmark below:
- I do NOT account for startup time in the sqlbulk_arrow benches
- I do not include time taken to turn the data into arrow format pre the bulk copy (which is fair given people using this would already likely have their data in columnar format)
Benchmark results:
--- narrow: 3 cols: INT, VARCHAR(100), FLOAT ---
rows mssql-python sqlbulk_arrow
-------------------------------------------
1,000 52ms 582ms
10,000 73ms 584ms
50,000 127ms 711ms
100,000 186ms 894ms
500,000 687ms 1.88s
1,000,000 1.33s 2.86s
--- mixed: 8 cols: 4 NOT NULL + 4 nullable (INT, VARCHAR, FLOAT, DATETIME2) ---
rows mssql-python sqlbulk_arrow
-------------------------------------------
1,000 58ms 580ms
10,000 122ms 621ms
50,000 378ms 932ms
100,000 668ms 1.27s
500,000 3.18s 2.50s
1,000,000 6.27s 3.81s
--- wide: 20 cols: mixed types, 14 NOT NULL + 6 nullable ---
rows mssql-python sqlbulk_arrow
-------------------------------------------
1,000 70ms 587ms
10,000 196ms 750ms
50,000 778ms 1.43s
100,000 1.69s 1.81s
500,000 7.15s 4.84s
1,000,000 13.99s 8.47s
Hi, very open ended one here so feel free to close if this isn't productive I just thought it was worth sharing.
I've been taking a look at the bulk copy performance of this library vs an existing approach I take for bulk copying from python.
The existing approach I have is pretty gross, but it works surprisingly well! It basically launches an extremely tiny and simple dotnet application as a subprocess, then pipes data from python to the subprocess as arrow ipc record batches. The dotnet application just implements an IDataReader to read from incoming the arrow batches and uses
Microsoft.Data.SqlClient'sSqlBulkCopyto insert the data. You obviously pay a very large startup cost for this approach, but as the amount of data gets larger it actually still quite significantly outperforms this library's bulk copy, benchmark results at the bottom of this issue.I suspect the performance difference is coming from doing too much work on the python side. Arrow as an interface is really nice, I wonder if adding the ability to do something similar for the bulk copy functionality of this library might make sense. Instead of passing a list of tuples I could instead pass an iterator/iterable of arrow batches.
Worth noting in the benchmark below:
Benchmark results: