Update the documentation to use parquet output#2607
Update the documentation to use parquet output#2607erikvansebille wants to merge 76 commits intoParcels-code:mainfrom
Conversation
Covered by test_write_dtypes_pfile
for more information, see https://pre-commit.ci
Remove temporary test_cftime.py file
This function is now independent of the time_interval as time is now stored as float
Remove nested key - save on root instead
VeckoTheGecko
left a comment
There was a problem hiding this comment.
As part of review I've looked both at the code, as well as visually compared plots before and after.
I've gone through and pushed some edits which were quite straight-forward:
- 1b35bf9 Fixing a notebook
- d977c88e7
Now the docs builds are passing
Other than that, I have some small comments - nothing major.
Given we're now using Polars in the docs, the tests, and in the read_particlefile function - I think its easiest just for us to add it as a core dependency to Parcels. We could make it an optional dependency, but we don't really have the tooling for that in Parcels (and I don't think its worth adding the tooling in this case).
If we add as a core dependency:
- Update
pyproject.tomlandpixi.toml(run-dependenciesto= ">=1.31.0"andfeature.minimum.dependenciesto= "1.31.*") - Update
recipe.yaml
I'm happy to make those updates.
|
|
||
| The output files are in `.zarr` [format](https://zarr.readthedocs.io/en/stable/), which can be read by `xarray`. | ||
| See the [Parcels output tutorial](./tutorial_output.ipynb) for more information on the zarr format. We want to choose | ||
| The output files are in `.parquet` [format](https://parquet.apache.org/), which can be read by `polars`. |
There was a problem hiding this comment.
Would this be a good place to link to Polars?
| The output files are in `.parquet` [format](https://parquet.apache.org/), which can be read by `polars`. | |
| The output files are in `.parquet` [format](https://parquet.apache.org/), which can be read by [Polars](https://pola.rs/). |
I don't think we link to it yet in the docs here
|
|
||
| Parcels depends on `xarray`, expecting inputs in the form of [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html) | ||
| and writing output files that can be read with xarray. | ||
| Parcels depends on `xarray`, expecting inputs in the form of [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html). Output files can be read with `pandas`. |
There was a problem hiding this comment.
| Parcels depends on `xarray`, expecting inputs in the form of [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html). Output files can be read with `pandas`. | |
| Parcels depends on `xarray`, expecting inputs in the form of [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html). Output files can be read with `polars`. |
| @@ -155,23 +155,22 @@ pset.execute( | |||
| To start analyzing the trajectories computed by **Parcels**, we can open the `ParticleFile` using `xarray`: | |||
There was a problem hiding this comment.
This needs to be updated from "xarray"
| if "since" in attrs["units"]: | ||
| values = values.astype("datetime64[ns]") | ||
| df = df.with_columns(pl.Series("time", values, dtype=pl.Datetime("ns"))) | ||
| else: | ||
| values = values.astype("timedelta64[ns]") * 1e9 | ||
| df = df.with_columns(pl.Series("time", values, dtype=pl.Duration("ns"))) |
There was a problem hiding this comment.
I don't think this works properly with cf-time variables, and think that it will silently fail by providing incorrect times. I think its worth updating the docstring and also adding a check (if the calendar in the metadata isn't supported (ie. is CF specific) raise a NotImplementedError) since this is quite a user facing function.
| assert isinstance(df["time"][0], (cftime.datetime, datetime)), ( | ||
| "CF-time values in Parquet did not get properly decoded. Are the attributes correct?" | ||
| ) |
There was a problem hiding this comment.
This assert should be updated pending the discussion from the other comment on the read_particlefile function.
| ```{code-cell} | ||
| ds_particles_back = xr.open_zarr("output-backwards.zarr") | ||
| df_back = parcels.read_particlefile("output-backwards.parquet") | ||
|
|
||
| scatter = plt.scatter(ds_particles_back.lon.T, ds_particles_back.lat.T, c=np.repeat(ds_particles_back.obs.values,npart)) | ||
| plt.scatter(ds_particles_back.lon[:,0],ds_particles_back.lat[:,0],facecolors="none",edgecolors='r') # starting positions | ||
| scatter = plt.scatter(df_back['lon'], df_back['lat'], c=df_back['time']) | ||
| particles_at_start = df_back.filter(pl.col("time") == df_back["time"].min()) | ||
| plt.scatter(particles_at_start['lon'], particles_at_start['lat'], facecolors="none", edgecolors='r') # starting positions | ||
| plt.xlabel("Longitude [deg E]") | ||
| plt.xlim(31,33) | ||
| plt.ylabel("Latitude [deg N]") | ||
| plt.ylim(-33,-30) | ||
| plt.colorbar(scatter, label="Observation number") | ||
| plt.show() | ||
| ``` |
There was a problem hiding this comment.
I've gone through this - really nice update! I think its quite clear



Description
This PR updates all the documentation and tutorial notebooks to parse the parquet output introduced in #2600, as tracked in #2582. It also updated the
parcels.read_particlefile()to use polars, which scales better for large output filesChecklist
mainfor normal development,v3-supportfor v3 support)AI Disclosure