Skip to content

Update the documentation to use parquet output#2607

Open
erikvansebille wants to merge 76 commits intoParcels-code:mainfrom
erikvansebille:update_parquet_docs
Open

Update the documentation to use parquet output#2607
erikvansebille wants to merge 76 commits intoParcels-code:mainfrom
erikvansebille:update_parquet_docs

Conversation

@erikvansebille
Copy link
Copy Markdown
Member

Description

This PR updates all the documentation and tutorial notebooks to parse the parquet output introduced in #2600, as tracked in #2582. It also updated the parcels.read_particlefile() to use polars, which scales better for large output files

Checklist

AI Disclosure

  • This PR contains AI-generated content.
    • I have tested any AI-generated content in my PR.
    • I take responsibility for any AI-generated content in my PR.
    • Describe how you used it (e.g., by pasting your prompt): Help with how to use polars

@erikvansebille erikvansebille mentioned this pull request May 1, 2026
5 tasks
Copy link
Copy Markdown
Contributor

@VeckoTheGecko VeckoTheGecko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As part of review I've looked both at the code, as well as visually compared plots before and after.

I've gone through and pushed some edits which were quite straight-forward:

  • 1b35bf9 Fixing a notebook
  • d977c88e7

Now the docs builds are passing

Other than that, I have some small comments - nothing major.


Given we're now using Polars in the docs, the tests, and in the read_particlefile function - I think its easiest just for us to add it as a core dependency to Parcels. We could make it an optional dependency, but we don't really have the tooling for that in Parcels (and I don't think its worth adding the tooling in this case).

If we add as a core dependency:

  • Update pyproject.toml and pixi.toml (run-dependencies to = ">=1.31.0" and feature.minimum.dependencies to = "1.31.*")
  • Update recipe.yaml

I'm happy to make those updates.


The output files are in `.zarr` [format](https://zarr.readthedocs.io/en/stable/), which can be read by `xarray`.
See the [Parcels output tutorial](./tutorial_output.ipynb) for more information on the zarr format. We want to choose
The output files are in `.parquet` [format](https://parquet.apache.org/), which can be read by `polars`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be a good place to link to Polars?

Suggested change
The output files are in `.parquet` [format](https://parquet.apache.org/), which can be read by `polars`.
The output files are in `.parquet` [format](https://parquet.apache.org/), which can be read by [Polars](https://pola.rs/).

I don't think we link to it yet in the docs here


Parcels depends on `xarray`, expecting inputs in the form of [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html)
and writing output files that can be read with xarray.
Parcels depends on `xarray`, expecting inputs in the form of [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html). Output files can be read with `pandas`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Parcels depends on `xarray`, expecting inputs in the form of [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html). Output files can be read with `pandas`.
Parcels depends on `xarray`, expecting inputs in the form of [`xarray.Dataset`](https://docs.xarray.dev/en/stable/generated/xarray.Dataset.html). Output files can be read with `polars`.

@@ -155,23 +155,22 @@ pset.execute(
To start analyzing the trajectories computed by **Parcels**, we can open the `ParticleFile` using `xarray`:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be updated from "xarray"

Comment on lines +243 to +248
if "since" in attrs["units"]:
values = values.astype("datetime64[ns]")
df = df.with_columns(pl.Series("time", values, dtype=pl.Datetime("ns")))
else:
values = values.astype("timedelta64[ns]") * 1e9
df = df.with_columns(pl.Series("time", values, dtype=pl.Duration("ns")))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this works properly with cf-time variables, and think that it will silently fail by providing incorrect times. I think its worth updating the docstring and also adding a check (if the calendar in the metadata isn't supported (ie. is CF specific) raise a NotImplementedError) since this is quite a user facing function.

Comment thread tests/utils.py
Comment on lines +164 to 166
assert isinstance(df["time"][0], (cftime.datetime, datetime)), (
"CF-time values in Parquet did not get properly decoded. Are the attributes correct?"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert should be updated pending the discussion from the other comment on the read_particlefile function.

Comment on lines 211 to 223
```{code-cell}
ds_particles_back = xr.open_zarr("output-backwards.zarr")
df_back = parcels.read_particlefile("output-backwards.parquet")

scatter = plt.scatter(ds_particles_back.lon.T, ds_particles_back.lat.T, c=np.repeat(ds_particles_back.obs.values,npart))
plt.scatter(ds_particles_back.lon[:,0],ds_particles_back.lat[:,0],facecolors="none",edgecolors='r') # starting positions
scatter = plt.scatter(df_back['lon'], df_back['lat'], c=df_back['time'])
particles_at_start = df_back.filter(pl.col("time") == df_back["time"].min())
plt.scatter(particles_at_start['lon'], particles_at_start['lat'], facecolors="none", edgecolors='r') # starting positions
plt.xlabel("Longitude [deg E]")
plt.xlim(31,33)
plt.ylabel("Latitude [deg N]")
plt.ylim(-33,-30)
plt.colorbar(scatter, label="Observation number")
plt.show()
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This output is different to what is was before.

What are we highlighting with the red circles here?

(before on the left, this PR on the right)

Image

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through this - really nice update! I think its quite clear

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI the colourbar is different here - seems to be better now (fits the range of values better)

Image

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI The final plot here doesn't have black lines through it

Image

I'm happy either way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants