Organize structure by kan-fu · Pull Request #5 · OceanNetworksCanada/python-community-notebooks

kan-fu · 2026-01-22T22:51:02Z

Hi, this is my attempt to reorganize the repo so that it would be easier to incorporate other people's work into this repo. I also proposed guidelines (recommended, not enforced) on some common topics. The main motivation is to establish a easy and consistent way for users to try out those notebooks. Feel free to leave comments if you have any questions or suggestions.

I used copilot to generate the catalog for Ian's notebooks (actually polished the whole README file).

Might need @IanTBlack to double check those description if they are correct.

Explanations on some of the decision (which I hesitated between both sides and am open to change):

I put the helper python script (pcn_common.py) in the sub directory, instead of root directory.

Having the helper file in the root directory helps other contributors to reuse the methods. But from my experience, different people tend to have their own helper files. And if they really want to use the methods in other people's helper method, they can always copy and paste. Original authors would not need to consider backward compatibility issues.
I used author names as the directory name.

I think it would be easier to manage the repo by organizing the notebooks under the author name instead of categories. I just used Ian's GitHub name. Feel free to change that @IanTBlack.
I put description and keywords in the catalog section.

I don't want to overburden the contributors, but I think a brief description and some keywords would be beneficial for users to navigate inside the repo. Users might just want to take a look at a random notebook, or they might want to look for some specific topics. Having both description and keywords (including names of the external libraries used) would give users a good idea on whether they are interesting or not. Users can also simply search keywords in the README file.
I initially plan to replace all the notebooks names with links, but later I decided not to because it adds some extra workload for the contributors.

IanTBlack · 2026-01-23T21:09:25Z

I'm not a big fan of organizing by contributor. That information is already available in repo metadata if someone really cares.

The pcn_common.py functions were placed there because they are common to the notebooks I provided and I did not want to create long notebooks. Since they are for scalar data requests, their use is agnostic to the data being requested. In my experience, even with the overhead, packages like Pandas and Xarray are common in scientific computing and data exploration. My notebooks are one interpretation of how users can access and organize ONC data. Users are more than welcome to make edits and suggestions for improving their clarity and efficiency. However, I can also see the pcn_common.py file being confusing or causing new Python users environment issues. I will spend some time making each notebook standalone and will remove pcn_common.py from the repo. Users wishing to reuse code can then copy-paste like you say.

I think the goal for this repository should be to 1) show people how to access and discover ONC data via the onc python package, and 2) highlight interesting ONC assets and data.

What do you think about separating into tutorials and science/advanced processing examples?

For tutorials, notebooks could describe how to extend basic operations with the api-python-client. Such as finding all location codes within a bounding box, or finding all location codes that produced 'seawatertemperature' between two dates.

Then science examples would be for reviewing data, making figures, or performing data corrections.
Such as making gifs from seafloor cameras, recreating plots like from this ONC story, identifying and binning profiles/ferry transits, or doing depth matchup for profiling bioacoustic data (like @slonimer has done for BACVP). Things that are generally a little more advanced or show that the data exist.

Thoughts? @kan-fu @aschlesin

Once a decision is made on the repo structure, I will update the notebooks and their descriptions in the README.

kan-fu · 2026-01-23T23:31:32Z

Contributor name way or Category way

My main reason for organizing under contributor names is that notebooks (along with the helper files) from one contributor can be independent with those from different contributors in this way, so contributors are free to modify their own content as they like in an autonomous space. It also reduces maintenance burden. Soon we will add https://github.com/OceanNetworksCanada/Barkley-Sound-datalabs and possibly https://github.com/g-bertozzi/Ocean-Hackathon-Datasets into this repository. The whole restructure thing in this PR is all about how to add more notebooks from different contributors. In this case, they have their own repo first, then we want to incorporate them into our repo.

By using contributor name way, it will be pretty easy to move them.
By using category way, someone (probably the contributor) needs to put them into the correct categories.

To be frank I was leaning towards category way in the very beginning (tutorials and specific topics are words in my mind), as organizing by contributor names seems weird. It's just because my experience is not in the data/science area, I cannot help with classifying each notebook. I added keywords in the description as an alternative to act like categories.

I am OK with either way. Just don't want this to discourage contributors to share their notebooks as this requires extra work.

For pcn_common.py

I like the idea of having helper methods. I put it in the sub folder because I am thinking in the contributor name way. If we go with the category way, we should keep it in the root. So I don't think making notebooks standalone is necessary (or even beneficial). Having the same helper methods in 10 standalone notebooks is against DRY principle. Also it highlights and advertises some common methods (like xarray you mentioned) to users. The environment issue I mentioned is about the pinned versions in the different requirements.txt file. Users need to be aware of that, but contributors should not worry about it.

One minor issue of having a pcn_common.py in the root is that other contributors need to either append their helper files into this file and adapt their import in the notebooks, or they simply ignore it and use their own ones. Take Barkley-Sound-datalabs as an example. This one looks like a tutorial for a conference or workshop. They have their own structure and helper files. It would be best if they can just move everything into the repository without any changes.

We just need to let other contributors know that they can have their own ones. Not a big issue here.

BTW, I added the Code Organization section in the README because I hope users can smoothly run the notebooks. Right now I believe users need to put pcn_common.py besides the notebooks to make the import work.

aschlesin · 2026-01-27T18:38:35Z

Hi, I also don't prefer the naming convention by initial contributor. I like Ian's thoughts:
_"I think the goal for this repository should be to 1) show people how to access and discover ONC data via the onc python package, and 2) highlight interesting ONC assets and data.

What do you think about separating into tutorials and science/advanced processing examples?"_

First I thought we should organize by instrument category, but that does not really work if one creates a notebook to get data from different instruments to compare or to investigate a specific research idea. E.g. an easy example - water property changes in Strait of Georgia - one would request data from different sites (along the strait, ferries, moorrings (autonomous sites) and also different instrument categories (oxygen sensors, turbitity sensors, CO2 sensors,...)
I think it should be up to the owner of the notebooks to describe (1) what research aspect they were looking at, (2) what instruments/data they are requesting in the notebooks. I don't see it just as a way to show multiple ways on how to access data, but more as way to show interesting research and ONCs data in that way. Hope that makes sense. I will have a read at the instructions guide that Kan wrote (Many thanks!) and will see if I can modify this to my understanding if you all agree.

kan-fu · 2026-04-09T19:12:55Z

Hi, I finally have some time to come back to some Python work. Just want to give some updates of my thought about managing this repo, especially the structure of individual notebook.

Recently I came across Google Colab and MyBinder (GitHub codespaces is another option that is similar to Mybinder). Both are free, cloud-based platforms for running Jupyter notebooks. Users can run notebooks in the cloud with zero local setup. Contributors also only have one testing goal: to make sure the notebooks run all cells to the end successfully on the cloud. Here are some comparisons generated by AI:

Feature	MyBinder (Binder)	Google Colab
User Access	No login required. 1-click access for any user.	Requires a Google Account.
Environment Setup	Automatic. Uses your repo's `requirements.txt` or `environment.yml` to build a matching image.	Manual/Scripted. Users usually need to run `!pip install` cells at the start of the notebook.
Startup Speed	Slow. If the image isn't "warm," it can take minutes to build.	Fast. Instances spin up in seconds.
Persistence	None. Changes are lost once the session expires or is closed.	High. Integrated with Google Drive; users can save copies easily.
Secrets (API Keys)	Difficult. Hard to securely pass an ONC token without the user typing it in every time.	Good. Offers "Secrets" management (User-defined env vars) in the sidebar.
Hardware	Basic (memory/CPU limited).	Powerful (Free GPU/TPU access).

From my experience, the biggest difference is that mybinder creates a headless docker runtime and clones all the repo content into the docker (slow setup, and opencv might not work well), while Google Colab opens only one notebook (notebooks can be stored in Google Drive, or on GitHub directly). Here are two links to try out the two platforms

In the Colab verion, I modified several small places: retrieving ONC_TOKEN from Colab secrets feture and defaulting to python-dotenv if in local, adding uv pip dependecies (yes Colab supports uv out of the box, and I plan to add recommending uv in the README as it is way faster and has cleaner output), limiting all the generated outputs in the current folder (../data -> ./data). and removing the pcn_common import and usage.

I am more inclined to use Google Colab since it is fast and easy to use. In order to use Google Colab, all the notebooks need to be stand-alone. This might make the notebook longer, and harder for contributors to reuse useful code snippets in other notebooks, but I think it is worth the effort considering that users need only focus on understanding one notebook without navigating back and forth between notebooks and supporting files, and there will be no denpendency conflict issue since all the notebooks are standalone.

What do you think about the idea of using Google Colab as the main recommended platform for users to try these notebooks? I already made the refactor in the playground branch to make the notebooks standalone with the help of AI, and all ran successfully on the Google Colab. You can try that by going to https://colab.research.google.com/, click GitHub, paste https://github.com/OceanNetworksCanada/python-community-notebooks in the url, select playground branch, and click search icon (the same Colab link I posted above for bacvp_ctd_up_profiles). Al the available notebooks should be listed in the dialog. Or you can manually modify the GitHub link

BTW on the google Colab it is pretty easy to summarize the notebook using Gemini. I just tried the following prompt and it worked well. Clicking the Gemini icon at the center bottom of the page should bring up the chat UI.

Generate a markdown cell on the top of the notebook describing (1) what research aspect they were looking at, (2) what instruments/data they are requesting in the notebooks.

IanTBlack · 2026-04-11T18:58:18Z

Hi Kan,

This is an interesting idea that I think is worth exploring, but I'd like to suggest one alternative (which would be a significant commitment on ONC's part). Do you think in the future that ONC will have the infrastructure to host their own JupyterHub?
Creating working copies of the data for use in an intuitive science notebook takes time away from actually looking at the data. Users will need to do this every time (if using MyBinder) or once (if using GoogleColab). I have used Colab (positive experience) in the past and I know that that workspace is not persistent, so like you suggested storing the data in Google Drive is an option. However, I fear that storing data in GDrive and then writing the code to integrate it into a new session is something that will be a challenge for oceanography students. A work around to this is to provide code examples for accessing data via GoogleDrive. FYI, my university moved away from the Google Workspace, so I only have 10GB of space, so others in a similar situation will fill up that space quickly with ONC data in addition to their own. I am thinking about these things as tools in an Intro to Oceanography class at the undergraduate level and as workspaces for grad students.

For example, the OOI has a hosted JupyterHub.
They integrate this with read-only copies of their GoldCopy datasets (which would be equivalent to the ONC archive files). Users do not need to download OOI data and any custom datasets are persistent in a user's home directory.
I use OOI data in my research and use the JupyterHub several times a week to compile datasets and make figures. It provides different server CPU/GPU configurations depending on the workload the user anticipates. For example, combining a year deployment of OOI OPTAA data at full resolution in a single file and bringing it into memory is almost impossible without really being careful of memory management (https://dataexplorer.oceanobservatories.org/#ooi/array/CE/subsite/CE02SHBP/node/CE02SHBP-LJ?instrument_id=CE02SHBP-LJ01D-08-OPTAAD106). The sensor samples at 4Hz and produces values for ~80 wavelengths for multiple variables, so a single deployment netCDF file can easily exceed 30GB. The JupyterHub lets me review this data in full resolution and then whittle it down to something that is more manageable. Downloading this data and processing it on my own machine would take a half a day, while the OOI JupyterHub can handle it in 10 minutes.

I am in favor of exploring Google Colab to showcase ONC data, but the only advantage I see at the moment is that it is more inclusive of new Python users and is better suited for use in the classroom. Is there a way to host curated datasets on an ONC-affiliated GDrive so that notebooks have access to common data examples?

Ian

aschlesin · 2026-04-22T21:04:00Z

@IanTBlack Re: jupyter hub. I got back a reply from the ODO manager and they say that this is currently not an option for ONC due to the lack of reasonable GPUs. We will probably get to this in the future.

Re: datasets in a google-drive. We could pursue this and host selected data sets on a google drive for easy access. However, we normally don't provide this to users. We have uploaded DOI minted datasets to Borealis: https://borealisdata.ca/dataverse/oceannetworkscanada which we could do with more data sets.

I will ask Dwight to see where the ONC hackathon (2025) data sets are hosted currently. Maybe there is an option I don't know about.

Re: collab versus Binder. What is the advantage of these in terms of keeping track of changes and maintenance? I can see the appeal for easy use (beginners and students), but other than that?

kan-fu · 2026-04-23T19:58:04Z

@IanTBlack Re: jupyter hub. I got back a reply from the ODO manager and they say that this is currently not an option for ONC due to the lack of reasonable GPUs. We will probably get to this in the future.

Re: datasets in a google-drive. We could pursue this and host selected data sets on a google drive for easy access. However, we normally don't provide this to users. We have uploaded DOI minted datasets to Borealis: https://borealisdata.ca/dataverse/oceannetworkscanada which we could do with more data sets.

I will ask Dwight to see where the ONC hackathon (2025) data sets are hosted currently. Maybe there is an option I don't know about.

Re: collab versus Binder. What is the advantage of these in terms of keeping track of changes and maintenance? I can see the appeal for easy use (beginners and students), but other than that?

For colab vs binder, these are free cloud platforms to run Jupyter notebooks. The main advantage is to provide an extra way for the users try out the notebooks. The normal way (without colab or binder) is to clone/download the repo (or copy the code), then run the code locally. The whole colab idea only requires the contributors to just add one extra cell in their notebooks when passing the token to the ONC class, and it won't affect the normal local running.

# If in Colab, add your ONC_TOKEN secrets by clicking the key icon on the left sidebar
# If in local, create .env file in the root directory, and add ONC_TOKEN=XXX in your .env file

try:
    from google.colab import userdata
    import os
    os.environ['ONC_TOKEN'] = userdata.get('ONC_TOKEN')
except ImportError:
    from dotenv import load_dotenv
    load_dotenv()

In terms of keeping track of changes and maintenance, they are still done on the GitHub.

kan-fu added 2 commits January 22, 2026 14:11

Update README for guideline and catalog

7e7635c

Move notebooks

746421d

kan-fu requested review from IanTBlack and aschlesin January 22, 2026 22:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Organize structure#5

Organize structure#5
kan-fu wants to merge 2 commits intomainfrom
organize-structure

kan-fu commented Jan 22, 2026

Uh oh!

IanTBlack commented Jan 23, 2026

Uh oh!

kan-fu commented Jan 23, 2026

Uh oh!

aschlesin commented Jan 27, 2026

Uh oh!

kan-fu commented Apr 9, 2026 •

edited

Loading

Uh oh!

IanTBlack commented Apr 11, 2026

Uh oh!

aschlesin commented Apr 22, 2026

Uh oh!

kan-fu commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

kan-fu commented Jan 22, 2026

Uh oh!

IanTBlack commented Jan 23, 2026

Uh oh!

kan-fu commented Jan 23, 2026

Contributor name way or Category way

For pcn_common.py

Uh oh!

aschlesin commented Jan 27, 2026

Uh oh!

kan-fu commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

IanTBlack commented Apr 11, 2026

Uh oh!

aschlesin commented Apr 22, 2026

Uh oh!

kan-fu commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kan-fu commented Apr 9, 2026 •

edited

Loading