Conversation
|
I'm not a big fan of organizing by contributor. That information is already available in repo metadata if someone really cares. The pcn_common.py functions were placed there because they are common to the notebooks I provided and I did not want to create long notebooks. Since they are for scalar data requests, their use is agnostic to the data being requested. In my experience, even with the overhead, packages like Pandas and Xarray are common in scientific computing and data exploration. My notebooks are one interpretation of how users can access and organize ONC data. Users are more than welcome to make edits and suggestions for improving their clarity and efficiency. However, I can also see the pcn_common.py file being confusing or causing new Python users environment issues. I will spend some time making each notebook standalone and will remove pcn_common.py from the repo. Users wishing to reuse code can then copy-paste like you say. I think the goal for this repository should be to 1) show people how to access and discover ONC data via the onc python package, and 2) highlight interesting ONC assets and data. What do you think about separating into tutorials and science/advanced processing examples? For tutorials, notebooks could describe how to extend basic operations with the api-python-client. Such as finding all location codes within a bounding box, or finding all location codes that produced 'seawatertemperature' between two dates. Then science examples would be for reviewing data, making figures, or performing data corrections. Thoughts? @kan-fu @aschlesin Once a decision is made on the repo structure, I will update the notebooks and their descriptions in the README. |
Contributor name way or Category wayMy main reason for organizing under contributor names is that notebooks (along with the helper files) from one contributor can be independent with those from different contributors in this way, so contributors are free to modify their own content as they like in an autonomous space. It also reduces maintenance burden. Soon we will add https://github.com/OceanNetworksCanada/Barkley-Sound-datalabs and possibly https://github.com/g-bertozzi/Ocean-Hackathon-Datasets into this repository. The whole restructure thing in this PR is all about how to add more notebooks from different contributors. In this case, they have their own repo first, then we want to incorporate them into our repo.
To be frank I was leaning towards category way in the very beginning (tutorials and specific topics are words in my mind), as organizing by contributor names seems weird. It's just because my experience is not in the data/science area, I cannot help with classifying each notebook. I added keywords in the description as an alternative to act like categories. I am OK with either way. Just don't want this to discourage contributors to share their notebooks as this requires extra work. For pcn_common.pyI like the idea of having helper methods. I put it in the sub folder because I am thinking in the contributor name way. If we go with the category way, we should keep it in the root. So I don't think making notebooks standalone is necessary (or even beneficial). Having the same helper methods in 10 standalone notebooks is against DRY principle. Also it highlights and advertises some common methods (like xarray you mentioned) to users. The environment issue I mentioned is about the pinned versions in the different requirements.txt file. Users need to be aware of that, but contributors should not worry about it. One minor issue of having a pcn_common.py in the root is that other contributors need to either append their helper files into this file and adapt their import in the notebooks, or they simply ignore it and use their own ones. Take Barkley-Sound-datalabs as an example. This one looks like a tutorial for a conference or workshop. They have their own structure and helper files. It would be best if they can just move everything into the repository without any changes. We just need to let other contributors know that they can have their own ones. Not a big issue here. BTW, I added the Code Organization section in the README because I hope users can smoothly run the notebooks. Right now I believe users need to put pcn_common.py besides the notebooks to make the import work. |
|
Hi, I also don't prefer the naming convention by initial contributor. I like Ian's thoughts: What do you think about separating into tutorials and science/advanced processing examples?"_ First I thought we should organize by instrument category, but that does not really work if one creates a notebook to get data from different instruments to compare or to investigate a specific research idea. E.g. an easy example - water property changes in Strait of Georgia - one would request data from different sites (along the strait, ferries, moorrings (autonomous sites) and also different instrument categories (oxygen sensors, turbitity sensors, CO2 sensors,...) |
|
Hi, I finally have some time to come back to some Python work. Just want to give some updates of my thought about managing this repo, especially the structure of individual notebook. Recently I came across Google Colab and MyBinder (GitHub codespaces is another option that is similar to Mybinder). Both are free, cloud-based platforms for running Jupyter notebooks. Users can run notebooks in the cloud with zero local setup. Contributors also only have one testing goal: to make sure the notebooks run all cells to the end successfully on the cloud. Here are some comparisons generated by AI:
From my experience, the biggest difference is that mybinder creates a headless docker runtime and clones all the repo content into the docker (slow setup, and opencv might not work well), while Google Colab opens only one notebook (notebooks can be stored in Google Drive, or on GitHub directly). Here are two links to try out the two platforms In the Colab verion, I modified several small places: retrieving ONC_TOKEN from Colab secrets feture and defaulting to python-dotenv if in local, adding uv pip dependecies (yes Colab supports uv out of the box, and I plan to add recommending uv in the README as it is way faster and has cleaner output), limiting all the generated outputs in the current folder ( I am more inclined to use Google Colab since it is fast and easy to use. In order to use Google Colab, all the notebooks need to be stand-alone. This might make the notebook longer, and harder for contributors to reuse useful code snippets in other notebooks, but I think it is worth the effort considering that users need only focus on understanding one notebook without navigating back and forth between notebooks and supporting files, and there will be no denpendency conflict issue since all the notebooks are standalone. What do you think about the idea of using Google Colab as the main recommended platform for users to try these notebooks? I already made the refactor in the playground branch to make the notebooks standalone with the help of AI, and all ran successfully on the Google Colab. You can try that by going to https://colab.research.google.com/, click GitHub, paste https://github.com/OceanNetworksCanada/python-community-notebooks in the url, select playground branch, and click search icon (the same Colab link I posted above for bacvp_ctd_up_profiles). Al the available notebooks should be listed in the dialog. Or you can manually modify the GitHub link
BTW on the google Colab it is pretty easy to summarize the notebook using Gemini. I just tried the following prompt and it worked well. Clicking the Gemini icon at the center bottom of the page should bring up the chat UI.
|
|
Hi Kan, This is an interesting idea that I think is worth exploring, but I'd like to suggest one alternative (which would be a significant commitment on ONC's part). Do you think in the future that ONC will have the infrastructure to host their own JupyterHub? For example, the OOI has a hosted JupyterHub. I am in favor of exploring Google Colab to showcase ONC data, but the only advantage I see at the moment is that it is more inclusive of new Python users and is better suited for use in the classroom. Is there a way to host curated datasets on an ONC-affiliated GDrive so that notebooks have access to common data examples?
|
|
@IanTBlack Re: jupyter hub. I got back a reply from the ODO manager and they say that this is currently not an option for ONC due to the lack of reasonable GPUs. We will probably get to this in the future. Re: datasets in a google-drive. We could pursue this and host selected data sets on a google drive for easy access. However, we normally don't provide this to users. We have uploaded DOI minted datasets to Borealis: https://borealisdata.ca/dataverse/oceannetworkscanada which we could do with more data sets. I will ask Dwight to see where the ONC hackathon (2025) data sets are hosted currently. Maybe there is an option I don't know about. Re: collab versus Binder. What is the advantage of these in terms of keeping track of changes and maintenance? I can see the appeal for easy use (beginners and students), but other than that? |
For colab vs binder, these are free cloud platforms to run Jupyter notebooks. The main advantage is to provide an extra way for the users try out the notebooks. The normal way (without colab or binder) is to clone/download the repo (or copy the code), then run the code locally. The whole colab idea only requires the contributors to just add one extra cell in their notebooks when passing the token to the ONC class, and it won't affect the normal local running. # If in Colab, add your ONC_TOKEN secrets by clicking the key icon on the left sidebar
# If in local, create .env file in the root directory, and add ONC_TOKEN=XXX in your .env file
try:
from google.colab import userdata
import os
os.environ['ONC_TOKEN'] = userdata.get('ONC_TOKEN')
except ImportError:
from dotenv import load_dotenv
load_dotenv()In terms of keeping track of changes and maintenance, they are still done on the GitHub. |
Hi, this is my attempt to reorganize the repo so that it would be easier to incorporate other people's work into this repo. I also proposed guidelines (recommended, not enforced) on some common topics. The main motivation is to establish a easy and consistent way for users to try out those notebooks. Feel free to leave comments if you have any questions or suggestions.
I used copilot to generate the catalog for Ian's notebooks (actually polished the whole README file).
Explanations on some of the decision (which I hesitated between both sides and am open to change):
I put the helper python script (
pcn_common.py) in the sub directory, instead of root directory.Having the helper file in the root directory helps other contributors to reuse the methods. But from my experience, different people tend to have their own helper files. And if they really want to use the methods in other people's helper method, they can always copy and paste. Original authors would not need to consider backward compatibility issues.
I used author names as the directory name.
I think it would be easier to manage the repo by organizing the notebooks under the author name instead of categories. I just used Ian's GitHub name. Feel free to change that @IanTBlack.
I put description and keywords in the catalog section.
I don't want to overburden the contributors, but I think a brief description and some keywords would be beneficial for users to navigate inside the repo. Users might just want to take a look at a random notebook, or they might want to look for some specific topics. Having both description and keywords (including names of the external libraries used) would give users a good idea on whether they are interesting or not. Users can also simply search keywords in the README file.
I initially plan to replace all the notebooks names with links, but later I decided not to because it adds some extra workload for the contributors.