Architecture, Terminology & API Introductions
At the heart of Notebook CI is astropy’s nbcollection & nbcollection-ci. This document will focus on nbcollection-ci, introducing modules and concepts used in the development.
A custom testing framework has been built into nbcollection so tests can be ran with file-system objects objects found on remote systems. For more details, please refere to the Testing Architecture documentation.
Software Support
Python Version Support
Software |
Version |
Test Status |
|---|---|---|
Python |
3.6 |
|
Python |
3.7 |
|
Python |
3.8 |
|
Python |
3.9 |
Operating System Support
OS Name |
Version |
Test Status |
|---|---|---|
Ubuntu LTS |
20.04 |
|
RHEL |
7 |
|
macOS |
10.15 |
|
Windows Server |
2019 |
Terminology
Notebook Collection: A notebook collection is a collection of categories, or folders just relative to the root-level of a Repository.
Notebook Category: A notebook category is a folder of notebooks relative or deeply nested inside a notebook collection. There can be many categories per collection. Notebook Categories are transformed into isolated build environments and notebooks within are ordered and executed sequentially.
Notebooks
CI/CD Checks and Utilities
Bandit Security Audit
Flake8 PEP-8
Sphinx Link Checking
Code Cov ( not yet configured )
Modules and Concepts
Great care was taken to make the code maintainable. A majority of the code was written around the Scanner Module. You’ll find small implementations of the scanner module functions implemented throughout the codebase takes on most complexe operations within nbcollection-ci.
At the core of the scanner module is the concept “Isolated Build Environments”. This allows for a lot of encapsulation and abstraction in various areas of the codebase. Including a Testing Architecture completely decoupled from dependencies and commands, decoupled from the scanner module. The only hard dependencies on each module are objects being passed through to the functions being ran. Making upgrading straightforward.
Let’s dig into the APIs available
Scanner Module API
Enumerating the Python API into bash was pretty straight forward. The only caveat to worry about is how to manage multiple entries at once. In the Bash API, a comma seperated list can be passed into each option such as –collection-names, –category-names, and –notebook-names.
Command Line API
$ nbcollection-ci build-notebooks --collection-names jdat_notebooks,notebooks --project-path /tmp/dat_pyinthesky
Python API
from nbcollection.ci.scanner.utils import find_build_jobs, generate_job_context, run_job_context
project_path = '/tmp/dat_pyinthesky'
collection_names = ['jdat_notebooks', 'notebooks']
for job_idx, job in enumerate(find_build_jobs(project_path, collection_names)):
# Construct an isolated build environment by generating the job context
job_context = generate_job_context(job)
# Execute Jupyter Notebooks by running the Job Context
run_job_context(job_context)
Metadata Module API
nbcollection.ci.metadata implements functions to extract useful information from Jupyter Notebooks.
The logic inspects each Jupyter Notebook for title and description and provides a python datatype.
Extracting Metadata
Command Line API
nbcollection-ci metadata --mode extract-metadata --collection-names jdat_notebooks --project-path /tmp/dat_pyinthesky --category-name asdf_example
Python API
# Extract Metadata from
from nbcollection.ci.scanner.utils import find_build_jobs, generate_job_context
from nbcollection.ci.metadata.utils import extract_metadata
project_path = '/tmp/dat_pyinthesky'
collection_names = ['jdat_notebooks']
category_names = ['asdf_example']
for job in find_build_job(project_path, collection_names, category_names):
job_context = generate_job_context(job)
for notebook_context in job_context.notebooks:
metadata = extract_metadata(notebook_context)
Reset Notebooks
Resets Jupyter Notebook cell output.
Command Line API
nbcollection-ci metadata --mode reset-notebooks --collection-names jdat_notebooks --project-path /tmp/dat_pyinthesky --category-name asdf_example
Python API
import json
from nbcollection.ci.scanner.utils import find_build_jobs, generate_job_context
from nbcollection.ci.metadata.utils import reset_notebook_execution
project_path = '/tmp/dat_pyinthesky'
collection_names = ['jdat_notebooks']
category_names = ['asdf_example']
for job in find_build_job(project_path, collection_names, category_names):
job_context = generate_job_context(job)
for notebook_context in job_context.notebooks:
with open(notebook_context.path, 'rb') as stream:
notebook_data = json.loads(stream.read().decode(ENCODING))
reset_notebook_execution(notebook_data)
with open(notebook_context.path, 'wb') as stream:
stream.write(json.dumps(notebook_data).encode(ENCODING))
Generate CI Environment
nbcollection-ci generate-ci-env accepts an arbitrary number of Collection, Category, and Notebooks. Takes the information and renders into a configuration file of which can be submitted to a CI/CD pipeline. The logic is flexible enough to generate configs for CircleCI, Github Actions, AWS CloudFormation for Lambda, or K8 Yaml files if need be.
Command Line API
$ nbcollection-ci generate-ci-env --collection-names jdat_notebooks --ci-environment circle-ci --project-path /tmp/dat_pyinthesky
Python API
from nbcollection.ci.commands.datatypes import CIEnvironment
from nbcollection.ci.generate_ci_environment.utils import gen_ci_env
project_path = '/tmp/dat_pyinthesky'
collection_names = ['jdat_notebooks']
jobs = []
for job in find_build_jobs(project_path, collection_names):
jobs.append(job)
gen_ci_env(jobs, CIEnvironment.CircleCI, project_path)
Merge Artifacts
Artifacts are generated by previously ran commands and could be available in CircleCI or locally in a temporary folder. To merge these artifacts into a single website, which is generated from a theme from within nbcollection.ci, this merge-artifacts command will need a CIRCLECI_TOKEN to run successfully.
The logic of this command identifies which notebooks have been built recently in CircleCI. Downloads and stores the Notebooks with other artifacts in a temporary folder. A website is then generated from the downloaded artifacts using a Beautiful Soup lxml parser. Extracts the core HTML from the built Jupyter Notebook and renders into a Jinja2 Website Theme.
Command Line API
$ nbcollection-ci merge-artifacts --org spacetelescope --repo-name dat_pyinthesky
Python API
from nbcollection.ci.merge_artifacts.utils import artifact_merge
project_path = '/tmp/dat_pyinthesky'
repo_name = 'dat_pyinthesky'
org = 'spacetelescope'
collection_names = ['jdat_notebooks']
artifact_merge(project_path, repo_name, org, collection_names)
Build Notebooks
nbcollection-ci build-notebooks is where most of the work is done. The scope of the command can take in entire collections, categories, or notebooks. Narrowing down the scope of the notebook builds allows for concurrent builds to be ran through the command line interface for CI/CD to accuratly report failures in services such as CircleCI or Github Actions.
The logic of the command accepts an arbitrary set of Notebook Collections, Categories, and/or Notebooks. Of which will than be sequentially executed in alphabetical order.
Command Line API
nbcollection-ci build-notebooks --collection-names jdat_notebooks --category-names asdf_example
Alternatively, there are also utility options for the power user which will run all the notebook builds in different processes with a single command.
The logic of the command accepts an arbitrary set of Notebook Collections, Category, and/or Notebooks. Of which are than piped into concurrent processes where each process executes a single category on notebooks. If the category takes advantage of multiple cores on the machine, build-notebook isn’t written to measure available resources. Its up to the notebooks to know if the CPUs are busy. All memory managment is assumed, managed by a Python Interpreter or Operating System.
Command Line API
nbcollection-ci build-notebooks --collection-names jdat_notebooks -b concurrent -w 4
Either way nbcollection-ci build-notebooks is ran, artifacts will be saved in /tmp/nbcollection-ci-artifacts.
Pull Request
nbcollection-ci pull-request is only used inside a pull request. Chances are the only reason to run locally will be to debug code or problematic behaviour of the program.
The logic of the command accepts an URI as input in the command line. Followed by parsing the URI into a nbcollection datatype called RepoType. Checks to see if the URI passed is a Github Pull Request. Then parse the URI, extracting all the meta-data associated with a Github Pull Request. Next making a call to the Github Pull Request API, learning more about the Pull Request to be tested. It then checks to see which files have been explicitly added within the Github Pull Request, and builds notebooks accordingly.
Command Line API
nbcollection-ci pull-request -u https://github.com/spacetelescope/dat_pyinthesky/pull/139
Site Deployment
nbcollection-ci site-deployment takes care of publishing a in Github Pages. Should be installed into the build machinery of the repository with the -w command from nbcorrection-ci generate-ci-env
The logic of the command, starts by checking to see if a Pull Request was created. If yes, stop running because overriding a published website with a Pull Request would lead to an inconsistent User Experience. Through python, the command inspects the .git/config file using GitPython. Checking for a remote which to push to. Then makes a copy of the site/ directory, pushing the published website into a Github Branch.
Command Line API
nbcollection-ci site-deployment -r origin -b gh-pages
Sync Notebooks
nbcollection-ci sync-notebooks provides an automated action to sync notebooks from one repository to the next. Taking in a set of collection and/or categories and copying the notebooks within to a destination folder.
# Copy all notebooks within spacetelescope/dat_pyinthesky/jdat_notebooks to spacetelescope/jdat_notebooks/notebooks
nbcollection-ci sync-notebooks -c jdat_notebooks -d ../jdat_notebooks/notebooks