Skip to content

Commit fca27b8

Browse files
authored
Using mkdocs to generate documentation (#55)
1 parent 68dffae commit fca27b8

16 files changed

+197
-70
lines changed

.github/workflows/mkdocs_deploy.yml

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
name: deploy documentation (only on push to main branch)
2+
on:
3+
push:
4+
branches: main
5+
# Declare default permissions as read only.
6+
permissions: read-all
7+
jobs:
8+
build:
9+
runs-on: ubuntu-22.04
10+
permissions:
11+
# Need to be able to write to the deploy branch
12+
contents: write
13+
steps:
14+
- name: checkout
15+
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
16+
with:
17+
fetch-depth: 0 # need to fetch all history to ensure correct Git revision dates in docs
18+
19+
- name: set up Python
20+
uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c # v5.0.0
21+
with:
22+
python-version: '3.10'
23+
24+
- name: install mkdocs + plugins
25+
run: |
26+
pip install mkdocs mkdocs-material
27+
pip list | grep mkdocs
28+
mkdocs --version
29+
- name: build
30+
run: mkdocs build --strict && mkdocs gh-deploy --force

.github/workflows/mkdocs_test.yml

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
name: build documentation
2+
on: [push, pull_request]
3+
# Declare default permissions as read only.
4+
permissions: read-all
5+
jobs:
6+
build:
7+
runs-on: ubuntu-22.04
8+
steps:
9+
- name: checkout
10+
uses: actions/checkout@b4ffde65f46336ab88eb53be808477a3936bae11 # v4.1.1
11+
12+
- name: set up Python
13+
uses: actions/setup-python@0a5c61591373683505ea898e09a3ea4f39ef2b9c # v5.0.0
14+
with:
15+
python-version: '3.10'
16+
17+
# - name: Markdown Linting Action
18+
# uses: avto-dev/[email protected]
19+
# with:
20+
# rules: '/lint/rules/changelog.js'
21+
# config: '/lint/config/changelog.yml'
22+
# args: '.'
23+
24+
- name: install mkdocs + plugins
25+
run: |
26+
pip install mkdocs mkdocs-material
27+
pip list | grep mkdocs
28+
mkdocs --version
29+
- name: build
30+
run: mkdocs build --strict

.gitignore

+1
Original file line numberDiff line numberDiff line change
@@ -146,3 +146,4 @@ public.cert
146146
idp_metadata.xml
147147

148148
.DS_Store
149+
.site/

README.md

+4-48
Original file line numberDiff line numberDiff line change
@@ -3,20 +3,15 @@
33

44
[![DOI](https://zenodo.org/badge/549763009.svg)](https://zenodo.org/badge/latestdoi/549763009)
55

6-
This web portal is intended to give HPC users a view of the overall use of the HPC cluster and their use. This portal is using the information collected on compute nodes and management servers to produce the information in the various modules:
6+
This web portal is intended to give HPC users a view of the overall use of the HPC cluster and their use. This portal uses the information collected on compute nodes and management servers to produce the information in the various modules:
77

8-
* [jobstats](docs/jobstats.md)
9-
* [accountstats](docs/accountstats.md)
10-
* [cloudstats](docs/cloudstats.md)
11-
* [quotas](docs/quotas.md)
12-
* [top](docs/top.md)
13-
* [usersummary](docs/usersummary.md)
8+
[Documentation](docs/index.md)
149

1510
Some examples of the available graphs are displayed in the documentation of each module.
1611

1712
This portal is made to be modular, some modules can be disabled if the data required is not needed or collected. Some modules have optional dependencies and will hide some graphs if the data is not available.
1813

19-
This portal also supports Openstack, the users can see their use without having to install a monitoring agent in their VM in their OpenStack VMs.
14+
This portal also supports OpenStack, the users can see their use without having to install a monitoring agent in their VM in their OpenStack VMs.
2015

2116
Staff members can also see the use of any users to help them optimize their use of HPC and OpenStack clusters.
2217

@@ -26,7 +21,7 @@ Some information collected is also available for the general public like the num
2621
## Design
2722
Performance metrics are stored in Prometheus, multiple exporters are used to gather this data, and most are optional.
2823

29-
The Django portal will also access various MySQL databases like the database of Slurm and Robinhood (if installed) to gather some information. Timeseries are stored with Prometheus for better performance. Compatible alternatives to Prometheus like Thanos, VictoriaMetrics, and Grafana Mimir should work without any problems (Thanos is used in production). Recorder rules in Prometheus are used to pre-aggregate some stats for the portal.
24+
The Django portal will also access various MySQL databases like the database of Slurm and Robinhood (if installed) to gather some information. Time series are stored with Prometheus for better performance. Compatible alternatives to Prometheus like Thanos, VictoriaMetrics, and Grafana Mimir should work without any problems (Thanos is used in production). Recorder rules in Prometheus are used to pre-aggregate some stats for the portal.
3025

3126
![Architecture diagram](docs/userportal.png)
3227

@@ -35,42 +30,3 @@ Various data sources are used to populate the content of this portal. Most of th
3530
Some pre-aggregation is done using recorder rules in Prometheus. The required recorder rules are documented in the data sources documentation.
3631

3732
[Data sources documentation](docs/data.md)
38-
39-
## Test environment
40-
A test environment using the local `uid` resolver and dummies allocations is provided to test the portal.
41-
42-
To use it, copy `example/local.py` to `userportal/local.py`. The other functions are documented in `common.py` if any other overrides are needed for your environment.
43-
44-
To quickly test and bypass authentication, add this line to `userportal/settings/99-local.py`. Other local configuration can be added in this file to override the default settings.
45-
46-
```
47-
AUTHENTICATION_BACKENDS.insert(0, 'userportal.authentication.staffRemoteUserBackend')
48-
```
49-
50-
This bypasses the authentication and will use the `REMOTE_USER` header or env variable to authenticate the user. This is useful to be able to try the portal without having to set up a full IDP environment. The REMOTE_USER method can be used when using some IDP such as Shibboleth. SAML2 based IDP is now the preferred authentication method for production.
51-
52-
Examine the default configuration in `userportal/settings/` and override any settings in `99-local.py` as needed.
53-
54-
Then you can launch the example server with:
55-
56-
```
57-
[email protected] [email protected] python manage.py runserver
58-
```
59-
60-
This will run the portal with the user `someuser` logged in as a staff member.
61-
62-
Automated Django tests are also available, they can be run with:
63-
64-
```
65-
python manage.py test
66-
```
67-
68-
This will test the various modules, including reading job data from the Slurm database and Prometheus. A temporary database for Django is created automatically for the tests. Slurm and Prometheus data are read directly from production data with a read-only account. A representative user, job and account need to be defined to be used in the tests, check the `90-tests.py` file for an example.
69-
70-
## Production install
71-
The portal can be installed directly on a Centos7 or Rocky8 Apache web server or with Nginx and Gunicorn. The portal can also be deployed as a container with Podman or Kubernetes. Some scripts used to deploy both Nginx and Django containers inside the same pod are provided in the `podman` directory.
72-
The various recommendations for any normal Django production deployment can be followed.
73-
74-
[Deploying Django](https://docs.djangoproject.com/en/3.2/howto/deployment/)
75-
76-
[Install documentation](docs/install.md)

docs/accountstats.md

+3-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
# Accountstats
22
The users can also see the aggregated use of the users in the same group. This also shows the current priority of this account in Slurm and a few months of history on how much computing resources were used.
33

4-
<a href="accountstats.png"><img src="accountstats.png" alt="Stats per account" width="100"/></a>
4+
## Screenshots
5+
### Account stats
6+
![Stats per account](accountstats.png)
57

68
## Requirements
79

docs/cloudstats.md

+7-3
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,13 @@
11
# Cloudstats
22
The stats of the VM running on Openstack can be viewed. This is using the stats of libvirtd, no agent needs to be installed in the VM. There is an overall stats page available for staff. The page per project and VM is also available for the users.
33

4-
<a href="cloudstats.png"><img src="cloudstats.png" alt="Overall use" width="100"/></a>
5-
<a href="cloudstats_rpoject.png"><img src="cloudstats_project.png" alt="Use within a project" width="100"/></a>
6-
<a href="cloudstats_vm.png"><img src="cloudstats_vm.png" alt="Use within a VM" width="100"/></a>
4+
## Screenshots
5+
### Overall use
6+
![Overall use](cloudstats.png)
7+
### Use within a project
8+
![Use within a project](cloudstats_project.png)
9+
### Use within a VM
10+
![Use within a VM](cloudstats_vm.png)
711

812
## Requirements
913

docs/data.md

+4-4
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
# Data sources
2-
Some features will not be available if the exporter required to gather the stats is not configured.
2+
The main requirement to monitor a Slurm cluster is to install slurm-job-exporter and open a read-only access to the Slurm MySQL database. Other data sources in this page can be installed to gather more data.
33

44
## slurm-job-exporter
55
[slurm-job-exporter](https://github.com/guilbaults/slurm-job-exporter) is used to capture information from cgroups managed by Slurm on each compute node. This gathers CPU, memory, and GPU utilization.
@@ -47,12 +47,12 @@ groups:
4747
expr: sum(label_replace(deriv(slurm_job_process_usage_total{}[1m]) > 0, "bin", "$1", "exe", ".*/(.*)")) by (cluster, account, bin)
4848
```
4949

50-
## slurm-exporter
51-
[slurm-exporter](https://github.com/guilbaults/prometheus-slurm-exporter/tree/osc) is used to capture stats from Slurm like the priority of each user. This portal is using a fork, branch `osc` in the linked repository. This fork support GPU reporting and sshare stats.
52-
5350
## Access to the database of slurmacct
5451
This MySQL database is accessed by a read-only user. It does not need to be in the same database server where Django is storing its data.
5552

53+
## slurm-exporter
54+
[slurm-exporter](https://github.com/guilbaults/prometheus-slurm-exporter/tree/osc) is used to capture stats from Slurm like the priority of each user. This portal is using a fork, branch `osc` in the linked repository. This fork support GPU reporting and sshare stats.
55+
5656
## lustre\_exporter and lustre\_exporter\_slurm
5757
Those 2 exporters are used to gather information about Lustre usage.
5858

docs/development.md

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
A test and development environment using the local `uid` resolver and dummies allocations is provided to test the portal.
2+
3+
To use it, copy `example/local.py` to `userportal/local.py`. The other functions are documented in `common.py` if any other overrides are needed for your environment.
4+
5+
To quickly test and bypass authentication, add this line to `userportal/settings/99-local.py`. Other local configuration can be added in this file to override the default settings.
6+
7+
```
8+
AUTHENTICATION_BACKENDS.insert(0, 'userportal.authentication.staffRemoteUserBackend')
9+
```
10+
11+
This bypasses the authentication and will use the `REMOTE_USER` header or env variable to authenticate the user. This is useful to be able to try the portal without having to set up a full IDP environment. The REMOTE_USER method can be used when using some IDP such as Shibboleth. SAML2 based IDP is now the preferred authentication method for production.
12+
13+
Examine the default configuration in `userportal/settings/` and override any settings in `99-local.py` as needed.
14+
15+
Then you can launch the example server with:
16+
17+
```
18+
[email protected] [email protected] python manage.py runserver
19+
```
20+
21+
This will run the portal with the user `someuser` logged in as a staff member.
22+
23+
Automated Django tests are also available, they can be run with:
24+
25+
```
26+
python manage.py test
27+
```
28+
29+
This will test the various modules, including reading job data from the Slurm database and Prometheus. A temporary database for Django is created automatically for the tests. Slurm and Prometheus data are read directly from production data with a read-only account. A representative user, job and account need to be defined to be used in the tests, check the `90-tests.py` file for an example.

docs/index.md

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# TrailblazingTurtle
2+
3+
# Introduction
4+
TrailblazingTurtle is a web portal for HPC clusters. It is designed to be a single point of entry for users to access information about the cluster, their jobs, and the performance of the cluster. It is designed to be modular, so that it can be easily extended to support new features.
5+
6+
# Design
7+
The Django portal will access various MySQL databases like the database of Slurm and Robinhood (if installed) to gather some information.
8+
9+
Time series are stored with Prometheus for better performance. Compatible alternatives to Prometheus like Thanos, VictoriaMetrics, and Grafana Mimir should work without any problems (Thanos is used in production). Recorder rules in Prometheus are used to pre-aggregate some stats for the portal.
10+
11+
![Architecture diagram](userportal.png)

docs/install.md

+9
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,12 @@
1+
# Installation
2+
3+
Before installing in production, [a test environment should be set up to test the portal](development.md). This makes it easier to fully configure each module and modify as needed some functions like how the allocations are retrieved. Installing Prometheus and some exporters is also recommended to test the portal with real data.
4+
5+
The portal can be installed directly on a Rocky8 Apache web server or with Nginx and Gunicorn. The portal can also be deployed as a container with Podman or Kubernetes. Some scripts used to deploy both Nginx and Django containers inside the same pod are provided in the `podman` directory.
6+
The various recommendations for any normal Django production deployment can be followed.
7+
8+
[Deploying Django](https://docs.djangoproject.com/en/3.2/howto/deployment/)
9+
110
# Production without containers on Rocky Linux 8
211

312
RPMs required for production

docs/jobstats.md

+5-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,11 @@
11
# Jobstats
22
Each user can see their current uses on the cluster and a few hours in the past. The stats for each job are also available. Information about CPU, GPU, memory, filesystem, InfiniBand, power, etc. is also available per job. The submitted job script can also be collected from the Slurm server and then stored and displayed in the portal. Some automatic recommendations are also given to the user, based on the content of their job script and the stats of their job.
33

4-
<a href="user.png"><img src="user.png" alt="Stats per user" width="100"/></a>
5-
<a href="job.png"><img src="job.png" alt="Stats per job" width="100"/></a>
4+
## Screenshots
5+
### User stats
6+
![Stats per user](user.png)
7+
### Job stats
8+
![Stats per job](job.png)
69

710
## Requirements
811
* Access to the database of Slurm

docs/nodes.md

+5-2
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,11 @@
11
# Nodes
22
This main page present the list of nodes in the cluster with a small graph representing the cores, memory and localdisk used. Each node has a link to a detailed page with more information about the node similar to the jobstats page.
33

4-
<a href="nodes_list.png"><img src="nodes_list.png" alt="Nodes in the cluster with a small graph for each" width="100"/></a>
5-
<a href="nodes_details.png"><img src="nodes_details.png" alt="Detailed stats for a node" width="100"/></a>
4+
## Screenshots
5+
### Nodes list
6+
![Nodes in the cluster with a small trend graph for each](nodes_list.png)
7+
### Node details
8+
![Detailed stats for a node](nodes_details.png)
69

710
## Requirements
811
* Access to the database of Slurm

docs/quotas.md

+5-4
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
# Quotas
22
Each user can see their current storage allocations and who within their group is using the group quota.
33

4-
<a href="quota.png"><img src="quota.png" alt="Quotas" width="100"/></a>
4+
## Screenshots
5+
### Quotas
6+
![Quotas](quota.png)
57

6-
Info about the HSM state (Tape) is also available.
7-
8-
<a href="hsm.png"><img src="hsm.png" alt="HSM" width="100"/></a>
8+
### HSM
9+
![HSM](hsm.png)
910

1011
## Requirements
1112
* Read-only access to the databases of Robinhood

docs/top.md

+12-4
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,18 @@ These pages are only available to staff and are meant to visualize poor cluster
55
* Jobs on large memory nodes (ranked by worst to best)
66
* Top users on Lustre
77

8-
<a href="top_compute.png"><img src="top_compute.png" alt="Top compute user (CPU)" width="100"/></a>
9-
<a href="top_compute_gpu.png"><img src="top_compute_gpu.png" alt="Top compute user(GPU)" width="100"/></a>
10-
<a href="top_largemem.png"><img src="top_largemem.png" alt="Jobs on large memory nodes" width="100"/></a>
11-
<a href="top_lustre.png"><img src="top_lustre.png" alt="Top users on Lustre" width="100"/></a>
8+
## Screenshots
9+
### Top compute user (CPU)
10+
![Top compute user (CPU)](top_compute.png)
11+
12+
### Top compute user (GPU)
13+
![Top compute user (GPU)](top_compute_gpu.png)
14+
15+
### Jobs on large memory nodes
16+
![Jobs on large memory nodes](top_largemem.png)
17+
18+
### Top users on Lustre
19+
![Top users on Lustre](top_lustre.png)
1220

1321
## Requirements
1422
* Access to the database of Slurm

docs/usersummary.md

+4-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
1-
# Usersummary
1+
# User Summary
22
The usersummary page can be used for a quick diagnostic of a user to see their current quotas and last jobs.
33

4-
<a href="usersummary.png"><img src="usersummary.png" alt="Quotas and jobs of a user" width="100"/></a>
4+
## Screenshots
5+
### Quotas and jobs of a user
6+
![Quotas and jobs of a user](usersummary.png)
57

68
## Requirements
79
* Access to the database of Slurm

mkdocs.yml

+38
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
site_name: TrailblazingTurtle
2+
repo_url: https://github.com/guilbaults/TrailblazingTurtle/
3+
nav:
4+
- 'Home': index.md
5+
- 'Data collection': data.md
6+
- 'Development': development.md
7+
- 'Installation': install.md
8+
- 'Modules':
9+
- 'Job Stats': jobstats.md
10+
- 'Top': top.md
11+
- 'User Summary': usersummary.md
12+
- 'Account Stats': accountstats.md
13+
- 'Cloud Stats': cloudstats.md
14+
- 'Nodes': nodes.md
15+
- 'Quotas': quotas.md
16+
- 'Quotas GPFS': quotasgpfs.md
17+
- 'CF Access': cfaccess.md
18+
19+
theme:
20+
name: material
21+
# logo: img/logo.png
22+
features:
23+
# enable button to copy code blocks
24+
- content.code.copy
25+
plugins:
26+
- search
27+
markdown_extensions:
28+
# allow for arbitrary nesting of code and content blocks
29+
- pymdownx.superfences:
30+
# syntax highlighting in code blocks and inline code
31+
- pymdownx.highlight
32+
# support for (collapsible) admonitions (notes, tips, etc.)
33+
- admonition
34+
- pymdownx.details
35+
# icon + emoji
36+
# - pymdownx.emoji:
37+
# emoji_index: !!python/name:material.extensions.emoji.twemoji
38+
# emoji_generator: !!python/name:material.extensions.emoji.to_svg

0 commit comments

Comments
 (0)