Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prune fails because of unmounted datasets #35

Closed
einsiedlerkrebs opened this issue Mar 22, 2023 · 4 comments
Closed

prune fails because of unmounted datasets #35

einsiedlerkrebs opened this issue Mar 22, 2023 · 4 comments

Comments

@einsiedlerkrebs
Copy link
Contributor

einsiedlerkrebs commented Mar 22, 2023

I am experiencing the issue, that the pots are not cleaned up after hard reboot of a nomad node and therefore the jobs are failing.

When the system is up again, the ZFS datasets are not mounted into position, therefore the configuration file of a pot can not be found.
This leads to a failing prune command and therefore to the inability to run "prepare" in nomad. Because of this reason the node in not reboot safe.

To reproduce:

  • on a single server nomad node with running services (via pot) run reboot command

  • after system is up, observe that the desired services are not up

  • get pot datasets with zfs list

  • mount each "service" related datasets and its recursive ones

  • run pot prune

  • trigger fresh service start on nomad node (either setting count to 0 and back to 1 or removing database)

  • now service should be working again

@einsiedlerkrebs einsiedlerkrebs changed the title prune fails because of unmounted filesystems prune fails because of unmounted datasets Mar 22, 2023
@grembo
Copy link
Contributor

grembo commented Mar 24, 2023

Thanks for opening this issue.

What does your system config look like, at least:

cat /etc/rc.conf
zpool status

(after reboot - redact as necessary)

@einsiedlerkrebs
Copy link
Contributor Author

nomad_enable="True"
nomad_user="root"
nomad_debug="YES"
nomad_env="PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/sbin:/bin"
nomad_dir="/var/tmp/nomad"
nomad_args="-config=/opt/hashicorp/nomad-agent.hcl"

zpool status shows both pools online.

my fix for the issue is:

#!/bin/sh
# After hard crash of a nomad node, remaining pots can't be pruned since their datasets are not mounted.
# This mounts the datasets and prunes pots.

zfs list -rH -o name zdata/pot/jails | xargs -L 1 zfs mount && logger -t pot_cleanup mounting all pot datasets || \
 logger -t pot_cleanup failed to mount all pot datasets
pot prune && logger -t pot_cleanup pruning pots || logger -t pot_cleanup could not prune all pots

This supported by a RC script which runs it once before nomad.

@grembo
Copy link
Contributor

grembo commented Mar 28, 2023

@einsiedlerkrebs any reason you didn’t enable zfs in rc.conf? This can, e.g., be done using the service command:

# service zfs enable

It would take care of mounting zfs file systems on boot.

@einsiedlerkrebs
Copy link
Contributor Author

Yes indeed this solved the issue. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants