Lessons in DR
I just suffered a major blow to my home workflow.
A few months ago, I started to have a nagging feeling that I hadn't worked out what to do if any of my containers fail on the container orchestration platform itself. Murphy's Law has come to pass, and this exact scenario has occurred: Docker has failed because of an update.
My LXC containers are backed up daily with a retention of 10 daily, 4 weekly, and 4 monthly backup copies. It's saved me more times than I can count. I recursively back up /etc on the hypervisor daily. Thinking I was prepared makes this situation all the more stupid.
It's easy to understand failure response in a container due to a broken update: restore from previous versions to one that is working and figure out how to get around the problem. Similarly, A non-storage hardware failure can usually just work without too much hassle by moving disks to another system. Any disk that fails can be replaced.
However, I did not anticipate how badly off I'd be if my docker containers stopped working because of updates to either a pve tool or a major change in cgroup implementation. This is what happened after a casual update/upgrade this morning.
My Docker containers are each a single docker-compose in a very light Alpine linux LXC container. This makes the setup and backup integrate into my current backup scenario. An update to the PVE server changed the behaviour of the Docker LXC containers to lose networking to their docker instances running inside, which effectively makes them non-functional. I consequently spent 24 hours without CCTV camera service, photo curation, DNS (!!!), and monitoring (!!!).
Interwoven into this fiasco was a misconfigured Telus endpoint device, which I actually had to reset to factory settings and get back to bridge mode. Another story.
The DNS servers (both Technitium) needed to be up right away, not only to block ads and other junk, but also because every internal service here talks to each other in DNS, I don't have IPs configured anywhere if I can help it. So I simply rebuilt them without docker to get other things going, but it sparked in me the notion that I should be either collecting config backups or I should set up the configs in ansible.
I may also re-create the monitoring setup without Docker. This leaves me with Immich and Frigate, both of which are only offered as a docker solution. They really are the best solutions to those problems, so I need to fix docker eventually.
The upside to this is that I at least had the forethought to separate data from the functioning container.
Docker Solution 1 - Revert the hosting environment
As this suggests, I will have to understand which packages changed, revert them, and pin them to previous versions. This has a significant downside, which is that I will eventually have to upgrade them for security and compatibility with other PVE packages.
Docker Solution 2 - Import old images into new, working LXC
A smart, future-looking person would figure out how to actually get Docker running post-upgrade and take the opportunity to learn how to export and import docker configs and images across platforms. The downside to this is potential time. Additionally, there exists the potential time sink of learning how to import/export docker.
Docker Solution 3 - Fix broken old ones to work with new system
Another way is to get a new container running docker successfully, compare to old container, and make any changes I need to the containers. This has the potential to solve my issues in the shortest time, but this also has a major flaw; I might not learn anything about what changes were made and this will end up kicking the can down the road.
Taking the Correct Approach
In this situation, my choice is not hampered by time constraints, because I can afford a few days without photo curation and CCTV. To get the best outcome not only now (getting services back up), but also in the future (planning for the next catastrophe), I need to use Solution 3. It's tempting to look at Solution 2 and import old images into new, working docker LXC, but I can still carry this out after "rescuing" things to a working normal.
I need to take a clue from this and try to mitigate for future such situations, which might include a state snapshot of packages before the next environment upgrade. In any case, moving forward should include a better understanding of what to do.