netupgrade/state.md

# Repository analysis

## Project overview
- Project name: `netupgrade`
- Primary language: Bash
- Main entrypoint: `bin/netupgrade`
- Project type: interactive CLI administration tool
- Main purpose: orchestrate remote upgrade and maintenance actions on multiple hosts over SSH
- Configuration model: Bash-based configuration files sourced from `~/.config/netupgrade/*.cfg`

## Repository structure
- `bin/netupgrade`: main executable script containing CLI parsing, node selection, remote execution, and logging
- `config/netupgrade/*.cfg`: sample configuration files defining host groups and action sequences
- `README.md`: installation and usage documentation
- `docs/`: project documentation

## Functional behavior
The tool:
1. Loads a Bash configuration file
2. Expects a `NODES` array populated with entries formatted like:
   `host;display-name;action1;action2;...`
3. Displays an interactive multi-select checklist using `whiptail`
4. Executes the selected actions on each selected host through SSH, using `root@host` by default or `SSH_USER@host` when configured
5. Writes execution logs to `~/netupgrade.log`
6. Opens the log with `$EDITOR` when available, otherwise `nano`, `vi`, or `less`; if none is available, it prints the log path, then optionally removes the log file

Supported action types currently include:
- `apt`
- `yum`
- `pkg`
- `pacman`
- `apk`
- `reboot`
- `cmd:<remote command>`
- `docker-stacks:<directory>`

## Architecture notes
- The project is intentionally lightweight and script-based
- Configuration is code-driven rather than declarative, since config files are sourced as shell files
- The entire execution flow currently lives in a single Bash script
- Remote operations are performed sequentially, not in parallel
- Logging is file-based and coupled directly to command execution

## Strengths
- Very small and easy to deploy
- Clear practical purpose for system administration workflows
- Flexible host/action configuration model
- Supports several Linux/BSD package managers
- Suitable for use from a bastion host or admin workstation

## Main issues identified

### 1. Documentation accuracy problems
- `README.md` and CLI help were updated to better match current behavior
- The previous typo in the configuration path (`netuprade`) has been fixed
- The unsupported `-b` option was removed from the displayed help
- The configuration format and supported actions are now documented in more detail

### 2. Shell robustness concerns
- Config files are sourced directly, which is flexible but implies arbitrary code execution
- The script still has some quoting-sensitive areas and does not use a stricter shell safety baseline
- The `NODES` parsing was hardened to split on `;` with `IFS`/`read -r -a`, which now preserves spaces in action values such as `cmd:...`

### 3. Remote execution correctness and safety
- SSH execution now goes through a dedicated `runSSH` helper and the SSH user is configurable via `SSH_USER`, defaulting to `root`
- `cmd:<...>` intentionally allows arbitrary remote command execution and is now executed through a remote shell, which improves support for shell operators but remains a powerful unsafe feature
- The `pacman` orphan-removal command was corrected so orphan detection happens on the remote host instead of locally
- The `docker-stacks` remote loop was rewritten to pass the stack root as an argument to a remote shell script, improving quoting and path handling

### 4. UX and dependency issues
- Required runtime dependencies are now checked at startup (`ssh`, `whiptail`, `cat`, `tee`, `rm`, `touch`)
- Log viewing no longer depends strictly on `nano`; the script now falls back to `$EDITOR`, then `nano`, `vi`, or `less`
- If no supported log viewer is available, execution continues and the log path is shown
- The workflow is highly interactive and not well suited for automation
- `rm -i` introduces an extra prompt even when the rest of the flow is meant to be streamlined

### 5. Error handling limitations
- Error propagation is inconsistent depending on the action type
- Cleanup commands often do not affect the final failure state
- The script continues through action sequences without a documented policy
- The advertised “break on error” behavior does not exist yet

### 6. Maintainability limitations
- Most logic is concentrated in one script
- There is duplication in package-manager handling
- No tests or validation tooling are present in the repository
- Some wording, typos, and naming inconsistencies reduce clarity

## Recommended direction

### Short term
- Tackle the next hardening work as small, reviewable commits instead of one broad patch
- Define and document an explicit error-handling policy: what is fatal, what is best-effort, and whether host action sequences continue after a failure
- Review the remaining quoting-sensitive areas, especially around remote shell command construction for `cmd:<...>` and `docker-stacks:<...>`
- Align `README.md`, CLI help, and runtime behavior on dependencies and interaction details, especially the actual meaning of `-f`, current log-file behavior, and whether `sed` is still required
- Review sample configuration files for naming or targeting inconsistencies that could lead to operator mistakes

### Medium term
- Make SSH user, log path, and editor configurable
- Improve non-interactive usage options with explicit batch-friendly flags such as a true `--all`, `--no-log-view`, `--keep-log`, and custom log-path support
- Standardize error handling and exit codes with a documented policy for best-effort cleanup steps versus fatal failures
- Add lightweight validation or warnings around sourced config files, for example around trust expectations or overly permissive file permissions
- Consider adopting a clearer shell option baseline such as an explicit global `pipefail` policy

### Long term
- Refactor the script into smaller functions with less duplication
- Add shell linting guidance and automation (for example ShellCheck, optionally shfmt)
- Consider a safer declarative configuration format if the project grows
- Add test coverage for parsing, command construction, and non-regression around SSH quoting behavior

## Recent changes
- `README.md` was expanded to document installation, requirements, usage, configuration format, and supported actions
- `bin/netupgrade` help output was aligned with actual CLI behavior and now documents `--help`
- A startup dependency check was added before loading configuration or opening the interactive selector
- Log viewer selection was made more flexible: `$EDITOR` is preferred, then `nano`, `vi`, or `less`
- `nano` is no longer a strict runtime dependency
- The unsupported `-b` option remains unimplemented and is no longer shown in help output
- `NODES` parsing was hardened to preserve spaces in action values by splitting on `;` with `IFS` and `read -r -a`
- SSH calls were centralized through a `runSSH` helper and `SSH_USER` is now configurable, defaulting to `root`
- The `pacman` orphan cleanup now runs entirely on the remote host instead of evaluating orphan detection locally
- The `pacman` orphan cleanup remote command now avoids nested `bash -lc` argument-passing issues by selecting between two simple remote `sh -c` commands, one with `--noconfirm` and one without
- The `docker-stacks` action uses a remote shell script sent over SSH stdin, with the stack directory exported as a remote environment assignment before `bash -s`, to keep path handling working after recent SSH command-construction changes
- Unknown actions and reboot SSH failures now propagate error status more consistently
- A focused code review identified the next recommended work items and suggested splitting them into separate commits rather than combining them in one larger hardening change
- `whiptail` checklist defaults are now passed explicitly as `ON`/`OFF`, and selected items are parsed through a dedicated helper instead of relying on raw shell word splitting
- The CLI help and README now clarify that `-f` preselects all nodes in the interactive checklist
- Log summary generation no longer uses `sed -i` interpolation; the script now writes a temporary file with the summary header plus the existing log content and replaces the original log atomically
- The `apk` action no longer passes `-y` to `apk upgrade`, because current Alpine `apk` does not accept that option there; `-y` remains a best-effort flag for other supported package managers
- The `apt` action now uses `apt-get autoremove --purge` and no longer runs `apt-get purge` without arguments, which makes the cleanup step more meaningful and avoids a misleading command in the log
- The `pacman` action was further hardened by simplifying orphan cleanup command construction, reducing quoting-related regressions while still skipping removal when no orphan packages are present

## Change guidance
- Preserve backward compatibility for existing config files where possible
- Prefer incremental hardening over a full rewrite
- Keep the tool simple and admin-friendly
- Split behavioral fixes into small logical commits when possible, for example: selection handling, log generation, package-manager cleanup semantics, error-policy changes, and non-interactive CLI improvements
- Be cautious with changes to remote command construction, as quoting changes can introduce regressions
- Treat documentation and behavior alignment as part of functional quality, not as a separate cleanup task
- Avoid introducing a global `set -euo pipefail` baseline in one step without first documenting and testing the expected failure semantics

## Suggested review focus for future changes
- Correctness of package-manager cleanup commands and confirmation flags
- Correctness of remote command execution
- Safe quoting and shell expansion behavior
- Compatibility of config format with existing user setups
- Error-handling policy consistency across action types
- Package-manager command correctness and cleanup-step behavior
- Usability in both interactive and semi-automated contexts
- Documentation consistency with actual runtime behavior and dependencies
- Review of sample config accuracy to avoid mislabelled hosts or risky operator confusion

## Additional recommendations from latest review
- Highest priority should go to defining an explicit execution and failure policy, because it currently affects operator trust more than missing features do
- The next highest priority should be protecting against regressions in SSH command construction by documenting manual test cases for commands with spaces, pipes, redirections, `&&`, `||`, and quoted arguments
- A small CLI usability pass would have strong value: `-f` currently only preselects nodes in `whiptail`, so a true non-interactive selection mode would improve automation without changing the overall project model
- The dependency list should be rechecked: `README.md` still mentions `sed`, while the current implementation no longer appears to require it after the log-summary rewrite
- The sample configuration set should be reviewed for consistency; for example, duplicate or mismatched display names attached to different IPs increase the risk of accidental operations on the wrong host
- Shell quality improvements should favor linting, targeted helpers, and incremental refactors before any broad strict-mode changes
- Future testing should focus first on parser behavior, command construction, and result reporting rather than trying to build a large end-to-end framework immediately