Article — 5 minutes read

From Proxmox to NetBox: full infrastructure automation with Foundry

From Proxmox to NetBox: full infrastructure automation with Foundry

Our self-built 'workflow engine' Foundry fully automates our infrastructure - across 8 systems, from virtualisation to IPAM.

Foundry was nominated for a LarAward during LaraFest 2024: an award for the most technically challenging Laravel-based project. During a 10-minute presentation, Roy and William explained Foundry, and why it's built with Laravel.

The presentation had to be cut short: Foundry is not a simple project. Here's another attempt at explaining full infrastructure automation with Foundry.

What is Foundry?

Foundry is a so-called 'workflow engine':

  • The user (either a person or a system) starts an action, such as 'create VM', with certain parameters. 
  • Within the created workflow, fixed tasks are run.
  • If the workflow fails, the user can retry/resume the workflow.

In this article, we're going to look at:

  • Foundry's custom-built workflow system.
  • The 8 systems that Foundry integrates with (including Proxmox and NetBox).
  • Scheduled tasks.
  • Automatic reusability (across GUI, CLI and API) with auto-registering tasks.
  • Dynamic GUI dropdowns with datasets.

Let's dive in.

This is the 'create VM' workflow:

Foundry workflow: create VM

Cyberfusion is not the only organisation with such complex infrastructure workflows. But few go this far in automating them. Why do we?

  • We strive for perfection. Many organisations attempt to streamline complex workflows using 'work instructions', but those instructions must be followed manually. Which leaves room for human mistakes.
  • Our 'fully-managed infrastructure' service Cyberfusion Core requires an API to deploy infrastructure.
  • As we grow, our infrastructure - and its processes - will become even more complex. Now is the time to start working in the most structured way possible.

Foundry is three birds with one stone:

  • Well-defined workflows: no human mistakes.
  • Full automation.
  • An internal API.

Workflows, tasks, actions, processes, steps: how does it work?

Foundry has 4 core concepts:

  • Workflows. For example: 'create VM example.com'. A workflow is created by an action, and consists of one or more tasks.
  • Actions. For example: 'create VM'. An action always leads to a specific workflow.
  • Tasks. For example: 'create VM in NetBox'. Workflows consist of one or more tasks.
  • Domains. For example: 'VM'. All actions are in a specific domain.

Foundry has 63 actions. Such as:

  • Create VM
  • Create FHRP group (e.g. for VRRP)
  • Create Infscape client (backups)
  • Create Sensu user (real-time monitoring)
  • Create vLAN

As an example, let's look at the 'create VM' action, which is in the 'VM' domain:

'Create' action in 'VM' domain

The action receives a `VmCreateRequest` - inheriting from `FormRequest` to do basic validation.

After validation, the request returns a DTO - which contains all data to pass to the action.

"I heard you like automation, so I automated your automation": reusable actions for API, CLI and GUI

When we started building Foundry, we divided actions between interfaces: some should be available in the API, some in the CLI, some in the GUI. After all, some things you never do manually, and some things you never do automatically.

Quickly, we realised that it's more practical for all actions to be available via all three interfaces. But for over 60 actions, we don't want to repeat business logic thrice.

The solution: auto-registering actions. Based on every action's request, Foundry auto-generates an API endpoint, console command and a GUI form.

First, the GUI:

GUI

GUI forms are fully auto-generated from actions and their requests.

A quick note on dropdowns.

Some dropdowns not fixed. For example, values in the 'Tenant' dropdown are dynamically retrieved from NetBox.

Values in dynamic dropdowns are retrieved from a 'dataset': values are retrieved from any source, then possibly processed (for example, filtering values based on permissions).

An example dataset:

Dataset for dropdowns

Thanks to the 'magic' of Livewire, datasets are reloaded whenever an argument changes. Return values of datasets are cached for combinations of arguments.

Datasets may also depend on the value of other fields. For example, when creating a VM with a certain tenant, only IP addresses belonging to that same tenant should be shown.

Second, the API:

API

Endpoints and the OpenAPI spec are dynamically generated.

Endpoints are generated according to the format: `/api/{domain}/{action}`. For example, the 'rollback snapshot' action endpoint is: `POST /vm/rollback-snapshot`.

There is one exception: endpoints for 'create' and 'delete' actions do not contain the action. For example, the 'create VM' action endpoint is: `POST /vm` (without a trailing `/create`).

Finally, the CLI:

CLI

Anatomy of workflows

An action (such as 'create VM') creates a workflow (such as 'create VM example.com').

In its most basic form, a workflow looks like this:

Basic 'create VM' workflow

Earlier, we expained that `FormRequest`s (passed to the action) perform only basic validation.

For complex validation, dependencies come into play:

Dependencies of 'create VM' workflow

If any dependency fails, the workflow stops. Dependencies can also return values to use in other dependencies or tasks.

Dependencies were added to Foundry later on. Initially, we used custom validation rules. As more actions were added, this became unmaintainable. Not only because of the sheer amount of rules, but also because every action has slightly different requirements: often, data retrieved for one rule had to be reused for other rules.

Next, let's look at how tasks and workflows are tied together:

Sequence of 'create VM' workflow

All tasks in the workflow run in the specified sequence. The constants are human-readable task descriptions.

Finally, let's look at a task:

Task in 'create VM' workflow

Tasks can receive several arguments:

When starting the workflow, tasks are queued - using Laravel's built-in queueing facilities - with dependencies' return values.

From this point, the workflow runs asynchronously. The caller (be it via API, CLI or GUI) receives a workflow UUID - with which the status of the workflow can be queried.

Tasks can have 4 statuses:

  • Successful, and may has a 'carry' value for use by a next task.
  • Skipped, with a human-readable reason.
  • Failed, making the entire workflow fail (tasks later on in the sequence are skipped).
  • Terminated, idem ditto.

Resilient workflows: retries

Foundry communicates with 8 systems (and counting), namely:

Due to the sheer number of systems that Foundry integrates with, some things inevitably go wrong sometimes.

Think of temporary service interruptions and validation errors. But integration errors happen too: some APIs are better documented than others.

So, what happens when a task fails?

In the GUI, engineers immediately see the exception that caused a task - and therefore the workflow - to fail:

Failed workflow in GUI

... and on which task:

Failed task in GUI

In the task report, the engineer sees the traceback:

Task exception in GUI

To pick up where the task failed, engineers can easily retry the workflow:

Workflow retry button in GUI

... after which a new 'attempt' is created:

Attempt 2 in GUI

Attempt 2 in GUI

Tasks that succeeded in the previous attempt aren't re-run (marked as 'Cached').

In this case, the workflow failed again. For auditing and debugging, full history is kept - even for previous attempts.

Finally: scheduled tasks

Several actions must run periodically. For example, we check whether VMs in Proxmox pools are sufficiently spread across physical hypervisors - every day.

This 'virtualisation resource distribution' check doesn't consist of a single workflow, but of a workflow per Proxmox pool.

Therefore, the scheduled task returns an array of workflows:

Scheduled task

In the GUI, engineers see an overview of the past 7 runs per scheduled task:

Scheduled tasks overview in GUI

... to easily see which workflows failed, when, and why.

Next steps

Foundry has been serving us incredibly well for over a year, and we're expanding it with:

  • License management (all existing solutions are overkill).
  • More advanced Ansible integration (AWX/Tower/Semaphore don't meet our needs).
  • Support for more 802.1Q modes.

... and more. Want to help? Look at our openings.

Want to see more code? View the presentation from LaraFest (Dutch only).

Ready to get started?