Skip to content

Commit

Permalink
docs: clarify reschedule, migrate, and replacement terminology
Browse files Browse the repository at this point in the history
Our vocabulary around scheduler behaviors outside of the `reschedule` and
`migrate` blocks leaves room for confusion around whether the reschedule tracker
should be propagated between allocations. There are effectively five different
behaviors we need to cover:

* restart: when the tasks of an allocation fail and we try to restart the tasks
  in place.

* reschedule: when the `restart` block runs out of attempts (or the allocation
  fails before tasks even start), and we need to move
  the allocation to another node to try again.

* migrate: when the user has asked to drain a node and we need to move the
  allocations. These are not failures, so we don't want to propagate the
  reschedule tracker.

* replacement: when a node is lost, we don't count that against the `reschedule`
  tracker for the allocations on the node (it's not the allocation's "fault",
  after all). We don't want to run the `migrate` machinery here here either, as we
  can't contact the down node. To the scheduler, this is effectively the same as
  if we bumped the `group.count`

* replacement for `disconnect.replace = true`: this is a replacement, but the
  replacement is intended to be temporary, so we propagate the reschedule tracker.

Add a section to the `reschedule`, `migrate`, and `disconnect` blocks explaining
when each item applies. Update the use of the word "reschedule" in several
places where "replacement" is correct, and vice-versa.

Fixes: #24918
  • Loading branch information
tgross committed Jan 23, 2025
1 parent c1dc9ed commit 39f153d
Show file tree
Hide file tree
Showing 8 changed files with 79 additions and 46 deletions.
2 changes: 1 addition & 1 deletion command/job_restart.go
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ Usage: nomad job restart [options] <job>
groups are restarted.
When rescheduling, the current allocations are stopped triggering the Nomad
scheduler to create replacement allocations that may be placed in different
scheduler to create new allocations that may be placed in different
clients. The command waits until the new allocations have client status
'ready' before proceeding with the remaining batches. Services health checks
are not taken into account.
Expand Down
17 changes: 9 additions & 8 deletions scheduler/generic_sched.go
Original file line number Diff line number Diff line change
Expand Up @@ -470,7 +470,8 @@ func (s *GenericScheduler) computeJobAllocs() error {
return s.computePlacements(destructive, place, results.taskGroupAllocNameIndexes)
}

// downgradedJobForPlacement returns the job appropriate for non-canary placement replacement
// downgradedJobForPlacement returns the previous stable version of the job for
// downgrading a placement for non-canaries
func (s *GenericScheduler) downgradedJobForPlacement(p placementResult) (string, *structs.Job, error) {
ns, jobID := s.job.Namespace, s.job.ID
tgName := p.TaskGroup().Name
Expand Down Expand Up @@ -588,8 +589,8 @@ func (s *GenericScheduler) computePlacements(destructive, place []placementResul
}

// Check if we should stop the previous allocation upon successful
// placement of its replacement. This allow atomic placements/stops. We
// stop the allocation before trying to find a replacement because this
// placement of the new alloc. This allow atomic placements/stops. We
// stop the allocation before trying to place the new alloc because this
// frees the resources currently used by the previous allocation.
stopPrevAlloc, stopPrevAllocDesc := missing.StopPreviousAlloc()
prevAllocation := missing.PreviousAllocation()
Expand Down Expand Up @@ -740,7 +741,7 @@ func (s *GenericScheduler) computePlacements(destructive, place []placementResul
// Track the fact that we didn't find a placement
s.failedTGAllocs[tg.Name] = s.ctx.Metrics()

// If we weren't able to find a replacement for the allocation, back
// If we weren't able to find a placement for the allocation, back
// out the fact that we asked to stop the allocation.
if stopPrevAlloc {
s.plan.PopUpdate(prevAllocation)
Expand Down Expand Up @@ -802,10 +803,10 @@ func needsToSetNodes(a, b *structs.Job) bool {
a.NodePool != b.NodePool
}

// propagateTaskState copies task handles from previous allocations to
// replacement allocations when the previous allocation is being drained or was
// lost. Remote task drivers rely on this to reconnect to remote tasks when the
// allocation managing them changes due to a down or draining node.
// propagateTaskState copies task handles from previous allocations to migrated
// or replacement allocations when the previous allocation is being drained or
// was lost. Remote task drivers rely on this to reconnect to remote tasks when
// the allocation managing them changes due to a down or draining node.
//
// The previous allocation will be marked as lost after task state has been
// propagated (when the plan is applied), so its ClientStatus is not yet marked
Expand Down
8 changes: 4 additions & 4 deletions website/content/docs/commands/job/restart.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,10 @@ When both groups and tasks are defined only the tasks for the allocations of
those groups are restarted.

When rescheduling, the current allocations are stopped triggering the Nomad
scheduler to create replacement allocations that may be placed in different
clients. The command waits until the new allocations have client status `ready`
before proceeding with the remaining batches. Services health checks are not
taken into account.
scheduler to create new allocations that may be placed in different clients. The
command waits until the new allocations have client status `ready` before
proceeding with the remaining batches. Services health checks are not taken into
account.

By default the command restarts all running tasks in-place with one allocation
per batch.
Expand Down
8 changes: 4 additions & 4 deletions website/content/docs/configuration/server.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -438,17 +438,17 @@ Nomad Clients periodically heartbeat to Nomad Servers to confirm they are
operating as expected. Nomad Clients which do not heartbeat in the specified
amount of time are considered `down` and their allocations are marked as `lost`
or `disconnected` (if [`disconnect.lost_after`][disconnect.lost_after] is set)
and rescheduled.
and replaced.

The various heartbeat related parameters allow you to tune the following
tradeoffs:

- The longer the heartbeat period, the longer a `down` Client's workload will
take to be rescheduled.
take to be replaced.
- The shorter the heartbeat period, the more likely transient network issues,
leader elections, and other temporary issues could cause a perfectly
functional Client and its workloads to be marked as `down` and the work
rescheduled.
replaced.

While Nomad Clients can connect to any Server, all heartbeats are forwarded to
the leader for processing. Since this heartbeat processing consumes resources,
Expand Down Expand Up @@ -510,7 +510,7 @@ system has for a delay in noticing crashed Clients. For example a
`failover_heartbeat_ttl` of 30 minutes may give even the slowest clients in the
largest clusters ample time to heartbeat after an election. However if the
election was due to a datacenter-wide failure affecting Clients, it will be 30
minutes before Nomad recognizes that they are `down` and reschedules their
minutes before Nomad recognizes that they are `down` and replaces their
work.

[encryption]: /nomad/tutorials/transport-security/security-gossip-encryption 'Nomad Encryption Overview'
Expand Down
32 changes: 22 additions & 10 deletions website/content/docs/job-specification/disconnect.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,14 @@ description: |-
The `disconnect` block describes the system's behavior in case of a network
partition. By default, without a `disconnect` block, if an allocation is on a
node that misses heartbeats, the allocation will be marked `lost` and will be
rescheduled.
replaced.

Replacement happens when a node is lost. When a node is drained, Nomad
[migrates][] the allocations instead and the `disconnect` block does not
apply. When a Nomad agent fails to setup the allocation or the tasks of an
allocation fail more than their [`restart`][] block allows, Nomad
[reschedules][] the allocations and the `disconnect` block does not apply.


```hcl
job "docs" {
Expand Down Expand Up @@ -51,11 +58,12 @@ same `disconnect` block.

Refer to [the Lost After section][lost-after] for more details.

- `replace` `(bool: false)` - Specifies if the disconnected allocation should
be replaced by a new one rescheduled on a different node. If false and the
node it is running on becomes disconnected or goes down, this allocation
won't be rescheduled and will be reported as `unknown` until the node reconnects,
or until the allocation is manually stopped:
- `replace` `(bool: false)` - Specifies if the disconnected allocation should be
replaced by a new one rescheduled on a different node. The replacement
allocation is considered a reschedule and will obey the job's [`reschedule`][]
block. If false and the node it is running on becomes disconnected or goes
down, this allocation won't be replaced and will be reported as `unknown`
until the node reconnects, or until the allocation is manually stopped:

```plaintext
`nomad alloc stop <alloc ID>`
Expand Down Expand Up @@ -84,7 +92,7 @@ same `disconnect` block.
- `keep_original`: Always keep the original allocation. Bear in mind
when choosing this option, it can have crashed while the client was
disconnected.
- `keep_replacement`: Always keep the allocation that was rescheduled
- `keep_replacement`: Always keep the allocation that was replaced
to replace the disconnected one.
- `best_score`: Keep the allocation running on the node with the best
score.
Expand All @@ -102,17 +110,17 @@ The following examples only show the `disconnect` blocks. Remember that the
This example shows how `stop_on_client_after` interacts with
other blocks. For the `first` group, after the default 10 second
[`heartbeat_grace`] window expires and 90 more seconds passes, the
server will reschedule the allocation. The client will wait 90 seconds
server will replace the allocation. The client will wait 90 seconds
before sending a stop signal (`SIGTERM`) to the `first-task`
task. After 15 more seconds because of the task's `kill_timeout`, the
client will send `SIGKILL`. The `second` group does not have
`stop_on_client_after`, so the server will reschedule the
`stop_on_client_after`, so the server will replace the
allocation after the 10 second [`heartbeat_grace`] expires. It will
not be stopped on the client, regardless of how long the client is out
of touch.

Note that if the server's clocks are not closely synchronized with
each other, the server may reschedule the group before the client has
each other, the server may replace the group before the client has
stopped the allocation. Operators should ensure that clock drift
between servers is as small as possible.

Expand Down Expand Up @@ -217,3 +225,7 @@ group "second" {
[stop-after]: /nomad/docs/job-specification/disconnect#stop-after
[lost-after]: /nomad/docs/job-specification/disconnect#lost-after
[`reconcile`]: /nomad/docs/job-specification/disconnect#reconcile
[migrates]: /nomad/docs/job-specification/migrate
[`restart`]: /nomad/docs/job-specification/restart
[reschedules]: /nomad/docs/job-specification/reschedule
[`reschedule`]: /nomad/docs/job-specification/reschedule
38 changes: 19 additions & 19 deletions website/content/docs/job-specification/group.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -48,9 +48,9 @@ job "docs" {
ephemeral disk requirements of the group. Ephemeral disks can be marked as
sticky and support live data migrations.

- `disconnect` <code>([disconnect][]: nil)</code> - Specifies the disconnect
strategy for the server and client for all tasks in this group in case of a
network partition. The tasks can be left unconnected, stopped or replaced
- `disconnect` <code>([disconnect][]: nil)</code> - Specifies the disconnect
strategy for the server and client for all tasks in this group in case of a
network partition. The tasks can be left unconnected, stopped or replaced
when the client disconnects. The policy for reconciliation in case the client
regains connectivity is also specified here.

Expand All @@ -65,14 +65,14 @@ job "docs" {
requirements and configuration, including static and dynamic port allocations,
for the group.

- `prevent_reschedule_on_lost` `(bool: false)` - Defines the reschedule behaviour
of an allocation when the node it is running on misses heartbeats.
When enabled, if the node it is running on becomes disconnected
or goes down, this allocations wont be rescheduled and will show up as `unknown`
until the node comes back up or it is manually restarted.
- `prevent_reschedule_on_lost` `(bool: false)` - Defines the replacement
behaviour of an allocation when the node it is running on misses heartbeats.
When enabled, if the node it is running on becomes disconnected or goes down,
this allocations wont be replaced and will show up as `unknown` until the node
comes back up or it is manually restarted.

This behaviour will only modify the reschedule process on the server.
To modify the allocation behaviour on the client, see
This behaviour will only modify the replacement process on the server. To
modify the allocation behaviour on the client, see
[`stop_after_client_disconnect`](#stop_after_client_disconnect).

The `unknown` allocation has to be manually stopped to run it again.
Expand All @@ -84,7 +84,7 @@ job "docs" {
Setting `max_client_disconnect` and `prevent_reschedule_on_lost = true` at the
same time requires that [rescheduling is disabled entirely][`disable_rescheduling`].

This field was deprecated in favour of `replace` on the [`disconnect`] block,
This field was deprecated in favour of `replace` on the [`disconnect`] block,
see [example below][disconect_migration] for more details about migrating.

- `reschedule` <code>([Reschedule][]: nil)</code> - Allows to specify a
Expand Down Expand Up @@ -299,18 +299,18 @@ issues with stateful tasks or tasks with long restart times.

Instead, an operator may desire that these allocations reconnect without a
restart. When `max_client_disconnect` or `disconnect.lost_after` is specified,
the Nomad server will mark clients that fail to heartbeat as "disconnected"
the Nomad server will mark clients that fail to heartbeat as "disconnected"
rather than "down", and will mark allocations on a disconnected client as
"unknown" rather than "lost". These allocations may continue to run on the
disconnected client. Replacement allocations will be scheduled according to the
allocations' `disconnect.replace` settings. until the disconnected client
reconnects. Once a disconnected client reconnects, Nomad will compare the "unknown"
allocations with their replacements and will decide which ones to keep according
to the `disconnect.replace` setting. If the `max_client_disconnect` or
`disconnect.losta_after` duration expires before the client reconnects,
allocations' `disconnect.replace` settings. until the disconnected client
reconnects. Once a disconnected client reconnects, Nomad will compare the "unknown"
allocations with their replacements and will decide which ones to keep according
to the `disconnect.replace` setting. If the `max_client_disconnect` or
`disconnect.losta_after` duration expires before the client reconnects,
the allocations will be marked "lost".
Clients that contain "unknown" allocations will transition to "disconnected"
rather than "down" until the last `max_client_disconnect` or `disconnect.lost_after`
rather than "down" until the last `max_client_disconnect` or `disconnect.lost_after`
duration has expired.

In the example code below, if both of these task groups were placed on the same
Expand Down Expand Up @@ -390,7 +390,7 @@ will remain as `unknown` and won't be rescheduled.
#### Migration to `disconnect` block

The new configuration fileds in the disconnect block work exactly the same as the
ones they are replacing:
ones they are replacing:
* `stop_after_client_disconnect` is replaced by `stop_after`
* `max_client_disconnect` is replaced by `lost_after`
* `prevent_reschedule_on_lost` is replaced by `replace`
Expand Down
10 changes: 10 additions & 0 deletions website/content/docs/job-specification/migrate.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,13 @@ If specified at the job level, the configuration will apply to all groups
within the job. Only service jobs with a count greater than 1 support migrate
blocks.

Migrating happens when a Nomad node is drained. When a node is lost, Nomad
[replaces][] the allocations instead and the `migrate` block does not
apply. When the agent fails to setup the allocation or the tasks of an
allocation more than their [`restart`][] block allows, Nomad [reschedules][] the
allocations instead and the `migrate` block does not apply.


```hcl
job "docs" {
migrate {
Expand Down Expand Up @@ -78,3 +85,6 @@ on node draining.
[count]: /nomad/docs/job-specification/group#count
[drain]: /nomad/docs/commands/node/drain
[deadline]: /nomad/docs/commands/node/drain#deadline
[replaces]: /nomad/docs/job-specification/disconnect#replace
[`restart`]: /nomad/docs/job-specification/restart
[reschedules]: /nomad/docs/job-specification/reschedule
10 changes: 10 additions & 0 deletions website/content/docs/job-specification/reschedule.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,12 @@ Nomad will attempt to schedule the allocation on another node if any of its
task statuses become `failed`. The scheduler prefers to create a replacement
allocation on a node that was not used by a previous allocation.

Rescheduling happens when the Nomad agent fails to setup the allocation or the
tasks of an allocation fail more than their [`restart`][] block allows. When a
node is drained, Nomad [migrates][] the allocations instead and the `reschedule`
block does not apply. When a node is lost, Nomad [replaces][] the allocations
instead and the `reschedule` block does not apply.


```hcl
job "docs" {
Expand Down Expand Up @@ -131,3 +137,7 @@ job "docs" {
```

[`progress_deadline`]: /nomad/docs/job-specification/update#progress_deadline
[`restart`]: /nomad/docs/job-specification/restart
[migrates]: /nomad/docs/job-specification/migrate
[replaces]: /nomad/docs/job-specification/disconnect#replace
[reschedules]: /nomad/docs/job-specification/reschedule

0 comments on commit 39f153d

Please sign in to comment.