This project is mirrored from https://gitee.com/mirrors/nomad.git.
Pull mirroring failed .
Repository mirroring has been paused due to too many failed attempts. It can be resumed by a project maintainer.
Repository mirroring has been paused due to too many failed attempts. It can be resumed by a project maintainer.
- 08 Mar, 2022 1 commit
-
-
temp authored
-
- 07 Mar, 2022 1 commit
-
-
hc-github-team-nomad-core authored
This pull request was automerged via backport-assistant
-
- 01 Mar, 2022 1 commit
-
-
hc-github-team-nomad-core authored
This pull request was automerged via backport-assistant
-
- 28 Feb, 2022 1 commit
-
-
Tim Gross authored
-
- 18 Feb, 2022 1 commit
-
-
Ignacio Torres Masdeu authored
-
- 01 Feb, 2022 1 commit
-
-
Tim Gross authored
-
- 31 Jan, 2022 3 commits
-
-
Nomad Release Bot authored
-
Nomad Release Bot authored
-
Nomad Release bot authored
-
- 28 Jan, 2022 10 commits
-
-
Tim Gross authored
-
Tim Gross authored
-
Tim Gross authored
-
Tim Gross authored
When an allocation stops, the `csi_hook` makes an unpublish RPC to the servers to unpublish via the CSI RPCs: first to the node plugins and then the controller plugins. The controller RPCs must happen after the node RPCs so that the node has had a chance to unmount the volume before the controller tries to detach the associated device. But the client has local access to the node plugins and can independently determine if it's safe to send unpublish RPC to those plugins. This will allow the server to treat the node plugin as abandoned if a client is disconnected and `stop_on_client_disconnect` is set. This will let the server try to send unpublish RPCs to the controller plugins, under the assumption that the client will be trying to unmount the volume on its end first. Note that the CSI `NodeUnpublishVolume`/`NodeUnstageVolume` RPCs can return ignorable errors in the case where the volume has already been unmounted from the node. Handle all other errors by retrying until we get success so as to give operators the opportunity to reschedule a failed node plugin (ex. in the case where they accidentally drained a node without `-ignore-system`). Fan-out the work for each volume into its own goroutine so that we can release a subset of volumes if only one is stuck.
-
Tim Gross authored
Small refactoring of the allocrunner hook for CSI to make it more testable, and a unit test that covers most of its logic.
-
Tim Gross authored
* csi: resolve invalid claim states on read It's currently possible for CSI volumes to be claimed by allocations that no longer exist. This changeset asserts a reasonable state at the state store level by registering these nil allocations as "past claims" on any read. This will cause any pass through the periodic GC or volumewatcher to trigger the unpublishing workflow for those claims. * csi: make feasibility check errors more understandable When the feasibility checker finds we have no free write claims, it checks to see if any of those claims are for the job we're currently scheduling (so that earlier versions of a job can't block claims for new versions) and reports a conflict if the volume can't be scheduled so that the user can fix their claims. But when the checker hits a claim that has a GCd allocation, the state is recoverable by the server once claim reaping completes and no user intervention is required; the blocked eval should complete. Differentia...
-
Mahmood Ali authored
Glint pulled in an updated version of mitchellh/go-testing-interface which broke some existing tests because the update added a Parallel() method to testing.T. This switches to the standard library testing.TB which doesn't have a Parallel() method.
-
Tim Gross authored
The volumewatcher that runs on the leader needs to make RPC calls rather than writing to raft (as we do in the deploymentwatcher) because the unpublish workflow needs to make RPC calls to the clients. This requires that the volumewatcher has access to the leader's ACL token. But when leadership transitions, the new leader creates a new leader ACL token. This ACL token needs to be passed into the volumewatcher when we enable it, otherwise the volumewatcher can find itself with a stale token.
-
Tim Gross authored
When `volumewatcher.Watcher` starts on the leader, it starts a watch on every volume and triggers a reap of unused claims on any change to that volume. But if a reaping is in-flight during leadership transitions, it will fail and the event that triggered the reap will be dropped. Perform one reap of unused claims at the start of the watcher so that leadership transitions don't drop this event.
-
James Rasell authored
volumewatcher: fix test data race.
-
- 18 Jan, 2022 5 commits
-
-
Nomad Release Bot authored
-
Nomad Release bot authored
-
Luiz Aoqui authored
-
Luiz Aoqui authored
-
Michael Schurter authored
Fix Node.Copy()
-
- 17 Jan, 2022 16 commits
-
-
Tim Gross authored
When we copy the system DNS to a task's `resolv.conf`, we should set the permissions as world-readable so that unprivileged users within the task can read it.
-
Tim Gross authored
The size of `stat_t` fields is architecture dependent, which was reportedly causing a build failure on FreeBSD ARM7 32-bit systems. This changeset matches the behavior we have on Linux.
-
Tim Gross authored
When the `volume deregister` or `volume detach` commands get an ID prefix that matches multiple volumes, show the full length of the volume IDs in the list of volumes shown so so that the user can select the correct one.
-
Tim Gross authored
The command line client sends a specific volume ID, but this isn't enforced at the API level and we were incorrectly using a prefix match for volume deregistration, resulting in cases where a volume with a shorter ID that's a prefix of another volume would be deregistered instead of the intended volume.
-
grembo authored
Templates in nomad jobs make use of the vault token defined in the vault stanza when issuing credentials like client certificates. When using change_mode "noop" in the vault stanza, consul-template is not informed in case a vault token is re-issued (which can happen from time to time for various reasons, as described in https://www.nomadproject.io/docs/job-specification/vault). As a result, consul-template will keep using the old vault token to renew credentials and - once the token expired - stop renewing credentials. The symptom of this problem is a vault_token file that is newer than the issued credential (e.g., TLS certificate) in a job's /secrets directory. This change corrects this, so that h.updater.updatedVaultToken(token) is called, which will inform stakeholders about the new token and make sure, the new token is used by consul-template. Example job template fragment: vault { policies = ["nomad-job-policy"] change_mode = "noop" } template { data = <<-EOH {{ with secret "pki_int/issue/nomad-job" "common_name=myjob.service.consul" "ttl=90m" "alt_names=localhost" "ip_sans=127.0.0.1"}} {{ .Data.certificate }} {{ .Data.private_key }} {{ .Data.issuing_ca }} {{ end }} EOH destination = "${NOMAD_SECRETS_DIR}/myjob.crt" change_mode = "noop" } This fix does not alter the meaning of the three change modes of vault - "noop" - Take no action - "restart" - Restart the job - "signal" - send a signal to the task as the switch statement following line 232 contains the necessary logic. It is assumed that "take no action" was never meant to mean "don't tell consul-template about the new vault token". Successfully tested in a staging cluster consisting of multiple nomad client nodes.
-
Tim Gross authored
The task runner prestart hooks take a `joincontext` so they have the option to exit early if either of two contexts are canceled: from killing the task or client shutdown. Some tasks exit without being shutdown from the server, so neither of the joined contexts ever gets canceled and we leak the `joincontext` (48 bytes) and its internal goroutine. This primarily impacts batch jobs and any task that fails or completes early such as non-sidecar prestart lifecycle tasks. Cancel the `joincontext` after the prestart call exits to fix the leak.
-
Luiz Aoqui authored
-
Michael Schurter authored
agent: validate reserved_ports are valid
-
Michael Schurter authored
deps: update go-getter to v1.5.11
-
Luiz Aoqui authored
-
Tim Gross authored
When a cluster doesn't have a leader, the `nomad operator debug` command can safely use stale queries to gracefully degrade the consistency of almost all its queries. The query parameter for these API calls was not being set by the command. Some `api` package queries do not include `QueryOptions` because they target a specific agent, but they can potentially be forwarded to other agents. If there is no leader, these forwarded queries will fail. Provide methods to call these APIs with `QueryOptions`.
-
dependabot[bot] authored
* build(deps): bump github.com/hashicorp/cronexpr in /api Bumps [github.com/hashicorp/cronexpr](https://github.com/hashicorp/cronexpr) from 1.1.0 to 1.1.1. - [Release notes](https://github.com/hashicorp/cronexpr/releases) - [Commits](https://github.com/hashicorp/cronexpr/compare/v1.1.0...v1.1.1 ) --- updated-dependencies: - dependency-name: github.com/hashicorp/cronexpr dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by:
dependabot[bot] <support@github.com> * go mod tidy Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by:
Tim Gross <tim@0x74696d.com>
-
Luiz Aoqui authored
-
James Rasell authored
changelog: add entry for #11848
-
Tim Gross authored
-
Tim Gross authored
When the scheduler picks a node for each evaluation, the `LimitIterator` provides at most 2 eligible nodes for the `MaxScoreIterator` to choose from. This keeps scheduling fast while producing acceptable results because the results are binpacked. Jobs with a `spread` block (or node affinity) remove this limit in order to produce correct spread scoring. This means that every allocation within a job with a `spread` block is evaluated against _all_ eligible nodes. Operators of large clusters have reported that jobs with `spread` blocks that are eligible on a large number of nodes can take longer than the nack timeout to evaluate (60s). Typical evaluations are processed in milliseconds. In practice, it's not necessary to evaluate every eligible node for every allocation on large clusters, because the `RandomIterator` at the base of the scheduler stack produces enough variation in each pass that the likelihood of an uneven sprea...
-