This project is mirrored from https://gitee.com/mirrors/nomad.git.
Pull mirroring failed .
Repository mirroring has been paused due to too many failed attempts. It can be resumed by a project maintainer.
Repository mirroring has been paused due to too many failed attempts. It can be resumed by a project maintainer.
- 11 May, 2022 1 commit
-
-
Michael Schurter authored
Whenever a node joins the cluster, either for the first time or after being `down`, we emit a evaluation for every system job to ensure all applicable system jobs are running on the node. This patch adds an optimization to skip creating evaluations for system jobs not in the current node's DC. While the scheduler performs the same feasability check, skipping the creation of the evaluation altogether saves disk, network, and memory.
-
- 26 Apr, 2022 4 commits
-
-
Michael Schurter authored
Fixes #10200 **The bug** A user reported receiving the following error when an alloc was placed that needed to preempt existing allocs: ``` [ERROR] client.alloc_watcher: error querying previous alloc: alloc_id=28... previous_alloc=8e... error="rpc error: alloc lookup failed: index error: UUID must be 36 characters" ``` The previous alloc (8e) was already complete on the client. This is possible if an alloc stops *after* the scheduling decision was made to preempt it, but *before* the node running both allocations was able to pull and start the preemptor. While that is hopefully a narrow window of time, you can expect it to occur in high throughput batch scheduling heavy systems. However the RPC error made no sense! `previous_alloc` in the logs was a valid 36 character UUID! **The fix** The fix is: ``` - prevAllocID: c.Alloc.PreviousAllocation, + prevAllocID: watchedAllocID, ``` The alloc watcher new func used for preemption improperly referenced Alloc.PreviousAllocation instead of the passed in watchedAllocID. When multiple allocs are preempted, a watcher is created for each with watchedAllocID set properly by the caller. In this case Alloc.PreviousAllocation="" -- which is where the `UUID must be 36 characters` error was coming from! Sadly we were properly referencing watchedAllocID in the log, so it made the error make no sense! **The repro** I was able to reproduce this with a dev agent with [preemption enabled](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hcl) and [lowered limits](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-limits-hcl) for ease of repro. First I started a [low priority count 3 job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-lo-nomad), then a [high priority job](https://gist.github.com/schmichael/53f79cbd898afdfab76865ad8c7fc6a0#file-preempt-hi-nomad) that evicts 2 low priority jobs. Everything worked as expected. However if I force it to use the [remotePrevAlloc implementation](https://github.com/hashicorp/nomad/blob/v1.3.0-beta.1/client/allocwatcher/alloc_watcher.go#L147), it reproduces the bug because the watcher references PreviousAllocation instead of watchedAllocID.
-
Tim Gross authored
Part of ongoing work to remove the old E2E framework code.
-
Tim Gross authored
We moved off the old provisioning process for nightly E2E to one driven entirely by Terraform quite a while back now. We're in the slow process of removing the framework code for this test-by-test, but this chunk of code no longer has any callers.
-
Tim Gross authored
We enforce exactly one plugin supervisor loop by checking whether `running` is set and returning early. This works but is fairly subtle. It can briefly result in two goroutines where one quickly exits before doing any work. Clarify the intent by using `sync.Once`. The goroutine we've spawned only exits when the entire task runner is being torn down, and not when the task driver restarts the workload, so it should never be re-run.
-
- 25 Apr, 2022 3 commits
-
-
Michael Schurter authored
The existing ParseHCL func didn't allow setting HCLv1=true.
-
Tim Gross authored
-
Luiz Aoqui authored
-
- 22 Apr, 2022 19 commits
-
-
Michael Schurter authored
* docs: update json jobs docs Did you know that Nomad has not 1 but 2 JSON formats for jobs? 2½ if you want to acknowledge that sometimes our JSON job representations have a Job top-level wrapper and sometimes do not. The 2½ formats are: ``` 1. HCL JSON 2. Input API JSON (top-level Job field) 2.5. Output API JSON (lacks top-level Job field) ``` `#2` is what our docs consider our API JSON. `#2.5` seems to be an accident of history we can't fix with breaking API compatibility. `#1` is an even more interesting accident of history: the `jobspec2` package automatically detects if the input to Parse is JSON and switches to a JSON parser. This behavior is undocumented, the format is unspecified, and there is no official HashiCorp tooling to produce this JSON from HCL. The plot thickens when you discover popular third party tools like hcl2json.com and https://github.com/tmccombs/hcl2json seem to produce JSON that `nomad run` accepts! Since...
-
Jai authored
* chore: remove commented out code and skipped tests * refact: triggeredBy requires filter expression not qp * refact: use filter expression dsl instead of named params * fix: add type * docs: add in-line reference to filter expression DSL * fix: update filter copy for non-matches * fix: correct conditional logic to render no match copy
-
Phil Renaud authored
-
Phil Renaud authored
-
Tim Gross authored
The task runner hook `Prestart` response object includes a `Done` field that's intended to tell the client not to run the hook again. The plugin supervisor creates mount points for the task during prestart and saves these mounts in the hook resources. But if a client restarts the hook resources will not be populated. If the plugin task restarts at any time after the client restarts, it will fail to have the correct mounts and crash loop until restart attempts run out. Fix this by not returning `Done` in the response, just as we do for the `volume_mount_hook`.
-
James Rasell authored
* deps: update consul-template to v0.29.0 * changelog: add entry for #12747
-
Phil Renaud authored
-
Phil Renaud authored
* Unknown status for allocations accounted for * Canary string removed * Test cleanup * Generate unknown in mirage * aacidentally oovervoowled * Update ui/app/components/allocation-status-bar.js Co-authored-by:
Derek Strickland <1111455+DerekStrickland@users.noreply.github.com> * Disconnected state on job status in client * Renaming Disconnected to Unknown in the job-status-in-client * Unknown accounted for on job rows filtering and testsfix * Adding lostAllocs as a computed dependency * Unknown client status within acceptance test * Swatches updated and PR comments addressed * Unknown and disconnected added to test fixtures Co-authored-by:
Derek Strickland <1111455+DerekStrickland@users.noreply.github.com>
-
Luiz Aoqui authored
After a more detailed analysis of this feature, the approach taken in PR #12449 was found to be not ideal due to poor UX (users are responsible for setting the entity alias they would like to use) and issues around jobs potentially masquerading itself as another Vault entity.
-
Seth Hoenig authored
services: enable setting arbitrary address value in service registrations
-
Seth Hoenig authored
-
Seth Hoenig authored
-
Seth Hoenig authored
Co-authored-by:
Michael Schurter <mschurter@hashicorp.com>
-
Seth Hoenig authored
This PR introduces the `address` field in the `service` block so that Nomad or Consul services can be registered with a custom `.Address.` to advertise. The address can be an IP address or domain name. If the `address` field is set, the `service.address_mode` must be set in `auto` mode.
-
Tim Gross authored
We don't need the absolute path for any of the commands in this script so long as we `cd` into the source directory path. Doing this removes the need for weird platform-specific tricks we have to do with realpath vs GNU realpath.
-
James Rasell authored
-
Tim Gross authored
-
Tim Gross authored
-
James Rasell authored
-
- 21 Apr, 2022 13 commits
-
-
Tim Gross authored
The E2E test runner is running from the root of the Nomad repository. Make this run independent of the working directory for convenience of developers and the test runner.
-
Michael Schurter authored
* cli: add -json flag to support job commands While the CLI has always supported running JSON jobs, its support has been via HCLv2's JSON parsing. I have no idea what format it expects the job to be in, but it's absolutely not in the same format as the API expects. So I ignored that and added a new -json flag to explicitly support *API* style JSON jobspecs. The jobspecs can even have the wrapping {"Job": {...}} envelope or not! * docs: fix example for `nomad job validate` We haven't been able to validate inside driver config stanzas ever since the move to task driver plugins.
-
Tim Gross authored
The new `namespace apply` feature that allows for passing a namespace specification file detects the difference between an empty namespace and a namespace specification by checking if the file exists. For most cases, the file will have an extension like `.hcl` and so there's little danger that a user will apply a file spec when they intended to apply a file name. But because directory names typically don't include an extension, you're much more likely to collide when trying to `namespace apply` by name only, and then you get a confusing error message of the form: Failed to read file: read $namespace: is a directory Detect the case where the namespace name collides with a directory in the current working directory, and skip trying to load the directory.
-
Phil Renaud authored
* Allocation page linkfix * fix added to task page and computed prop moved to allocation model * Fallback query added to task group when specific volume isnt knowable * Delog * link text reflects alloc suffix * Helper instead of in-template conditionals * formatVolumeName unit test * Removing unused helper import
-
Seth Hoenig authored
build: update golang to 1.17.9
-
Seth Hoenig authored
-
Seth Hoenig authored
build: update ec2 instance profiles
-
Seth Hoenig authored
using tools/ec2info
-
Seth Hoenig authored
-
Tim Gross authored
When shutting down an allocation that ends up needing to be force-killed, we're getting a spurious "OOM Killed (137)" message on the task termination event. We introduced this as part of cgroups v2 support because the Docker daemon isn't detecting the container status correctly. Although exit code 137 is the exit code we get for OOM-killed processes, that's because OOM kill is a `SIGKILL`. So any sigkilled process will get that exit code.
-
Tim Gross authored
The CSI plugin allocations take a while to be marked healthy, sometimes causing E2E test flakes during the setup phase of the tests. There's nothing CSI specific about marking plugin allocs healthy, as the plugin supervisor hook does all the fingerprinting in the postrun hook (the prestart hook just makes a couple of empty directories). The timeouts we're seeing may be because of where we're pulling the images from; most our jobs pull from a CDN-backed public registry whereas these are pulling from ECR. Set a 1min timeout for these to make sure we have enough time to pull the image and start the task.
-
James Rasell authored
-
Tim Gross authored
-