Commits · adcbcc129b25ad9771fa8a9a3e4de9dff0c93ec3 · 小白蛋 / Nomad

This project is mirrored from https://gitee.com/mirrors/nomad.git. Pull mirroring failed 2 years ago.
Repository mirroring has been paused due to too many failed attempts. It can be resumed by a project maintainer.

18 Jun, 2021 5 commits

consul/connect: Validate uniqueness of Connect upstreams within task group · adcbcc12

Seth Hoenig authored 3 years ago

This PR adds validation during job submission that Connect proxy upstreams
within a task group are using different listener addresses. Otherwise, a
duplicate envoy listener will be created and not be able to bind.

Closes #7833

adcbcc12

Merge pull request #10784 from hashicorp/b-dlskf · 6dcada43
Seth Hoenig authored 3 years ago
```
e2e: fix a couple recent e2e bugs
```
6dcada43

e2e: use -detach mode when registering jobs with cli · 15d39f0d

Seth Hoenig authored 3 years ago

This PR changes the e2e helper thingy to set -detach option
when registering a job with the CLI instead of the API. This is
necessary for jobs which never become healthy, as the deployment
never finishes for failing jobs and the command never returns,
causing the test to timeout after 10 minutes.

15d39f0d

consul: set task name only for group service checks · 57fdb814

Seth Hoenig authored 3 years ago

This PR fixes a bug introduced in a refactoring

https://github.com/hashicorp/nomad/pull/10764/files#diff-56b3c82fcbc857f8fb93a903f1610f6e6859b3610a4eddf92bad9ea27fdc85ec

where task level service checks would inherent the task name
field, when they shouldn't.

Fixes #10781

57fdb814

tests: allocrunner CNI tests are Linux-only (#10783) · 2520d83e

Tim Gross authored 3 years ago

Running the `client/allocrunner` tests fail to compile on macOS because the
CNI test file depends on the CNI network configurator, which is in a
Linux-only file.

2520d83e

17 Jun, 2021 4 commits
- deps: bump go-getter to 1.5.4 (#10778) · 77f6ecbb
  Tim Gross authored 3 years ago
  
  77f6ecbb
- Merge pull request #10776 from hashicorp/b-cns-sysjob-ups · 2d8fc6b3
  Seth Hoenig authored 3 years ago
```
consul/connect: in-place update service definition when connect upstreams are modified
```
  2d8fc6b3
- docs: host_network does support Docker task port mapping (#10774) · ad3070a1
  Tim Gross authored 3 years ago
  
  ad3070a1
- changelog entry for #10756 · b0922e90
  Tim Gross authored 3 years ago
  
  b0922e90
16 Jun, 2021 4 commits

consul/connect: in-place update service definition when connect upstreams are modified · 7ba60b4e

Seth Hoenig authored 3 years ago

This PR fixes a bug where modifying the upstreams of a Connect sidecar proxy
would not result Consul applying the changes, unless an additional change to
the job would trigger a task replacement (thus replacing the service definition).

The fix is to check if upstreams have been modified between Nomad's view of the
sidecar service definition, and the service definition for the sidecar that is
actually registered in Consul.

Fixes #8754

7ba60b4e

docker: generate /etc/hosts file for bridge network mode (#10766) · 2a640f0b

Tim Gross authored 3 years ago

When `network.mode = "bridge"`, we create a pause container in Docker with no
networking so that we have a process to hold the network namespace we create
in Nomad. The default `/etc/hosts` file of that pause container is then used
for all the Docker tasks that share that network namespace. Some applications
rely on this file being populated.

This changeset generates a `/etc/hosts` file and bind-mounts it to the
container when Nomad owns the network, so that the container's hostname has an
IP in the file as expected. The hosts file will include the entries added by
the Docker driver's `extra_hosts` field.

In this changeset, only the Docker task driver will take advantage of this
option, as the `exec`/`java` drivers currently copy the host's `/etc/hosts`
file and this can't be changed without breaking backwards compatibility. But
the fields are available in the task driver protobuf for community task
drivers to use if they'd like.

2a640f0b

build(deps): bump postcss from 7.0.35 to 7.0.36 in /website (#10772) · 3b5bca63

dependabot[bot] authored 3 years ago

Bumps [postcss](https://github.com/postcss/postcss) from 7.0.35 to 7.0.36.
- [Release notes](https://github.com/postcss/postcss/releases)
- [Changelog](https://github.com/postcss/postcss/blob/main/CHANGELOG.md)
- [Commits](https://github.com/postcss/postcss/compare/7.0.35...7.0.36

)

---
updated-dependencies:
- dependency-name: postcss
  dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

3b5bca63

build(deps): bump ws from 7.3.1 to 7.4.6 in /scripts/screenshots/src (#10671) · 0983b073

dependabot[bot] authored 3 years ago

Bumps [ws](https://github.com/websockets/ws) from 7.3.1 to 7.4.6.
- [Release notes](https://github.com/websockets/ws/releases)
- [Commits](https://github.com/websockets/ws/compare/7.3.1...7.4.6

)
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

0983b073

15 Jun, 2021 9 commits

Merge pull request #10765 from hashicorp/b-java-fp-version · e16a3516
Seth Hoenig authored 3 years ago
```
client/fingerprint/java: improve java version string regex matching
```
e16a3516

client/fingerprint/java: improve java version string regex matching · 674183c3

Seth Hoenig authored 3 years ago

This PR improves the regular expression used for matching the java
version string, which varies a lot depending on the java vendor and
version.

These are the example strings we now test for:

java version "1.7.0_80"
openjdk version "11.0.1" 2018-10-16
openjdk version "11.0.1" 2018-10-16
java version "1.6.0_36"
openjdk version "1.8.0_192"
openjdk 11.0.11 2021-04-20 LTS

The last one is a new test added on behalf of #6081, which is
still broken on today's CentOS 7 default JDK package.

openjdk 11.0.11 2021-04-20 LTS
OpenJDK Runtime Environment 18.9 (build 11.0.11+9-LTS)
OpenJDK 64-Bit Server VM 18.9 (build 11.0.11+9-LTS, mixed mode, sharing)

==> Evaluation "21c6caf7" finished with status "complete" but failed to place all allocations:
    Task Group "example" (failed to place 1 allocation):
      * Constraint "${driver.java.version} >= 11.0.0": 1 nodes excluded by filter
    Evaluation "2b737d48" waiting for additional capacity to place r...

674183c3

Merge pull request #10764 from hashicorp/b-passfail-lost · 52bf1977
Seth Hoenig authored 3 years ago
```
consul: make failures_before_critical and success_before_passing work with group services
```
52bf1977
docs: add bugfix note to 1.0.8 · 0ef0b2ef
Seth Hoenig authored 3 years ago

0ef0b2ef

consul: make failures_before_critical and success_before_passing work with group services · b4a631c1

Seth Hoenig authored 3 years ago

This PR fixes some job submission plumbing to make sure the Consul Check parameters
- failure_before_critical
- success_before_passing

work with group-level services. They already work with task-level services.

b4a631c1

Merge pull request #10762 from hashicorp/docs-update-cl-2 · ab9b589b
Seth Hoenig authored 3 years ago
```
docs: update changelog
```
ab9b589b
docs: update changelog · d7530f04
Seth Hoenig authored 3 years ago

d7530f04
Merge pull request #10758 from hashicorp/b-fix-test-datarace-plugins · c3b15b87
James Rasell authored 3 years ago
```
plugins: fix test data race.
```
c3b15b87
plugins: fix test data race. · ff4cd338
James Rasell authored 3 years ago

ff4cd338

14 Jun, 2021 7 commits

cli: check deployment exists before monitoring (#10757) · ca010f9f

Isabel Suchanek authored 3 years ago


System and batch jobs don't create deployments, which means nomad tries
to monitor a non-existent deployment when it runs a job and outputs an
error message. This adds a check to make sure a deployment exists before
monitoring. Also fixes some formatting.
Co-authored-by: Tim Gross <tgross@hashicorp.com>

ca010f9f

deployment watcher: Reuse allocsCh if allocIndex remains the same (#10756) · 8052ae1d

Mahmood Ali authored 3 years ago

Fix deployment watchers to avoid creating unnecessary deployment watcher goroutines and blocking queries. `deploymentWatcher.getAllocsCh` creates a new goroutine that makes a blocking query to fetch updates of deployment allocs.

## Background

When operators submit a new or updated service job, Nomad create a new deployment by default. The deployment object controls how fast to place the allocations through [`max_parallel`](https://www.nomadproject.io/docs/job-specification/update#max_parallel) and health checks configurations.

The `scheduler` and `deploymentwatcher` package collaborate to achieve deployment logic: The scheduler only places the canaries and `max_parallel` allocations for a new deployment; the `deploymentwatcher` monitors for alloc progress and then enqueues a new evaluation whenever the scheduler should reprocess a job and places the next `max_parallel` round of allocations.

The `deploymentwatcher` package makes blocking queries against the state store, to fetch all deployments and the relevant allocs for each running deployments. If `deploymentwatcher` fails or is hindered from fetching the state, the deployments fail to make progress.

`Deploymentwatcher` logic only runs on the leader.

## Why unnecessary deployment watchers can halt cluster progress
Previously, `getAllocsCh` is called on every for loop iteration in `deploymentWatcher.watch()` function. However, the for-loop may iterate many times before the allocs get updated. In fact, whenever a new deployment is created/updated/deleted, *all* `deploymentWatcher`s get notified through `w.deploymentUpdateCh`. The `getAllocsCh` goroutines and blocking queries spike significantly and grow quadratically with respect to the number of running deployments. The growth leads to two adverse outcomes:

1. it spikes the CPU/Memory usage resulting potentially leading to OOM or very slow processing
2. it activates the [query rate limiter](https://github.com/hashicorp/nomad/blob/abaa9c5c5bd09af774fda30d76d5767b06128df4/nomad/deploymentwatcher/deployment_watcher.go#L896-L898), so later the watcher fails to get updates and consequently fails to make progress towards placing new allocations for the deployment!

So the cluster fails to catch up and fails to make progress in almost all deployments. The cluster recovers after a leader transition: the deposed leader stops all watchers and free up goroutines and blocking queries; the new leader recreates the watchers without the quadratic growth and remaining under the rate limiter. Well, until a spike of deployments are created triggering the condition again.

### Relevant Code References

Path for deployment monitoring:
* [`Watcher.watchDeployments`](https://github.com/hashicorp/nomad/blob/abaa9c5c5bd09af774fda30d76d5767b06128df4/nomad/deploymentwatcher/deployments_watcher.go#L164-L192) loops waiting for deployment updates.
* On every deployment update, [`w.getDeploys`](https://github.com/hashicorp/nomad/blob/abaa9c5c5bd09af774fda30d76d5767b06128df4/nomad/deploymentwatcher/deployments_watcher.go#L194-L229) returns all deployments in the system
* `watchDeployments` calls `w.add(d)` on every active deployment
* which in turns, [updates existing watcher if one is found](https://github.com/hashicorp/nomad/blob/abaa9c5c5bd09af774fda30d76d5767b06128df4/nomad/deploymentwatcher/deployments_watcher.go#L251-L255).
* The deployment watcher [updates local local deployment field and trigger `deploymentUpdateCh` channel]( https://github.com/hashicorp/nomad/blob/abaa9c5c5bd09af774fda30d76d5767b06128df4/nomad/deploymentwatcher/deployment_watcher.go#L136-L147)
* The [deployment watcher `deploymentUpdateCh` selector is activated](https://github.com/hashicorp/nomad/blob/abaa9c5c5bd09af774fda30d76d5767b06128df4/nomad/deploymentwatcher/deployment_watcher.go#L455-L489

). Most of the time the selector clause is a no-op, because the flow was triggered due to another deployment update
* The `watch` for-loop iterates again and in the previous code we create yet another goroutine and blocking call that risks being rate limited.
Co-authored-by: Tim Gross <tgross@hashicorp.com>

8052ae1d

Merge pull request #10754 from hashicorp/b-client-connect-constraint · 3a5cbc47
Seth Hoenig authored 3 years ago
```
consul/connect: remove unnecessary connect constraint on clients
```
3a5cbc47
Merge pull request #10752 from hashicorp/b-fix-test-datarace-volumewatcher · 7019bc2b
James Rasell authored 3 years ago
```
volumewatcher: fix test data race.
```
7019bc2b

quotas: evaluate quota feasibility last in scheduler (#10753) · 2b63a093

Tim Gross authored 3 years ago

The `QuotaIterator` is used as the source of nodes passed into feasibility
checking for constraints. Every node that passes the quota check counts the
allocation resources agains the quota, and as a result we count nodes which
will be later filtered out by constraints. Therefore for jobs with
constraints, nodes that are feasibility checked but fail have been counted
against quotas. This failure mode is order dependent; if all the unfiltered
nodes happen to be quota checked first, everything works as expected.

This changeset moves the `QuotaIterator` to happen last among all feasibility
checkers (but before ranking). The `QuotaIterator` will never receive filtered
nodes so it will calculate quotas correctly.

2b63a093

consul/connect: remove unnecessary connect constraint on clients · 0d13ef0c

Seth Hoenig authored 3 years ago

PR https://github.com/hashicorp/nomad/pull/10702 added 2 new constraints
for connect jobs - one for Consul gRPC listener, and one for Connect being
enabled on Clients. Connect does not need to be enabled on clients, only
on Consul servers. Remove the extra constraint.

Discuss:
https://discuss.hashicorp.com/t/nomad-1-1-1-and-consul-connect-enabled-on-consul-clients/25295

0d13ef0c

volumewatcher: fix test data race. · b6505c23
James Rasell authored 3 years ago

b6505c23

11 Jun, 2021 11 commits
- Merge pull request #10750 from hashicorp/br.quote-image · ff2e2c11
  Brandon Romano authored 3 years ago
```
Fix headshot image 404
```
  ff2e2c11
- Fix headshot image 404 · 399dd84a
  Brandon Romano authored 3 years ago
  
  399dd84a
- fix agent-info help message formatting (#10747) · 5cfc104a
  Luiz Aoqui authored 3 years ago
  
  5cfc104a
- Merge pull request #10745 from hashicorp/b-fix-test-datarace-deploymentwatcher · 88e456d9
  James Rasell authored 4 years ago
```
deploymentwatcher: fix test data race.
```
  88e456d9
- Merge pull request #10744 from hashicorp/b-remove-duplicate-imports · a7d055a5
  James Rasell authored 4 years ago
```
chore: remove duplicate import statements
```
  a7d055a5
- Merge pull request #10742 from hashicorp/deflake-tests-20210608 · c467de33
  Mahmood Ali authored 4 years ago
```
Deflaking Test 2021 June edition
```
  c467de33
- Merge pull request #10739 from hashicorp/f-remove-unused-types-pkg · 9c926d7a
  James Rasell authored 4 years ago
```
core: remove unused types pkg and PeriodicCallback type.
```
  9c926d7a
- deploymentwatcher: fix test data race. · 9a25e89b
  James Rasell authored 4 years ago
  
  9a25e89b
- tests: remove duplicate import statements. · a4156c3e
  James Rasell authored 4 years ago
  
  a4156c3e
- jobspec2: remove duplicate imports statements. · d1db1414
  James Rasell authored 4 years ago
  
  d1db1414
- drivers: remove duplicate import statements. · b3fe60a7
  James Rasell authored 4 years ago
  
  b3fe60a7