Commits · jrasell/gh-13120-sso-http-api · 小白蛋 / Nomad

This project is mirrored from https://gitee.com/mirrors/nomad.git. Pull mirroring failed 2 years ago.
Repository mirroring has been paused due to too many failed attempts. It can be resumed by a project maintainer.

22 Nov, 2022 1 commit
- pr: fix-up based on feedback from @pkazmierczak · 5fa30d51
  James Rasell authored 2 years ago
  
  5fa30d51
21 Nov, 2022 5 commits
- api: add ACL auth-method client. · b703fc3e
  James Rasell authored 2 years ago
  
  b703fc3e
- agent: add ACL auth-method HTTP endpoints for CRUD actions. · f5e65852
  James Rasell authored 2 years ago
  
  f5e65852
- core: remove custom auth-method TTLS and use ACL token TTLS. · 898014f9
  James Rasell authored 2 years ago
  
  898014f9
- acl: sso auth method RPC endpoints (#15221) · b7ddd5bf
  Piotr Kazmierczak authored 2 years ago
```
This PR implements RPC endpoints for SSO auth methods.

This PR is part of the SSO work captured under ☂️ ticket #13120.
```
  b7ddd5bf
- acl: sso auth method event stream (#15280) · fee85dac
  Piotr Kazmierczak authored 2 years ago
```
This PR implements SSO auth method support in the event stream.

This PR is part of the SSO work captured under ☂️ ticket #13120.
```
  fee85dac
19 Nov, 2022 1 commit
- [ui] Show Consul Connect upstreams / on update info in sidebar (#15324) · 4703f55d
  Phil Renaud authored 2 years ago
```
* Added consul connect icon and sidebar info

* Show icon to the right of name
```
  4703f55d
18 Nov, 2022 5 commits
- e2e: jammy image needs latest java lts (#15323) · 78593daa
  Seth Hoenig authored 2 years ago
  
  78593daa
- api: ensure ACL role upsert decode error returns a 400 status code. (#15253) · faabc2b2
  James Rasell authored 2 years ago
  
  faabc2b2
- api: ensure all request body decode error return a 400 status code. (#15252) · c495cd99
  James Rasell authored 2 years ago
  
  c495cd99
- docs: add cpu-allocated and memory-allocated (#15299) · 329807bd
  Luiz Aoqui authored 2 years ago
```
Document the Autoscaler Nomad APM paramemeters `cpu-allocated` and
`memory-allocated` that were implemented in
https://github.com/hashicorp/nomad-autoscaler/pull/324 and
https://github.com/hashicorp/nomad-autoscaler/pull/334
```
  329807bd
- make eval cancelation really async with `Eval.Ack` (#15298) · 991e9a27
  Tim Gross authored 2 years ago
```
Ensure we never block in the `Eval.Ack`
```
  991e9a27
17 Nov, 2022 10 commits

scheduler: log stack in case of panic (#15303) · 6a3cf74f
Luiz Aoqui authored 2 years ago

6a3cf74f

Add mount propagation to protobuf definition of mounts (#15096) · 5ce42fe8

stswidwinski authored 2 years ago


* Add mount propagation to protobuf definition of mounts

* Fix formatting

* Add mount propagation to the simple roundtrip test.

* changelog: add entry for #15096
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>

5ce42fe8

make eval cancelation async with `Eval.Ack` (#15294) · eb1507c8

Tim Gross authored 2 years ago

In #14621 we added an eval canelation reaper goroutine with a channel that
allowed us to wake it up. But we forgot to actually send on this channel from
`Eval.Ack` and are still committing the cancelations synchronously. Fix this by
sending on the buffered channel to wake up the reaper instead.

eb1507c8

autopilot: include only servers from the same region (#15290) · 6eb1f99f

Tim Gross authored 2 years ago

When we migrated to the updated autopilot library in Nomad 1.4.0, the interface
for finding servers changed. Previously autopilot would get the serf members and
call `IsServer` on each of them, leaving it up to the implementor to filter out
clients (and in Nomad's case, other regions). But in the "new" autopilot
library, the equivalent interface is `KnownServers` for which we did not filter
by region. This causes spurious attempts for the cross-region stats fetching,
which results in TLS errors and a lot of log noise.

Filter the member set by region to fix the regression.

6eb1f99f

remove deprecated `AllocUpdateRequestType` raft entry (#15285) · 21c2d159

Tim Gross authored 2 years ago

After Deployments were added in Nomad 0.6.0, the `AllocUpdateRequestType` raft
log entry was no longer in use. Mark this as deprecated, remove the associated
dead code, and remove references to the metrics it emits from the docs. We'll
leave the entry itself just in case we encounter old raft logs that we need to
be able to safely load.

21c2d159

e2e: disable systemd stub dns in jammy image (#15286) · 7c254ccd
Seth Hoenig authored 2 years ago

7c254ccd

Fix goroutine leakage (#15180) · d16a2c94

stswidwinski authored 2 years ago


* Fix goroutine leakage

* cl: add cl entry
Co-authored-by: Seth Hoenig <shoenig@duck.com>

d16a2c94

ci: use hashicorp/setup-golang for setting up go compiler, cache (#15271) · 732adae9

Seth Hoenig authored 2 years ago

This PR changes test-core to make use of

https://github.com/hashicorp/setup-golang

to consolidate the setting up of the Go compiler and the Go modules cache
used for the CI job.

Fixes: #14905

732adae9

keyring: update handle to state inside replication loop (#15227) · f54a50bb

Tim Gross authored 2 years ago

* keyring: update handle to state inside replication loop

When keyring replication starts, we take a handle to the state store. But
whenever a snapshot is restored, this handle is invalidated and no longer points
to a state store that is receiving new keys. This leaks a bunch of memory too!

In addition to operator-initiated restores, when fresh servers are added to
existing clusters with large-enough state, the keyring replication can get
started quickly enough that it's running before the snapshot from the existing
clusters have been restored.

Fix this by updating the handle to the state store on each pass.

f54a50bb

fix create snapshot request docs (#15242) · 322c6b3d
Ayrat Badykov authored 2 years ago

322c6b3d

16 Nov, 2022 3 commits

eval broker: shed all but one blocked eval per job after ack (#14621) · 1c4307b8

Tim Gross authored 2 years ago

When an evaluation is acknowledged by a scheduler, the resulting plan is
guaranteed to cover up to the `waitIndex` set by the worker based on the most
recent evaluation for that job in the state store. At that point, we no longer
need to retain blocked evaluations in the broker that are older than that index.

Move all but the highest priority / highest `ModifyIndex` blocked eval into a
canceled set. When the `Eval.Ack` RPC returns from the eval broker it will
signal a reap of a batch of cancelable evals to write to raft. This paces the
cancelations limited by how frequently the schedulers are acknowledging evals;
this should reduce the risk of cancelations from overwhelming raft relative to
scheduler progress. In order to avoid straggling batches when the cluster is
quiet, we also include a periodic sweep through the cancelable list.

1c4307b8

e2e: swap bionic image for jammy (#15220) · 0e3606af
Seth Hoenig authored 2 years ago

0e3606af

test: ensure leader is still valid in reelection test (#15267) · 460f19b6

Tim Gross authored 2 years ago

The `TestLeader_Reelection` test waits for a leader to be elected and then makes
some other assertions. But it implcitly assumes that there's no failure of
leadership before shutting down the leader, which can lead to a panic in the
tests. Assert there's still a leader before the shutdown.

460f19b6

15 Nov, 2022 4 commits
- feat: add tooltip to storage volumes (#15245) · 3743e913
  Jai authored 2 years ago
```
* feat: add tooltip to storage volumes

* chore: move Tooltip into td to preserve style

* styling: add overflow-x to section (#15246)

* styling: add overflow-x to section

* refact: use media query with display block
```
  3743e913
- refact: remove unused API (#15244) · 22f9c554
  Jai authored 2 years ago
  
  22f9c554
- agent: ensure all HTTP Server methods are pointer receivers. (#15250) · a3f30182
  James Rasell authored 2 years ago
  
  a3f30182
- Fix variable create API example in docs (#15248) · b55ab631
  Nikita Beletskii authored 2 years ago
  
  b55ab631
14 Nov, 2022 3 commits

eval delete: move batching of deletes into RPC handler and state (#15117) · 65b3d01a

Tim Gross authored 2 years ago

During unusual outage recovery scenarios on large clusters, a backlog of
millions of evaluations can appear. In these cases, the `eval delete` command can
put excessive load on the cluster by listing large sets of evals to extract the
IDs and then sending larges batches of IDs. Although the command's batch size
was carefully tuned, we still need to be JSON deserialize, re-serialize to
MessagePack, send the log entries through raft, and get the FSM applied.

To improve performance of this recovery case, move the batching process into the
RPC handler and the state store. The design here is a little weird, so let's
look a the failed options first:

* A naive solution here would be to just send the filter as the raft request and
  let the FSM apply delete the whole set in a single operation. Benchmarking with
  1M evals on a 3 node cluster demonstrated this can block the FSM apply for
  several minutes, which puts the cluster at risk if there's a leadership
  failover (the barrier write can't be made while this apply is in-flight).

* A less naive but still bad solution would be to have the RPC handler filter
  and paginate, and then hand a list of IDs to the existing raft log
  entry. Benchmarks showed this blocked the FSM apply for 20-30s at a time and
  took roughly an hour to complete.

Instead, we're filtering and paginating in the RPC handler to find a page token,
and then passing both the filter and page token in the raft log. The FSM apply
recreates the paginator using the filter and page token to get roughly the same
page of evaluations, which it then deletes. The pagination process is fairly
cheap (only abut 5% of the total FSM apply time), so counter-intuitively this
rework ends up being much faster. A benchmark of 1M evaluations showed this
blocked the FSM apply for 20-30ms at a time (typical for normal operations) and
completes in less than 4 minutes.

Note that, as with the existing design, this delete is not consistent: a new
evaluation inserted "behind" the cursor of the pagination will fail to be
deleted.

65b3d01a

Fix wrong reference to `vault` (#15228) · 1217a96e
Douglas Jose authored 2 years ago

1217a96e
Fix broken URL to nvidia device plugin (#15234) · 263ed6f9
Kyle Root authored 2 years ago

263ed6f9

11 Nov, 2022 3 commits

[bug] Return a spec on reconnect (#15214) · 9ad90290

Charlie Voiselle authored 2 years ago

client: fixed a bug where non-`docker` tasks with network isolation would leak network namespaces and iptables rules if the client was restarted while they were running

9ad90290

client: avoid unconsumed channel in timer construction (#15215) · 5f3f5215

Seth Hoenig authored 2 years ago

* client: avoid unconsumed channel in timer construction

This PR fixes a bug introduced in #11983 where a Timer initialized with 0
duration causes an immediate tick, even if Reset is called before reading the
channel. The fix is to avoid doing that, instead creating a Timer with a non-zero
initial wait time, and then immediately calling Stop.

* pr: remove redundant stop

5f3f5215

exec: allow running commands from host volume (#14851) · 11a5f790

Tim Gross authored 2 years ago

The exec driver and other drivers derived from the shared executor check the
path of the command before handing off to libcontainer to ensure that the
command doesn't escape the sandbox. But we don't check any host volume mounts,
which should be safe to use as a source for executables if we're letting the
user mount them to the container in the first place.

Check the mount config to verify the executable lives in the mount's host path,
but then return an absolute path within the mount's task path so that we can hand
that off to libcontainer to run.

Includes a good bit of refactoring here because the anchoring of the final task
path has different code paths for inside the task dir vs inside a mount. But
I've fleshed out the test coverage of this a good bit to ensure we haven't
created any regressions in the process.

11a5f790

10 Nov, 2022 5 commits

docs: clarify how to access task meta values in templates (#15212) · 106dce9c

Seth Hoenig authored 2 years ago

This PR updates template and meta docs pages to give examples of accessing
meta values in templates. To do so one must use the environment variable form
of the meta key name, which isn't obvious and wasn't yet documented.

106dce9c

ci: notify on backport-assistant errors (#15203) · a2fed26f
Luiz Aoqui authored 2 years ago

a2fed26f

ci: re-enable tests on main (#15204) · e20af3cf

Luiz Aoqui authored 2 years ago

Now that the tests are grouped more tightly we don't use as many runners
as before, so we can re-enable these without clogging the queue.

e20af3cf

acl: sso auth method schema and store functions (#15191) · 02253e6f

Piotr Kazmierczak authored 2 years ago

This PR implements ACLAuthMethod type, acl_auth_methods table schema and crud state store methods. It also updates nomadSnapshot.Persist and nomadSnapshot.Restore methods in order for them to work with the new table, and adds two new Raft messages: ACLAuthMethodsUpsertRequestType and ACLAuthMethodsDeleteRequestType

This PR is part of the SSO work captured under ☂️ ticket #13120.

02253e6f

template: protect use of template manager with a lock (#15192) · 00c8cd37

Seth Hoenig authored 2 years ago

This PR protects access to `templateHook.templateManager` with its lock. So
far we have not been able to reproduce the panic - but it seems either Poststart
is running without a Prestart being run first (should be impossible), or the
Update hook is running concurrently with Poststart, nil-ing out the templateManager
in a race with Poststart.

Fixes #15189

00c8cd37