This project is mirrored from https://gitee.com/mirrors/nomad.git. Pull mirroring failed .
Repository mirroring has been paused due to too many failed attempts. It can be resumed by a project maintainer.
  1. 28 Jan, 2022 6 commits
    • Jai Bhagat's avatar
      feat: add meta evaluations · 79d1c11e
      Jai Bhagat authored
      To support pagination on evaluations queries.
      79d1c11e
    • Jai Bhagat's avatar
      3d848654
    • Jai Bhagat's avatar
    • Jai Bhagat's avatar
      chore: run prettier on gutter-menu · 2ef93947
      Jai Bhagat authored
      2ef93947
    • Jai Bhagat's avatar
      feat: add evalutions view with table · 0b70c1a4
      Jai Bhagat authored
      0b70c1a4
    • Tim Gross's avatar
      CSI: node unmount from the client before unpublish RPC (#11892) · 8364eda1
      Tim Gross authored
      When an allocation stops, the `csi_hook` makes an unpublish RPC to the
      servers to unpublish via the CSI RPCs: first to the node plugins and
      then the controller plugins. The controller RPCs must happen after the
      node RPCs so that the node has had a chance to unmount the volume
      before the controller tries to detach the associated device.
      
      But the client has local access to the node plugins and can
      independently determine if it's safe to send unpublish RPC to those
      plugins. This will allow the server to treat the node plugin as
      abandoned if a client is disconnected and `stop_on_client_disconnect`
      is set. This will let the server try to send unpublish RPCs to the
      controller plugins, under the assumption that the client will be
      trying to unmount the volume on its end first.
      
      Note that the CSI `NodeUnpublishVolume`/`NodeUnstageVolume` RPCs can 
      return ignorable errors in the case where the volume has already been
      unmounted from the node. Handle all other errors by retrying until we
      get success so as to give operators the opportunity to reschedule a
      failed node plugin (ex. in the case where they accidentally drained a
      node without `-ignore-system`). Fan-out the work for each volume into
      its own goroutine so that we can release a subset of volumes if only
      one is stuck.
      8364eda1
  2. 27 Jan, 2022 8 commits
    • Jai's avatar
      Merge pull request #11942 from hashicorp/f-ui/test-tooling · f2fef6ff
      Jai authored
      ui:  test tooling
      f2fef6ff
    • Seth Hoenig's avatar
      Merge pull request #11951 from hashicorp/b-cgroups-broken-part1-oss · 2b93ae67
      Seth Hoenig authored
      client: change test to not poke cgroupv2 edge case
      2b93ae67
    • Tim Gross's avatar
      CSI: move terminal alloc handling into denormalization (#11931) · 2e357163
      Tim Gross authored
      * The volume claim GC method and volumewatcher both have logic
      collecting terminal allocations that duplicates most of the logic
      that's now in the state store's `CSIVolumeDenormalize` method. Copy
      this logic into the state store so that all code paths have the same
      view of the past claims.
      * Remove logic in the volume claim GC that now lives in the state
      store's `CSIVolumeDenormalize` method.
      * Remove logic in the volumewatcher that now lives in the state
      store's `CSIVolumeDenormalize` method.
      * Remove logic in the node unpublish RPC that now lives in the state
      store's `CSIVolumeDenormalize` method.
      2e357163
    • Tim Gross's avatar
      csi: ensure that PastClaims are populated with correct mode (#11932) · b588a7bd
      Tim Gross authored
      In the client's `(*csiHook) Postrun()` method, we make an unpublish
      RPC that includes a claim in the `CSIVolumeClaimStateUnpublishing`
      state and using the mode from the client. But then in the
      `(*CSIVolume) Unpublish` RPC handler, we query the volume from the
      state store (because we only get an ID from the client). And when we
      make the client RPC for the node unpublish step, we use the _current
      volume's_ view of the mode. If the volume's mode has been changed
      before the old allocations can have their claims released, then we end
      up making a CSI RPC that will never succeed.
      
      Why does this code path get the mode from the volume and not the
      claim? Because the claim written by the GC job in `(*CoreScheduler)
      csiVolumeClaimGC` doesn't have a mode. Instead it just writes a claim
      in the unpublishing state to ensure the volumewatcher detects a "past
      claim" change and reaps all the claims on the volumes.
      
      Fix this by ensuring that the `CSIVolumeDenormalize` creates past
      claims for all nil allocations with a correct access mode set.
      b588a7bd
    • Tim Gross's avatar
      CSI: resolve invalid claim states (#11890) · d0624fc0
      Tim Gross authored
      * csi: resolve invalid claim states on read
      
      It's currently possible for CSI volumes to be claimed by allocations
      that no longer exist. This changeset asserts a reasonable state at
      the state store level by registering these nil allocations as "past
      claims" on any read. This will cause any pass through the periodic GC
      or volumewatcher to trigger the unpublishing workflow for those claims.
      
      * csi: make feasibility check errors more understandable
      
      When the feasibility checker finds we have no free write claims, it
      checks to see if any of those claims are for the job we're currently
      scheduling (so that earlier versions of a job can't block claims for
      new versions) and reports a conflict if the volume can't be scheduled
      so that the user can fix their claims. But when the checker hits a
      claim that has a GCd allocation, the state is recoverable by the
      server once claim reaping completes and no user intervention is
      required; the blocked eval should complete. Differentiate the
      scheduler error produced by these two conditions.
      d0624fc0
    • Seth Hoenig's avatar
      client: change test to not poke cgroupv2 edge case · 87d54b8c
      Seth Hoenig authored
      This PR tweaks the TestCpusetManager_AddAlloc unit test to not break
      when being run on a machine using cgroupsv2. The behavior of writing
      an empty cpuset.cpu changes in cgroupv2, where such a group now inherits
      the value of its parent group, rather than remaining empty.
      
      The test in question was written such that a task would consume all available
      cores shared on an alloc, causing the empty set to be written to the shared
      group, which works fine on cgroupsv1 but breaks on cgroupsv2. By adjusting
      the test to consume only 1 core instead of all cores, it no longer triggers
      that edge case.
      
      The actual fix for the new cgroupsv2 behavior will be in #11933
      87d54b8c
    • Jai Bhagat's avatar
      7f5e0b82
    • James Rasell's avatar
      Merge pull request #11940 from hashicorp/b-docs-add-client-reserved-cores · 402e36bb
      James Rasell authored
      docs: add `cores` to client reserved config block.
      402e36bb
  3. 26 Jan, 2022 15 commits
  4. 25 Jan, 2022 4 commits
    • Seth Hoenig's avatar
      Merge pull request #11920 from hashicorp/dependabot/go_modules/github.com/rs/cors-1.8.2 · 94b744c5
      Seth Hoenig authored
      build(deps): bump github.com/rs/cors from 1.8.0 to 1.8.2
      94b744c5
    • Seth Hoenig's avatar
      connect: fix bug where sidecar_task.resources was ignored with hcl1 · 15442b35
      Seth Hoenig authored
      The HCL1 parser did not respect connect.sidecar_task.resources if the
      connect.sidecar_service block was not set (an optimiztion that no longer
      makes sense with connect gateways).
      
      Fixes #10899
      15442b35
    • Tim Gross's avatar
      fix integer bounds checks (#11815) · 358a4681
      Tim Gross authored
      * driver: fix integer conversion error
      
      The shared executor incorrectly parsed the user's group into int32 and
      then cast to uint32 without bounds checking. This is harmless because
      an out-of-bounds gid will throw an error later, but it triggers
      security and code quality scans. Parse directly to uint32 so that we
      get correct error handling.
      
      * helper: fix integer conversion error
      
      The autopilot flags helper incorrectly parses a uint64 to a uint which
      is machine specific size. Although we don't have 32-bit builds, this
      sets off security and code quality scaans. Parse to the machine sized
      uint.
      
      * driver: restrict bounds of port map
      
      The plugin server doesn't constrain the maximum integer for port
      maps. This could result in a user-visible misconfiguration, but it
      also triggers security and code quality scans. Restrict the bounds
      before casting to int32 and return an error.
      
      * cpuset: restrict upper bounds of cpuset values
      
      Our cpuset configuration expects values in the range of uint16 to
      match the expectations set by the kernel, but we don't constrain the
      values before downcasting. An underflow could lead to allocations
      failing on the client rather than being caught earlier. This also make
      security and code quality scanners happy.
      
      * http: fix integer downcast for per_page parameter
      
      The parser for the `per_page` query parameter downcasts to int32
      without bounds checking. This could result in underflow and
      nonsensical paging, but there's no server-side consequences for
      this. Fixing this will silence some security and code quality scanners
      though.
      358a4681
    • James Rasell's avatar
      Merge pull request #11907 from hashicorp/f-state-store-nomad-file · 34231188
      James Rasell authored
      state: move restore functionality into its own file.
      34231188
  5. 24 Jan, 2022 7 commits