• Tim Gross's avatar
    scheduler: fix quadratic performance with spread blocks (#11712) · 2d4e5b8f
    Tim Gross authored
    When the scheduler picks a node for each evaluation, the
    `LimitIterator` provides at most 2 eligible nodes for the
    `MaxScoreIterator` to choose from. This keeps scheduling fast while
    producing acceptable results because the results are binpacked.
    
    Jobs with a `spread` block (or node affinity) remove this limit in
    order to produce correct spread scoring. This means that every
    allocation within a job with a `spread` block is evaluated against
    _all_ eligible nodes. Operators of large clusters have reported that
    jobs with `spread` blocks that are eligible on a large number of nodes
    can take longer than the nack timeout to evaluate (60s). Typical
    evaluations are processed in milliseconds.
    
    In practice, it's not necessary to evaluate every eligible node for
    every allocation on large clusters, because the `RandomIterator` at
    the base of the scheduler stack produces enough variation in each pass
    that the likelihood of an uneven spread is negligible. Note that
    feasibility is checked before the limit, so this only impacts the
    number of _eligible_ nodes available for scoring, not the total number
    of nodes.
    
    This changeset sets the iterator limit for "large" `spread` block and
    node affinity jobs to be equal to the number of desired
    allocations. This brings an example problematic job evaluation down
    from ~3min to ~10s. The included tests ensure that we have acceptable
    spread results across a variety of large cluster topologies.
    2d4e5b8f