• Tim Gross's avatar
    scheduler: recover from panic (#12009) · f8111692
    Tim Gross authored
    If processing a specific evaluation causes the scheduler (and
    therefore the entire server) to panic, that evaluation will never
    get a chance to be nack'd and cleared from the state store. It will
    get dequeued by another scheduler, causing that server to panic, and
    so forth until all servers are in a panic loop. This prevents the
    operator from intervening to remove the evaluation or update the
    state.
    
    Recover the goroutine from the top-level `Process` methods for each
    scheduler so that this condition can be detected without panicking the
    server process. This will lead to a loop of recovering the scheduler
    goroutine until the eval can be removed or nack'd, but that's much
    better than taking a downtime.
    f8111692