• Mahmood Ali's avatar
    client: defensive against getting stale alloc updates · 2e1978eb
    Mahmood Ali authored
    When fetching node alloc assignments, be defensive against a stale read before
    killing local nodes allocs.
    
    The bug is when both client and servers are restarting and the client requests
    the node allocation for the node, it might get stale data as server hasn't
    finished applying all the restored raft transaction to store.
    
    Consequently, client would kill and destroy the alloc locally, just to fetch it
    again moments later when server store is up to date.
    
    The bug can be reproduced quite reliably with single node setup (configured with
    persistence).  I suspect it's too edge-casey to occur in production cluster with
    multiple servers, but we may need to examine leader failover scenarios more closely.
    
    In this commit, we only remove and destroy allocs if the removal index is more
    recent than the alloc index. This seems like a cheap resiliency fix we already
    use for detecting alloc updates.
    
    A more proper fix would be to ensure that a nomad server only serves
    RPC calls when state store is fully restored or up to date in leadership
    transition cases.
    2e1978eb