• Michal Hocko's avatar
    oom, PM: make OOM detection in the freezer path raceless · c32b3cbe
    Michal Hocko authored
    Commit 5695be14
    
     ("OOM, PM: OOM killed task shouldn't escape PM
    suspend") has left a race window when OOM killer manages to
    note_oom_kill after freeze_processes checks the counter.  The race
    window is quite small and really unlikely and partial solution deemed
    sufficient at the time of submission.
    
    Tejun wasn't happy about this partial solution though and insisted on a
    full solution.  That requires the full OOM and freezer's task freezing
    exclusion, though.  This is done by this patch which introduces oom_sem
    RW lock and turns oom_killer_disable() into a full OOM barrier.
    
    oom_killer_disabled check is moved from the allocation path to the OOM
    level and we take oom_sem for reading for both the check and the whole
    OOM invocation.
    
    oom_killer_disable() takes oom_sem for writing so it waits for all
    currently running OOM killer invocations.  Then it disable all the further
    OOMs by setting oom_killer_disabled and checks for any oom victims.
    Victims are counted via mark_tsk_oom_victim resp.  unmark_oom_victim.  The
    last victim wakes up all waiters enqueued by oom_killer_disable().
    Therefore this function acts as the full OOM barrier.
    
    The page fault path is covered now as well although it was assumed to be
    safe before.  As per Tejun, "We used to have freezing points deep in file
    system code which may be reacheable from page fault." so it would be
    better and more robust to not rely on freezing points here.  Same applies
    to the memcg OOM killer.
    
    out_of_memory tells the caller whether the OOM was allowed to trigger and
    the callers are supposed to handle the situation.  The page allocation
    path simply fails the allocation same as before.  The page fault path will
    retry the fault (more on that later) and Sysrq OOM trigger will simply
    complain to the log.
    
    Normally there wouldn't be any unfrozen user tasks after
    try_to_freeze_tasks so the function will not block. But if there was an
    OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
    finish yet then we have to wait for it. This should complete in a finite
    time, though, because
    
    	- the victim cannot loop in the page fault handler (it would die
    	  on the way out from the exception)
    	- it cannot loop in the page allocator because all the further
    	  allocation would fail and __GFP_NOFAIL allocations are not
    	  acceptable at this stage
    	- it shouldn't be blocked on any locks held by frozen tasks
    	  (try_to_freeze expects lockless context) and kernel threads and
    	  work queues are not frozen yet
    Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
    Suggested-by: default avatarTejun Heo <tj@kernel.org>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Oleg Nesterov <oleg@redhat.com>
    Cc: Cong Wang <xiyou.wangcong@gmail.com>
    Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    c32b3cbe
page_alloc.c 181 KB