From: avie@next.com (Avadis Tevanian) Subject: Re: Why does NS require so much Memory? Date: 6 Jun 1994 04:38:53 GMT Organization: NeXT, Inc. Message-ID: <2su98t$4ip@rosie.next.com> References: <1994Jun5.221433.24748@sifon.cc.mcgill.ca> In article <1994Jun5.221433.24748@sifon.cc.mcgill.ca> samurai@cs.mcgill.ca (Darcy BROCKBANK) writes: > Oh well... can someone more informed than me *please* take up > this discussion, because I don't have enough knowledge on this > to come to the correct conclusion. Here's the facts on how swapfiles work: For every page in the swapfile, the kernel maintains status telling whether that page is in use or not. When a swapfile it enabled (mach_swapon), it is truncated to lowat and each page is flagged as free. When the page out daemon requests a page to be swapped out, the pager locates the first free page in the swapfile (actually, there is an algorithm to determine which swapfile is used, if more than one is enabled, but I will omit this from the discussion). The first free page is defined as the lowest numbered page. As more and more memory is consumed by processes, higher and higher numbered pages are used. When all pages in the swapfile are in use, and additional page out causes the swapfile to be extended in size. This occurs until hiwat is reached. If hiwat is reached, or if the file system is out of space, the page will be left in memory (unless there is another swapfile enabled that can be used). If the system stays in this state, it will eventually be full of dirty pages which can not be paged out. When this happens, the system comes to a grinding halt as it is forced to use fewer and fewer pages of memory (memory is filled with dirty pages that can not be paged out). Now, it gets interesting when we consider what happens when memory is freed. In particular, when a process exits or calls vm_deallocate, the VM system attempts to free any memory that was associated with the appropriate regions of virtual memory. When memory is shared, it simply makes a note that there is one fewer reference to the shared memory (or copy-on-written memory) and no further action is taken. If this is the last reference to the memory, any corresponding physical pages are freed from main memory and any corresponding pages in the swapfile are tagged as free. A subsequent allocation of page on the swapfile will most definitely reuse this page! When a page is freed, if it is the highest page in the swapfile, the swapfile will be truncated all the way down to the highest page in use (down to lowat). In practice, this happens rarely. The basic problem is that if you have a long running process use a very high number paged (e.g., if the Windowserver allocates a high numbered page) the swapfile will not get truncated until that process exits --- which could be a very long time. When this happens due to a core process (e.g., the nmserver), which cannot be restarted unless the system is rebooted, your swapfile will remain large. Still, there can be lots of free pages in the swapfile file, and rest assured they will be reused! So why don't we compact the swapfile to handle these pages that get allocated at high page numbers? Good question. We've considered doing it many times. However, it has always been considered a quite risky change (how many of YOU have debugged a virtual memory system before) and would need to be done very carefully to ensure correctness and adequate performance. As an example, it would not be acceptable to just start a compaction and cause the system to lock up as the kernel does several megabytes of I/O for the compaction. The relative merits of making this improvement has never outweighted the costs in risk and the opportunity costs of not working on other parts of the system. I'm not saying we'll never do it, I'm just saying we haven't done it yet for some carefully considered reasons. Having said all of this, why do so many people seem to have problems with their swapfiles? Here are some possible explanations: 1) Not everyone realizes just how much memory their apps use. As has been mentioned before, the Windowserver keeps backing store for all the windows (on or off screen). On 16-bit color systems this can be quite large, on 24-bit systems its downright huge! Simple images on the screen can translate into megabytes of storage. Mathematica sessions are notorious for consuming 10's or even 100's of megabytes of VM. 2) Programs occasionally have memory leaks. We work hard to be sure that the software we release does not have leaks. There's a reason we developed MallocDebug! I think we do pretty well, but I'm sure there are some bugs. For example, the Windowserver, with it's printer heritage, has long had problems with correctly managing its memory. On the printers they just "reset" the memory heap for each new job --- we can't do that. If/when the Windowserver leaks we get a double whammy since not only do we leak a small amount of memory, but the Windowserver is a long running process and tends to hog those high numbered pages. I think NEXTSTEP ISV's generally do a good job too, but it only takes one or two apps to leak memory and cause problems. 3) As many of you know, Mach has a quite advanced virtual memory scheme, which NEXTSTEP makes excellent use of. Features like copy-on-write and pageable read/write sharing can cause complex relationships between memory and how it is mapped into one or more processes. There is one known optimization that the kernel does (specifically the coalescing of adjacent memory regions when backing store has not yet been allocated --- for those of you Mach VM literate) which sometimes causes the freeing of some memory to be delayed until a process has exited. The situations when this happens are fairly rare, and worse case the memory is freed when the process exits, but it wouldn't surprise me if this is the cause of isolated problems. I personally think the Mach swapfile solution is quite good. I'm obviously biased though. Sure, there are a few things I think could be improved, but that's true of any piece of software. Overall I think we've made some reasonable trade-offs. I also think swapfile management is fairly bug-free. We know we can improve the situation is (3) above (but it is difficult). Certainly if anyone has any other possible reasons for swapfile growth, especially with concrete examples of programs, let us know so we can investigate! I'd be more than happy to read suggestions others have on improving how swapfiles work. I can't guarantee we'll implement them, but you never know! I hope this sheds a little light on the whole swapfile discussion. Somehow I think it will still continue on --- but hopefully it can be grounded with a few more facts now. Avie