CHARLA DE RIK VAN RIEL

Charlas del 27/11/2000

Log de la conferencia. Se han suprimido las líneas correspondientes a entradas y salidas de diferentes personas en el canal durante la conferencia

TALKING IN #LINUX TALKING IN #QC

[19:06] * riel changes server
[19:06] (Fernand0) The talk will be here, in #linux channel, Mr Riel suggested us to make
[19:06] (Fernand0) another channel (#qc -) questions channel) to write questions during the talk.
[19:06] (Fernand0) Should you have any questions, comments, etc, just write them in #qc
[19:06] (Fernand0) and Mr. Riel will reply.
[19:13] (Fernand0) The talk will be here, in #linux channel, Mr Riel suggested us to make
[19:13] (Fernand0) another channel (#qc -) questions channel) to write questions during the talk.
[19:13] (Fernand0) Should you have any questions, comments, etc, just write them in #qc
[19:13] (Fernand0) and Mr. Riel will reply.
[19:14] (Fernand0) Hi,
[19:14] (Fernand0) we are very pleased to present you today Rik van Riel.
[19:14] (Fernand0) He is a kernel hacker working on memory management.
[19:14] (riel) before we begin, I think everybody should start their web browser and load http://www.surriel.com/lectures/mmtour.html
[19:14] (Fernand0) Currently he is working at conectiva S.A. in Brazil. As all of you know,
[19:14] (Fernand0) it is a big Linux company from South America.
[19:14] (Fernand0) Currently he is working at conectiva S.A. in Brazil. As all of you know,
[19:14] (Fernand0) it is a big Linux company from South America.
[19:14] (Fernand0) Appart from kernel hacking, he also runs the Linux-MM website and the
[19:14] (Fernand0) #kernelnewbies IRC channel on openprojects.net
[19:14] (Fernand0) You can find more about him at: www.surriel.com (there you can find, among
[19:14] (Fernand0) other things the slides of this talk at:
[19:14] (Fernand0) (http://www.surriel.com/lectures/mmtour.html)
[19:15] (Fernand0) He will talk here about memory management but other interests of him
[19:15] (Fernand0) are: High availability, filesystems and various other things ...
[19:15] (Fernand0) The talk will be here, in #linux channel, Mr Riel suggested us to make
[19:15] (Fernand0) another channel (#qc -) questions channel) to write questions during the talk.
[19:15] (Fernand0) Should you have any questions, comments, etc, just write them in #qc
[19:15] (Fernand0) and Mr. Riel will reply.
[19:15] (Fernand0) Thank you to Mr. Riel for comming here and also to all of you
[19:15] (Fernand0) The title of his talk is:
[19:15] (Fernand0) Too little, too slow; memory management
[19:15] (Fernand0) Mr. Riel ...
[19:16] (riel) I guess it's time to begin .............
[19:16] (riel) ok, welcome everybody
[19:16] (riel) today I will be giving a talk about Linux memory management
[19:16] (riel) the slides are at http://www.surriel.com/lectures/mmtour.html
[19:17] (riel) we will begin with some of the slides introducing memory management and explaining why we need memory management
[19:19] (riel) if you have any questions about my talk, you can ask them in #qc
[19:19] (riel) in #qc, you can also discuss with each other the things I talk about
[19:19] (riel) this channel (#linux) is meant to be completely silent ... except for me of course ;)
[19:19] (riel) ...... (page 1) .....
[19:19] (riel) let me begin by telling a little bit about what I am doing at the moment
[19:19] (riel) Conectiva is paying me to work on improving the Linux kernel full-time
[19:19] (riel) this means that I am working for Linux Torvalds and Alan Cox, but Conectiva is paying me ;) [thanks Conectiva :)]
[19:19] (riel) now I'll move on to the real talk ... (page 2)
[19:19] (riel) [for the new people ... http://www.surriel.com/lectures/mmtour.htm l for the slides]
[19:20] (riel) ok, I will begin by explaining about memory management
[19:20] (riel) most of the introduction I will skip
[19:21] (riel) but I will tell a few things about the memory hierarchy and about page faults and page replacement
[19:21] (riel) lets start with the picture on (page 3)
[19:21] (riel) this picture represents the "memory hierarchy"
[19:22] (riel) every computer has more kinds of memory
[19:22] (riel) very fast and very small memory
[19:22] (riel) and very big but very slow memory
[19:22] (riel) fast memory can be the registers or L1 cpu cache
[19:22] (riel) slow memory can be L2 cache or RAM
[19:23] (riel) and then you have hard disk, which is REALLY REALLY extremely slow ;)
[19:25] (riel) when you see this picture, some people will ask themselves the question "but why doesn't my machine have only fast memory?"
[19:25] (riel) or "why don't we just run everything from fast memory?"
[19:25] (riel) the reason for this is that it is impossible to make very big fast memory
[19:25] (riel) and even if it was possible, it would simply be too expensive
[19:25] (riel) and you cannot run everything from fast memory because programs are simply too big
[19:25] (riel) now we've talked about "fast" and "slow" memory ... (page 4) tells us about different kinds of speeds
[19:26] (riel) you have "latency" and "throughput"

Here is netsplitz...

[19:30] (riel) ok, good to have everybody back
[19:30] (Fugas) How do you administrate the cache memory?
[19:30] (riel) if you look at (page 6) you can see how rediculously slow some memory things are
[19:31] (riel) Fugas: questions in #qc please
[19:31] (riel) Fugas: the cache memory is managed in hardware, it is invisible for the Operating System
[19:32] (riel) Fugas: only RAM and disk management is done in software
[19:32] (riel) ok, lets go to (page 7)
[19:36] (riel) "latency" == "if I ask for something, how long do I have to wait until I get the answer"

[19:37] (riel) "throughput" == "how much data can I get per minute"
[19:44] (riel) 6 (riel) "latency" == "if I ask for something, how long do I have to wait until I get the answer"
[19:44] (riel) (riel) "throughput" == "how much data can I get per minute"
[19:44] (riel) (riel) I think we do not have time to look at the L1 and L2 cache things
[19:44] (riel) (riel) so lets move on to the RAM management
[19:44] (riel) (riel) on (page 14)
[19:44] (riel) (riel) RAM is the slowest electronic memory in a computer
[19:44] (riel) (riel) it is often 100 times slower than the CPU core (in latency)
[19:44] (riel) (riel) this is very very slow
[19:44] (riel) (riel) but when you see that the disk is 100000 times slower than RAM (in latency), suddenly memory looks fast again ... ;)
[19:44] (riel) (riel) this enormous difference in speed makesit very important that you have the data in memory that you need
[19:44] (riel) -
[19:45] (riel) 6(riel) if you do not have the data you need in RAM, you need to wait VERY LONG (often more than 5 million CPU cycles) before your data is there and your program can continue
[19:45] (riel) (riel) on the other hand, everybody knows that you NEVER have enough memory ;)
[19:45] (riel) (riel) so the system has to chose which pages to keep in memory (or which pages to read from disk) and which pages to throw away (swap out)
[19:45] (riel) <
[19:45] (riel) ok, lets try this again ;)
[19:46] (riel) the ping timeout probably lost my last 3 minutes of the talk
[19:46] (riel) 6(riel) so lets move on to the RAM management
[19:46] (riel) (riel) on (page 14)
[19:46] (riel) (riel) RAM is the slowest electronic memory in a computer
[19:46] (riel) (riel) it is often 100 times slower than the CPU core (in latency)
[19:46] (riel) (riel) this is very very slow
[19:46] (riel) (riel) but when you see that the disk is 100000 times slower than RAM (in latency), suddenly memory looks fast again ... ;)
[19:46] (riel) (riel) this enormous difference in speed makesit very important that you have the data in memory that you need
[19:46] (riel) -
[19:46] (riel) but as we all know, no computer ever has enough memory ... ;)
[19:37] (Alma) riel, i can manipulate the memory ?
[19:38] (MIKE_ITL) ¿What are the comparisons between memory costs?
[19:39] (riel) Alma: in what way? ;)
[19:40] (debUgo-) SRAM (that kind of memory used in L1/L2 cache) would cost +10x normal SDRAM
[19:40] (Arthur) for the best manegement
[19:40] (Alma) for the interaction in the internet

[19:47] (riel) and the speed difference is REALLY big ... this means that the system has to choose very carefully what data it keeps in RAM and what data it throws away (swaps out)
[19:47] (riel) lets move on to page 18
[19:48] (riel) ok, if a page of a process is NOT in memory (but the process wants it) then the CPU will give an error and abort the program
[19:49] (riel) then the Operating System (OS) gets the job of fixing this error and letting the program continue
[19:49] (riel) this trap is called a "PAGE FAULT"
[19:50] (riel) the OS fixes the job by getting a free page, putting the right data in that page and giving the page to that program
[19:50] (riel) after that the process continues just like nothing happened
[19:50] (riel) the ONLY big problem is that such a page fault easily takes 5 _million_ CPU cycles
[19:51] (riel) so you want to make sure you have as little page faults as possible
[19:51] (riel) the other problem is that you only have a "little bit" of memory in your machine
[19:51] (riel) and you run out of free memory very fast
[19:51] (riel) at that point, the OS needs to choose which data it keeps in memory and which data it swaps out
[19:52] (riel) ..... lets move to (page 19) of http://www.surriel.com/lectures/mmtour.html ....
[19:52] (riel) the "perfect" thing to do is to throw away (swap out) that data which will not be needed again for the longest time
[19:52] (riel) that way you have the longest time between page faults and the minimum number of page faults per minute ... so the best system performance
[19:53] (riel) the only problem with this method is that you need to look into the future to do this
[19:53] (riel) and that isn't really possible ... ;)))
[19:53] (riel) so we have to come up with other ideas that approximate this idea
[19:53] (riel) ......
[19:53] (riel) one idea is LRU ... we swap out the page which has not been used for the longest time
[19:47] (Alma) how i cant manipulate the memory for a best work in the internet ?
[19:47] (riel) Alma: you cannot make the internet faster with more memory ;)
[19:48] (Martha) , who in your opinion el the best OS that management best the memory and why??
[19:48] (riel) Alma: maybe somebody else can explain that to you
[19:49] (riel) Martha: I will answer this question after the talk, ok?
[19:49] (movement) page 18 is referring specifically to "major" page faults right
[19:50] (riel) movement: minor page faults are handled the same way, only the disk read is missing
[19:52] (erikm) Alma: adding more memory to a computer won't speed up the internet. It will speed up your local computer because you will get less page faults (as riel just explained).

[19:54] (riel) the idea is: "if a page has not been used for 30 minutes, I can be pretty sure I will not use it again in the next 5 seconds"
[19:54] (riel) which really makes a lot of sense in most situation
[19:54] (riel) unfortunately, there are a few (very common) cases where LRU does the exact wrong thing
[19:55] (riel) take for example a system where somebody is burning a CD
[19:55] (riel) to burn a CD at 8-speed, you will be "streaming" your data at 1.2MB per second
[19:56] (riel) at that speed, it will take just 30 seconds on your 64MB workstation before your mail reader is "older" than the old data from the CD write program
[19:56] (riel) and your system will swap out the mail reader
[19:56] (riel) which is the exact WRONG thing to do
[19:57] (riel) because most likely you will use your mail reader before you will burn the same CD image again
[19:57] (riel) LFU would avoid this situation
[19:57] (riel) LFU swaps out the page which has been used least often
[19:58] (riel) so it would see that the mail reader is being used all the time (pages used 400 times in the last 2 minutes) while the CD image has only been used one time (read from disk, burn to CD and forget about it)
[19:58] (riel) and LFU would nicely throw out the CD image data that has been used
[19:58] (riel) and keep the mail reader in memory
[19:59] (riel) in this situation LFU is almost perfect
[19:59] (riel) ... now we take another example ;)
[19:59] (riel) if we look at GCC, you will see that it consists of 3 parts
[19:59] (riel) a preprocessor (cpp), a compiler (cc1) and an assembler (as)
[20:00] (riel) suppose you only have memory for one of these at a time
[20:00] (riel) cpp was running just fine and used its memory 400 times in the last minute
[20:00] (riel) now it is cc1's turn to do work
[20:01] (riel) but cc1 does not fit in memory at the same time as cpp
[20:01] (riel) and LFU will swap out parts of cc1 because cpp used its memory a lot ...
[20:01] (riel) and does the exact wrong thing ... cpp _stopped doing work_ a second ago
[20:01] (riel) and cc1 is now the important process to keep in memory
[19:54] (rcastro) riel: I am not a LRU expert, but I read that LRU is not the best options in all scenarios. Is that true?
[19:55] (riel) rcastro: one moment ;)
[19:55] (rcastro) riel: yeah, I saw it :-)
[19:55] (Rob) riel: How do you measure when a page has last been "used"?
[19:55] (EfrenCA) what is the best performance between Least Recently Used and Least Frequently Used??
[19:56] (riel) Rob: one moment ;)
[19:56] (riel) EfrenCA: one moment...
[19:57] (rcastro) Rob: there's an aging process
[19:57] (rcastro) Rob: in every page frame

[20:03] (riel) ... this means that both LRU and LFU are good for some situations, but really bad for other situations
[20:05] (riel) I got a question if LRU or LFU is better ... the answer is none of them ;)
[20:05] (riel) what we really want is something that has the good parts of both LRU and LFU but not the bad parts
[20:05] (riel) luckily we have such a solution ... page aging (on page 20)
[20:05] (riel) page aging is really simple
[20:05] (riel) the system scans over all of memory, and each page has an "age" (just points)
[20:05] (riel) if the page has been used since we scanned the page last, we increase the page age
[20:05] (riel) (Rob) riel: How do you measure when a page has last been "used"?
[20:05] (riel) ... umm yes ... I almost forgot about that part ;))
[20:05] (riel) --- when a page is being used, the CPU sets a special bit, the "accessed bit" on the page (or the page table)
[20:05] (riel) --- and we only have to look at this bit to see if the page was used
[20:06] (riel) --- and after we look at the bit, we set it to 0 so it will change if is being used again after we scan
[20:06] (riel) so back to page aging now
[20:07] (riel) if the page was used since we last scan it, we make the page age bigger
[20:07] (riel) if the page was not used, we make the page age smaller
[20:07] (riel) and when the page age reaches 0, the page is a candidate for swapout ... we remove the data and use the memory for something else
[20:07] (riel) now there are different ways of making the page age bigger and smaller
[20:07] (riel) for making it bigger, we just add a magic number to the page age ... page->age += 3
[20:07] (riel) for making it smaller, we can do multiple things
[20:02] (laz) is there going to be a log of this available somewhere afterwards ?
[20:02] (Fernand0) yes laz
[20:02] (Fernand0) we'll put on the web
[20:03] (bruder) I'll put it too in .BR (first in English, latter translated to Portuguese). (http://pontobr.org)
[20:04] (erikm) riel: why is this a problem? after cpp stops, all its pages are no longer used
[20:05] (debUgo-) erikm: but you are using LFU, no LRU
[20:06] (riel) erikm: they _were_ used ... which is what LFU looks at
[20:06] (riel) erikm: some of them may still be in memory

[20:08] (riel) if we substract a magic number (page->age -= 1), we will be close to LFU
[20:09] (riel) if we divide the page age by 2 (page->age /= 2), we will be close to LRU
[20:09] (riel) to be honest, I have absolutely no idea which of the two would work best
[20:09] (riel) or if we want system administrators to select this themselves, depending on what the system is doing
[20:10] (riel) page aging is used by Linux 2.0, FreeBSD and Linux 2.4
[20:10] (riel) somebody thought it would be a good idea to remove page aging in Linux 2.2, but it turned out not to work very well ... ;)
[20:10] (riel) ... so we put it back for Linux 2.4 ;))
[20:10] (riel) and another question: (HoraPe) riel: what is linux using?
[20:10] (riel) HoraPe: Linux 2.0 uses the "page->age -= 1" strategy
[20:10] (riel) HoraPe: and Linux 2.4 uses the "page->age /= 2" strategy
[20:11] (riel) maybe the first strategy is better, maybe the second strategy is better
[20:12] (riel) if we have any volunteers who want to test this, talk to me after the lecture ;))
[20:13] (riel) I will now go on to (page 21) and talk about drop-behind
[20:13] (riel) most of you have probably heard about read-ahead
[20:13] (riel) where the system tries to read in data from a file *before* the program which uses the file needs it
[20:14] (riel) this sounds difficult, but if the program is just reading the file from beginning to end it is quite simple ...
[20:14] (riel) one problem is that this linear read will quickly fill up all memory if it is very fast
[20:14] (riel) and you do not want that, because you also have other things you want to do with your memory
[20:15] (riel) the solution is to put all the pages _behind_ where the program has been reading on the list of pages we will swap out next (the inactive list)
[20:08] (bruder) riel: You are describing generic MM, 2.2, 2.4 or 2.5? :
[20:09] (erikm) riel: yes, but they are no longer in use by a process, so they are free to use for others again (iow, page->count--)
[20:09] (HoraPe) riel: what is linux using? if(random()) page->age -= 1 else page->age /= 2;
[20:10] (HoraPe) ;-)
[20:10] (riel) bruder: still generic MM
[20:10] (riel) bruder: but I will move on to Linux now
[20:11] (movement) erikm: doesn't that assume that an unused page is always selected over an old used one ?
[20:12] (laz) riel: this stuff about the sysadmin selecting aging policy... is that currently implemented? I thought the idea was to give the best algo for most common cases ?
[20:12] (rcastro) riel: how it was implemented LRU in version 2.2 then?
[20:12] (erikm) movement: yes, sounds like a sane sane solution to me
[20:12] (laz) riel: ah, n/m... you sorta answered it :)
[20:12] (movement) erikm: I don't want my sleeping daemon hogging pages from a repeatedly occuring cpp ...
[20:12] (riel) laz: they're both about the same
[20:13] (riel) laz: I don't think it will be really needed to chose
[20:13] (erikm) movement: but it al depends on what the page was used for. if it is still in the page cache, I suppose it will be prefered over the free pages.

[20:18] (riel) so in front of where the program is now, you read in all the data the program needs next (very friendly for the program)
[20:19] (riel) and in exchange for that, you remove the data the program will probably not need any more
[20:19] (riel) of course you can make mistakes here, but if you get it right 90% of the time it is still good for performance ... you do not need to be perfect
[20:19] (riel) ... from the part about hard disks, I will skip almost everything
[20:19] (riel) ... only (page 23) I will discuss today
[20:20] (riel) as you probably know, hard disks are really strange devices
[20:20] (riel) they consist of a bunch of metal (or glass) plates with a magnetic coating on them which spin around at rediculously high speeds
[20:20] (riel) and there is a read-write arm which can seek across the disk at very low speeds
[20:20] (riel) the consequences of this design are that hard disks have a high throughput .. 20 MB/second is quite normal today
[20:20] (riel) this is fast enough to keep a modern CPU busy
[20:20] (riel) on the other hand, if you need some piece of data, your CPU will have to wait for 5 _million_ CPU cycles
[20:20] (riel) so hard disks are MUCH too slow if you're not reading the disk from beginning to end
[20:21] (riel) this means that so called "linear reads" are very fast
[20:21] (riel) while "random access" is extremely slow
[20:21] (riel) you should not be surprised if 90% of the data is in linear reads, but hard disks spend 95% of their time doing random disk IO
[20:21] (riel) because the linear IO is so fast the disk can do it in almost no time ;)
[20:22] (riel) the normal optimisation for this is "IO clustering", where the OS reads (or writes) as much data in one place of the disk as possible
[20:23] (riel) the "as possible" can not be too large, however ...
[20:23] (riel) if you have "only" 64 MB RAM in your machine, you probably do not want to do readahead in 2MB pieces
[20:23] (riel) because that way you will throw useful data out of memory, which you will need to read in again later, etc...
[20:24] (riel) so it is good to read in a small part of data, but it is also good to read in very big parts of data ... and the OS will have to decide on some good value all by itself
[20:16] (rcastro) riel: (sorry for repeating it) how it was implemented LRU in version 2.2 then?

[20:19] (bruder) riel: And in the case of random access (not linear)?
[20:19] (riel) rcastro: NRU
[20:20] (riel) rcastro: if the page was not used since the last time we scanned it, we swap it out
[20:20] (riel) rcastro: no less old or more old ... just old or not old

[20:25] (riel) Linux has some auto-tuning readahead code for this situation (in mm/filemap.c::generic_file_readahead(), for the interested) but that code still needs some work to make it better
[20:25] (riel) and of course, another way to make "disk accesses" fast is to make sure you do not access the disk
[20:25] (riel) you can do this if the data you need is already (or still) in memory
[20:26] (riel) Linux uses all "extra" memory as a disk cache in the hope that it can avoid disk reads
[20:26] (riel) and most other good operating systems do the same (FreeBSD for example)
[20:27] (riel) ... now I will go on with Linux memory management
[20:27] (riel) ... on (page 28) and furter
[20:28] (riel) I will explain how memory management in Linux 2.2 chooses which pages to swap out, what is wrong with that and how we fix the situation in Linux 2.4
[20:28] (riel) and also the things that are still wrong in Linux 2.4 and need to be fixed later ;)
[20:29] (riel) ok, another question: (rcastro) riel: to keep data in memory, have you ever thought about compressed data in memory?
[20:29] (riel) --- this is a good idea in some circumstances
[20:30] (riel) --- research has shown that compressed cache means that some systems can do with less disk IO and are faster
[20:30] (riel) --- on the other hand, for some other systems it makes the system slower because of the overhead of compression
[20:25] (EfrenCA) how can we use to "linear reads" if we dates are temporaly??
[20:28] (rcastro) riel: to keep data in memory, have you ever thought about compressed data in memory? (I am working with that, and that's why I ask you that! :-)
[20:29] (debUgo-) riel: AFAIK, linux don't have a pure elevator algorithm. Would a good elevator algorithm improve VM performance notably?

[20:31] (riel) --- it really depends on what you do with your system if the "compressed cache" trick is worth it or not
[20:31] (riel) --- and it would be interesting to see as an option on Linux since it is really useful for some special systems
[20:31] (riel) --- for example, systems which do not have swap
[20:32] (riel) ... ok, lets move on to (page 31)
[20:32] (riel) Linux 2.2 swapout code is really simple
[20:32] (riel) (at least, that's the idea)
[20:32] (riel) the main function is do_try_to_free_pages()
[20:33] (riel) this function calls shrink_mmap(), swap_out() and a few other - less important - functions
[20:33] (riel) shrink_mmap() simply scans all of memory and will throw away (swap out) all cache pages which were not used since the last time we scanned
[20:34] (riel) and swap_out() scans the memory of all programs and swaps out every program page which was not used since the last time we scanned it
[20:34] (riel) ... (page 32)
[20:34] (riel) this is a really simple system which works well if the system load is not too high
[20:35] (riel) but as soon as the load gets higher, it can completely break down for some reasons
[20:35] (riel) if, for example, the load on the system is very variable, we get problems
[20:36] (riel) if you have enough memory for 30 minutes and all memory has been used in those 30 minutes, then after 30 minutes _every_ page has been used since the last time we scanned (30 minutes ago)
[20:36] (riel) and then something happens in the system (Netscape gets started)
[20:37] (riel) but the OS has no idea which page to swap out, since all pages were used in the last 30 minutes, when we scanned last
[20:37] (riel) in that situation, the OS usually swaps out the 'wrong' pages
[20:37] (riel) and those wrong pages are needed again 5 milliseconds later
[20:37] (riel) which makes the OS swap out *other* wrong pages again, until everything settles down
[20:39] (riel) so every time the system load increases, you have a period where the system is really slow and has to adjust to the load ...
[20:39] (riel) another problem is that (in shrink_mmap) we scan and swap out pages from the same function
[20:39] (riel) this breaks down when we have a very high load on the system and a lot of the pages we want to swap out need to be written to disk first
[20:39] (riel) shrink_mmap() will scan every page in memory and start disk IO for the pages that need to be written to disk
[20:39] (riel) after that it will start scanning at the beginning again
[20:31] (Rob) riel: Any idea what the determining factor is re: compression performance?
[20:32] (erikm) Rob: it depends on the compression speed vs. the disk speed.
[20:32] (riel) debUgo-: we'll talk about that after the talk, ok?
[20:33] (debUgo-) riel: ok

[20:40] (riel) and no page it sees has been used since the last time we scanned it, since kswapd was the only thing running
[20:40] (riel) at that point the system -again- starts swapping out the wrong pages
[20:41] (riel) a question: (movement) is this the do_try_to_free_pages() printk we hear so much about on lkml ?
[20:41] (riel) --- this printk is called when do_try_to_free_pages() cannot find pages to swap out
[20:41] (riel) --- not when do_try_to_free_pages() swaps the wrong pages by accident
[20:42] (riel) --- so these things are not the same
[20:42] (riel) ... lets move on to (page 33) and see how we fix these problems in Linux 2.4
[20:43] (riel) the two big changes for Linux 2.4 are page aging and the separation of page aging and page writeback to disk
[20:44] (riel) page aging means we are more precise in chosing which page we swap out, so we will have a better chance of having the pages we need in memory
[20:44] (riel) and the system will perform better when memory is getting full
[20:45] (riel) the separation of page aging and page flushing means that we will not swap out the wrong page just because the right page still needs to be written to disk and we cannot use it for something else yet
[20:46] (riel) ... on (page 35) I will explain about the memory queues we have in Linux 2.4
[20:46] (riel) we have 3 "types" of pages in Linux 2.4
[20:46] (riel) active pages, inactive_dirty pages and inactive_clean pages
[20:46] (riel) we do page aging on the active pages
[20:47] (riel) and the inactive_dirty and inactive_clean pages are simply sitting there waiting to be used for something else
[20:47] (riel) ... now we go back to (page 34) [sorry]
[20:40] (movement) is this the do_try_to_free_pages() printk we hear so much about on lkml ?
[20:41] (rcastro) riel: is this problem related to the kswap lock_kernel()? was that kept in 2.4?
[20:42] (riel) rcastro: nope
[20:42] (riel) rcastro: just think about how many programs you can run in 1 millisecond ;)

[20:52] (riel) so when the system gets a burst of activity again after 30 minutes, the system knows exactly which pages to swapout and which pages to keep in memory
[20:52] (riel) this fixes the biggest problems we have with Linux 2.2 VM
[20:53] (riel) ... because we have little time left, I will now go to the Out Of Memory (OOM) killer on (page 43)
[20:50] (riel) rcastro: we keep data around as long as possible, on the inactive_clean list
[20:51] (bruder) riel: when al inactive_* are empty and we need more memory, pages in active list are swapped out?
[20:52] (riel) movement: hysterical raisins
[20:52] (riel) umm, historical reasons

[20:54] (riel) which will be the last part of the lecture, after this you can ask questions ;)
[20:54] (riel) ok, the OOM killer
[20:54] (riel) when memory *and* swap are full, there is not much you can do
[20:54] (riel) in fact, you can either sit there and wait until a program goes away, or you can kill a program and hope the system goes on running
[20:54] (riel) in Linux, the system always kills a process
[20:55] (riel) Linux 2.2 kills the process which is currently doing an allocation, which is very bad if it happens to be syslog or init
[20:55] (riel) Linux 2.4 tries to be smart and select a "good" process to kill
[20:56] (riel) for this, it looks at the size of the process (so killing 1 process gets us all the memory back we need)
[20:56] (riel) but also at if it is a root process or if the process has direct hardware access (it is very bad to kill these programs)
[20:56] (riel) and at the amount of time the process has been running and the CPU time it has used
[20:57] (riel) because it is better to kill a 5-second old Netscape than to kill your mathematical calculation which has been running for 3 weeks
[20:57] (riel) even if the Netscape is smaller ...
[20:57] (riel) killing the big calculation will mean the computer loses a lot of work, which is bad
[20:58] (riel) for Linux 2.5, I guess some people will also want to have the OOM killer look at which _user_ is doing bad things to the system
[20:58] (riel) but that are things to do in the future
[20:58] (riel) ... on (page 44) you can find some URLs with interesting information
[20:59] (riel) ... thank you for your time, if you have any questions, feel free to join the discussion on #qc
[20:59] (riel) ... this is the end of my talk, but I will be in #qc for a bit more time
[20:59] (Fernand0) clap clap clap clap clap clap clpa clpar clap
[20:59] (Fernand0) clap clap clap clap clap clap clpa clpar clap
[20:59] (Fernand0) clap clap clap clap clap clap clpa clpar clap (mjc)plas plas plas plas plas plas plas plas plas plas (mjc)plas plas plas plas plas plas plas plas plas plas (mjc)plas plas plas plas plas plas plas plas plas plas (mjc)plas plas plas plas plas plas plas plas plas plas (mjc)plas plas plas plas plas plas plas plas plas plas (mjc)plas plas plas plas plas plas plas plas plas plas
[21:01] (riel) btw, for people interested in Linux kernel hacking we have a special IRC channel
[21:02] (riel) on irc.openprojects.net #kernelnewbies
[21:02] (riel) see http://kernelnewbies.org/ for the #kernelnewbies website
[21:02] (riel) most of the time you can find me (and other kernel hackers) on that channel
[21:03] (riel) if you have some in-depth questions or find something interesting when reading my slides, you can always go there (mjc)plas plas plas plas plas plas plas plas plas plas (mjc)plas plas plas plas plas plas plas plas plas plas (mjc)plas plas plas plas plas plas plas plas plas plas
[21:04] (Fernand0) well, my friends
[21:05] (Fernand0) feel free to continue discussing at #qc
[21:05] (Fernand0) many thanks to Rik van riel and to all of you for comming here
[20:54] (movement) :)
[20:55] (riel) bruder: pages in the active list are aged and moved to the inactive_ lists
[20:56] (bruder) so, only inactive_ pages are swapped out.
[20:58] (riel) bruder: yes
[20:59] (rcastro) riel: so is there a chance that the kernel chooses my sysklogd, for instance?
[20:59] (Fernand0) clap clap clap clap clap clap clpa clpar clap
[20:59] (Fernand0) clap clap clap clap clap clap clpa clpar clap
[20:59] (Fernand0) clap clap clap clap clap clap clpa clpar clap
[20:59] (riel) rcastro: in 2.2, definately
[20:59] (movement) riel: large page sizes you mention in 2.5
[21:00] (movement) riel: aren't these very problematic with a sort of VM "false sharing"
[21:00] (riel) rcastro: when the kernel cannot free memory, it spits out a message and syslogd tries to allocate memory
[21:00] (riel) rcastro: so in 2.2, syslogd is often the first process to die ;(
[21:00] (rcastro) riel: I noticed that
[21:00] (movement) riel: if I use only a little of a large page, I might be keeping a big pile of junk in memory
[21:00] (movement) and forcing useful pages to swap out...
[21:00] (riel) movement: indeed
[21:01] (riel) movement: you don't want them to be too large
[21:01] (movement) so it is a balance between algorithmic overhead and what I mentioned ?
[21:01] (erikm) movement: that's a known trick to kill Unix systems: the malloc() bomb
[21:01] (riel) movement: exactly, you want big pages so the administration is simple, but you cannot have them too big
[21:01] (Rob) riel: Why kill a process rather than let the allocation fail?
[21:01] (movement) erikm: I don't follow the connection ... vm resource limits would surely deal with that
[21:01] (rcastro) riel: I'd like to talk a little bit more about compressed caching. One guy (kaplan) did a phd thesis in which he tells that if a dynamic compressed system were implemented, it would worth. There would be a decrease of even 80% in paging actity. What do you think?
[21:02] (rcastro) riel: do you see any clear disadvantages in an implementation?
[21:02] (debUgo-) riel: IIRC, someone post a patch to implements a OOM-plugin system, what you think about that?
[21:02] (erikm) movement: yes, if you have resource limits. most Linux boxes come without.
[21:02] (movement) erikm: I know. I'm not talking about that though ...
[21:03] (riel) Rob: letting the allocation fail usually means the process dies ...
[21:03] (riel) rcastro: Kaplan is probably right
[21:03] (riel) rcastro: the main problem is that nobody as implemented it yet in Linux
[21:03] (riel) rcastro: and the implementation may well be very difficult
[21:04] (mike) must of the people recomends to use a swap space twice the size of the RAM, what is this based on...?
[21:04] (Fernand0) well, my friends
[21:04] (Fernand0) feel free to continue discussing here
[21:04] (rcastro) riel: have you wondered of implementing it?
[21:04] (Fernand0) many thanks to Rik van riel and to all of you for comming here
[21:05] (Rob) Thanks, Fernand0!
[21:05] (riel) rcastro: I haven't looked at it at all
[21:05] (riel) rcastro: but I suspect it is very difficult to implement it in such a way that performance is good
[21:05] (riel) rcastro: I am afraid that the "simple" implementation will only make performance worse
[21:05] (rcastro) riel: I started an implementation in 2.2, but I have a cache with no compression between RAM and swap (that doesn't work very well ;-)
[21:05] (rcastro) riel: I see
[21:05] (riel) rcastro: that works just fine
[21:05] (riel) rcastro: compressed swap makes very little sense
[21:05] (riel) rcastro: since the disk spends most of its time _seeking_
[21:06] (riel) rcastro: and not reading or writing data
[21:06] (debUgo-) seek times hasn't improved over last 4 year :o/
[21:08] (bruder) riel: Based on that problem (seek time, not read/write time), we can think a file system with swap space distributed all over (I know, a crazy ideia)!
[21:08] (rcastro) riel: do you that it could be implemented by the VM usual guys in a future or you have never talked about it?
[21:08] (debUgo-) riel: AFAIK, linux don't have a pure elevator algorithm. Would a good elevator algorithm improve VM performance notably?
[21:08] (bruder) :)
[21:08] (riel) rcastro: we haven't planned implementing this
[21:08] (riel) rcastro: the "usual guys" are extremely busy
[21:08] (riel) rcastro: this is a very special project and we hope someone else will do it
[21:09] (Rob) riel: I might be interested in something like that
[21:09] (riel) debUgo-: I don't think so
[21:09] (riel) debUgo-: a pure elevator is not fair
[21:09] (rcastro) riel: ok. :-) thank you for you lecture and attention... sorry for "usual guys", sometimes I do not express myself well in english.
[21:09] (riel) debUgo-: with a 80GB disk, a request on the "far away" side of the elevator might have to wait a VERY LONG time
[21:10] (riel) rcastro: I understood it perfectly, so it was good ;)
[21:10] (erikm) Rob: you're thinking about PDAs?
[21:10] (Rob) erikm: yes
[21:10] (riel) bruder: remember that you have to read the data back in too ... ;)
[21:10] * erikm guessed so
[21:11] (debUgo-) well... i suppose that there must be any kind of 'request aging' to deal with that
[21:11] (riel) ok everybody, I have to go now (portuguese lessons)
[21:11] (Rob) riel: thank you very much
[21:11] (riel) if you want to talk with me, go to irc.openprojects.net #kernelnewbies
[21:11] (riel) I will be there every working day most of the time
[21:11] (erikm) ok, thanks riel
[21:12] (debUgo-) gotta go
[21:12] (riel) and if you look for more information ... http://kernelnewbies. org/
[21:12] * riel runs off to his pt_BR lesson
[21:13] (movement) I'm planning to clean up the logs of these two channels and put them on http://kernelnewbies.org/ soon
[21:13] (Rob) erikm: are you doing diskless stuff with the LART?
[21:13] (movement) if that's ok with you riel (and likn to your slides of course)
[21:13] (oroz) movement:the logs of this channels will also be upload in the congress web page ;)
[21:14] (erikm) Rob: sometimes. I currently have two systems on my desk: with and without disk
[21:15] (gwm) He's really run off to his pt_BR lesson... [aka disappeared ;)]
[21:15] (Rob) erikm: what kind of storage is the diskless system using?
[21:15] (erikm) Rob: currently the plain old ramdisk loaded from flash at boot.
[21:16] (erikm) Rob: but maybe I'll try cramfs soon (compressed RAM filesystem)
[21:17] (Rob) erikm: ah. I'm currently trying to see about mapping read-only pages directly from flash (romfs)
[21:17] (Rob) erikm: We have cramfs now, but it eats too much RAM.
[21:17] (erikm) Rob: that should be possible.
[21:18] (erikm) Rob: (the romfs, I mean)
[21:19] (erikm) OK, if somebody missed something, movement has a log: http://www.movement.uklinux.net/rieltalk
[21:19] (erikm) and: http://www.movement.uklinux.net/rielquestions
[21:20] (Rob) excellent
[21:24] (erikm) Rob: can't you use the compression from cramfs and put it into romfs? than you have the advantages of both romfs and cramfs
[21:26] (bruder) for people around Brazil and AL, I have a log (lecture and #qc) at http://pontobr.org/noticia.php3?nid=1511
[21:26] (Rob) erikm: not sure what you mean...?
[21:29] (Rob) erikm: the compressed cramfs pages are stored in (read-only) flash, but they have to be decompressed into RAM
[21:29] (erikm) Rob: well, you said that cramfs takes too much memory and you're using romfs instead. but romfs doesn't do compression.
[21:30] (erikm) Rob: ah, OK, so you already use ramfs as cromfs/
[21:30] (Rob) erikm: right... I want to store some things uncompressed in romfs so I don't have to map pages from RAM for them
[21:32] * erikm looks into the romfs source
[21:32] (Rob) currently romfs doesn't support what I want to do though.
[21:33] (erikm) hmm. maybe it needs to be rewritten?
[21:33] * Rob smiles.
[21:34] * Rob waves.
[21:34] (Beltzak) quit
[21:34] (Beltzak) iy xDDD
[21:34] (Pal) bye Beltzak
[21:36] (Rob) erikm: I'm not sure I understand how to map pages the right way to do this...
[21:36] (erikm) Rob: look at ramfs. it's the most simple filesystem. it's very clean.
[21:36] (Rob) since these are not pages that can be reclaimed/reused
[21:37] (MJesu) riel :)) telefonica ?
[21:37] (erikm) Rob: you hack linux as well?
[21:38] (Rob) erikm: I am just starting to get into the kernel
[21:39] (erikm) Rob: it might be nice to join #kernelnewbies. there are some filesystem hackers as well over there, like Daniel Philips (tux2 fs)
[21:40] *** riel has quit IRC (Ping timeout for riel[brutus.conectiva.com.br] )
[21:40] (Rob) erikm: thanks...
[21:41] *** riel (riel@brutus.conectiva.com.br) has joined #qc
[21:42] * erikm looks if philips is aroung on #kernelnewbies
[21:43] (erikm) Rob: yes, he is, though not active for the last couple of minutes
[21:43] (math) I guess I missed all the fun
[21:43] (erikm) math: here is a URL with a log: http://www.movement.uklinux.n et/rieltalk
[21:44] (erikm) math: and: http://www.movement.uklinux.net/rielquestions
[21:44] (math) erikm: thanks
[21:54] *** riel has quit IRC (Ping timeout for riel[brutus.conectiva.com.br]

Contact: