The Case for Free Online Books (FOBs): Experiences with “Operating Systems: Three Easy Pieces” | From A To RemZi
Excerpt
Abstract: This article is a short (well, not that short) summary of our experiences in writing a free online text book known as Operating S…
[NOTE: Some updates to the data and tables below; thanks to all who have pointed out issues. Please do send me email if you have questions or concerns, preferably after reading through this entire post.]
Last week, I spent a few hours csh’ing, awk’ing, python’ing, and even sed’ing my way to a “Top 50” list for systems researchers, showing just who has been filling SOSP and OSDI with their thoughts over the years. It got a little bit of traction, generating thousands of pageviews in the first hours (one has to admire the built-in analytics of blogger), inspiring a SIGMOD-based version, an NSDI top list, a fun news post on the Washington CS web site, and email discussions around the world (or so I hear).
Butler Lampson once wrote “Don’t generalize; generalizations are generally wrong.” Here, I will eschew Lampson and extend the Top 50 in Systems (narrow) to a much wider look at Systems broadly defined, including fields in the practical side of Computer Science such as computer architecture, databases, networking, programming languages, and security. The combined look at the most prolific researchers in these areas leads to a broad ranking, which I call ZIPS, which is short for Z’s Index of Prolific Scholars (the alternate name, Z’s Index of Top-publishing Scholars, was rejected by my wife and collaborator, Andrea, as being unsound acronym-ically, if you know what I mean). Why Z instead of R, you ask? Who knows; I guess Z has always seemed like the coolest letter in my name.
Method
I’ll now describe my method, followed by the results and a little discussion as to the value of these sorts of endeavors. Be patient!
Most of the data was obtained by downloading XML files from DBLP; some simple wget scripts, python-based HTML parsing, and a few other scripting adventures gets you most of the way. For conferences without good DBLP sourcing (e.g., NSDI, FAST), I relied upon other sources of information; for NSDI, for example, I was given the most recent years of data by the amazing David Andersen of CMU; for FAST, I had been collecting the data myself for some time.
Unfortunately, DBLP is not a perfect data source. For example, in some conference listings, many short entries (not real papers) are included. Thus, I (somewhat arbitrarily) filtered to remove any papers that are 4 pages or less. If this removed your 4-page super impactful paper from the results, my apologies! I also on occasion noticed that some authors show up under slightly different names in DBLP, and tried to rectify that where possible (e.g., sometimes the prolific security scholar Dawn Song is just “Dawn Song”, and sometimes she is “Dawn Xiaodong Song”, which, without care, leads to Berkeley having two leading scholars in security); it is almost certain that I have missed some of these cases. Finally, some page counts were missing; I simply filled those in as needed (fortunately not too many).
I also had to decide on a set of conferences to include in ZIPS (Systems-ish Edition), which I did based on my own thoughts as to which conferences are “important” in a given area, as well as data availability. Undoubtedly this is the most biased portion of ZIPS, but such is life.
Thus, the list: for architecture, MICRO, ISCA, and ASPLOS (though this is perhaps broader than just architecture); for databases SIGMOD and VLDB; for systems SOSP and OSDI; for storage, FAST; for networking NSDI, SIGCOMM, IMC, and MOBICOM; for programming languages, compilers, and verification PLDI, POPL, and CAV; for security, CCS and IEEE S&P; and finally don’t forget SIGMETRICS where a number of these sorts of folks publish (though mostly networking these days). There certainly could (and should?) be other areas included, such as sensor networks, etc.; your (constructive) thoughts on this would be most appreciated.
As before, I combine SOSP and OSDI into one “super conference”, which counts it less (in some ways) but gives a better sense of the community (I think) than splitting the two does. I also had to combine the proceedings of PVLDB (the journal) with VLDB (the former conference) to get a sensible ranking there. I would also love to include more USENIX conferences, but for some reason DBLP does not do a very good job with these data sets; thus, some important conferences (e.g., USENIX Security) are missing.
After obtaining all of this data and munging it a bit, I generated a “Top 50” for each of the above conferences. The “Top 50” list, as discussed previously, simply ranks researchers based on how many papers they have published at a given venue. To be first on the SOSP/OSDI list, for example, you need to publish at least 24 papers at those venues; sadly, 24 will only get you into a tie with Frans Kaashoek, so you had better get to work. Note that I am not (necessarily) commenting on the perceived value of said lists, though I do discuss how such lists may be useful below.
I then associated each person on said list with their current affiliation, which was a bit of a pain and likely to be error-prone, as it is not always obvious from web search and other sources if a person is still at a given locale. Thus, given the way I approached this, even if a person published a lot of papers while at a given institution, if they are no longer with the institution, it no longer counts for said institution. This is bad for Wisconsin, for example, in Databases, as Dave DeWitt, Mike Carey, and Raghu Ramakrishnan no longer add to our Database total (though Wisconsin still does pretty well in Databases, as it turns out, which is also no surprise).
I also filtered each conference’s “Top 50” list so as remove students, as I wanted the list to represent the more permanent people (i.e., faculty) at a given institution; sorry students! However, note that students do show up, immediately, when they are hired by some University; for example, Shyam Gollakata would not have been on these lists as a graduate student at MIT; he now counts for Washington in the SIGCOMM column, already climbing this list so early in his career.
I also pruned some lists when the absolute number of publications to get into the “Top 50” was “too low” (deemed somewhat arbitrarily by me), making some “Top” lists shorter than 50 (e.g., FAST).
Finally, I have filtered these results to only include Universities; thus, no companies or research labs are included. This removes prolific publishing entities such as Microsoft, IBM, ATT, Bell Labs, and many others from these lists; I will shortly publish the industry version of this information if people are interested.
Results
And now, the results. I give you in all its glory, ZIPS (Systems-ish Edition). Click the following for [PDF] or [JPG] versions.
Each column show the scholars at a specific venue (e.g., “ASPLOS”) selected from the “Top 50” list of those who have published the most at a given venue, and each row of the table shows which of these “Top 50” scholars are affiliated with a given institution (e.g., “Berkeley”), in the following form:
- RankAtVenue. LastName (PapersPublishedAtThatVenue)
Thus, when you see “7. D.Andersen (9)” under the NSDI column, it means that Dave Andersen is 7th on the NSDI list, and has published 9 papers in said conferences to get there. You see Andersen in the CMU row because he is a faculty member there.
[UPDATE (05/23: 09:00am): A few more affiliations have been fixed.]
[UPDATE (05/23 09:00am): The following paragraph has been updated to match the data above.]
The list is ordered by the total number of scholars in the “Top 50” from each school, as seen by the next-to-rightmost column. In this ordering, Stanford is #1 with 22, Wisconsin and CMU tied for #2 with 20, and so forth as per the table. I cut off the full list at the Top 20 (5 people or more), as there are large numbers of schools with 4 and 3 scholars; however, I could make this entire data set available if people would like.
Also included (in the rightmost column) is the number of “Top 10” scholars at each institution. In this regard, Wisconsin is #1(!) with 13, followed by the usual suspects.
You can of course look over the ZIPS table in more detail to draw whatever kinds of conclusions you like. It is fun to look at a given row of the ZIPS table to better understand who is actively publishing at a given institution; it is also informative to scan down a column in ZIPS to see who is publishing at a given venue.
I also include here each of the “Top 50” lists for your inspection. Note that filtering by page count (as described above) leads to a slightly different set of results than others have seen (e.g., my SIGMOD count is a little different than Stephen Tu’s).
MICRO - ISCA - ASPLOS - FAST - SOSP+OSDI - NSDI - SIGCOMM - SIGMETRICS - IMC - MOBICOM - SIGMOD - VLDB+PVLDB - PLDI - POPL - CAV - CCS - SP
Most of these lists only go through 2012; thus, if you recently had great success at XXX-13, it is not (yet) reflected in the lists. Exceptions: FAST, NSDI (both updated through 2013).
[UPDATE (05/21 10:14am): Due to a numbering convention on the CCS site, CCS years 2004 and 2006 were being counted incorrectly; this is now fixed. Thanks to Vinod Ganapathy for the pointer.]
[UPDATE (05/22 07:45am): A fair amount of name disambiguation has been performed; mostly by script-aided manual inspection, but also thanks to T. Anderson, S. Katti, and others for pointing out problems). Also numerous people pointed out that some folks had passed away (sadly), so these have been removed (as this list is meant to represent current researchers). Others pointed out some long-retired folks, so those too have been removed. All data above should reflect the latest, which is slightly different than yesterday (but mostly similar). One small change was that disambiguation in PLDI moved a large-ish group up to tied for 50th(!), thus adding a few names to the list; the most humorous disambiguation was that there used to be two top PLDI researchers, Alex Aiken and Alexander Aiken, both quite prolific!]
Discussion
Since the publication of the Systems “Top 50” list, I have heard a fair amount of discussion as to the importance of such lists. I will now wade (carefully) into this discussion.
What this type of list
is
might be useful for:
- I think this information is particularly useful for potential graduate students who are interested in a given area and are weighing their options as to which graduate school to join. If you want to be a computer architect, for example, it might be good to see if a particular department actually has professors who actively publish in that area in the top venues.
What this type of list is not (necessarily) useful for:
- I don’t think this type of list is particularly useful at judging whether a particular person or institution is having “research impact” (though it may be correlated). Weighting by citation count could help here (something in my future plans); however, even doing that does not assess whether an idea has transitioned to industry or become widely used in deployment, which perhaps is a better/different metric (though one that is harder to quantify).
As the Washington news site hilariously wrote: “We reject these sorts of beauty contests unless we fare well, in which case we trumpet them as authoritative.” Perhaps this is the best summary of ZIPS; with that in mind, I personally trumpet ZIPS as highly authoritative, especially in the area of storage systems. Enjoy!
[UPDATE (05/21 9:20am): I should probably have listed a number of other caveats, though I think they are likely obvious. First, “Top 50” lists by their design are a bit unfair to older, more established areas. For example, it will take years for someone in databases to make their way into the Top 50 of SIGMOD; for younger conferences such as NSDI and FAST, there is a significantly less steep slope to climb (though time will fix that problem). Second, clearly some conferences are harder to publish in than others, though it is hard to compare across fields; I do know that any SOSP/OSDI paper feels like a significant accomplishment, given the difficulty of the review cycle (often including 6 or 9 or one time 12 reviews!). Third, Top 50 rankings are more unfair to people who are ranked (repeatedly) just below 50 (e.g., 51) at a number of top conferences; perhaps some kind of proportional metric would make more sense. Finally, it is just bean counting; please only take as seriously as the phrase indicates.]