Our books

Become a Fan

« Do these things wreck anonymized review? If so, what should be done? | Main | More on citation practices, and making the perfect the enemy of the good »



Feed You can follow this conversation by subscribing to the comment feed for this post.


I imagine small-talk ability is very relevant when considering who should be a talk-show host, and quite irrelevant when considering who should pilot a one-person submarine....

I think reasonable people can disagree, however, on where a professor falls on this spectrum...

on the one hand, being a professor involves being comfortable with public speaking, improvising/thinking on one's feet, and connecting with lots of different types of people, etc. So perhaps it is not entirely unreasonable to think ability to schmooze might have some predictive value here.

On the other hand, though, small-talk and chit-chat are hardly the sorts of things professors are expected to do qua teachers; a good lecturer/classroom leader might well be socially awkward without the structure of a classroom to guide them- but with no effect on teaching. So it's a tough call.

My biggest worry, though, is this: all sorts of social inadequacies (and worse) are tolerated in people perceived as stars, whereas a small social stumble may be used to exclude someone perceived as lower on the totem pole... in which case this becomes yet another way that prestige triumphs over someone else who on paper might be just as good (and in real-time even better...)

David Shope

Two words: Implicit bias.

There are all sorts of ways in which biases can corrupt this process many of which may disadvantage those from different backgrounds than the tester (will women be judged more harshly than men if they don't smile frequently? what about differing norms of politeness in different cultures?)

On top of that, totally arbitrary preferences can be snuck in. In the original article, the test seems to involve ascertaining how likely someone is to come out to a pub for extra-curricular socializing. What does that have to do with *anything*?

Moti Mizrahi

Thanks for the comments.

Name: I share your worry.

David: Good points. Someone once told me that departments simply replicate themselves. I think I am starting to understand what that person meant.


"We went for coffee together, lunch, dinner, tour of the area, and even a visit to a local museum." Holy crap. This was just the two of you? Out together for hours and hours? That is awful.

The big issue I have with this is that job seekers as a rule cannot be relaxed, social and 'natural'--they would have to somehow ignore the fact that their entire career could hang on their next sentence, a Herculean task to say the least, one which definitely tends to disrupt the natural flow of conversational activity.

That said, committees have every right to look for people they will get along with. Too many faculty meetings, committees and social events are compromised by people who may have the Incompleteness Theorems memorized but who have the social sensitivity of a dead newt. The worry is just that this is an awful way of gauging what a person is actually like.


The job is not simply teaching/scholarship. We tenure faculty to be constitutive of an academic community. Much of the rest of the job depends upon interpersonal skills and rather than some sort of psychological diagnostic (like many senior admin undergo with search firms) faculty typically use the informal and cheaper method of getting a read on someone's social skills.

That said, biases are a huge worry. And although the original story seems a bit weird, some version of the happy hour test is gonna happen if hiring for TT, though not so much with non-TT.

Moti Mizrahi

Thanks again for the comments.

Joe: I wouldn’t go so far as to say it was horrible. But it was very difficult and stressful precisely for the reason you mention. I share your worry that the “Happy Hour Test” may be a terrible way of gauging what a person is really like. Given that the circumstances are unusual, it stands to reason that the candidate in those circumstances is not his/her usual self.

C.: When you say that the job is not simply teaching and scholarship, I suppose you are referring to service, i.e., serving on committees, going to meetings, etc. But I am still not sure that the “Happy Hour Test” is a good way of telling whether or not a person will be able to perform such tasks. Consider psychopaths, for instance. If anyone is anti-social, a psychopath is. The prevalence of psychopaths among CEOs, (roughly 10% among CEOs compared to 1% in the general population), however, suggests that they can meet, plan, and get stuff done. What more do we want from faculty?

David Shope

To expand on Moti's most recent point, I've known incredibly responsible, kind and polite people who are awkward conversationalists and very quiet. I've also known some charming, outgoing people who are flaky and irresponsible.

Are we really testing the right interpersonal skills in these informal interviews? Granted, not having experience being a faculty member I don't really know what duties are involved, but it seems like the interpersonal skills needed to perform various academic duties probably don't overlap much with the interpersonal skills needed to be someone enjoyable to be trapped in an airport with.

One thing that might be helpful is if the social skills (or alternately interpersonal skills) that are supposed to be relevant are specified.

Politeness? Ability to make conversation? Ability to make small-talk? Sensitivity to the moods of participants? Patience? The ability to listen well? Body language reading? The ability to discern genuine pauses from sentence breaks? Persuasiveness? Leadership ability? Respect for obligations to others? Empathy? Voice volume modulation? Expressing themselves clearly? Understanding others easily?


We can call it "service"--but it really is more than that--it is the involvement in the life of the academic community that we expect of "regular" (constitutive) faculty members. Service is the administrative category within which we evaluate that work, but it is better understood as the ability to function in a community of peers across a broad range of human interaction.

I agree that "happy hour" tests are not ideal ways of getting good evidence of the potentialities of a candidate, but in the absence of a better mechanism and given the necessity of such judgments, some version of it will be the fallback. (Again probably not as awkward as the examples, but some occasion for informal interaction that gives some indication of some "soft skills.")As I said, I don't know how to think about biases in this case, except to try to be aware of how they might operate, be generous in our judgments, and have diverse hiring committees as far as possible.

But, I'm not particularly persuaded that the prevalence of high-functioning psychopaths in business is telling. Highly accomplished narcissists are not exactly rare in academia-- and given the complexity of faculty life and assuming I had some confidence that a candidate would likely behave in similar ways, I would oppose said person's candidacy on those grounds, assuming that there are 5 other equally qualified and promising candidates in the pool who do not display such behaviors.

Psychopaths likely thrive in the corporate world precisely because of the power dynamics of a hierarchical institution (vertical power). A single "psychopath" can utterly ruin the collegiality of an academic community (horizontal power).

I doubt there is a simple list of behaviors and abilities that can be specified, however, in advance. Perhaps someone could create a psychological diagnostic and provide it for hiring committees to evaluate candidates. But in the absence of this and because of the burden on hiring committees, some sort of proxy like the "happy hour" test are a reasonable, though fallible, substitute.

Ultimately hiring, it seems to me, is a matter of judgment almost unavoidably under-determined by the evidence and not an objective evaluation of someone's record (though the latter is part of the evidence).

Marcus Arvan

Just wanted to throw another thing out there. My wife is an academic in the field of Industrial-Organizational Psychology. She tells me that there is basically a *consensus* in her field -- a consensus confirmed by a vast array of experimental and longitudinal studies -- that although (1) people are very confident that they are better at predicting who will be a good employee/colleague than any mechanical procedure, the evidence is clear: (2) mechanical procedures systematically reliably outperform individual or collective judgment in predicting employee success (viz. firms that use merely mechanical devices to hire employees consistently display higher employee performance than firms that hire on the basis of "soft" information like interviews and "the happy hour test").

Again, I'm just going on her authority on this, but she is at a top-5 I/O-Psychology department and she tells me this is a matter of absolute consensus in her field. It apparently drives people in her field crazy that individuals (and hiring committees) think they can do better than more mechanical procedures, but according to her, there is overwhelming evidence that this common belief is false...and that the single best predictor of a worker's prospect of success (in any field so far measured) is that individual's past *record* of success.

Just food for thought!

Marcus Arvan

Perhaps a story may make the results my wife reports more "intuitive."

Suppose Jones has a several year long record of success as a researcher, teacher, and colleague at University U. Jones had multiple good publications, stellar teaching reviews, and a stellar personal-professional reputation in the department and beyond. Jones' colleagues often remark to one another and to anyone who will listen that Jones is the best kind of colleague imaginable: kind, articulate, helpful, etc.

Jones then travels to an interview. After a several hour flight, Jones is exhausted. Nevertheless, because of nerves, Jones sleeps badly. Jones wakes up tired the next day a bit out of sorts, and is a bit less clearheaded than normal. Later that day at happy hour, Jones appears a bit uncomfortable. The committee judges that Jones failed the happy hour test. Jones is then eliminated from consideration for the job or downgraded compared to other applicants with a somewhat less stellar reputation with students and colleagues but who aced The Happy Hour test.

What is likely to be the better predicting data? Several years of consistent performance/data/glowing recommendations from students/faculty/etc., or a *one-off* evening at happy hour? My wife has told me the data on this is clear as day, but I'm no expert in her field. I'll leave it for you to think about! ;)


It seems to me that the problem is the likely absence of mechanical methods for judging likelihood of success beyond scholarly ability and teaching ability. I guess that the research is suggesting that we just exclude those considerations because they aren't mechanically decidable. If we can't measure it, it isn't relevant? Perhaps I'm misunderstanding what "mechanical" means here. But if job x requires activity in areas a, b, and c, and we have mechanical criteria to judge a and b but not c, it seems to me that despite the possible fallibility, making "soft" judgments in c is necessary. But, again perhaps the evidence shows that considering c, never or never reliably correlates to success in an activity that involves areas a, b, and c.

This would be an interesting eliminative strategy. Since student evaluations are not robust evidence of teaching excellence, and since there is no better mechanical measure of teaching success, we should only hire faculty positions on the basis of scholarly achievement.

I think I'd need to understand more what is meant by "mechanical" here.


The Jones scenario seems a bit tendentious and at least suggests that that the evidence is exclusive. If I encountered that conflict in the evidence I would want to understand why my experience conflicted so profoundly with the testimony I had from others. I would wonder about the reliability of the testimony (are they just telling us what they think they want us to hear?). I would hope that I would have noticed that Jones seemed wearied, or that it would come up in our 1 hour conversation. I might call the references to ask them about my impression or try to schedule further conversation with Jones to make sense of the conflicting evidence.

We can spin this the other way. Let's say that a candidate comes with solid evaluations from students and their director and has several publications in decent venues etc., but in several situations during their visit engages in problematic behavior--seeming tendency to interrupt women but not men, texting during an interview with the Dean, dropping out of conversations to stare off into space with a detached and annoyed attitude, or despite the positive evidence from his peers, Jones spends happy hours sneering at his current colleagues, or perhaps Jones is dismissive and rude during our attempts to engage him in conversation about the profession or the world. Now I'm no psychologist but should the stellar record tell us to discount our judgment about the problematic social behaviors?

Actually I'd probably handle this in the same way as Jones--ask whether there is some explanation for the conflict in evidence that can help me understand which piece of evidence is more reliable?

And let's say that we have two equally solid candidates, one of whom is a pleasure to talk with and who engages all constituencies graciously and another candidate whose visit (while mechanically as successful as the first) is a dreary and torturous drag. Is there any fault with a committee that chooses the former over the latter (whatever the reason that the latter has such a miserable guest)?

Marcus Arvan

C: Thanks for your comment. The way I understand it (given everything my wife has told me), there is indeed no 'c' such that, if soft measures are used, it *improves* selection over merely mechanical judgments using 'a' and 'b'. She tells me that whenever soft judgments of any variable are included, selection committees reliably do a *worse* job than if they had just mechanically measured 'a' and 'b' alone.

Further, I-O psychologists will tell you that there is almost *always* a way to mechanically evaluate whatever variable you're interested in in a way that is more reliable than a search committee's judgments on that variable.

To take just one example, consider the whole point of the happy hour test: collegiality, well-spoken-ness, etc. I-O psychologists will tell you that a long train of *quantitative* data -- say, annual ratings by one's own department members -- are a better predictor of collegiality than one-off search committee judgments.

Moreover, however,

Marcus Arvan

Hi C: Thanks again for your comment.

I appreciate that if you experienced a conflict in your evidence, you might want further information. But, the thing is, would you *believe* the further information? Would you act upon it? Suppose you contacted Jones' references, and they said, "Yeah, he's really awesome. He must have been tired." Would this move you? Would it move the members of your committee? Or, would the committee (or at least some members of it) be moved by the more "salient" information that Jones behaved nervously or out of sorts? My wife has gone on and on to me about this. People tend to pay attention to the most "salient" information (viz. personal experience) -- this despite the fact that the more reliable predictive information is the *other* information (long-term evaluations of the person's interpersonal and professional behavior by people who work with the candidate on a day-to-day basis).

Second, I completely agree with you in the story you tell. If someone texted while with the dean, behaved inappropriately, etc., then that would surely be *decisive* information. But -- I would say -- that information would almost surely show up in the long-term information as well. If, for instance, someone has a habit of cutting people off, ignoring women, etc., then that will almost certainly show up in their student evaluation comments and other long-term measureables. So, I don't think it's a counterexample to the Jones story. Someone like Jones -- with a litany of stellar measurables (student reviews, faculty evaluations, etc.) -- would not behave in those ways (their student reviews, etc. would reveal such tendencies).

Elisa Freschi

Sorry for the Alien-like question (you will remember that I have only been working in North America for 45 days), but are these "long-term measurables" really available in the US academic job market? Apart from students' evaluations, what are you thinking of? I assumed that one's portfolio only included (apart from one's research and teaching) a letter by one's former employer, with no comments by colleagues and the like.
I am also asking because I am often shocked by the opposite here in Europe. E.g., back in 20XX, XZ ---who was and is a stellar scholar, with great publications and impressive research projects--- became the director of a very important institute, although s/he has no social ability (s/he abuses his or her employees and does not display any empathy with them), cannot coordinate research, etc. etc. You can only imagine how the situation in the department worsened, from all points of view.


I think I have a better sense of what the mechanicals might be, but I'm still a bit skeptical that the mechanical evidence can be so directly predictive of success as apparently the psychologists are.

I think in the case of senior hires it is likely that the "happy hour test" could work the other way around--can we persuade Herr Professor Doktor Awsome-scholar that he wants to join our department? In the case where there is a solid record, the happy-hour test might at most require a few sniffs.

But at the newly-minted-phd level, the lack of a record and some of the curious behaviors and attitudes one can see manifested during searches would make reliance on mere mechanicals problematic to me. It probably does suggest that candidates from VAP's or others with experience, if we can get mechanical evidence should be privileged over newly-minted-phd's.


Any chance of getting a citation to a survey of this research? Would be interesting to file away.

Marcus Arvan

C.: Trust me, I appreciate the skepticism. I've raised similar skepticism to my wife. Her answer is, "Everyone thinks their own human judgment -- or the judgment of their hiring committee -- can outperform machines in terms of better predicting employee success and employer satisfaction with who is hired. But there is not a *single* study that confirms this. Anytime mechanical procedures have been studied -- experimentally, longitudinally, etc. -- the result is always the same: mechanical procedures better predict results (both employee success *and* employer satisfaction with hirees).

She suggests that although it is probably human nature to think we can predict success better than any mechanical procedure, the empirical results overwhemingly show that this is false in every domain studied so far.

People from her program, by the way, consult with Fortune 500 companies and the government...and she tells me that although every company/government is skeptical before mechanical selection procedures are used, the results come out in the wash: once mechanical selection materials are introduced, the firm/government reports higher worker success and hiring satisfaction.

Again, we might not like to admit that machines can judge people better than we -- but, according to my wife, all of the science to this point shows that we're just deluded in thinking we can do a better job.

And I think the "intuitive" reasons I gave in the Jones case show why. We human beings suffer from all kinds of biases: salience effects, recency effects, confirmation bias, etc. Mechanical procedures are subject to no such bias. The best predictor of future performance is past performance...and that's what mechanical procedures measure.

Marcus Arvan

C: I'll get some selections from my wife and hopefully post them here soon!

Marcus Arvan

Elisa: No, I don't think such "measurables" are readily available now. But here are a few ways we could make them available.

(1) My university has annual faculty evaluations by the Dean and Department Chair. Submitting these with job applications could give committees a long-term view of how others (the Dean, Chair) view the candidate.

(2) In addition to traditional 3-7 research and teaching references, we might include "testimonials": short letters from other people (colleagues, etc.) attesting to things like character, collegiality, etc.

But, beyond that, teaching evaluations seem to me to be a great guide. Students usually pick up on (and comment on) problematic behaviors (cutting people off, not calling on women), as well as good behaviors.

Marcus Arvan

C: Here's one famous meta-study my wife cited off the top of her head:

Grove, William M., and Paul E. Meehl (1996) "Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: The clinical–statistical controversy." Psychology, Public Policy, and Law 2.2: 293.

Apparently, it's one of the most famous articles on selection (it's been cited about as many times as Rawls' "Justice as Fairness").

Here's the ABSTRACT:

"Given a data set about an individual or group (e.g., interviewer ratings, life history or demographic facts, test results, self-descriptions), there are two modes of data combination for a predictive or diagnostic purpose. The clinical method relies on human judgment that is based on informal contemplation and, sometimes, discussion with others (e.g., case conferences). The mechanical method involves a formal, algorithmic, objective procedure (e.g., equation) to reach the decision. Empirical comparisons of the accuracy of the two methods (136 studies over a wide range of predictands) show that the mechanical method is almost invariably equal to or superior to the clinical method: Common antiactuarial arguments are rebutted, possible causes of widespread resistance to the comparative research are offered, and policy implications of the statistical method’s superiority are discussed."


Thanks for the reference, but do I understand the argument to go further--that is that any human interpretation/evaluation is worse than a mechanical procedure? But isn't the whole process infused through and through with interpretation and evaluation--from the student evaluations (which are highly problematic for predicting effective teaching) to the dissertation advisor (testimony from your besties). Only blind review of behaviors fully algorthymized would seem to escape the influence of human evaluation.

If that's true, one might argue that the best thing to do is eliminate as much dependence on human judgment as possible (cancel the interviews along with happy hour), or one might argue that given that the process is suffused with human judgment and subjectivity we should rely on as broad a range of evidence to as possible with the highest degree of reflection.

But until we have an algorithm for all of this (teaching, research quality, collegiality/social skills/etc), we are perhaps always reliant on our fallible subjective judgment. Again I think there are diagnostics that purport to get at "leadership characteristics" and perhaps there is some analogue that could be used to evaluate departments and candidates and match up the ideals like an on-line dating site. There's your first million $$'s for you and your wife!

Marcus Arvan

C: Thanks for your question. That's *exactly* what I-O Psychologists will tell you (that we should aim to cut out human judgment as much as possible).

I was discussing this with my wife yesterday, and she got very animated about it. She said, "We *know* -- from hundreds and hundreds of studies -- that the predictive validity of things like happy hour, socializing, getting a feel for a person, is close to ZERO." She then added that she thought it is really sad that people in our field hire on the basis of personal judgments. She said, basically, "They are basically just flipping a coin." But, she said, "We I-O's face this everywhere we go. People just don't want to believe that mechanical algorithms are better at predicting success than live human beings. But this belief is false, false, false. Study after study has shown that people are *worse* predictors of success than algorithms."

She did say that some *highly* specialized performances -- teaching demos or research talks -- have a *little* bit better predictive validity than interviews or happy hours, but that they still confound judgments more than algorithms alone.

Anyway, thanks for the suggestion about our "first million", but we are WAY behind the curve. I-O Psychologists routinely have multi-million dollar contracts from the government and corporations to do precisely this! I doubt philosophy departments will ever get on board, though.

The I/O Psychologist says: "The best way to hire someone is to use purely quantitative methods."

To which the philosopher says: "I just don't have that intuition."

A joke, yes -- but uncomfortably close to the truth! ;)

Verify your Comment

Previewing your Comment

This is only a preview. Your comment has not yet been posted.

Your comment could not be posted. Error type:
Your comment has been saved. Comments are moderated and will not appear until approved by the author. Post another comment

The letters and numbers you entered did not match the image. Please try again.

As a final step before posting your comment, enter the letters and numbers you see in the image below. This prevents automated programs from posting comments.

Having trouble reading this image? View an alternate.


Post a comment

Comments are moderated, and will not appear until the author has approved them.

Your Information

(Name and email address are required. Email address will not be displayed with the comment.)

Subscribe to the Cocoon

Job-market reporting thread

Current Job-Market Discussion Thread

Philosophers in Industry Directory


Subscribe to the Cocoon