The JWST Cycle 1 TAC was great, or randomize middle-ranked proposals, you cowards*

Cool Webb logo by the phenomenal Antonio Holguin, used with permission. Lazy overlay by me.

The JWST Cycle 1 Program was released today. The science is extraordinary, exciting, and broad. It befits the launch of a new Great Observatory, and the enormous discovery space that awaits.

I was honored to serve on a panel for the Cycle 1 Time Allocation Committee (TAC). We met (remotely, of course) the week of Feb. 22. A few thoughts:

* I don’t actually think anyone is a coward. ❤️

1. The TAC was exquisitely, honestly, and fairly run by a bunch of stone-cold pros.

Our friends at STScI are the best in the business. From the Science Policies Group to the Instrument Teams to our fabulous Panel Support Scientist and Leveler. Brilliant scientists and brilliant humans, trying their best to maximize the science return of the observatory while ensuring an equitable and fair process. I have nothing but glowing praise**. Seriously. Thank you, friends.

Our panel members were fabulous. FABULOUS. From the Chair on down, everyone took the process extremely seriously, and did a ton of work (for free!) to ensure the integrity of the process. It was obvious that everyone compiled pages of notes and binged a ton of literature for every proposal they were primary referee on. Dual Anonymity was sacred within the panel, as it should’ve been. We focused entirely on the science, as was our charge.

People inevitably and understandably complain when they read their TAC comments, particularly on unsuccessful proposals. I have many thoughts on the efficacy and use of such comments, but let me just say: every member on our panel really did respect every proposal before us. More on that below.

** This is true in general for me, including my past experience on HST, Chandra, ALMA, ESO, Gemini, and NASA review panels. I’ve genuinely never doubted the integrity of the process.

2. Dual Anonymous Works.

It just works. ¯\_(ツ)_/¯. ***

Of course, it cannot fully rid us of the fractal biases that creep into all levels of a review. But it goes a long way. NASA should be commended for implementing this across SMD, and STScI should be proud of how they executed it — they’re one of the early catalysts behind the change.

*** I got a few, totally fair complaints about this statement, which definitely looks like I’m making a scientific claim on a single point of (nice!) “anec-data”. I wasn’t trying to, but … that’s fair. We need many more cycles of data.

3. The fully remote panel? Totally fine.

One (of very few) positive outcomes of our garbage-bag pandemic year is that it forced a phase change in our thinking w.r.t. remote work. One year in, I think the astronomy community has demonstrated a deft and effective pivot, and shown that remote proposal reviews don’t compromise on integrity (scientific or otherwise).

The only thing I missed was grabbing dinner each night with old friends. The Colonnade is nice, but I don’t need to go back. Let’s save the money, carbon footprint, and time. I was once told that each in-person ALMA proposal review meeting cost more than one million euros (🥴🥴🥴). The Chandra and HST in-person TACs were in excess of several hundred K each year. Let’s save the money and disperse it among the community via GO/AR, hire an FTE or two, etc. etc. Let’s not spend it on plane tickets and overpriced hotel food — doing so doesn’t deliver better science.

One major personal caveat: for two of the review days, we had no daycare for our two daughters. Our oldest got lots of Cocomelon those days, which, if you don’t know, is the show they play on repeat in hell.

Anyway, that’s a whole other can of worms beyond the control of STScI.

4. Effectively all of the proposals were excellent and worthy of time.

There were no bad proposals before our panel. None. I don’t know about other panels, but there were none in ours. Well done, community. Seriously.

Yes, some were more apparently compelling than others. Some didn’t convincingly describe what we’d learn / why Webb was needed / etc. etc. But they were all absolutely worth doing. I hope everyone whose proposal was before our panel resubmits in future cycles. Seriously. The discovery space for JWST is vast, and the community has a load of great ideas for how to explore it. I wish we could do it all with this Cycle. But let’s do it all over the course of the mission. Seriously, keep trying. Even if you were fifth quintile / triaged. Keep trying. This is because:

5. Even the best TACs have very low instrumental resolution.

Above, I’m pretty effusive (and sincere) in my praise of the JWST Cycle 1 process as executed by STScI. From my (admittedly limited) view, I don’t think the JWST TAC could’ve been obviously better. This is therefore absolutely not a criticism of STScI, or any one institution. With that said:

I remain utterly convinced that, at best, a review panel can reliably place three or four resolution elements across a proposal pool.

They’re probably pretty good at identifying the hyper exciting, must-do-now proposals (at least I hope so, lest we be totally useless). They’re (hopefully) decent at identifying those proposals that aren’t particularly well motivated / have important technical problems / etc. etc.

But the middle? The majority of proposals? All of which are really, really good and absolutely worth doing?

We’re a glorified random number generator. I’m sorry, but we are.

There is surely a gradient in “quality” (however defined) across these middle-ranked proposals, but I don’t think the TAC can reproducibly or consistently resolve it. I’m also convinced that it’s in this vast middle ground where both biases and raw stochasticity imprint their largest signal in the final ranking.

It comes down to how convincingly and strongly the primary referee argued for the proposal. It comes down to when your proposal was discussed. It comes down to the weather. It comes down to if Panel Member X graded your proposal a 2.2 vs. a 2.3, tipping your average across the cutoff threshold.

I just looked at the (obviously private) final ranked list for our panel. From the five proposals above the cutoff line to the five proposals below it, the normalized mean score ranged from 2.25 to 2.72. The difference between the two proposals on either side of the cutoff line is 0.10.

This is inevitable, of course — you have to put a cutoff line somehwere! — but it’s also borderline arbitrary. Take 100 identical universes and run the same panel 100 times. Look at me in the eye and tell me the ranked list for >50% of proposals is even close to the same each time.

6. The Bandwagon Effect is an unavoidable engine of stochasticity in a panel, and “well written” counts for much more than it should.

No matter how hard-working or well-intentioned the panel, the Bandwagon Effect always, always reigns. It goes like this:

There are vastly too many proposals before the panel for every member to be thoroughly familiar with each one. Remember, pre-grades are often due weeks before a panel actually meets. This is (probably) the last time that a panel member has actually read a proposal they are not primary or secondary on. It’s easy to almost completely forget critical details of those proposals, unless you take thorough notes (which many / most do!).
Regardless, for the above reason, the Primary Referee is the obvious authority for any given proposal. They’ve done the most thorough (re-)reading of it. They’ve done the most ancillary research.
A proposal’s chances therefore often live or die based on how the Primary Referee presents it to the panel. The secondaries play a big role too, of course. But if a Primary Referee trashes a proposal in their verbal presentation to the panel, and if avatars don’t rise with shining swords to defend it, the proposal will die. The ~4-5 people who — let’s be honest — can’t really remember that proposal, will be strongly influenced by the opinions of the Primary.

So now what happens if your Primary Referee likes your proposal, but just doesn’t do a very good job of presenting it? What if it’s 5:30pm on day three, and your Referee is just tired? The proposal comes across as … meh. Capital F “Fine”. “Good”. Panel members agree in silence. Nobody says much. The chair says “any other comments?” a few times … nothing. Scores drift toward the middle. The central limit theorem of proposal fatigue. We regret to inform you.

The point is, panels of human beings cannot be expected to be unbiased arbiters of quality or promise. They’re for sure a lot better at it than some algorithm, but they’re guaranteed to be way less consistent than that algorithm.

There are two impossible plots I’d love to see: scores as a function of reviewer fatigue, and scores as a function of depth/breadth of reviewer discussion. You can’t make them, because the data isn’t/can’t be collected. But I bet that it imprints real signal on the final ranked list - signal that is almost entirely decoupled from proposal quality.

Now, some of these above effects are random, rather than systematic. I think “access to good luck” (and therefore bad luck) is evenly / fairly distributed across the proposal pool. My dumb proposal might be lucky enough to find a Primary who just returned from, I dunno … a great lunch? They give it a banger of a presentation before the panel, not because they liked my proposal, but because they liked the Pad Thai. Proposal discussion order is whatever the panel decides, and so is effectively random. The panel itself is random! Ever re-submitted your previously fourth quintile proposal with zero edits, only to watch it get time next cycle? Equitable access to luck.

But, random or not, these are snakes in the bingo cage. They inject unwanted entropy into a process designed to minimize it. A Great Observatory’s voyage of discovery shouldn’t be guided by a compass whose arrow is randomly jiggled by how bad the coffee was at the review panel, should it?

Also, bikeshedding is real. 🙃

And, much more dangerously, because this is systematic: “well-written” proposals just straight-up “come across better” or sound more coherent, leading to a “grammar contest” effect that badly disadvantages non-native speakers with brilliant ideas. Yes, it is incumbent on the proposer to clearly articulate their proposed experiment. But there’s a very fine line to navigate here, and we’re never more than a few steps away from that requirement becoming … an English contest.

From the dawn of time, human beings have (understandably!) loved stories. And a great story, i.e. a “well written” proposal, is not necessarily a great science idea. Conversely, an incandescently brilliant proposal can be encoded in “badly written” words. It’s hard. But we need to correct for it. We’re not a poetry contest, and poets do really well at astronomy reviews.

7. So? Let’s try an actual random number generator.

If a review panel is doomed to be little more than an over-caffeinated, biased random number generator, then why not just use an unbiased random number generator? **Thanos blushes**

I am hardly the first i̶d̶i̶o̶t̶ genius to suggest this, of course.

I’ve heard this argument from hyper senior people (including several on the ESO council!) for more than a decade. But I’m serious:

Divide your proposals into topical bins, as usual, to ensure scientific breadth.
Convene (remote!) panel(s) for each topical bin. Each member reads their assigned proposals and only assigns a pre-grade, just as before. Grade distributions per member are normalized prior to merging, just as before.
A small subset of the lowest-ranked proposals are triaged, so long as their mean grades have a low dispersion.
A small-ish subset of the highest-ranked, low dispersion proposals are assigned to a tentative “guaranteed to get time” status.
The panel meets once (for a few hours) via Zoom (or whatever), to discuss the triage list and take time to carefully consider potential resurrections. Perhaps more importantly, the panel also takes time to discuss their top-ranked proposals, and decide if indeed they want those proposals to absolutely reach the telescope.
For the rest? For the majority?
```
np.random()
```
😏

Let’s try it. And we need to try it on a major facility, i.e. an 8-meter or a Great Observatory. HST Cycle 32? Chandra Cycle 25? ESO Period 112?

There are so many issues to work out first, of course:

A proposer would need only to make their proposal “good enough” to get into the randomized pool, i.e. above the triage line. This misaligns incentives in obvious and subtle ways: proposers may ask for way more time than they need / unnecessary modes / poorly selected and overly large samples, etc. It absolutely could compromise both the science and efficiency of the observatory. This would need to be carefully considered and mitigated, including with thorough technical reviews.
Spamming the TAC would be a totally viable strategy. That sucks.
There are probably great arguments for adding a quasi-complex weighting scheme to your randomizer algorithm, to address e.g. target conflicts, sample the time request distribution, ensure scheduling and scientific efficiency, etc. The “randomized” selection may in fact require so much finessing and human curation that … we inevitably end up with a panel of humans, ranking proposals. Time is a flat circle. We shall not cease from exploration.
Randomizing proposal selection might literally be against federal law in the US. Minor detail. ¯\_(ツ)_/¯

But if you want to nearly eliminate the overt and subtle biases that infect reviews? If you want to (sometimes) enable high-risk, high-reward proposals to have a chance to nab that spectacular result?

If you want to enable incremental science that would inevitably be rejected from an oversubscribed panel because it’s “not exciting enough” or “just one galaxy”? The “incremental” science that is the engine of progress in our voyage of discovery?

np.random()

Let’s try it.

8. I think the above basically applies to all review panels, including Grad applications / Fellowships / etc.

🧐

For context: I’ve written the above, with pride and pure-hearted sincerity, amid the flaming wreckage of our nine rejected JWST proposals (🥴). I stand by every word. I’d write the same thing had we had a 9/9 success rate.

Don’t @ me: The TAC membership is (appropriately) secret until the results are released, i.e. today. All of our names will be revealed by STScI in a newsletter article soon. I have explicit permission from STScI to post this today. I have no agenda other than to write my dumb thoughts down somewhere.

Feedback: Your thoughts are almost certainly smarter than mine. You can comment privately via sending me an email at grant.tremblay@cfa.harvard.edu, reply publicly to this twitter post, or scream into the boundless void (highly recommended).

Further reading: Nando Patat, a dear friend, personal hero, and glorious god of proposals at ESO, served on an important Time Allocation Working Group a few years ago (along with other great people who you’ll know). Their report raises a ton of really interesting points and is definitely worth a read. Some Health colleagues in New Zealand have actually already tried this randomization experiment, albeit with a pretty small sample. It’s an interesting read nonetheless. Thanks to Pradip Gatkine and Sarah Gallagher for pointing this out to me!

Finally: A lot of people are really confused by the post title. In principle it’s a Blood Meridian, or Evening Redness in the West / The Hobbit, or There and Back Again type of double title thing, but I get that it might be confusing. Too late now.

More finally: I’m fully aware that the randomization idea may be, um, bad. Wouldn’t be my first. What were you looking for, conviction? ¯\_(ツ)_/¯

Updates: 31 March, 9:32am EST: Typos and minor restructuring. Added two bullet points to section 7, and made a new, galaxy-brain Section 8.

31 March, 10:45pm EST: adding more caveats 🥺. Refusing to delete emojis.

2 April, 10:30am EST: adding a few cop-out weasel words about DAPR and the limited quantitative data available for such a new program. Also added two paragraphs about “equitable access to luck” in Section 6.

1 June, 2:00pm EST: STScI has released their newsletter about the Cycle 1 TAC, along with the names of all panelists. You can read it here.