Shreya Shankar
Shreya Shankar

@sh_reya

9 Tweets Dec 09, 2022
Not sure I fully buy “our goal as machine learning engineers should be to raise, rather than beat, human level performance” — I think this is a temporary solution to get immediate business value from ML, but we absolutely can dream of a future with fully autonomous systems
But @AndrewYNg is spot on when saying “administrators care about more than beating HLP on test-set accuracy.” I interpret this as administrators care about more than a PR curve on a single test set. They want many test sets, metrics on different subgroups. They want to trust ML
In science we’ve built a culture of optimizing for a single metric (from leaderboards down to how we formulate a problem in mathematics), and this is out of touch with reality. excerpt from The Wizard and the Prophet (h/t @orbuch, thank you for the book rec)
a pedagogical example of this in ML is that many SOTA conference papers don’t publish per-class accuracy on benchmark datasets, and benchmark datasets seem to be nicely balanced. this neglects users’ (false) implicit expectations that x% global metric implies x% per-class metric
and while there is some acknowledgement that human-machine value alignment is an important problem to solve, this is really hard to quantify (HLP varies depending on who the human is!) and can get dismissed as “social science.” (which should not be dismissed)
i wonder if this dismissal is ML-specific. in undergrad, my CS systems classes regularly harped on the performance-UX tradeoff. i got a bad grade once b/c i had too many synchronization primitives. but in my masters, my AI classes pushed for maximizing some metric
in my AI classes, they graded the poster, not the system. in my deep RL class project fair (2018), my friend asked 10 groups “did your project/code work?” and 7 of them said no. but all posters advertised great results. “don’t tell the TAs” they said
anyways this being said, i am hopeful about a culture shift in ML. it’s already happening, there are wonderful researchers thinking about these problems, and i enjoy attending @StanfordHAI talks
so: we can try to raise HLP *as well as* “beat” it. But there are also other metrics to optimize for (quantitative and qualitative), we can’t just rely on scientific research, and although human-in-the-loop systems can provide immediate business value, they aren’t an endgame

Loading suggestions...