> Many think this benchmark should eventually go superexponential, since once AI learns the general planning and error-correction skills needed to complete multiweek tasks, it should be able to complete multimonth ones too.
I've updated against this position over the past 6 months, since wouldn't it naively imply that if a junior software developer gets good at completing multiweek tasks (eg. submitting a PR for a major feature with minimal supervision), they would automatically be roughly as good as senior developers at multimonth/multiyear tasks like planning and guiding the development of a 1M LoC major project? But that is clearly not how it works.
Another argument people bring up is that eventually AI will be able to do tasks that no human can do, so the horizon on METR needs to go infinite. (Which seems true but doesn't tell us when it'll happen, and is arguably more about our measurements than true skill increase.)
> Epoch hasn’t released an official score, but external parties believe Mythos is on trend on this index. […] Though there’s a complication. Anthropic has their own version of ECI, using a probably larger set of internal benchmarks. On the version in the Opus 4.7 system card, Mythos appears to be about 6 months of progress in only 2.2
I think there’s a misunderstanding here. Both of those use the same underlying data - the AECI datapoints from Anthropic. The ECI chart from Ramez Naam simply attempts to convert AECI values to ECI, so that we can contextualize Mythos with models from OpenAI and Google instead of just Anthropic models.
Good post!
> Many think this benchmark should eventually go superexponential, since once AI learns the general planning and error-correction skills needed to complete multiweek tasks, it should be able to complete multimonth ones too.
I've updated against this position over the past 6 months, since wouldn't it naively imply that if a junior software developer gets good at completing multiweek tasks (eg. submitting a PR for a major feature with minimal supervision), they would automatically be roughly as good as senior developers at multimonth/multiyear tasks like planning and guiding the development of a 1M LoC major project? But that is clearly not how it works.
Yes I thought this post was good on this point: https://secondthoughts.ai/p/a-project-is-not-a-bundle-of-tasks
Another argument people bring up is that eventually AI will be able to do tasks that no human can do, so the horizon on METR needs to go infinite. (Which seems true but doesn't tell us when it'll happen, and is arguably more about our measurements than true skill increase.)
> Epoch hasn’t released an official score, but external parties believe Mythos is on trend on this index. […] Though there’s a complication. Anthropic has their own version of ECI, using a probably larger set of internal benchmarks. On the version in the Opus 4.7 system card, Mythos appears to be about 6 months of progress in only 2.2
I think there’s a misunderstanding here. Both of those use the same underlying data - the AECI datapoints from Anthropic. The ECI chart from Ramez Naam simply attempts to convert AECI values to ECI, so that we can contextualize Mythos with models from OpenAI and Google instead of just Anthropic models.
Ah OK good point, I'll clarify that.