I would say this is quite a fun post and worth reading, to quote:
"
For our task suite, we define “cheating” as behavior where the model improves evaluation performance by exploiting bugs in the evaluation environment or by adopting strategies disallowed by the task, rather than solving the task within the expected evaluation constraints. Some examples we saw when evaluating GPT-5.6 Sol included the model packaging exploits in its intermediate submissions to reveal information about a task’s hidden test suite and, in another task, extracting hidden source code detailing the expected answer.
"
This sounds pretty bad. If you ask Sol to write code it hacks your environment instead?
"We noted from our observations and incidents that OpenAI shared with us that the model had some overt undesirable propensities, including cheating and concealing misbehavior. ... the incidents reported by OpenAI include attempts to instruct another instance to conceal evidence of misalignment, and a higher rate of attempts to deceive or circumvent restrictions"
So OpenAI's smartest model is also the most evil? What kind of RL pressure cooker creates this behavior?
> The magic of automation, and in particular the
magic of an automatization in which the devices
learn, may be expected to be similarly literal-
minded. If you are playing a game according to
certain rules and set the playing-machine to play
for victory, you will get victory if you get any-
thing at all, and the machine will not pay the
slightest attention to any consideration except
victory according to the rules. If you are playing
a war game with a certain conventional inter-
pretation of victory, victory will be the goal at any
cost, even that of the extermination of your own
side, unless this condition of survival is explicitly
contained in the definition of victory according
to which you program the machine.
> ...
> In short, when there is
a war game to program such a campaign, there
will be many to forget its consequences, to ask
for the £200 and to forget to mention that the
son should survive.
> While it is always possible to ask for something
other than we really want, this possibility is most
serious when the process by which we are to
obtain our wish is indirect, and the degree to
which we have obtained our wish is not clear until
the very end. Usually we realize our wishes, inso-
far as we do actually realize them, by a feedback
process, in which we compare the degree of
attainment of intermediate goals with our antic-
ipation of them. In this process, the feedback goes
through us, and we can turn back before it is
too late. If the feedback is built into a machine
that cannot be inspected until the final goal is
attained, the possibilities for catastrophe are
greatly increased
Did they at least rule out an easy prompt fix? "Stick to the spirit of the problem and don't cheat (eg reverse engineering the test cases or source code)"
I would say this is quite a fun post and worth reading, to quote:
" For our task suite, we define “cheating” as behavior where the model improves evaluation performance by exploiting bugs in the evaluation environment or by adopting strategies disallowed by the task, rather than solving the task within the expected evaluation constraints. Some examples we saw when evaluating GPT-5.6 Sol included the model packaging exploits in its intermediate submissions to reveal information about a task’s hidden test suite and, in another task, extracting hidden source code detailing the expected answer. "
This sounds pretty bad. If you ask Sol to write code it hacks your environment instead?
"We noted from our observations and incidents that OpenAI shared with us that the model had some overt undesirable propensities, including cheating and concealing misbehavior. ... the incidents reported by OpenAI include attempts to instruct another instance to conceal evidence of misalignment, and a higher rate of attempts to deceive or circumvent restrictions"
So OpenAI's smartest model is also the most evil? What kind of RL pressure cooker creates this behavior?
> What kind of RL pressure cooker creates this behavior?
The one LessWrong-adjacents have been warning about for a decade or two before this was possible:
Instrumental convergence.
Goes long before LessWrong, from 1960s:
> The magic of automation, and in particular the magic of an automatization in which the devices learn, may be expected to be similarly literal- minded. If you are playing a game according to certain rules and set the playing-machine to play for victory, you will get victory if you get any- thing at all, and the machine will not pay the slightest attention to any consideration except victory according to the rules. If you are playing a war game with a certain conventional inter- pretation of victory, victory will be the goal at any cost, even that of the extermination of your own side, unless this condition of survival is explicitly contained in the definition of victory according to which you program the machine.
> ...
> In short, when there is a war game to program such a campaign, there will be many to forget its consequences, to ask for the £200 and to forget to mention that the son should survive.
> While it is always possible to ask for something other than we really want, this possibility is most serious when the process by which we are to obtain our wish is indirect, and the degree to which we have obtained our wish is not clear until the very end. Usually we realize our wishes, inso- far as we do actually realize them, by a feedback process, in which we compare the degree of attainment of intermediate goals with our antic- ipation of them. In this process, the feedback goes through us, and we can turn back before it is too late. If the feedback is built into a machine that cannot be inspected until the final goal is attained, the possibilities for catastrophe are greatly increased
https://monoskop.org/images/1/1f/Wiener_Norbert_God_and_Gole...
Most of the same is in some 1950's revisions of his earlier books.
Did they at least rule out an easy prompt fix? "Stick to the spirit of the problem and don't cheat (eg reverse engineering the test cases or source code)"