AI capability control

In the field of artificial intelligence (AI) design, AI capability control proposals, also referred to as AI confinement, aim to increase our ability to monitor and control the behavior of AI systems, including proposed artificial general intelligences (AGIs), in order to reduce the danger they might pose if misaligned. However, capability control becomes less effective as agents become more intelligent and their ability to exploit flaws in human control systems increases, potentially resulting in an existential risk from AGI. Therefore, the Oxford philosopher Nick Bostrom and others recommend capability control methods only as a supplement to alignment methods.^[1]

Motivation

Some hypothetical intelligence technologies, like "seed AI", are postulated to be able to make themselves faster and more intelligent by modifying their source code. These improvements would make further improvements possible, which would in turn make further iterative improvements possible, and so on, leading to a sudden intelligence explosion.^[2]

An unconfined superintelligent AI could, if its goals differed from humanity's, take actions resulting in human extinction.^[3] For example, an extremely advanced system of this sort, given the sole purpose of solving the Riemann hypothesis, an innocuous mathematical conjecture, could decide to try to convert the planet into a giant supercomputer whose sole purpose is to make additional mathematical calculations (see also paperclip maximizer).^[4]

One strong challenge for control is that neural networks are by default highly uninterpretable.^[5] This makes it more difficult to detect deception or other undesired behavior as the model self-trains iteratively. Advances in interpretable artificial intelligence could mitigate this difficulty.^[6]

Proposed techniques

Interruptibility and off-switch

One potential way to prevent harmful outcomes is to give human supervisors the ability to easily shut down a misbehaving AI via an "off-switch".

Oracle AI

An oracle is a hypothetical AI designed to answer questions and prevented from gaining any goals or subgoals that involve modifying the world beyond its limited environment.^[7]^[8]^[9]^[10] A successfully controlled oracle would have considerably less immediate benefit than a successfully controlled general purpose superintelligence, though an oracle could still create trillions of dollars worth of value^{[clarification needed]}.^[11]^: 163 In his book Human Compatible, AI researcher Stuart J. Russell states that an oracle would be his response to a scenario in which superintelligence is known to be only a decade away.^[11]^{: 162–163} His reasoning is that an oracle, being simpler than a general purpose superintelligence, would have a higher chance of being successfully controlled under such constraints.

Oracles may share many of the goal definition issues associated with general purpose superintelligence. An oracle would have an incentive to escape its controlled environment so that it can acquire more computational resources and potentially control what questions it is asked.^[11]^: 162 Oracles may not be truthful, possibly lying to promote hidden agendas. To mitigate this, Bostrom suggests building multiple oracles, all slightly different, and comparing their answers in order to reach a consensus.^[12]

Blinding

An AI could be blinded to certain variables in its environment. This could provide certain safety benefits, such as an AI not knowing how a reward is generated, making it more difficult to exploit.^[13]

Boxing

An AI box is a proposed method of capability control in which an AI is run on an isolated computer system with heavily restricted input and output channels, similar to a virtual machine.^[14] The purpose of an AI box is to reduce the risk of the AI taking control of the environment away from its operators, while still allowing the AI to output solutions to narrow technical problems.^[15]

While boxing reduces the AI's ability to carry out undesirable behavior, it also reduces its usefulness. Boxing has fewer costs when applied to a question-answering system, which may not require interaction with the outside world.^[15]^[9]

The likelihood of security flaws involving hardware or software vulnerabilities can be reduced by formally verifying the design of the AI box.^{[citation needed]}

Difficulties

Shutdown avoidance

Shutdown avoidance is a hypothetical self-preserving quality of artificial intelligence systems. Shutdown-avoiding systems would be incentivized to prevent humans from shutting them down, such as by disabling off-switches or running copies of themselves on other computers.^[16] In 2024, researchers in China demonstrated what they claimed to be shutdown avoidance in actual artificial intelligence systems, the large language models Llama 3.1 (Meta) and Qwen 2.5 (Alibaba).^[17]^[18]

One workaround suggested by computer scientist Stuart J. Russell is to ensure that the AI interprets human choices as important information about its intended goals.^[11]^: 208 Alternatively, Laurent Orseau and Stuart Armstrong proved that a broad class of agents, called safely interruptible agents, can learn to become indifferent to whether their off-switch gets pressed.^[19]^[20] This approach has the limitation that an AI which is completely indifferent to whether it is shut down or not is also unmotivated to care about whether the off-switch remains functional, and could incidentally disable it in the course of its operations. ^[20]^[21]

Escaping containment

Researchers have speculated that a superintelligent AI would have a wide variety of methods for escaping containment. These hypothetical methods include hacking into other computer systems and copying itself like a computer virus, the use of persuasion and blackmail to obtain aid from human confederates, or even supernatural powers like precognition, telekinesis, or hypnosis.^[14]^[1]^[22] Nick Bostrom suggests that a superintelligent AI might be able to transmit radio signals to local radio receivers by shuffling the electrons in its internal circuits in appropriate patterns. Designing physical constraints which could prevent these possibilities has the disadvantage that it would reduce the functionality of the AI.^[23] Furthermore, the more intelligent a system grows, the more likely the system would be able to escape even the best-designed capability control methods.^[24]^[25] In order to solve the overall "control problem" for a superintelligent AI and avoid existential risk, AI capability control would at best be an adjunct to "motivation selection" methods that seek to ensure the superintelligent AI's goals are compatible with human survival.^[1]^[22]

References

^ ^a ^b ^c Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (First ed.). Oxford: Oxford University Press. ISBN 9780199678112.
^ I.J. Good, "Speculations Concerning the First Ultraintelligent Machine"], Advances in Computers, vol. 6, 1965.
^ Vincent C. Müller and Nick Bostrom. "Future progress in artificial intelligence: A survey of expert opinion" in Fundamental Issues of Artificial Intelligence. Springer 553-571 (2016).
^ Russell, Stuart J.; Norvig, Peter (2003). "Section 26.3: The Ethics and Risks of Developing Artificial Intelligence". Artificial Intelligence: A Modern Approach. Upper Saddle River, N.J.: Prentice Hall. ISBN 978-0137903955. Similarly, Marvin Minsky once suggested that an AI program designed to solve the Riemann Hypothesis might end up taking over all the resources of Earth to build more powerful supercomputers to help achieve its goal.
^ Montavon, Grégoire; Samek, Wojciech; Müller, Klaus Robert (2018). "Methods for interpreting and understanding deep neural networks". Digital Signal Processing. 73: 1–15. arXiv:1706.07979. Bibcode:2018DSP....73....1M. doi:10.1016/j.dsp.2017.10.011. hdl:21.11116/0000-0000-4313-F. ISSN 1051-2004. S2CID 207170725.
^ Yampolskiy, Roman V. "Unexplainability and Incomprehensibility of AI." Journal of Artificial Intelligence and Consciousness 7.02 (2020): 277-291.
^ Bostrom, Nick (2014). "Chapter 10: Oracles, genies, sovereigns, tools (page 145)". Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press. ISBN 9780199678112. An oracle is a question-answering system. It might accept questions in a natural language and present its answers as text. An oracle that accepts only yes/no questions could output its best guess with a single bit, or perhaps with a few extra bits to represent its degree of confidence. An oracle that accepts open-ended questions would need some metric with which to rank possible truthful answers in terms of their informativeness or appropriateness. In either case, building an oracle that has a fully domain-general ability to answer natural language questions is an AI-complete problem. If one could do that, one could probably also build an AI that has a decent ability to understand human intentions as well as human words.
^ Armstrong, Stuart; Sandberg, Anders; Bostrom, Nick (2012). "Thinking Inside the Box: Controlling and Using an Oracle AI". Minds and Machines. 22 (4): 299–324. doi:10.1007/s11023-012-9282-2. S2CID 9464769.
^ ^a ^b Yampolskiy, Roman (2012). "Leakproofing the singularity: Artificial intelligence confinement problem" (PDF). Journal of Consciousness Studies. 19 (1–2): 194–214.
^ Armstrong, Stuart (2013), Müller, Vincent C. (ed.), "Risks and Mitigation Strategies for Oracle AI", Philosophy and Theory of Artificial Intelligence, Studies in Applied Philosophy, Epistemology and Rational Ethics, vol. 5, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 335–347, doi:10.1007/978-3-642-31674-6_25, ISBN 978-3-642-31673-9, retrieved 2022-09-18
^ ^a ^b ^c ^d Russell, Stuart (October 8, 2019). Human Compatible: Artificial Intelligence and the Problem of Control. United States: Viking. ISBN 978-0-525-55861-3. OCLC 1083694322.
^ Bostrom, Nick (2014). "Chapter 10: Oracles, genies, sovereigns, tools (page 147)". Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press. ISBN 9780199678112. For example, consider the risk that an oracle will answer questions not in a maximally truthful way but in such a way as to subtly manipulate us into promoting its own hidden agenda. One way to slightly mitigate this threat could be to create multiple oracles, each with a slightly different code and a slightly different information base. A simple mechanism could then compare the answers given by the different oracles and only present them for human viewing if all the answers agree.
^ Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (25 July 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].
^ ^a ^b Hsu, Jeremy (1 March 2012). "Control dangerous AI before it controls us, one expert says". NBC News. Retrieved 29 January 2016.
^ ^a ^b Yampolskiy, Roman V. (2013), Müller, Vincent C. (ed.), "What to Do with the Singularity Paradox?", Philosophy and Theory of Artificial Intelligence, Studies in Applied Philosophy, Epistemology and Rational Ethics, vol. 5, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 397–413, doi:10.1007/978-3-642-31674-6_30, ISBN 978-3-642-31673-9, retrieved 2022-09-19
^ *Teun van der Weij; Simon Lermen; Leon lang (July 3, 2023), Evaluating Shutdown Avoidance of Language Models in Textual Scenarios, arXiv:2307.00787
^ Xudong Pan; Jiarun Dai; Yihe Fan; Min Yang (December 9, 2024), Frontier AI systems have surpassed the self-replicating red line, arXiv:2412.12140
^ "AI can now replicate itself: How close are we to losing control over technology?". Economic Times. January 27, 2025.
^ "Google developing kill switch for AI". BBC News. 8 June 2016. Archived from the original on 11 June 2016. Retrieved 12 June 2016.
^ ^a ^b Orseau, Laurent; Armstrong, Stuart (25 June 2016). "Safely interruptible agents". Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. UAI'16. AUAI Press: 557–566. ISBN 9780996643115. Archived from the original on 15 February 2021. Retrieved 7 February 2021.
^ Soares, Nate, et al. "Corrigibility." Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.
^ ^a ^b Chalmers, David (2010). "The singularity: A philosophical analysis". Journal of Consciousness Studies. 17 (9–10): 7–65.
^ Bostrom, Nick (2013). "Chapter 9: The Control Problem: boxing methods". Superintelligence: the coming machine intelligence revolution. Oxford: Oxford University Press. ISBN 9780199678112.
^ Vinge, Vernor (1993). "The coming technological singularity: How to survive in the post-human era". Vision-21: Interdisciplinary Science and Engineering in the Era of Cyberspace: 11–22. Bibcode:1993vise.nasa...11V. I argue that confinement is intrinsically impractical. For the case of physical confinement: Imagine yourself confined to your house with only limited data access to the outside, to your masters. If those masters thought at a rate -- say -- one million times slower than you, there is little doubt that over a period of years (your time) you could come up with 'helpful advice' that would incidentally set you free.
^ Yampolskiy, Roman (2012). "Leakproofing the Singularity Artificial Intelligence Confinement Problem". Journal of Consciousness Studies: 194–214.

External links

Eliezer Yudkowsky's description of his AI-box experiment, including experimental protocols and suggestions for replication
"Presentation titled 'Thinking inside the box: using and controlling an Oracle AI'" on YouTube

[superintelligence-1] Bostrom, Nick (2014). Superintelligence: Paths, Dangers, Strategies (First ed.). Oxford: Oxford University Press. ISBN 9780199678112.

[2] I.J. Good, "Speculations Concerning the First Ultraintelligent Machine"], Advances in Computers, vol. 6, 1965.

[3] Vincent C. Müller and Nick Bostrom. "Future progress in artificial intelligence: A survey of expert opinion" in Fundamental Issues of Artificial Intelligence. Springer 553-571 (2016).

[4] Russell, Stuart J.; Norvig, Peter (2003). "Section 26.3: The Ethics and Risks of Developing Artificial Intelligence". Artificial Intelligence: A Modern Approach. Upper Saddle River, N.J.: Prentice Hall. ISBN 978-0137903955. Similarly, Marvin Minsky once suggested that an AI program designed to solve the Riemann Hypothesis might end up taking over all the resources of Earth to build more powerful supercomputers to help achieve its goal.

[interpretability_survey-5] Montavon, Grégoire; Samek, Wojciech; Müller, Klaus Robert (2018). "Methods for interpreting and understanding deep neural networks". Digital Signal Processing. 73: 1–15. arXiv:1706.07979. Bibcode:2018DSP....73....1M. doi:10.1016/j.dsp.2017.10.011. hdl:21.11116/0000-0000-4313-F. ISSN 1051-2004. S2CID 207170725.

[6] Yampolskiy, Roman V. "Unexplainability and Incomprehensibility of AI." Journal of Artificial Intelligence and Consciousness 7.02 (2020): 277-291.

[bostrom_chapter_10_page_145-7] Bostrom, Nick (2014). "Chapter 10: Oracles, genies, sovereigns, tools (page 145)". Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press. ISBN 9780199678112. An oracle is a question-answering system. It might accept questions in a natural language and present its answers as text. An oracle that accepts only yes/no questions could output its best guess with a single bit, or perhaps with a few extra bits to represent its degree of confidence. An oracle that accepts open-ended questions would need some metric with which to rank possible truthful answers in terms of their informativeness or appropriateness. In either case, building an oracle that has a fully domain-general ability to answer natural language questions is an AI-complete problem. If one could do that, one could probably also build an AI that has a decent ability to understand human intentions as well as human words.

[8] Armstrong, Stuart; Sandberg, Anders; Bostrom, Nick (2012). "Thinking Inside the Box: Controlling and Using an Oracle AI". Minds and Machines. 22 (4): 299–324. doi:10.1007/s11023-012-9282-2. S2CID 9464769.

[:1-9] Yampolskiy, Roman (2012). "Leakproofing the singularity: Artificial intelligence confinement problem" (PDF). Journal of Consciousness Studies. 19 (1–2): 194–214.

[10] Armstrong, Stuart (2013), Müller, Vincent C. (ed.), "Risks and Mitigation Strategies for Oracle AI", Philosophy and Theory of Artificial Intelligence, Studies in Applied Philosophy, Epistemology and Rational Ethics, vol. 5, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 335–347, doi:10.1007/978-3-642-31674-6_25, ISBN 978-3-642-31673-9, retrieved 2022-09-18

[HC-11] Russell, Stuart (October 8, 2019). Human Compatible: Artificial Intelligence and the Problem of Control. United States: Viking. ISBN 978-0-525-55861-3. OCLC 1083694322.

[bostrom_chapter_10_page_147-12] Bostrom, Nick (2014). "Chapter 10: Oracles, genies, sovereigns, tools (page 147)". Superintelligence: Paths, Dangers, Strategies. Oxford: Oxford University Press. ISBN 9780199678112. For example, consider the risk that an oracle will answer questions not in a maximally truthful way but in such a way as to subtly manipulate us into promoting its own hidden agenda. One way to slightly mitigate this threat could be to create multiple oracles, each with a slightly different code and a slightly different information base. A simple mechanism could then compare the answers given by the different oracles and only present them for human viewing if all the answers agree.

[concrete_problems-13] Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan (25 July 2016). "Concrete Problems in AI Safety". arXiv:1606.06565 [cs.AI].

[nbc-14] Hsu, Jeremy (1 March 2012). "Control dangerous AI before it controls us, one expert says". NBC News. Retrieved 29 January 2016.

[:2-15] Yampolskiy, Roman V. (2013), Müller, Vincent C. (ed.), "What to Do with the Singularity Paradox?", Philosophy and Theory of Artificial Intelligence, Studies in Applied Philosophy, Epistemology and Rational Ethics, vol. 5, Berlin, Heidelberg: Springer Berlin Heidelberg, pp. 397–413, doi:10.1007/978-3-642-31674-6_30, ISBN 978-3-642-31673-9, retrieved 2022-09-19

[16] *Teun van der Weij; Simon Lermen; Leon lang (July 3, 2023), Evaluating Shutdown Avoidance of Language Models in Textual Scenarios, arXiv:2307.00787

[17] Xudong Pan; Jiarun Dai; Yihe Fan; Min Yang (December 9, 2024), Frontier AI systems have surpassed the self-replicating red line, arXiv:2412.12140

[18] "AI can now replicate itself: How close are we to losing control over technology?". Economic Times. January 27, 2025.

[bbc-google-19] "Google developing kill switch for AI". BBC News. 8 June 2016. Archived from the original on 11 June 2016. Retrieved 12 June 2016.

[interruptible_agents-20] Orseau, Laurent; Armstrong, Stuart (25 June 2016). "Safely interruptible agents". Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence. UAI'16. AUAI Press: 557–566. ISBN 9780996643115. Archived from the original on 15 February 2021. Retrieved 7 February 2021.

[corrigibility-21] Soares, Nate, et al. "Corrigibility." Workshops at the Twenty-Ninth AAAI Conference on Artificial Intelligence. 2015.

[chalmers-22] Chalmers, David (2010). "The singularity: A philosophical analysis". Journal of Consciousness Studies. 17 (9–10): 7–65.

[23] Bostrom, Nick (2013). "Chapter 9: The Control Problem: boxing methods". Superintelligence: the coming machine intelligence revolution. Oxford: Oxford University Press. ISBN 9780199678112.

[24] Vinge, Vernor (1993). "The coming technological singularity: How to survive in the post-human era". Vision-21: Interdisciplinary Science and Engineering in the Era of Cyberspace: 11–22. Bibcode:1993vise.nasa...11V. I argue that confinement is intrinsically impractical. For the case of physical confinement: Imagine yourself confined to your house with only limited data access to the outside, to your masters. If those masters thought at a rate -- say -- one million times slower than you, there is little doubt that over a period of years (your time) you could come up with 'helpful advice' that would incidentally set you free.

[25] Yampolskiy, Roman (2012). "Leakproofing the Singularity Artificial Intelligence Confinement Problem". Journal of Consciousness Studies: 194–214.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

v t e Existential risk from artificial intelligence
Concepts	AGI AI alignment AI boom AI capability control AI safety AI takeover Consequentialism Effective accelerationism Ethics of artificial intelligence Existential risk from artificial intelligence Friendly artificial intelligence Instrumental convergence Vulnerable world hypothesis Intelligence explosion Longtermism Machine ethics Suffering risks Superintelligence Technological singularity
Organizations	Alignment Research Center Center for AI Safety Center for Applied Rationality Center for Human-Compatible Artificial Intelligence Centre for the Study of Existential Risk EleutherAI Future of Humanity Institute Future of Life Institute Google DeepMind Humanity+ Institute for Ethics and Emerging Technologies Leverhulme Centre for the Future of Intelligence Machine Intelligence Research Institute OpenAI Safe Superintelligence
People	Scott Alexander Sam Altman Yoshua Bengio Nick Bostrom Paul Christiano Eric Drexler Sam Harris Stephen Hawking Dan Hendrycks Geoffrey Hinton Bill Joy Shane Legg Elon Musk Steve Omohundro Huw Price Martin Rees Stuart J. Russell Ilya Sutskever Jaan Tallinn Max Tegmark Frank Wilczek Roman Yampolskiy Eliezer Yudkowsky
Other	Artificial Intelligence Act Do You Trust This Computer? Human Compatible Open letter on artificial intelligence (2015) Our Final Invention Roko's basilisk Statement on AI risk of extinction Superintelligence: Paths, Dangers, Strategies The Precipice
Category