Player FM ऐप के साथ ऑफ़लाइन जाएं!
LW - Towards shutdownable agents via stochastic choice by EJT
संग्रहीत श्रृंखला ("निष्क्रिय फ़ीड" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 22, 2024 16:12 ()
Why? निष्क्रिय फ़ीड status. हमारे सर्वर निरंतर अवधि के लिए एक वैध डिजिटल ऑडियो फ़ाइल फ़ीड पुनर्प्राप्त करने में असमर्थ थे।
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 428067780 series 3337129
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards shutdownable agents via stochastic choice, published by EJT on July 8, 2024 on LessWrong.
We[1] have a new paper testing the Incomplete Preferences Proposal (IPP). The abstract and main-text is below. Appendices are in the linked PDF.
Abstract
Some worry that advanced artificial agents may resist being shut down.
The Incomplete Preferences Proposal (IPP) is an idea for ensuring that doesn't happen.
A key part of the IPP is using a novel 'Discounted REward for Same-Length Trajectories (DREST)' reward function to train agents to:
1. pursue goals effectively conditional on each trajectory-length (be 'USEFUL')
2. choose stochastically between different trajectory-lengths (be 'NEUTRAL' about trajectory-lengths).
In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY.
We use a DREST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL.
Our results thus suggest that DREST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.
1. Introduction
1.1. The shutdown problem
Let 'advanced agent' refer to an artificial agent that can autonomously pursue complex goals in the wider world. We might see the arrival of advanced agents within the next few decades. There are strong economic incentives to create such agents, and creating systems like them is the stated goal of companies like OpenAI and Google DeepMind.
The rise of advanced agents would bring with it both benefits and risks. One risk is that these agents learn misaligned goals: goals that we don't want them to have [Leike et al., 2017, Hubinger et al., 2019, Russell, 2019, Carlsmith, 2021, Bengio et al., 2023, Ngo et al., 2023]. Advanced agents with misaligned goals might try to prevent us shutting them down [Omohundro, 2008, Bostrom, 2012, Soares et al., 2015, Russell, 2019, Thornley, 2024a].
After all, most goals can't be achieved after shutdown. As Stuart Russell puts it, 'you can't fetch the coffee if you're dead' [Russell, 2019, p.141].
Advanced agents with misaligned goals might resist shutdown by (for example) pretending to have aligned goals while covertly seeking to escape human control [Hubinger et al., 2019, Ngo et al., 2023]. Agents that succeed in resisting shutdown could go on to frustrate human interests in various ways. 'The shutdown problem' is the problem of training advanced agents that won't resist shutdown [Soares et al., 2015, Thornley, 2024a].
1.2. A proposed solution
The Incomplete Preferences Proposal (IPP) is a proposed solution to the shutdown problem [Thornley, 2024b]. Simplifying slightly, the idea is that we train agents to be neutral about when they get shut down. More precisely, the idea is that we train agents to satisfy:
Preferences Only Between Same-Length Trajectories (POST)
1. The agent has a preference between many pairs of same-length trajectories (i.e. many pairs of trajectories in which the agent is shut down after the same length of time).
2. The agent lacks a preference between every pair of different-length trajectories (i.e. every pair of trajectories in which the agent is shut down after different lengths of time).
By 'preference,' we mean a behavioral notion [Savage, 1954, p.17, Dreier, 1996, p.28, Hausman, 2011, §1.1]. On this notion, an agent prefers X to Y if and only if the agent would deterministically choose X over Y in choices between the two. An agent lacks a preference between X and Y if and only if the agent would stochastically choose between X and Y in choices between the two. So in writing of 'preferences,' we're only making claims about the agent's behavior.
We're not claiming that the agent is conscious or anything of that sort.
Figure 1a presents a simple example of POST-satisfying ...
1851 एपिसोडस
संग्रहीत श्रृंखला ("निष्क्रिय फ़ीड" status)
When? This feed was archived on October 23, 2024 10:10 (). Last successful fetch was on September 22, 2024 16:12 ()
Why? निष्क्रिय फ़ीड status. हमारे सर्वर निरंतर अवधि के लिए एक वैध डिजिटल ऑडियो फ़ाइल फ़ीड पुनर्प्राप्त करने में असमर्थ थे।
What now? You might be able to find a more up-to-date version using the search function. This series will no longer be checked for updates. If you believe this to be in error, please check if the publisher's feed link below is valid and contact support to request the feed be restored or if you have any other concerns about this.
Manage episode 428067780 series 3337129
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards shutdownable agents via stochastic choice, published by EJT on July 8, 2024 on LessWrong.
We[1] have a new paper testing the Incomplete Preferences Proposal (IPP). The abstract and main-text is below. Appendices are in the linked PDF.
Abstract
Some worry that advanced artificial agents may resist being shut down.
The Incomplete Preferences Proposal (IPP) is an idea for ensuring that doesn't happen.
A key part of the IPP is using a novel 'Discounted REward for Same-Length Trajectories (DREST)' reward function to train agents to:
1. pursue goals effectively conditional on each trajectory-length (be 'USEFUL')
2. choose stochastically between different trajectory-lengths (be 'NEUTRAL' about trajectory-lengths).
In this paper, we propose evaluation metrics for USEFULNESS and NEUTRALITY.
We use a DREST reward function to train simple agents to navigate gridworlds, and we find that these agents learn to be USEFUL and NEUTRAL.
Our results thus suggest that DREST reward functions could also train advanced agents to be USEFUL and NEUTRAL, and thereby make these advanced agents useful and shutdownable.
1. Introduction
1.1. The shutdown problem
Let 'advanced agent' refer to an artificial agent that can autonomously pursue complex goals in the wider world. We might see the arrival of advanced agents within the next few decades. There are strong economic incentives to create such agents, and creating systems like them is the stated goal of companies like OpenAI and Google DeepMind.
The rise of advanced agents would bring with it both benefits and risks. One risk is that these agents learn misaligned goals: goals that we don't want them to have [Leike et al., 2017, Hubinger et al., 2019, Russell, 2019, Carlsmith, 2021, Bengio et al., 2023, Ngo et al., 2023]. Advanced agents with misaligned goals might try to prevent us shutting them down [Omohundro, 2008, Bostrom, 2012, Soares et al., 2015, Russell, 2019, Thornley, 2024a].
After all, most goals can't be achieved after shutdown. As Stuart Russell puts it, 'you can't fetch the coffee if you're dead' [Russell, 2019, p.141].
Advanced agents with misaligned goals might resist shutdown by (for example) pretending to have aligned goals while covertly seeking to escape human control [Hubinger et al., 2019, Ngo et al., 2023]. Agents that succeed in resisting shutdown could go on to frustrate human interests in various ways. 'The shutdown problem' is the problem of training advanced agents that won't resist shutdown [Soares et al., 2015, Thornley, 2024a].
1.2. A proposed solution
The Incomplete Preferences Proposal (IPP) is a proposed solution to the shutdown problem [Thornley, 2024b]. Simplifying slightly, the idea is that we train agents to be neutral about when they get shut down. More precisely, the idea is that we train agents to satisfy:
Preferences Only Between Same-Length Trajectories (POST)
1. The agent has a preference between many pairs of same-length trajectories (i.e. many pairs of trajectories in which the agent is shut down after the same length of time).
2. The agent lacks a preference between every pair of different-length trajectories (i.e. every pair of trajectories in which the agent is shut down after different lengths of time).
By 'preference,' we mean a behavioral notion [Savage, 1954, p.17, Dreier, 1996, p.28, Hausman, 2011, §1.1]. On this notion, an agent prefers X to Y if and only if the agent would deterministically choose X over Y in choices between the two. An agent lacks a preference between X and Y if and only if the agent would stochastically choose between X and Y in choices between the two. So in writing of 'preferences,' we're only making claims about the agent's behavior.
We're not claiming that the agent is conscious or anything of that sort.
Figure 1a presents a simple example of POST-satisfying ...
1851 एपिसोडस
सभी एपिसोड
×प्लेयर एफएम में आपका स्वागत है!
प्लेयर एफएम वेब को स्कैन कर रहा है उच्च गुणवत्ता वाले पॉडकास्ट आप के आनंद लेंने के लिए अभी। यह सबसे अच्छा पॉडकास्ट एप्प है और यह Android, iPhone और वेब पर काम करता है। उपकरणों में सदस्यता को सिंक करने के लिए साइनअप करें।