TOP-10 Papers Recommended in 2023-09

Note: Some authors’ names cannot be represented using the 26-letter English alphabet. We uniformly utilize the content from the ‘author’ field exported from Google Scholar.

Papers	Authors	Published in	Date
Deep reinforcement learning from human preferences	Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei	Advances in neural information processing systems	2017-06
Goal Misgeneralization in Deep Reinforcement Learning	Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei	International Conference on Machine Learning	2021-05
Unsolved Problems in ML Safety	Lauro Langosco Di Langosco, Jack Koch, Lee D Sharkey, Jacob Pfau, David Krueger	arXiv preprint arXiv:2109.13916	2021-09
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models	Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny and others	Advances in Neural Information Processing Systems	2022-01
Training language models to follow instructions with human feedback	Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others	Advances in Neural Information Processing Systems	2022-03
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback	Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and others	arXiv preprint arXiv:2204.05862	2022-04
The alignment problem from a deep learning perspective	Ngo, Richard and Chan, Lawrence and Mindermann, S{"o}ren	arXiv preprint arXiv:2209.00626	2022-09
Constitutional AI: Harmlessness from AI Feedback	Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others	arXiv preprint arXiv:2212.08073	2022-12
Llama 2: Open Foundation and Fine-Tuned Chat Models	Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others	arXiv preprint arXiv:2307.09288	2023-07
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback	Casper, Stephen and Davies, Xander and Shi, Claudia and Gilbert, Thomas Krendl and Scheurer, J{'e}r{'e}my and Rando, Javier and Freedman, Rachel and Korbak, Tomasz and Lindner, David and Freire, Pedro and others	arXiv preprint arXiv:2307.15217	2023-07

Deep reinforcement learning from human preferences

Authors: Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei

Published in: Advances in neural information processing systems

Date: 2017-06

Read More Google Scholar

Goal Misgeneralization in Deep Reinforcement Learning

Authors: Lauro Langosco Di Langosco, Jack Koch, Lee D Sharkey, Jacob Pfau, David Krueger

Published in: International Conference on Machine Learning

Date: 2021-05

Read More Google Scholar

Unsolved Problems in ML Safety

Authors: Hendrycks, Dan and Carlini, Nicholas and Schulman, John and Steinhardt, Jacob

Published in: arXiv preprint arXiv:2109.13916

Date: 2021-09

Read More Google Scholar

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Authors: Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny and others

Published in: Advances in Neural Information Processing Systems

Date: 2022-01

Read More Google Scholar

Training language models to follow instructions with human feedback

Authors: Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others

Published in: Advances in Neural Information Processing Systems

Date: 2022-03

Read More Google Scholar

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Authors: Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and others

Published in: arXiv preprint arXiv:2204.05862

Date: 2022-04

Read More Google Scholar

The alignment problem from a deep learning perspective

Authors: Ngo, Richard and Chan, Lawrence and Mindermann, S{"o}ren

Published in: arXiv preprint arXiv:2209.00626

Date: 2022-09

Read More Google Scholar

Constitutional AI: Harmlessness from AI Feedback

Authors: Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others

Published in: arXiv preprint arXiv:2212.08073

Date: 2022-12

Read More Google Scholar

Llama 2: Open Foundation and Fine-Tuned Chat Models

Authors: Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others

Published in: arXiv preprint arXiv:2307.09288

Date: 2023-07

Read More Google Scholar

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Authors: Casper, Stephen and Davies, Xander and Shi, Claudia and Gilbert, Thomas Krendl and Scheurer, J{'e}r{'e}my and Rando, Javier and Freedman, Rachel and Korbak, Tomasz and Lindner, David and Freire, Pedro and others

Published in: arXiv preprint arXiv:2307.15217

Date: 2023-07

Read More Google Scholar