TOP-10 Papers Recommended in 2023-09

Note: Some authors’ names cannot be represented using the 26-letter English alphabet. We uniformly utilize the content from the ‘author’ field exported from Google Scholar.

PapersAuthorsPublished inDate
Deep reinforcement learning from human preferencesPaul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario AmodeiAdvances in neural information processing systems2017-06
Goal Misgeneralization in Deep Reinforcement LearningPaul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario AmodeiInternational Conference on Machine Learning2021-05
Unsolved Problems in ML SafetyLauro Langosco Di Langosco, Jack Koch, Lee D Sharkey, Jacob Pfau, David KruegerarXiv preprint arXiv:2109.139162021-09
Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsWei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny and othersAdvances in Neural Information Processing Systems2022-01
Training language models to follow instructions with human feedbackOuyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and othersAdvances in Neural Information Processing Systems2022-03
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human FeedbackBai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and othersarXiv preprint arXiv:2204.058622022-04
The alignment problem from a deep learning perspectiveNgo, Richard and Chan, Lawrence and Mindermann, S{"o}renarXiv preprint arXiv:2209.006262022-09
Constitutional AI: Harmlessness from AI FeedbackBai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and othersarXiv preprint arXiv:2212.080732022-12
Llama 2: Open Foundation and Fine-Tuned Chat ModelsTouvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and othersarXiv preprint arXiv:2307.092882023-07
Open Problems and Fundamental Limitations of Reinforcement Learning from Human FeedbackCasper, Stephen and Davies, Xander and Shi, Claudia and Gilbert, Thomas Krendl and Scheurer, J{'e}r{'e}my and Rando, Javier and Freedman, Rachel and Korbak, Tomasz and Lindner, David and Freire, Pedro and othersarXiv preprint arXiv:2307.152172023-07
Deep reinforcement learning from human preferences

Authors: Paul F. Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, Dario Amodei

Published in: Advances in neural information processing systems

Date: 2017-06

Read More Google Scholar

Goal Misgeneralization in Deep Reinforcement Learning

Authors: Lauro Langosco Di Langosco, Jack Koch, Lee D Sharkey, Jacob Pfau, David Krueger

Published in: International Conference on Machine Learning

Date: 2021-05

Read More Google Scholar

Unsolved Problems in ML Safety

Authors: Hendrycks, Dan and Carlini, Nicholas and Schulman, John and Steinhardt, Jacob

Published in: arXiv preprint arXiv:2109.13916

Date: 2021-09

Read More Google Scholar

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Authors: Wei, Jason and Wang, Xuezhi and Schuurmans, Dale and Bosma, Maarten and Xia, Fei and Chi, Ed and Le, Quoc V and Zhou, Denny and others

Published in: Advances in Neural Information Processing Systems

Date: 2022-01

Read More Google Scholar

Training language models to follow instructions with human feedback

Authors: Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others

Published in: Advances in Neural Information Processing Systems

Date: 2022-03

Read More Google Scholar

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Authors: Bai, Yuntao and Jones, Andy and Ndousse, Kamal and Askell, Amanda and Chen, Anna and DasSarma, Nova and Drain, Dawn and Fort, Stanislav and Ganguli, Deep and Henighan, Tom and others

Published in: arXiv preprint arXiv:2204.05862

Date: 2022-04

Read More Google Scholar

The alignment problem from a deep learning perspective

Authors: Ngo, Richard and Chan, Lawrence and Mindermann, S{"o}ren

Published in: arXiv preprint arXiv:2209.00626

Date: 2022-09

Read More Google Scholar

Constitutional AI: Harmlessness from AI Feedback

Authors: Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others

Published in: arXiv preprint arXiv:2212.08073

Date: 2022-12

Read More Google Scholar

Llama 2: Open Foundation and Fine-Tuned Chat Models

Authors: Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others

Published in: arXiv preprint arXiv:2307.09288

Date: 2023-07

Read More Google Scholar

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Authors: Casper, Stephen and Davies, Xander and Shi, Claudia and Gilbert, Thomas Krendl and Scheurer, J{'e}r{'e}my and Rando, Javier and Freedman, Rachel and Korbak, Tomasz and Lindner, David and Freire, Pedro and others

Published in: arXiv preprint arXiv:2307.15217

Date: 2023-07

Read More Google Scholar