The reliability of replications: a study in computational reproductions

Wuttke, Aßenmacher, Klamm, Lang, Würschinger, Kreuter (2025): „AI Conversational Interviewing: Transforming Surveys with LLMs as Adaptive Interviewers“, LaTeCH-CLfL 2025.

Authors
Affiliations

University of Bremen

University of Leeds

Ludwig-Maximilians-Universität München

Doi

Abstract

This study investigates researcher variability in computational reproduction, an activity for which it is least expected. Eighty-five independent teams attempted numerical replication of results from an original study of policy preferences and immigration. Reproduction teams were randomly grouped into a ‘transparent group’ receiving original study and code or ‘opaque group’ receiving only a method and results description and no code. The transparent group mostly verified original results (95.7% same sign and p-value cutoff), while the opaque group had less success (89.3%). Second-decimal place exact numerical reproductions were less common (76.9 and 48.1%). Qualitative investigation of the workflows revealed many causes of error, including mistakes and procedural variations. When curating mistakes, we still find that only the transparent group was reliably successful. Our findings imply a need for transparency, but also more. Institutional checks and less subjective difficulty for researchers ‘doing reproduction’ would help, implying a need for better training. We also urge increased awareness of complexity in the research process and in ‘push button’ replications.

Cite

@article{10.1098/rsos.241038,
  title = {The Reliability of Replications: A Study in Computational Reproductions},
  author = {Breznau, Nate and Rinke, Eike Mark and Wuttke, Alexander and Adem, Muna and Adriaans, Jule and Akdeniz, Esra and {Alvarez-Benjumea}, Amalia and Andersen, Henrik K. and Auer, Daniel and Azevedo, Flavio and Bahnsen, Oke and Bai, Ling and Balzer, Dave and Bauer, Paul C. and Bauer, Gerrit and Baumann, Markus and Baute, Sharon and Benoit, Verena and Bernauer, Julian and Berning, Carl and Berthold, Anna and Bethke, Felix S. and Biegert, Thomas and Blinzler, Katharina and Blumenberg, Johannes N. and Bobzien, Licia and Bohman, Andrea and Bol, Thijs and Bostic, Amie and Brzozowska, Zuzanna and Burgdorf, Katharina and Burger, Kaspar and Busch, Kathrin and Castillo, Juan-Carlos and Chan, Nathan and Christmann, Pablo and Connelly, Roxanne and Czymara, Christian S. and Damian, Elena and {de Rooij}, Eline A. and Ecker, Alejandro and Edelmann, Achim and Eder, Christina and Eger, Maureen A. and Ellerbrock, Simon and Forke, Anna and Forster, Andrea and Freire, Danilo and Gaasendam, Chris and Gavras, Konstantin and Gayle, Vernon and Gessler, Theresa and Gnambs, Timo and Godefroidt, Am{\'e}lie and Gr{\"o}mping, Max and Gro{\ss}, Martin and Gruber, Stefan and Gummer, Tobias and Hadjar, Andreas and Halbherr, Verena and Heisig, Jan Paul and Hellmeier, Sebastian and Heyne, Stefanie and Hirsch, Magdalena and Hjerm, Mikael and Hochman, Oshrat and H{\"o}ffler, Jan H. and H{\"o}vermann, Andreas and Hunger, Sophia and Hunkler, Christian and {Huth-St{\"o}ckle}, Nora and Ign{\'a}cz, Zs{\'o}fia S. and Israel, Sabine and Jacobs, Laura and Jacobsen, Jannes and Jaeger, Bastian and Jungkunz, Sebastian and Jungmann, Nils and Kanjana, Jennifer and Kauff, Mathias and Khan, Salman and Khatua, Sayak and Kleinert, Manuel and Klinger, Julia and Kolb, Jan-Philipp and Ko{\l}czy{\'n}ska, Marta and Kuk, John and Kuni{\ss}en, Katharina and Kurti Sinatra, Dafina and Langenkamp, Alexander and Lee, Robin C. and Lersch, Philipp M. and Liu, David and L{\"o}bel, Lea-Maria and Lutscher, Philipp and Mader, Matthias and Madia, Joan E. and Malancu, Natalia and Maldonado, Luis and Marahrens, Helge and Martin, Nicole and Martinez, Paul and Mayerl, Jochen and Mayorga, Oscar J. and McDonnell, Robert and McManus, Patricia and McWagner, Kyle and Meeusen, Cecil and Meierrieks, Daniel and Mellon, Jonathan and Merhout, Friedolin and Merk, Samuel and Meyer, Daniel and Micheli, Leticia and Mijs, Jonathan and Moya, Crist{\'o}bal and Neunhoeffer, Marcel and N{\"u}st, Daniel and Nyg{\aa}rd, Olav and Ochsenfeld, Fabian and Otte, Gunnar and Pechenkina, Anna and Pickup, Mark and Prosser, Christopher and Raes, Louis and Ralston, Kevin and Ramos, Miguel and Reichert, Frank and Roets, Arne and Rogers, Jonathan and Ropers, Guido and Samuel, Robin and Sand, Gregor and Sanhueza Petrarca, Constanza and Schachter, Ariela and Schaeffer, Merlin and Schieferdecker, David and Schlueter, Elmar and Schmidt, Katja and Schmidt, Regine and {Schmidt-Catran}, Alexander and Schmiedeberg, Claudia and Schneider, J{\"u}rgen and Schoonvelde, Martijn and {Schulte-Cloos}, Julia and Schumann, Sandy and Schunck, Reinhard and Seuring, Julian and Silber, Henning and Sleegers, Willem and Sonntag, Nico and Staudt, Alexander and Steiber, Nadia and Steiner, Nils D. and Sternberg, Sebastian and Stiers, Dieter and Stojmenovska, Dragana and Storz, Nora and Striessnig, Erich and Stroppe, Anne-Kathrin and Suchow, Jordan W. and Teltemann, Janna and Tibajev, Andrey and Tung, Brian and Vagni, Giacomo and Van Assche, Jasper and {van der Linden}, Meta and {van der Noll}, Jolanda and Van Hootegem, Arno and Vogtenhuber, Stefan and Voicu, Bogdan and Wagemans, Fieke and Wehl, Nadja and Werner, Hannah and Wiernik, Brenton M. and Winter, Fabian and Wolf, Christof and Wu, Cary and Yamada, Yuki and Zakula, Bj{\"o}rn and Zhang, Nan and Ziller, Conrad and Zins, Stefan and {\.Z}{\'o}{\l}tak, Tomasz and Nguyen, Hung H.V.},
  year = 2025,
  month = mar,
  journal = {Royal Society Open Science},
  volume = {12},
  number = {3},
  eprint = {https://royalsocietypublishing.org/rsos/article-pdf/doi/10.1098/rsos.241038/886835/rsos.241038.pdf},
  pages = {241038},
  issn = {2054-5703},
  doi = {10.1098/rsos.241038},
  abstract = {This study investigates researcher variability in computational reproduction, an activity for which it is least expected. Eighty-five independent teams attempted numerical replication of results from an original study of policy preferences and immigration. Reproduction teams were randomly grouped into a `transparent group' receiving original study and code or `opaque group' receiving only a method and results description and no code. The transparent group mostly verified original results (95.7\% same sign and p-value cutoff), while the opaque group had less success (89.3\%). Second-decimal place exact numerical reproductions were less common (76.9 and 48.1\%). Qualitative investigation of the workflows revealed many causes of error, including mistakes and procedural variations. When curating mistakes, we still find that only the transparent group was reliably successful. Our findings imply a need for transparency, but also more. Institutional checks and less subjective difficulty for researchers `doing reproduction' would help, implying a need for better training. We also urge increased awareness of complexity in the research process and in `push button' replications.}
}