How large language models "think" and can we trust them: a case study of testing ChatGPT on tasks in an introductory statistics course
PDF (Hrvatski)

Keywords

large language models
ChatGPT
statistics
testing
Croatian language veliki jezični modeli
ChatGPT
statistika
testiranje
hrvatski jezik

How to Cite

Dobša, J. (2023). How large language models "think" and can we trust them: a case study of testing ChatGPT on tasks in an introductory statistics course. Polytechnica, 7(2), 18-25. https://doi.org/10.36978/cte.7.2.2

Abstract

The article aims to try to identify cases in which large language models show behavior similar to human thinking and in which they "think" differently and to point out opportunities, risks, and limits in the application of artificial intelligence in teaching, in the context of testing the ChatGPT model on student tasks in the field of statistics. The possibilities and limitations of large language models will be analyzed, as well as how to overcome existing biases and shortcomings in this rapidly growing field. In the paper, a chatbot based on the large language model GPT-4 ChatGPT is tested as part of the introductory statistics course taught to second-year computer science students. The tests were conducted by manually entering 170 statistics quiz questions into the ChatGPT browser. The questions are divided into three categories: theoretical questions in which the knowledge is reproduced, theoretical questions in which the understanding of the field is tested, and exercises. The quiz questions were asked in Croatian and the answers given in Croatian were analysed. The accuracy in solving the quiz questions for students and ChatGPT was compared by question category with the Wilcoxon rank sum test. The results show that ChatGPT performs statistically better than students in the categories of theoretical questions where reproduction of knowledge and understanding is required, while students are more successful in solving the practice questions, but the difference in accuracy is not statistically significant (p < 0.01).
https://doi.org/10.36978/cte.7.2.2
PDF (Hrvatski)

References

Alfertshofer, M., Hoch, C. C., Funk, P. F., Hollmann, K., Wollenberg, B., Knoedler, S., Knoedler, L. (2023). Saling the Seven Seas: A Multinational Comparison of ChatGPT's Performance on Medical Licensing Examinations, Ann Biomed Eng (2023), Aug 8, doi: https://doi.org/10.1007/s10439-023-03338-3

Bengio, J., Ducharme, R., Vincet, P., Jauvin, C. (1997). A Neural probabilistic language model. Journal of Machine Learning Research, 3, 1137-1155.

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding, U Burstein, J. Doran, C., Solorio, T. (ur.), Proceedings of NAACL-HLT 2019 (str. 4171–4186), Association for Computational Linguistics. doi: https://doi.org/10.18653/v1/N19-1423

Lin, S. Hilton, J., Evans, O. (2021). TruthfulQA: Measuring how models mimic human falsehoods. Preuzeto s https://arxiv.org/abs/2109.07958

Ljubešić, N., Lauc, D. (2021). BERTić – The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian. Preuzeto s https://arxiv.org/abs/2104.09243v1

Martin, L., Muller, B., Suarez, P. J. O., Dupont, Y., Romary, L., Villemonte de la Clergerie, E., Seddah, D., Sagot, B. (2019). CamemBERT: a tasty French language model. Preuzeto s https://arxiv.org/abs/1911.03894

Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013). Efficient estimation of word representations in vector space. Preuzeto s https://arxiv.org/abs/1301.3781

Ordak, M. (2023). ChatGPT's skills in statistical analysis using the example of allergology: Do we have reason for concern?. Healthcare 2023, 11(18), 2554, doi: https://doi.org/10.3390/healthcare11182554

Ouyang, L., Wu, J., Jing, X., Almeida, D., Wainwright, C. L., Mishin, P., Zheng, C., Agarwal, S., Slama, K., Ray, A., Schulman, J.,

Hilton, J., Kelton, F., Miler, L., Simens, M., Askell, A., Welinder, P., Cristiano, P. Leike, J., Lowe, R. (2022). Training language models to follow instructions with human feedback. Preuzeto s https://arxiv.org/abs/2203.02155

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L. (2018). Deep contextual word representations. Preuzeto s https://arxiv.org/abs/1802.05365

Radford, A., Narasimhan, K. (2018). Improving language understanding by generative pre-training. Preuzeto s https://api.semanticscholar.org/CorpusID:49313245

Roivainen, E. (2023). AI's IQ. Scinetific American Magazine, 329 (1), 7.

Salton, G., McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill, Inc.

Sareen, K. (2023). Assesing the ethical capabilities of ChatGPT in healthcare: A study on its proficiency in situational judgement test. Innovations in Education and Teaching International. doi: https://doi.org/10.1080/14703297.2023.2258114

Schuman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017). Proximal policy optimization algorithms. Preuzeto s https://arxiv.org/abs/1707.06347

Sheng, E., Cheng, K., Natarajan, P., Peng, N. (2019). The woman worked as a babysitter: on biases in language generation. Preuzeto s https://arxiv.org/abs/1909.01326

Šebalj, D. (2022). Analiza tekstnih dokumenata na hrvatskom jeziku korištenjem metoda dubokog učenja, diplomski rad, Sveučilište u Zagrebu Fakultet organizacije i informatike. Preuzeto s https://gpml.foi.hr/laboratory/data/uploads/domagoj-sebalj-diplomski-rad.pdf

Tamkin, A., Brundage, M., Clark, J., Ganguli, D. (2021). Understanding the capabilities, limitations, and social impact of large language models. Preuzeto s na https://arxiv.org/abs/2102.02503

Ulčar, M., Robnik-Šikonja, M. (2020). FinEst and CroSloEngual BERT: less is more in multilingual models. Preuzeto s https://arxiv.org/abs/2006.07890

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I. (2017). Attention is all you need. Preuzeto s https://arxiv.org/abs/1706.03762

Vazquez-Cano, E., Jose M. Ramirez-Hurtado, Jose M. Saez-Lopez, Eloy Lopez-Meneses (2023). ChatGPT: The britest student in the class. Thinking Skills and Creativity, 49 (2023), 11380. doi: https://doi.org/10.1016/j.tsc.2023.101380

Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luotolahti, J., Salakoski, T., Ginter, F., Pyysalo, S. (2019). Multilingual is not enough: BERT for Finnish. Preuzeto s https://arxiv.org/abs/1912.07076

Walsh, T. (2022). Everyone's having a field day with ChatGPT – but nobody knows how it actually works, The Conversation. Preuzeto s https://theconversation.com/everyones-having-a-field-day-with-chatgpt-but-nobody-knows-how-it-actually-works-196378

Wang, A., Russakovsky, O. (2021). Directional bias amplification. Preuzeto s https://arxiv.org/abs/2304.04874

Weidenger, L., Mellor, J. Rauh, M., Griffin, C., Uesato, J., Huang, P., Cheng, M. G., Balle, B., Kasirzadeh, A., Kenton, Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L., Henricks, L. A., Isaac, W., Legassick, S., Irving, G., Gabriel, I. (2021). Ethical and social risks of harm from language models. Preuzeto s https://arxiv.org/abs/2112.04359

Creative Commons License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

Copyright (c) 2023 Array