Abstract
The article aims to try to identify cases in which large language models show behavior similar to human thinking and in which they "think" differently and to point out opportunities, risks, and limits in the application of artificial intelligence in teaching, in the context of testing the ChatGPT model on student tasks in the field of statistics. The possibilities and limitations of large language models will be analyzed, as well as how to overcome existing biases and shortcomings in this rapidly growing field. In the paper, a chatbot based on the large language model GPT-4 ChatGPT is tested as part of the introductory statistics course taught to second-year computer science students. The tests were conducted by manually entering 170 statistics quiz questions into the ChatGPT browser. The questions are divided into three categories: theoretical questions in which the knowledge is reproduced, theoretical questions in which the understanding of the field is tested, and exercises. The quiz questions were asked in Croatian and the answers given in Croatian were analysed. The accuracy in solving the quiz questions for students and ChatGPT was compared by question category with the Wilcoxon rank sum test. The results show that ChatGPT performs statistically better than students in the categories of theoretical questions where reproduction of knowledge and understanding is required, while students are more successful in solving the practice questions, but the difference in accuracy is not statistically significant (p < 0.01).References
Alfertshofer, M., Hoch, C. C., Funk, P. F., Hollmann, K., Wollenberg, B., Knoedler, S., Knoedler, L. (2023). Saling the Seven Seas: A Multinational Comparison of ChatGPT's Performance on Medical Licensing Examinations, Ann Biomed Eng (2023), Aug 8, doi: https://doi.org/10.1007/s10439-023-03338-3
Bengio, J., Ducharme, R., Vincet, P., Jauvin, C. (1997). A Neural probabilistic language model. Journal of Machine Learning Research, 3, 1137-1155.
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding, U Burstein, J. Doran, C., Solorio, T. (ur.), Proceedings of NAACL-HLT 2019 (str. 4171–4186), Association for Computational Linguistics. doi: https://doi.org/10.18653/v1/N19-1423
Lin, S. Hilton, J., Evans, O. (2021). TruthfulQA: Measuring how models mimic human falsehoods. Preuzeto s https://arxiv.org/abs/2109.07958
Ljubešić, N., Lauc, D. (2021). BERTić – The Transformer Language Model for Bosnian, Croatian, Montenegrin and Serbian. Preuzeto s https://arxiv.org/abs/2104.09243v1
Martin, L., Muller, B., Suarez, P. J. O., Dupont, Y., Romary, L., Villemonte de la Clergerie, E., Seddah, D., Sagot, B. (2019). CamemBERT: a tasty French language model. Preuzeto s https://arxiv.org/abs/1911.03894
Mikolov, T., Chen, K., Corrado, G., Dean, J. (2013). Efficient estimation of word representations in vector space. Preuzeto s https://arxiv.org/abs/1301.3781
Ordak, M. (2023). ChatGPT's skills in statistical analysis using the example of allergology: Do we have reason for concern?. Healthcare 2023, 11(18), 2554, doi: https://doi.org/10.3390/healthcare11182554
Ouyang, L., Wu, J., Jing, X., Almeida, D., Wainwright, C. L., Mishin, P., Zheng, C., Agarwal, S., Slama, K., Ray, A., Schulman, J.,
Hilton, J., Kelton, F., Miler, L., Simens, M., Askell, A., Welinder, P., Cristiano, P. Leike, J., Lowe, R. (2022). Training language models to follow instructions with human feedback. Preuzeto s https://arxiv.org/abs/2203.02155
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L. (2018). Deep contextual word representations. Preuzeto s https://arxiv.org/abs/1802.05365
Radford, A., Narasimhan, K. (2018). Improving language understanding by generative pre-training. Preuzeto s https://api.semanticscholar.org/CorpusID:49313245
Roivainen, E. (2023). AI's IQ. Scinetific American Magazine, 329 (1), 7.
Salton, G., McGill, M. J. (1983). Introduction to Modern Information Retrieval. McGraw-Hill, Inc.
Sareen, K. (2023). Assesing the ethical capabilities of ChatGPT in healthcare: A study on its proficiency in situational judgement test. Innovations in Education and Teaching International. doi: https://doi.org/10.1080/14703297.2023.2258114
Schuman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017). Proximal policy optimization algorithms. Preuzeto s https://arxiv.org/abs/1707.06347
Sheng, E., Cheng, K., Natarajan, P., Peng, N. (2019). The woman worked as a babysitter: on biases in language generation. Preuzeto s https://arxiv.org/abs/1909.01326
Šebalj, D. (2022). Analiza tekstnih dokumenata na hrvatskom jeziku korištenjem metoda dubokog učenja, diplomski rad, Sveučilište u Zagrebu Fakultet organizacije i informatike. Preuzeto s https://gpml.foi.hr/laboratory/data/uploads/domagoj-sebalj-diplomski-rad.pdf
Tamkin, A., Brundage, M., Clark, J., Ganguli, D. (2021). Understanding the capabilities, limitations, and social impact of large language models. Preuzeto s na https://arxiv.org/abs/2102.02503
Ulčar, M., Robnik-Šikonja, M. (2020). FinEst and CroSloEngual BERT: less is more in multilingual models. Preuzeto s https://arxiv.org/abs/2006.07890
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I. (2017). Attention is all you need. Preuzeto s https://arxiv.org/abs/1706.03762
Vazquez-Cano, E., Jose M. Ramirez-Hurtado, Jose M. Saez-Lopez, Eloy Lopez-Meneses (2023). ChatGPT: The britest student in the class. Thinking Skills and Creativity, 49 (2023), 11380. doi: https://doi.org/10.1016/j.tsc.2023.101380
Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luotolahti, J., Salakoski, T., Ginter, F., Pyysalo, S. (2019). Multilingual is not enough: BERT for Finnish. Preuzeto s https://arxiv.org/abs/1912.07076
Walsh, T. (2022). Everyone's having a field day with ChatGPT – but nobody knows how it actually works, The Conversation. Preuzeto s https://theconversation.com/everyones-having-a-field-day-with-chatgpt-but-nobody-knows-how-it-actually-works-196378
Wang, A., Russakovsky, O. (2021). Directional bias amplification. Preuzeto s https://arxiv.org/abs/2304.04874
Weidenger, L., Mellor, J. Rauh, M., Griffin, C., Uesato, J., Huang, P., Cheng, M. G., Balle, B., Kasirzadeh, A., Kenton, Z., Brown, S., Hawkins, W., Stepleton, T., Biles, C., Birhane, A., Haas, J., Rimell, L., Henricks, L. A., Isaac, W., Legassick, S., Irving, G., Gabriel, I. (2021). Ethical and social risks of harm from language models. Preuzeto s https://arxiv.org/abs/2112.04359
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
Copyright (c) 2023 Array