Artículo
Autoría
Álvarez-Martínez, Francisco Javier
;
Esteban, Luis
;
Frungillo, Lucas
;
BUTASSI, ESTEFANIA
;
Zambon, Alessandro
;
Herranz-López, María
;
Aranda, Mario
;
Pollastro, Federica
;
Tixier, Anne Sylvie
;
Garcia-Perez, Jose V.
;
Arráez-Román, David
;
Ross, Andrew
;
Mena, Pedro
;
Edrada-Ebel, Ru Angelie
;
Lyng, James
;
Micol, Vicente
;
Borrás-Rocher, Fernando
;
Barrajón-Catalán, Enrique
Fecha
2025
Editorial y Lugar de Edición
Frontiers Media S.A.
Revista
Frontiers in Artificial Intelligence,
vol. 8
Frontiers Media S.A.
Resumen
Información suministrada por el agente en
SIGEVA
Introduction: This study investigates the efficacy of large language models (LLMs)for generating accurate scientific responses through a comparative evaluationof five prominent free models: Claude 3.5 Sonnet, Gemini, ChatGPT 4o, MistralLarge 2, and Llama 3.1 70B.Methods: Sixteen expert scientific reviewers assessed these models in terms ofdepth, accuracy, relevance, and clarity.Results: Claude 3.5 Sonnet emerged as the highest scoring model, followed by Gemini,with notable variability among the...
Introduction: This study investigates the efficacy of large language models (LLMs)for generating accurate scientific responses through a comparative evaluationof five prominent free models: Claude 3.5 Sonnet, Gemini, ChatGPT 4o, MistralLarge 2, and Llama 3.1 70B.Methods: Sixteen expert scientific reviewers assessed these models in terms ofdepth, accuracy, relevance, and clarity.Results: Claude 3.5 Sonnet emerged as the highest scoring model, followed by Gemini,with notable variability among the other models. Additionally, retrieval-augmentedgeneration (RAG) techniques were applied to improve LLM performance, and promptswere refined to improve answers. The results indicate that although LLMs such asClaude 3.5 Sonnet have potential for scientific tasks, other models may require moredevelopment or additional prompt engineering to reach comparable accuracy. Reviewers’perceptions of artificial intelligence (AI) utility and trustworthiness showed a positive shiftafter evaluation. However, ethical concerns, particularly with respect to transparencyand disclosure, remained consistent.Discussion: The study highlights the need for structured frameworks for evaluatingLLMs and ethical considerations essential for responsible AI integration in scientificresearch. These findings should be interpreted with caution, as the limited sample sizeand domain-specific focus of the exam questions restrict the generalizability of the results.
Ver más
Ver menos
Palabras Clave
LARGE LANGUAGE MODELSRETRIEVEL-AUGMENTED GENERATIONSCIENTIFIC EVALUATIONPROMPT ENGINEERINGARTIFICIAL INTELLIGENCE