AI‑Engineering Research Round‑Up: Top 10 Research Papers Not‑to‑Miss

Article

By

Stephen Harrison

October 3, 2025

AI‑Engineering Research Round‑Up: Top 10 Research Papers Not‑to‑Miss | Article by AryaXAI

AI engineering sits at the intersection of artificial intelligence and traditional software engineering. It’s about designing, building, testing and maintaining systems that use AI models - often large language models (LLMs) or multimodal models, to perform real‑world tasks. In September 2025 the AI engineering community produced a surge of novel work, from new pipelines for retrieval‑augmented generation to rigorous studies on managing pre‑trained models in production. This article distils the ten most noteworthy AI‑engineering research contributions published in September’25. The summaries are intended for engineers and researchers who want to understand where the field is heading and how to apply these advances in practice.

Top Research Papers Covered 

1. Large Language Models for Software Testing: A Research Roadmap

2. MMORE: Massive Multimodal Open RAG & Extraction 

3. Prompts as Software Engineering Artifacts: A Research Agenda and Preliminary Findings

4. Software Dependencies 2.0: An Empirical Study of Reuse and Integration of Pre‑Trained Models in Open‑Source Projects – 

5. Generative Goal Modeling

6. Policy‑Driven Software Bill of Materials on GitHub: An Empirical Study

7. BloomAPR: A Bloom’s Taxonomy‑Based Framework for Assessing LLM‑Powered APR Solution

8. Evaluating Classical Software Process Models as Coordination Mechanisms for LLM‑Based Software Generation

9. Investigating Traffic Accident Detection Using Multimodal Large Language Models 

10. What Were You Thinking? An LLM‑Driven Large‑Scale Study of Refactoring Motivations in Open‑Source Projects

1. Large Language Models for Software Testing: A Research Roadmap

Software testing sits at the heart of reliable software, yet the explosion of LLM‑powered tools has made it challenging for researchers to keep track. Cristian Augusto et al. address this by providing a semi‑systematic literature review that maps out the landscape of LLM‑based testing. They group contributions by task type (test generation, summarisation, bug detection) and discuss challenges like prompt engineering, fairness and evaluating LLM‑generated tests. Beyond summarising dozens of papers, the authors sketch promising research directions, such as hybrid human‑LLM collaboration and robust evaluation benchmarks. For practitioners, this roadmap is invaluable for understanding where LLM‑assisted testing currently excels and where it needs more work.

2. MMORE: Massive Multimodal Open RAG & Extraction

Retrieval‑Augmented Generation (RAG) models can drastically improve LLM performance by providing external knowledge, but building a scalable ingestion pipeline for diverse media is hard. Alexandre Sallinen and colleagues propose MMORE, an open‑source pipeline that ingests over fifteen file types—including PDFs, spreadsheets, audio, video and images—and transforms them into a unified representation. The architecture distributes processing across CPUs and GPUs, achieving a 3.8× speed‑up over single‑node baselines and 40% higher accuracy than Docling on scanned PDFs. MMORE integrates dense‑sparse retrieval, interactive APIs and batch endpoints, enabling RAG‑augmented medical QA systems to improve accuracy as retrieval depth increases. If you’re building enterprise‑scale RAG systems, MMORE offers a well‑tested foundation.

3. Prompts as Software Engineering Artifacts

As developers increasingly rely on LLMs to write code, prompts themselves become a kind of asset—yet we know little about how they are created and maintained. Hugo Villamizar et al. argue that prompts should be treated like other software artifacts, with systematic development and documentation processes. Their research agenda proposes studying prompt evolution, traceability and reuse; preliminary survey results with 74 professionals reveal that prompts are refined ad‑hoc and rarely reused. The authors call for guidelines and tools to manage prompts effectively in LLM‑integrated workflows. For teams scaling AI assistants, building prompt repositories and tracking prompt changes could become as important as version‑controlling code.

4. Software Dependencies 2.0: Reuse and Integration of Pre‑Trained Models

Pre‑trained models (PTMs) like Hugging Face transformers are increasingly woven into software, but they introduce dependencies beyond traditional libraries. Jerin Yasmin and colleagues analysed 401 repositories from a pool of 28 k projects that reuse PTMs to understand how developers manage these new software dependencies 2.0. They investigate how repositories document PTMs, how reuse pipelines are structured, and how PTMs interact with other learned components. The study highlights the need for tooling to track PTM versions, evaluate model compatibility and manage the maintainability risks that come with embedding AI models into production software.

5. Generative Goal Modeling

Requirements engineering often involves interviewing stakeholders and manually deriving system goals. Ateeq Sharfuddin & Travis Breaux show that LLMs can automate this step. They use GPT‑4o to perform textual entailment on interview transcripts, extracting goals and building goal models. On 15 transcripts covering 29 domains, GPT‑4o matched 62% of human‑identified goals and traced goals back to the transcript with 98.7% accuracy. Human annotators rated the generated goal‑refinement relationships 72.2% accurate. These results suggest LLMs can support analysts in early requirements analysis, reducing manual effort while still requiring expert oversight.

6. Policy‑Driven Software Bill of Materials (SBOMs)

Software supply‑chain security has become a national priority, but how many open‑source projects actually provide SBOMs? Oleksii Novikov et al. mined GitHub and found that only 0.56 % of popular repositories have policy‑driven SBOMs—documents created for transparency and compliance rather than academic demos. Among these SBOMs, the authors identified 2,202 unique vulnerabilities, and 22 % of dependencies lacked licensing information. The study underscores the gap between policy mandates and actual adoption and highlights the need for tools that automatically generate and maintain SBOMs within continuous‑integration pipelines.

7. BloomAPR: Evaluating Program Repair with Bloom’s Taxonomy

Automated program repair (APR) solutions based on LLMs often benchmark on static datasets like Defects4J or SWE‑Bench, risking data contamination. Yinghang Ma et al. propose BloomAPR, a dynamic evaluation framework inspired by Bloom’s taxonomy. The framework assesses LLM‑powered APR solutions across cognitive levels (Remember, Understand, Apply, Analyze), revealing that existing systems fix up to 81.57 % of bugs at the basic “remember” level but only 13.46 %–41.34 % at higher analytic levels. Their case study shows that performance increases for synthetically generated bugs but drops when addressing real‑world projects. BloomAPR highlights the need for evolving benchmarks that test reasoning rather than memorisation.

8. Classical Software Process Models for Coordinating LLM‑Based Multi‑Agent Systems

Multi‑agent LLM systems can generate complex software, but coordinating them remains challenging. Duc Minh Ha and collaborators evaluated how traditional software processes (Waterfall, V‑Model and Agile) can be repurposed as coordination scaffolds for LLM‑powered agent teams. In 132 runs across 11 diverse projects, process and model choice significantly affected outcomes. Waterfall was most efficient, V‑Model produced the most verbose code, and Agile yielded the highest code quality at a greater computational cost. The results suggest that process selection should align with goals - efficiency, robustness or structured validation—and that combining process‑driven agent roles with prompt templates can improve multi‑agent collaboration.

9. Investigating Traffic Accident Detection with Multimodal LLMs

Safety‑critical applications like traffic monitoring demand reliable AI. Ilhan Skender et al. explored whether multimodal LLMs (Gemini 1.5/2.0, Gemma 3, Pixtral) can detect accidents in images from infrastructure cameras. Using a simulated dataset built on CARLA, they combined object detection (YOLO), multi‑object tracking (Deep SORT) and instance segmentation (SAM) to enhance prompts. Pixtral achieved an F1‑score of 0.71 and 83 % recall, while Gemini models improved precision with enhanced prompts but suffered F1 losses. The study demonstrates how integrating visual analytics with LLMs can improve real‑time accident detection and suggests directions for deploying AI in traffic management.

10. What Were You Thinking? Refactoring Motivations via LLMs

Understanding why developers refactor code helps improve maintenance and tool support. Mikel Robredo et al. used an LLM to analyse version‑control data and extract developers’ stated reasons for refactoring. The model matched human judgments in 80 % of cases, but its agreement with motivations reported in earlier literature was only 47 %. LLMs often enriched motivations with additional details, emphasising readability and simplification. These findings suggest LLMs can assist in categorising refactoring reasons and highlight the value of combining AI‑derived explanations with traditional software metrics to prioritise technical debt and maintain design integrity.

Beyond the Papers: Connecting the Dots

Together, these studies reveal a field moving rapidly from proof‑of‑concept to deployment. They demonstrate that AI engineering now spans everything from requirements extraction and testing to program repair, dependency management and safety‑critical applications. The emphasis on empirical methods—large‑scale data analyses, mixed‑methods studies and controlled experiments—reflects a maturing discipline focused on reliability, maintainability and safety.

For more AI‑engineering insights, check out our previous articles:

SHARE THIS

Subscribe to AryaXAI

Stay up to date with all updates

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Discover More Articles

Explore a curated collection of in-depth articles covering the latest advancements, insights, and trends in AI, MLOps, governance, and more. Stay informed with expert analyses, thought leadership, and actionable knowledge to drive innovation in your field.

View All

Is Explainability critical for your AI solutions?

Schedule a demo with our team to understand how AryaXAI can make your mission-critical 'AI' acceptable and aligned with all your stakeholders.

AI‑Engineering Research Round‑Up: Top 10 Research Papers Not‑to‑Miss

Stephen HarrisonStephen Harrison
Stephen Harrison
October 3, 2025
AI‑Engineering Research Round‑Up: Top 10 Research Papers Not‑to‑Miss
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

AI engineering sits at the intersection of artificial intelligence and traditional software engineering. It’s about designing, building, testing and maintaining systems that use AI models - often large language models (LLMs) or multimodal models, to perform real‑world tasks. In September 2025 the AI engineering community produced a surge of novel work, from new pipelines for retrieval‑augmented generation to rigorous studies on managing pre‑trained models in production. This article distils the ten most noteworthy AI‑engineering research contributions published in September’25. The summaries are intended for engineers and researchers who want to understand where the field is heading and how to apply these advances in practice.

Top Research Papers Covered 

1. Large Language Models for Software Testing: A Research Roadmap

2. MMORE: Massive Multimodal Open RAG & Extraction 

3. Prompts as Software Engineering Artifacts: A Research Agenda and Preliminary Findings

4. Software Dependencies 2.0: An Empirical Study of Reuse and Integration of Pre‑Trained Models in Open‑Source Projects – 

5. Generative Goal Modeling

6. Policy‑Driven Software Bill of Materials on GitHub: An Empirical Study

7. BloomAPR: A Bloom’s Taxonomy‑Based Framework for Assessing LLM‑Powered APR Solution

8. Evaluating Classical Software Process Models as Coordination Mechanisms for LLM‑Based Software Generation

9. Investigating Traffic Accident Detection Using Multimodal Large Language Models 

10. What Were You Thinking? An LLM‑Driven Large‑Scale Study of Refactoring Motivations in Open‑Source Projects

1. Large Language Models for Software Testing: A Research Roadmap

Software testing sits at the heart of reliable software, yet the explosion of LLM‑powered tools has made it challenging for researchers to keep track. Cristian Augusto et al. address this by providing a semi‑systematic literature review that maps out the landscape of LLM‑based testing. They group contributions by task type (test generation, summarisation, bug detection) and discuss challenges like prompt engineering, fairness and evaluating LLM‑generated tests. Beyond summarising dozens of papers, the authors sketch promising research directions, such as hybrid human‑LLM collaboration and robust evaluation benchmarks. For practitioners, this roadmap is invaluable for understanding where LLM‑assisted testing currently excels and where it needs more work.

2. MMORE: Massive Multimodal Open RAG & Extraction

Retrieval‑Augmented Generation (RAG) models can drastically improve LLM performance by providing external knowledge, but building a scalable ingestion pipeline for diverse media is hard. Alexandre Sallinen and colleagues propose MMORE, an open‑source pipeline that ingests over fifteen file types—including PDFs, spreadsheets, audio, video and images—and transforms them into a unified representation. The architecture distributes processing across CPUs and GPUs, achieving a 3.8× speed‑up over single‑node baselines and 40% higher accuracy than Docling on scanned PDFs. MMORE integrates dense‑sparse retrieval, interactive APIs and batch endpoints, enabling RAG‑augmented medical QA systems to improve accuracy as retrieval depth increases. If you’re building enterprise‑scale RAG systems, MMORE offers a well‑tested foundation.

3. Prompts as Software Engineering Artifacts

As developers increasingly rely on LLMs to write code, prompts themselves become a kind of asset—yet we know little about how they are created and maintained. Hugo Villamizar et al. argue that prompts should be treated like other software artifacts, with systematic development and documentation processes. Their research agenda proposes studying prompt evolution, traceability and reuse; preliminary survey results with 74 professionals reveal that prompts are refined ad‑hoc and rarely reused. The authors call for guidelines and tools to manage prompts effectively in LLM‑integrated workflows. For teams scaling AI assistants, building prompt repositories and tracking prompt changes could become as important as version‑controlling code.

4. Software Dependencies 2.0: Reuse and Integration of Pre‑Trained Models

Pre‑trained models (PTMs) like Hugging Face transformers are increasingly woven into software, but they introduce dependencies beyond traditional libraries. Jerin Yasmin and colleagues analysed 401 repositories from a pool of 28 k projects that reuse PTMs to understand how developers manage these new software dependencies 2.0. They investigate how repositories document PTMs, how reuse pipelines are structured, and how PTMs interact with other learned components. The study highlights the need for tooling to track PTM versions, evaluate model compatibility and manage the maintainability risks that come with embedding AI models into production software.

5. Generative Goal Modeling

Requirements engineering often involves interviewing stakeholders and manually deriving system goals. Ateeq Sharfuddin & Travis Breaux show that LLMs can automate this step. They use GPT‑4o to perform textual entailment on interview transcripts, extracting goals and building goal models. On 15 transcripts covering 29 domains, GPT‑4o matched 62% of human‑identified goals and traced goals back to the transcript with 98.7% accuracy. Human annotators rated the generated goal‑refinement relationships 72.2% accurate. These results suggest LLMs can support analysts in early requirements analysis, reducing manual effort while still requiring expert oversight.

6. Policy‑Driven Software Bill of Materials (SBOMs)

Software supply‑chain security has become a national priority, but how many open‑source projects actually provide SBOMs? Oleksii Novikov et al. mined GitHub and found that only 0.56 % of popular repositories have policy‑driven SBOMs—documents created for transparency and compliance rather than academic demos. Among these SBOMs, the authors identified 2,202 unique vulnerabilities, and 22 % of dependencies lacked licensing information. The study underscores the gap between policy mandates and actual adoption and highlights the need for tools that automatically generate and maintain SBOMs within continuous‑integration pipelines.

7. BloomAPR: Evaluating Program Repair with Bloom’s Taxonomy

Automated program repair (APR) solutions based on LLMs often benchmark on static datasets like Defects4J or SWE‑Bench, risking data contamination. Yinghang Ma et al. propose BloomAPR, a dynamic evaluation framework inspired by Bloom’s taxonomy. The framework assesses LLM‑powered APR solutions across cognitive levels (Remember, Understand, Apply, Analyze), revealing that existing systems fix up to 81.57 % of bugs at the basic “remember” level but only 13.46 %–41.34 % at higher analytic levels. Their case study shows that performance increases for synthetically generated bugs but drops when addressing real‑world projects. BloomAPR highlights the need for evolving benchmarks that test reasoning rather than memorisation.

8. Classical Software Process Models for Coordinating LLM‑Based Multi‑Agent Systems

Multi‑agent LLM systems can generate complex software, but coordinating them remains challenging. Duc Minh Ha and collaborators evaluated how traditional software processes (Waterfall, V‑Model and Agile) can be repurposed as coordination scaffolds for LLM‑powered agent teams. In 132 runs across 11 diverse projects, process and model choice significantly affected outcomes. Waterfall was most efficient, V‑Model produced the most verbose code, and Agile yielded the highest code quality at a greater computational cost. The results suggest that process selection should align with goals - efficiency, robustness or structured validation—and that combining process‑driven agent roles with prompt templates can improve multi‑agent collaboration.

9. Investigating Traffic Accident Detection with Multimodal LLMs

Safety‑critical applications like traffic monitoring demand reliable AI. Ilhan Skender et al. explored whether multimodal LLMs (Gemini 1.5/2.0, Gemma 3, Pixtral) can detect accidents in images from infrastructure cameras. Using a simulated dataset built on CARLA, they combined object detection (YOLO), multi‑object tracking (Deep SORT) and instance segmentation (SAM) to enhance prompts. Pixtral achieved an F1‑score of 0.71 and 83 % recall, while Gemini models improved precision with enhanced prompts but suffered F1 losses. The study demonstrates how integrating visual analytics with LLMs can improve real‑time accident detection and suggests directions for deploying AI in traffic management.

10. What Were You Thinking? Refactoring Motivations via LLMs

Understanding why developers refactor code helps improve maintenance and tool support. Mikel Robredo et al. used an LLM to analyse version‑control data and extract developers’ stated reasons for refactoring. The model matched human judgments in 80 % of cases, but its agreement with motivations reported in earlier literature was only 47 %. LLMs often enriched motivations with additional details, emphasising readability and simplification. These findings suggest LLMs can assist in categorising refactoring reasons and highlight the value of combining AI‑derived explanations with traditional software metrics to prioritise technical debt and maintain design integrity.

Beyond the Papers: Connecting the Dots

Together, these studies reveal a field moving rapidly from proof‑of‑concept to deployment. They demonstrate that AI engineering now spans everything from requirements extraction and testing to program repair, dependency management and safety‑critical applications. The emphasis on empirical methods—large‑scale data analyses, mixed‑methods studies and controlled experiments—reflects a maturing discipline focused on reliability, maintainability and safety.

For more AI‑engineering insights, check out our previous articles:

See how AryaXAI improves
ML Observability

Learn how to bring transparency & suitability to your AI Solutions, Explore relevant use cases for your team, and Get pricing information for XAI products.