Mining software repositories

Last updated December 02, 2024

Within software engineering, the mining software repositories^[1] (MSR) field ^[2] analyzes the rich data available in software repositories, such as version control repositories, mailing list archives, bug tracking systems, issue tracking systems, etc. to uncover interesting and actionable information about software systems, projects and software engineering.

Definition

Herzig and Zeller define ”mining software archives” as a process to ”obtain lots of initial evidence” by extracting data from software repositories. Further they define ”data sources” as product-based artifacts like source code, requirement artefacts or version archives and claim that these sources are unbiased, but noisy and incomplete.^[3]

Techniques

Coupled Change Analysis

The idea in coupled change analysis is that developers change code entities (e.g. files) together frequently for fixing defects or introducing new features. These couplings between the entities are often not made explicit in the code or other documents. Especially developers new on the project do not know which entities need to be changed together. Coupled change analysis aims to extract the coupling out of the version control system for a project. By the commits and the timing of changes, we might be able to identify which entities frequently change together. This information could then be presented to developers about to change one of the entities to support them in their further changes.^[4]

Commit Analysis

There are many different kinds of commits in version control systems, e.g. bug fix commits, new feature commits, documentation commits, etc. To take data-driven decisions based on past commits, one needs to select subsets of commits that meet a given criterion. That can be done based on the commit message.^[5]

Documentation generation

It is possible to generate useful documentation from mining software repositories. For instance, Jadeite computes usage statistics and helps newcomers to quickly identify commonly used classes.^[6]

Data & tools

The primary mining data comes from version control systems. Early mining experiments were done on CVS repositories.^[7] Then, researchers have extensively analyzed SVN repositories.^[8] Now, Git repositories are dominant.^[9] Depending on the nature of the data required (size, domain, processing), one can either download data from one of these sources.^{[ clarification needed ]} However, data governance and data collection for the sake of building large language models have come to change the rules of the game, by integrating the use of web crawlers to obtain data from multiple sources and domains.

Related Research Articles

In computer programming and software design, code refactoring is the process of restructuring existing source code—changing the factoring—without changing its external behavior. Refactoring is intended to improve the design, structure, and/or implementation of the software, while preserving its functionality. Potential advantages of refactoring may include improved code readability and reduced complexity; these can improve the source code's maintainability and create a simpler, cleaner, or more expressive internal architecture or object model to improve extensibility. Another potential goal for refactoring is improved performance; software engineers face an ongoing challenge to write programs that perform faster or use less memory.

Version control is the software engineering practice of controlling, organizing, and tracking different versions in history of computer files; primarily source code text files, but generally any type of file.

Code review is a software quality assurance activity in which one or more people examine the source code of a computer program, either after implementation or during the development process. The persons performing the checking, excluding the author, are called "reviewers". At least one reviewer must not be the code's author.

In software engineering, service-oriented architecture (SOA) is an architectural style that focuses on discrete services instead of a monolithic design. SOA is a good choice for system integration. By consequence, it is also applied in the field of software design where services are provided to the other components by application components, through a communication protocol over a network. A service is a discrete unit of functionality that can be accessed remotely and acted upon and updated independently, such as retrieving a credit card statement online. SOA is also intended to be independent of vendors, products and technologies.

Continuous integration (CI) is the practice of integrating source code changes frequently and ensuring that the integrated codebase is in a workable state.

Software visualization or software visualisation refers to the visualization of information of and related to software systems—either the architecture of its source code or metrics of their runtime behavior—and their development process by means of static, interactive or animated 2-D or 3-D visual representations of their structure, execution, behavior, and evolution.

In predictive analytics, data science, machine learning and related fields, concept drift or drift is an evolution of data that invalidates the data model. It happens when the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become less accurate as time passes. Drift detection and drift adaptation are of paramount importance in the fields that involve dynamically changing data and data models.

Application discovery and understanding (ADU) is the process of automatically analyzing artifacts of a software application and determining metadata structures associated with the application in the form of lists of data elements and business rules. The relationships discovered between this application and a central metadata registry is then stored in the metadata registry itself.

A software regression is a type of software bug where a feature that has worked before stops working. This may happen after changes are applied to the software's source code, including the addition of new features and bug fixes. They may also be introduced by changes to the environment in which the software is running, such as system upgrades, system patching or a change to daylight saving time. A software performance regression is a situation where the software still functions correctly, but performs more slowly or uses more memory or resources than before. Various types of software regressions have been identified in practice, including the following:

Search-based software engineering (SBSE) applies metaheuristic search techniques such as genetic algorithms, simulated annealing and tabu search to software engineering problems. Many activities in software engineering can be stated as optimization problems. Optimization techniques of operations research such as linear programming or dynamic programming are often impractical for large scale software engineering problems because of their computational complexity or their assumptions on the problem structure. Researchers and practitioners use metaheuristic search techniques, which impose little assumptions on the problem structure, to find near-optimal or "good-enough" solutions.

DevOps is a methodology in the software development and IT industry. Used as a set of practices and tools, DevOps integrates and automates the work of software development (Dev) and IT operations (Ops) as a means for improving and shortening the systems development life cycle. DevOps is complementary to agile software development; several DevOps aspects came from the agile approach.

An application programming interface (API) is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build such a connection or interface is called an API specification. A computer system that meets this standard is said to implement or expose an API. The term API may refer either to the specification or to the implementation.

Software analytics is the analytics specific to the domain of software systems taking into account source code, static and dynamic characteristics as well as related processes of their development and evolution. It aims at describing, monitoring, predicting, and improving the efficiency and effectiveness of software engineering throughout the software lifecycle, in particular during software development and software maintenance. The data collection is typically done by mining software repositories, but can also be achieved by collecting user actions or production data.

A software map represents static, dynamic, and evolutionary information of software systems and their software development processes by means of 2D or 3D map-oriented information visualization. It constitutes a fundamental concept and tool in software visualization, software analytics, and software diagnosis. Its primary applications include risk analysis for and monitoring of code quality, team activity, or software development progress and, generally, improving effectiveness of software engineering with respect to all related artifacts, processes, and stakeholders throughout the software engineering process and software maintenance.

Software diagnosis refers to concepts, techniques, and tools that allow for obtaining findings, conclusions, and evaluations about software systems and their implementation, composition, behaviour, and evolution. It serves as means to monitor, steer, observe and optimize software development, software maintenance, and software re-engineering in the sense of a business intelligence approach specific to software systems. It is generally based on the automatic extraction, analysis, and visualization of corresponding information sources of the software system. It can also be manually done and not automatic.

Software intelligence is insight into the inner workings and structural condition of software assets produced by software designed to analyze database structure, software framework and source code to better understand and control complex software systems in information technology environments. Similarly to business intelligence (BI), software intelligence is produced by a set of software tools and techniques for the mining of data and the software's inner-structure. Results are automatically produced and feed a knowledge base containing technical documentation and blueprints of the innerworking of applications, and make it available to all to be used by business and software stakeholders to make informed decisions, measure the efficiency of software development organizations, communicate about the software health, prevent software catastrophes.

Fan-out has multiple meanings in software engineering.

Automatic bug-fixing is the automatic repair of software bugs without the intervention of a human programmer. It is also commonly referred to as automatic patch generation, automatic bug repair, or automatic program repair. The typical goal of such techniques is to automatically generate correct patches to eliminate bugs in software programs without causing software regression.

Ahmed E. Hassan is a professor at Queen's University in the Queen's School of Computing, where he leads the Software Analysis and Intelligence Lab (SAIL). He is a fellow of the ACM and IEEE. In 2023, he received the Mustafa Prize for his contributions to software engineering.

In software engineering, code ownership is a term used to describe control of an individual software developer or a development team over source code modifications of a module or a product.

References

↑ Hassan, Ahmed E. (2008). "The road ahead for mining software repositories". 2008 frontiers of software maintenance. IEEE. pp. 48–57.
↑ Working Conference on Mining Software Repositories, the main software engineering conference in the area
↑ K. S. Herzig and A. Zeller, “Mining your own evidence,” in Making Software, pp. 517–529, Sebastopol, Calif., USA: O’Reilly, 2011.
↑ Gall, H.; Hajek, K.; Jazayeri, M. (1998). "Detection of logical coupling based on product release history". Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272). pp. 190–198. CiteSeerX 10.1.1.199.7754 . doi:10.1109/icsm.1998.738508. ISBN 978-0-8186-8779-2.
↑ Hindle, Abram; German, Daniel M.; Godfrey, Michael W.; Holt, Richard C. (2009). "Automatic classication of large changes into maintenance categories". 2009 IEEE 17th International Conference on Program Comprehension. pp. 30–39. doi:10.1109/ICPC.2009.5090025. ISBN 978-1-4244-3998-0.
↑ Stylos, Jeffrey; Faulring, Andrew; Yang, Zizhuang; Myers, Brad A. (2009). "Improving API documentation using API usage information". 2009 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). pp. 119–126. doi:10.1109/VLHCC.2009.5295283. ISBN 978-1-4244-4876-0.
↑ Canfora, G.; Cerulo, L. (2005). "Impact Analysis by Mining Software and Change Request Repositories". 11th IEEE International Software Metrics Symposium (METRICS'05). p. 29. doi:10.1109/METRICS.2005.28. ISBN 978-0-7695-2371-2.
↑ d'Ambros, Marco; Gall, Harald; Lanza, Michele; Pinzger, Martin (2008). "Analysing Software Repositories to Understand Software Evolution". Software Evolution. pp. 37–67. doi:10.1007/978-3-540-76440-3_3. ISBN 978-3-540-76439-7.
↑ Kalliamvakou, Eirini; Gousios, Georgios; Blincoe, Kelly; Singer, Leif; German, Daniel M.; Damian, Daniela (2014). "The promises and perils of mining GitHub". Proceedings of the 11th Working Conference on Mining Software Repositories - MSR 2014. pp. 92–101. doi:10.1145/2597073.2597074. ISBN 9781450328630.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Hassan, Ahmed E. (2008). "The road ahead for mining software repositories". 2008 frontiers of software maintenance. IEEE. pp. 48–57.

[2] Working Conference on Mining Software Repositories, the main software engineering conference in the area

[3] K. S. Herzig and A. Zeller, “Mining your own evidence,” in Making Software, pp. 517–529, Sebastopol, Calif., USA: O’Reilly, 2011.

[4] Gall, H.; Hajek, K.; Jazayeri, M. (1998). "Detection of logical coupling based on product release history". Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272). pp. 190–198. CiteSeerX 10.1.1.199.7754 . doi:10.1109/icsm.1998.738508. ISBN 978-0-8186-8779-2.

[HindleGerman2009-5] Hindle, Abram; German, Daniel M.; Godfrey, Michael W.; Holt, Richard C. (2009). "Automatic classication of large changes into maintenance categories". 2009 IEEE 17th International Conference on Program Comprehension. pp. 30–39. doi:10.1109/ICPC.2009.5090025. ISBN 978-1-4244-3998-0.

[6] Stylos, Jeffrey; Faulring, Andrew; Yang, Zizhuang; Myers, Brad A. (2009). "Improving API documentation using API usage information". 2009 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC). pp. 119–126. doi:10.1109/VLHCC.2009.5295283. ISBN 978-1-4244-4876-0.

[7] Canfora, G.; Cerulo, L. (2005). "Impact Analysis by Mining Software and Change Request Repositories". 11th IEEE International Software Metrics Symposium (METRICS'05). p. 29. doi:10.1109/METRICS.2005.28. ISBN 978-0-7695-2371-2.

[8] 'Ambros, Marco; Gall, Harald; Lanza, Michele; Pinzger, Martin (2008). "Analysing Software Repositories to Understand Software Evolution". Software Evolution. pp. 37–67. doi:10.1007/978-3-540-76440-3_3. ISBN 978-3-540-76439-7.

[9] Kalliamvakou, Eirini; Gousios, Georgios; Blincoe, Kelly; Singer, Leif; German, Daniel M.; Damian, Daniela (2014). "The promises and perils of mining GitHub". Proceedings of the 11th Working Conference on Mining Software Repositories - MSR 2014. pp. 92–101. doi:10.1145/2597073.2597074. ISBN 9781450328630.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]