Big data in software engineering: A systematic literature review

Purpose of Study: We investigate the big data studies using batch and/or streaming data generated in the process of software development lifecycle. All phases of application development phases are in our scope including but not limited to elicitation, requirements analysis, design, software implementation, version control management, unit / functional / regression / automated / performance / stress test, release management, application log monitoring, application usage monitoring, user complaint management, security and compliance management and software problem management. Methods: We use a systematic literature review methodology used in Software Engineering studies to find and analyse the related studies published from January 2010 to October 2015. We synthesize the quantitative and qualitative outputs of selected papers and report the results. Findings and Results: In general, there are scarce studies in the literature. However there are relatively more papers regarding some areas such as Software Quality, Development, Project Management and Human Computer Interaction. However research in some fields such as Deployment, Requirements Engineering, Release Management and Mobile Applications were relatively less. Conclusions & Recommendations: More studies are required to identify the use cases, data attributes, measurements, platform requirements especially in the fields which are identified as having lack of study. A holistic big data perspective is needed to support software engineering ecosystems in large and complex enterprises.


Introduction
Knowledge discovery from big data seems to have a huge potential for businesses, scientific studies, governments and so on. It presents lots of new opportunities and new research avenues [1]. Big Data also enables synergistic inter-disciplinary studies [2]. The 5Vs (Volume, Velocity, Variety, Veracity and Value) of big data [3] have become valid for the data generated within software engineering ecosystem or in the process of software development life cycle from elicitation to deployment and monitoring in the field until phasing out of the software. For example software source code is a basic artefact in the software engineering domain and Google declared that it had 2 billion lines of code [4]. This huge quantity gives an idea about how much software related big data a large enterprise sit over. Another example is the artefacts, changes and process data produced during the life cycle of software projects conducted in a large scale enterprise. In a recent case study, the number of total yearly finished projects (Small and mid-size) in a large telecommunications company is given as 5350 [5]. Considering some basic use cases that generates data in the context of the projects, for instance people assignments or timesheet data may yield an order of magnitude increase in the number. Code changes may introduce an increase of two orders of magnitude. Logs, transactions, usage or incidents data generated brings us to three, four or more orders of magnitude of project numbers. Therefore, software engineering practitioners have already entered the era of big data. We observe this phenomenon in the organisation of leading technology companies such as Microsoft and Google as well. Microsoft has a research team conducting empirical software engineering research and Google employs at least 100 engineers in developing its tools using data mining technics [6].
In this study, we conducted a systematic literature review (SLR) covering the intersection of Big data and concepts around software engineering discipline. The rest of the paper is organized as follows: research method details are given in the second section. The third section discusses the results obtained from extracted data and is followed by the conclusions.

Research Method
We conducted the SLR following Kitchenham and Charters' de facto review guideline for software engineering [7]. This methodology has been used in more than one thousand six hundred studies (Google Scholar's citation count) in last eight years. The original idea for employing systematic literature review practice is coming from evidence-based medicine. Kitchenham and Charters customised the method for Software Engineering domain [7]. There are three main stages of the method: planning, conducting and documenting the review. The review steps are as follows: definition of the research questions, design of the search, conducting the search, selecting the studies, assessing the quality and synthesizing the data at hand.

Research Questions
In this SLR we intend to find answers for the following questions: Research Question 1: In which software engineering areas Big Data and Software Engineering are interacting and to what extent? By Big Data we mean related keywords such as data mining, analytics, streaming data, complex event processing, knowledge discovery, operational intelligence etc. This researh question aims to find the areas (requirements engineering, performance testing etc.) that benefits from the Big Data research. State of practice for the Software Engineering practitioner community and research opportunities for researchers will also be identified.

Research Question 2:
Which software engineering artefacts are used for Big Data processing? What are the most frequently used artefacts? We want to discover the types of data used in Software Engineering Big Data research and whether there is a lack of holistic data usage or not.

Search Strategy
Having defined the research questions in previous section, we designed a search string based on the questions. To cover all relevant studies, keywords and terms regarding Software Engineering and Big Data are consolidated to define the search string. Alternative terms are connected using OR Boolean operator to get a wide coverage. There are mainly two segments of the search statement. The first sub-segment is the union of all basic Big Data related terms, second sub-segment is the union of Software Engineering keywords. The intersection of the first and second sub-segments constitutes the first output for Big Data in Software Engineering research. The second segment addresses interdisciplinary terms. Consequently, the union of these two segments are applied an OR Boolean operator to get the union of the results. As a result, we generated the following search string: [("Data Mining" OR "Big Data" OR "Streaming data" OR "complex event processing" OR "CEP" OR "Statistical Methods" OR "Anomaly Detection" OR "Knowledge Discovery") AND ("Software Engineering" OR "SE" OR "SD" OR "Software Development" OR "Software Implementation" OR "SDLC" OR "Software Development Life Cycle" OR "Requirements Engineering" OR "Software Design" OR "Software Architecture" OR "DevOps" OR "Continuous Delivery" OR "Continuous Integration" OR "Project Management" OR "Application Monitoring" OR "Software Measurement" OR "Software Size" OR "Software Metric" OR "Release Management" OR "Change Management" OR "Version Control" OR "Usability" OR "Software Usage" OR "Appplication Usage Monitoring" OR "HCI" OR "Human Computer Interaction" OR "Software Testing" OR "Test Automation" OR "Automated Test" OR "Unit Test" OR "Performance Test" OR "Stress Test" OR "Software Quality" OR "Incident Management" OR "Complaint Management" OR "Software Defect Prediction" OR "Software Log Mining" OR "Software Fault Detection" OR "Software Security" OR "Software Fraud detection" OR "Transaction mining" OR "Software Integration" OR "Static Code Analysis" OR "Application Development Life Cycle Management" OR "ADLM")] OR ("Operational Intelligence" OR "Operational Analytics" OR "Software Analytics" OR "Software Archaeology" OR "Digital Archaeology ")

Literature Resources
We used Google Scholar as the primary resource for three reasons. First, English published study coverage of Google Scholar is very high (87 %) [8]. Second, the subject of the study is interdisciplinary and Google Scholar is a convenient platform to find the related research under study. Third, there's an important disadvantage of other electronic databases. The search strings needed to be adapted to suit the specific requirements of the different databases. This may be a very time consuming task for the researchers. Google Scholar has some important issues as well [9]. Google Scholar has a 256 character limitation for the search string. If the length of the search string is above 256, it silently truncates the string without warning [9]. To overcome this limitation we constituted 17 shorter subqueries from the original search string.
Our search covers the time frame from January 2010 to November 2015. We aimed to cover relevant papers in the recent past. We also added another filter on the content search. We conducted the search by using "allintitle" keyword to limit the keyword search within paper titles. In this manner we aimed to increase relevancy.

Study Selection Process
We obtained 326 studies by executing our 17 search strings. In the first filtration phase, we made a quick scan of the abstracts of all the resulting papers and made elimination based on the following inclusion and exclusion criteria: After the first filtration, 112 papers remained for the second phase. In the second phase remaining 112 papers' full content were scanned and assessed according to the quality criterias given in the next section. 32 papers with highest quality assessment scores were selected. These papers are given in reference section in sequence .

Study Quality Assessment
We specified following quality assessment criteria in order to determine the final output of the survey which are the 32 papers cited in section 2.4. Each criteria is 5 points. Thus the possible maximum score is 20 and minimum score is 0.
 Criteria 1: Study contribution is clearly described.  Criteria 2: Artefacts and methods used in the study are clearly described.  Criteria 3: Empirical validation is performed.  Criteria 4: The results and applications are described and discussed thorougly.
Each candidate paper was given a score using the assesment. The highest score was 17 and the lowest score was 5. All primary studies scored above 12 points were selected.

Data Extraction and Data Synthesis
To reach the data needed to answer our research questions and constitute some additional statistical data, we extracted following data from the papers: Title, Quality Criteria 1 Score, Quality Criteria 2 Score, Quality Criteria 3 Score, Quality Criteria 4 Score, Overall Quality Score, Year of Publication, Type, Country, SE Sub Domain, Artefacts, Objective, Data Processing Algorithms, Batch/Streaming, Tool/Technology. Next, extracted data is synthesized using graphics and tables which are presented in the following section.

Data Results
In this section, answers for the two research questions defined in section 2.1 will be discussed. Some other statistical data extracted from the papers remained after first filtration and second filtration (selected final primary studies) will be presented as well.

Research Question 1
The question was "In which software engineering areas Big Data and Software Engineering are interacting and to what extent?" To answer this question, we classified the papers based on the software engineering phases or keywords they focus in the data extraction phase. In Figure 2

Research Question 2
The question was "Which software engineering artefacts are used for Big Data processing? What are the most frequently used artefacts?" To answer this question, we also classified the papers based on the software engineering artefacts they use. In Figure 3, the numbers of studies for each artefact are shown. Source code and source code changes, bug related data and operational data are the most used artefacts in both papers set. The usage of all the other artefact types are not significant. Average Artefact number per paper is 1.16 in the paper set after first filtration. The average is 1.06 for the second paper set. This implies that majority of the papers focus on the problems using a single artefact. This finding is also consistent with the Figure 7. That is, majority of the papers lack a holistic perspective. More studies are required to correlate several software engineering artefacts to support high level decision making.

Additional Statistics
The trend of the number of the papers in the last six years is shown in Figure 1. For the first paper set,  Figure  4 presents the paper type distribution. Conference and journal papers are the majority of the publications and conference papers are slightly more than journal studies. In Figure 5 and Figure  6, paper numbers by

Conclusions
In this study, we investigated the current state of the research in Big Data and Software Engineering by using the systematic literature review methodology. We selected the primary studies extracting 326 relevant studies published in last six years (2010)(2011)(2012)(2013)(2014)(2015). In the first filtration, we eliminated about % 66 of the extracted studies using inclusion and exclusion criterias. 32 primary studies with highest quality assessment scores were selected out of 112 papers.
The conducted primary studies in the literature are scarce. However some areas are studied relatively more. Software Quality, Development, Project Management, Human Computer Interaction, Software Evolution and Software Visualisation are the most active research topics in software engineering big data studies. Source code and source code changes, bug related data and operational data are the most used artefacts in the studies. Deployment, Requirements Engineering, Release Management and Mobile Applications are the areas that have less studies. Studies lack a holistic perspective in terms of used artefacts. More studies are required to correlate several software engineering artefacts to support efficient decision making in large and complex enterprises.