-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy path02-testing-analytics.Rmd
333 lines (295 loc) · 22.7 KB
/
02-testing-analytics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
# Testing Analytics
## Motivation
Testing is an important aspect in software engineering, as it forms the first line of
defense against the introduction of software faults Pinto et al. [@pinto2012understanding].
However, in
practice it seems that not all developers test actively. In this chapter we will survey on the use
of testing and the tools that make this possible. We will also look into the future development of
tools that is done or required in order to improve testing practices in real-world applications.
Testing is not the holy grail for completely removing all bugs from a program but it can decrease
the chances
for a user to encounter a bug. We believe that extra research is needed to ease the life of
developers by making testing more efficient, easier to maintain and more effective. Therefore, we
wanted to write a survey on the testing behavior, current practices and future developments of
testing.
In order to perform our survey, we formulated three Research Questions (RQs):
* **RQ1** How do developers currently test?
* **RQ2** What state of the art technologies are being used?
* **RQ3** What future developments can be expected?
In this chapter we will first elaborate on the research protocol that was used in order to find
papers
and extract information for the survey. Second, the actual findings for each of the research
questions will be explained.
## Research Protocol
For this chapter, we applied Kitchenham's survey method [@kitchenham2004procedures] was applied. For this method, a
protocol has to be
specified. This protocol is defined for the research questions given above. Below the inclusion
and exclusion criteria are given, which helped finding papers. After these criteria,
the actual search for papers is described. The papers that were found are listed and after they
are tested against the criteria that are given. The data that is extracted from these papers are
listed afterward. Some papers that were left out will be listed and the reasons for leaving them out
will be given to make clear why some papers do not meet the desired requirements.
Each of the papers was tested using our inclusion and exclusion criteria. These criteria
were introduced to make sure the papers have the information required to answer the RQs while also
being relevant with respect to their quality and age. Below a list of inclusion and exclusion
criteria is given. In general, for all criteria, the exclusion criteria take precedence over
inclusion criteria.
The following inclusion and exclusion criteria were used:
* Papers published before 2008 are excluded from the research, unless a reference/citation is used
for an unchanged concept.
* Papers referring to less than 15 other papers, excluding self-references, are excluded from the
research.
* Selected papers should have an abstract, introduction and conclusion section.
* Papers stating the developers’ testing behavior are included.
* Papers stating the developers’ problems related to testing are included.
* Papers stating the technologies related to testing analytics, which developers use, are included.
* Papers writing about the expected advantage of current findings in testing analytics are
included.
* Papers with recommendations for future development in the software testing field are included.
The papers used in this chapter were found by using a given initial seed of papers (query defined
below as 'Initial Paper Seed'). From this initial seed of papers we used the keywords used by those
papers to construct queries. Additionally, the references (‘referenced by’) and the citations
(‘cited in’) of the papers were used to find papers. The query row of the tables describing the
references, as found below, indicates how a paper was found. For queries the default search sites
were Scopus ^[https://www.scopus.com/], Google Scholar ^[https://scholar.google.com/] and Springer
^[https://www.springer.com].
The keywords used to construct queries in order to find papers were: software, test*, analytics,
test-suite, evolution, software
development, computer science, software engineering, risk-driven, survey software testing
The table below describes for each paper, which query resulted in which paper being found.
Each of the papers is categorized with a corresponding research question. In the
table below, the categories per paper were added based on their general topic.
These broad topics will be assigned to a corresponding research question. Categorizations are based
on the bullet points extracted from each paper. These bullet points can be found in the appendix of
this chapter in section _'Extracted paper information'_.
|Category |Reference | Query| Relevant to |
|---------------------|------------------|-----------------------------------------------------------|----|
|Co-evolution | Greiler et al. @greiler2013 | In ‘cited by’ of “Understanding myths and realities of test-suite evolution” on Scopus | RQ2, RQ3 |
|Co-evolution | Hurdugaci and Zaidman @hurdugaci2012 | Keywords: Maintain developer tests, ‘cited by’ in “Studying the co-evolution of production and test code in open source and industrial developer test processes through repository mining” on IEEE | RQ2 |
|Co-evolution | Marsavina et al. @marsavina2014 | Google Scholar keywords: Maintain developer tests, in ‘cited by’ of “Aiding Software Developers to Maintain Developer Tests” on IEEE| RQ1 |
|Co-evolution | Zaidman et al. @zaidman2011studying | Initial Paper Seed| RQ1 |
|Production evolution | Eick et al. @eick2001 | Referenced by: @leung2015testing | Discarded |
|Production evolution | Leung and Lui @leung2015testing | Initial Paper Seed | RQ3 |
|Risk-driven testing | Atifi et al. @atifi2017 | In ‘cited by’ of “Risk-driven software testing and reliability” | RQ2, RQ3 |
|Risk-driven testing | Hemmati and Sharifi @hemmati2018 | In ‘cited by’ of “Test case analytics: Mining test case traces to improve risk-driven testing” | RQ3 |
|Risk-driven testing | Noor and Hemmati @noor2015test | Initial Paper Seed | RQ2, RQ3 |
|Risk-driven testing | Schneidewind @schneidewind2007 | Scopus query: risk-driven testing | RQ3 |
|Risk-driven testing | Vernotte et al. @vernotte2015 | Scopus query: “risk-driven” AND testing | RQ2, RQ3 |
|Test evolution |Bevan et al. @bevan2005 | Referenced by: @pinto2012understanding | Discarded |
|Test evolution |Mirzaaghaei et al. @supportingtestsuite | Google Scholar query: test-suite evolution | RQ2, RQ3 |
|Test evolution | Pinto et al. @pinto2012understanding | Initial Paper Seed | RQ1 |
|Test evolution |Pinto et al. @pinto2013 | Referenced by: @pinto2012understanding | RQ1 |
|Test generation | Bowring and Hegler @bowring2014obsidian| Springer: Reverse search on “Automatically generating maintainable regression unit tests for programs” | RQ2, RQ3 |
|Test generation | Dulz @dulz2013model | Scopus query: “software development” AND Computer Science AND Software Engineering | RQ2 |
|Test generation | Robinson et al. @robinson2011 | Referenced by @supportingtestsuite | RQ2 |
|Test generation | Shamshiri et al. @shamshiri2018automatically | Google Scholar query: Automatically generating unit tests | RQ3 |
|Testing practices | Beller et al. @beller2015 | In ‘cited by’ of “Understanding myths and realities of test-suite evolution”. | RQ1 |
|Testing practices | Beller et al. @beller2017developer | Initial Paper Seed | RQ1 |
|Testing practices | Garousi and Zhi @GAROUSI20131354 | Google Scholar query: Survey software testing | RQ1 |
|Testing practices | Moiz @moiz2017uncertainty | Springer query: software testing | RQ3 |
## Results
In this section the research questions will be answered. To answer them, information from
the relevant papers are aggregated. The answers to each research questions are summarized in the
conclusion.
### (RQ1) How do developers currently test?
To answer RQ1, "How do developers currently test?", we first outline general test practices, then
discuss the co-evolution of test and production code and finally, look into the use of Test Driven
Development among developers.
#### How do we test?
For the quality of code, test coverage is a popular metric. Alternatives are, for example,
acceptance tests, the number of defects in the last week, or defects per Line of Code (LOC)
@GAROUSI20131354. However, code coverage might not be the best indicator for the extensiveness of
testing. For example, according to Beller et al. @Beller:2015:DT:2819009.2819101 a code coverage of
75% can possibly be
reached with only spending less than a tenth of the total development time on testing. Another
concern of using test coverage as a metric is the concept of treating the metric @bouwers2012a
, where developers try to uplift the value of code coverage by hitting many lines with only a few
test cases. Marsavina et al. @marsavina2014 observed that test cases were rarely
updated when changes related to attributes or methods in the production code were made. Possible
explanations for this
are that these changes were not significant or the tests were too simple and were likely to pass.
This also fits with the findings of Romano et al. @ROMANO201764, who
claim that "[d]evelopers write
quick-and-dirty production code to pass the tests, do not update their tests often, and ignore
refactoring."
Besides older tests rarely being updated for changed code, even new tests do not necessarily have
the purpose of validating new production code lines. Pinto et al.
@pinto2012understanding observed
that a significant number of new tests that are added were not necessarily
added to cover new code, but rather to exercise the changed parts of the code after the program is
modified. This finding fits with the observation of Marsavina et al. @marsavina2014,
who found that test cases are
created or deleted in order to address the modified branches whenever numerous
condition related changes are conducted in the production code base.
Older production code lines, therefore, may stay
untouched by any test cases. Lines not covered by the test (such as any traditional code coverage
tool would find them) should be
indicated and signaled to the developer. Therefore, developers should be aware of the fact that
they did not cover some lines of their production code with any tests. It seems to be a deliberate
action by most developers to not cover older production code lines. These lines might be
'too hard' to test, other lines may be easier to test, or developers do not seem to see the
relevance of testing these uncovered lines of code. However, the most commonly used coverage
metrics are branch coverage and conditional coverage @GAROUSI20131354. As both branch coverage
and conditional coverage require multiple different conditions for if-statements, it may possibly
be that the absolute number of missed lines of production code by tests is very low but rather the
number of missed conditions is higher.
#### Co-evolution
In a case study conducted by Zaidman et al. @zaidman2011studying, there was no evidence found for
an increased
activity of testing before a release.
However, the study detected periods of increased test writing activity.
These increased activities of writing test cases were found to be after
longer periods of writing production code @zaidman2011studying. With a longer timespan of not
writing tests, it can be concluded for these cases that the production code and test code do not
gracefully co-evolve @zaidman2011studying @marsavina2014.
#### Test-Driven Development (TDD)
We found different definitions for TDD across multiple studies. According to Zaidman et al.
@zaidman2011studying, evidence of TDD was found where test code was committed alongside production
code, meaning that the methodology of TDD is used when production code was written before the
respective test code. This is in contrast with the originally proposed constraint
by Beck @beck2003test,
where a line of production code should only be written after a failing automated test was written
in advance.
The confusion about the definition of TDD is also supported by the finding of
Beller et al. @beller2015, where programmers who claim they practice TDD neither
follow it strictly nor
practice it for all of their modification. A survey by Garousi and Zhi @GAROUSI20131354
on 196 respondents
(amongst them managers and developers) indicated that a ratio of 3:1 use Test-last development
and Test-driven development respectively. This ratio is in contrast with the numbers found by
Beller et al. @beller2015; only 1.7% of the observed developers seemed to follow the strict
TDD definition, where
most of these developers only practice this strict definition in less than 20% of their time.
However, it is important to mention that the survey done by Garousi and Zhi @GAROUSI20131354 only
surveyed the subjects,
which might allow the confusion for the definition of TDD to play a major role in the results.
### (RQ2) What state of the art technologies are being used?
We will cover two research fields regarding testing analytics: test evolution and generation, and
risk-driven testing.
#### Test Evolution and Generation
Pinto et al. @pinto2012understanding found the investigation of automated test repairing is not a
promising research avenue, as these techniques -- despite their name -- would require manual guidance,
which could end up
being similar to traditional refactoring tools. Nonetheless, more research is performed in this
field since then.
An approach for automatically repairing and generating test cases during software evolution is
proposed by Mirzaaghaei et al. @supportingtestsuite. This approach uses information available in
existing test cases,
defines a set of heuristics to repair test cases invalidated by changes in the software, and
generate new test cases for evolved software. This properly repairs 90% of the compilation errors
addressed and covers the same amount of instructions. The results show that the approach can
effectively maintain evolving test suites and perform well compared to competing approaches.
While fully automated test suite generation can not replace human testing entirely yet, Bowring and
Hegler @bowring2014obsidian introduced a
tool that generates the templates for tests, which guarantees compilation, supports exception
handling and finds a suitable location for the test. Developers still need to fix the test oracles
themselves, but the template is there. The technique looks at the context in order to decide what
template to use. Robinson et al. @robinson2011 created a regression unit tests generation tool.
It is a suite of
techniques for enhancing an existing unit test generation system. The authors performed experiments using
an industrial system. The generated tests from these experiments achieved good coverage and mutation
kill score, were readable by the product developers and required few edits as the system under test
evolved. Dulz @dulz2013model found that by directly adjusting specific probability values in the
usage
profile of a Markov chain usage model, it is relatively easy to generate abstract test suites for
different user classes and test purposes in an automated approach. By using proper tools, such as the
TestUS Testplayer, even less experienced test engineers will be able to efficiently generate
abstract test cases and graphically assess quality characteristics of different test suites.
Hurdugaci and Zaidman @hurdugaci2012 introduce TestNForce, a tool to help
developers identify
unit tests that need to be altered and executed after code change (Visual Studio only).
#### Risk-driven Testing
The paper by Vernotte et al. @vernotte2015 introduces and reports on an original tool-supported,
risk-driven security
testing process called Pattern-driven and Model-based Vulnerability Testing. This fully automated
testing process, relying on risk-driven strategies and Model-Based Testing (MBT) techniques, aims
to improve the capability of detecting various Web application vulnerabilities, in particular
SQL injections, Cross-Site Scripting, and Cross-Site Request Forgery. An empirical evaluation
shows that this novel process is appropriate for automatically generating and executing
risk-driven vulnerability test cases and is promising to be deployed for large-scale Web
applications.
A new risk measure is defined by Noor and Hemmati @noor2015test, which assigns a risk factor to a
test case if it
is similar to a failing test case from history. The new risk measure is by far more effective in
identifying failing test cases compared to the traditional risk measure. Using this method for
identifying test cases with a high risk factor, risky test cases can for example be run in the
background while developing code, to find faults earlier. Furthermore, prioritizing risky tests
while running the entire test-suite could make the suite detect failing tests earlier and the
developer can start fixing the faulty code right away.
### (RQ3) What Future Developments Can Be Expected?
This section will elaborate on which future developments can be expected in the field of software
analytics.
#### Co-Evolution and Test Generation
For understanding how test- and production code co-evolve and how tests can be generated to support
developers, several studies have been conducted
[@pinto2012understanding; @marsavina2014; @zaidman2011studying]. Additionally, a tool has been made
in order to analyze and, consequently,
better understand test-suite evolution @pinto2013. For the time being the practical
implications of
this subtopic have mainly been sought in the repairing and generation of tests.
According to Pinto et al. @pinto2012understanding test repairs occur often enough to justify the
development and research for automated repair techniques. Mirzaaghaei et al.
@supportingtestsuite argue that evolving test cases is an expensive and time-consuming activity,
for which automated approaches reduce the pressure on developers. Shamshiri et al.
@shamshiri2018automatically argue that automated generation of unit tests does not end up
generating realistic tests and that the effectiveness of developers writing manual tests is equal
to developers using automatically generated tests. Therefore, they call for the use of more
realistic tests. This suggests that automated test generation is still a topic of future interest,
which
will likely be researched in order to find a way to generate realistic tests.
#### Risk-driven Testing
Risk-driven testing is an area of recent attention. Researchers have been looking for
methods that can either detect potential risks within the same project @noor2015test
@hemmati2018
@vernotte2015 or that can detect risks based on models carried over from one project to another
@leung2015testing @atifi2017. These techniques have been implementing history based prediction
approaches.
In the future, we can expect more interest and research into risk-driven testing as allocating
testing activities effectively will remain important due to testing efforts and developer time
being expensive. This area will likely stay in its research phase for the next couple of years as
effective measures for risk prediction are still being researched. This goes for measures within
the same project and cross-project prediction. Given that the currently researched techniques regard
history based implementations, it is likely that these techniques will be subject to further
research
later on.
#### Testing Practices
Research of several papers @GAROUSI20131354 @beller2017developer @beller2015 has indicated
that testing of any
form is not as widely practiced as the status quo suggests. How the current state of the practice
will change depends on various developments within the field. Tools will be created to assist the
developer in writing quality code and tests, such as TestEvoHound as suggested by M. Greiler.
@greiler2013. As
automated test generation becomes more effective this may reduce the need for developers to spend a
lot of time on writing and maintaining tests. With the development of risk-driven testing,
developers may also be able to focus on the parts that are likely to be the most important to
address, which could lead to better time allocation. The status quo for how much time is to be
expected to be spent on testing may also change, given automated test repair and generation
techniques become effective and accessible.
## Conclusion
In this chapter, three different research questions about software testing analytics were answered.
(RQ1) How do developers currently test? (RQ2) What state of the art technologies are being used? (RQ3)
What future developments can be expected?
Regarding the current testing practices of developers (RQ1), we found that developers do not seem to
update their tests very often and when they do, it is because of a changed condition in production
code lines. Furthermore, older uncovered production code lines are not likely to be covered in the
end. Developers, thus, seem to ignore indications of their code coverage tools or do not seem to use
any code coverage tool at all. Furthermore, developers do not seem to put a lot of effort into
making sure the co-evolution of their production- and test code is done gracefully. They do, on the
other hand, make sure their test code compiles when production code classes have been removed.
However, testing is mostly done in longer periods of increased testing. The methodology of TDD
also seems to be a confusing term for developers, as there is not enough clear guidance in the
implementation of it. The actual ratio of TLD and TDD is, therefore, unknown but can be guessed with
great certainty to be much lower for TDD than for TDL.
The current state of the art in testing analytics (RQ2) consists of research in co-evolution and
generation of tests, and risk-driven testing. Approaches are proposed for automatically repairing
and generating test cases during software evolution. While fully automated test suite generation is
not there yet, a tool is introduced that generates the templates for tests, which guarantees
compilation, supports exception handling and finds a suitable location for the test. In the field of
risk-driven testing, new risk measures are defined which make prioritizing certain high-risk tests
able while running the entire test-suite, which could make the suite detect failing tests earlier.
For future developments (RQ3), further research can be expected on the front of automated test generation.
Even with some discussion regarding the effectiveness of test generation, the field currently agrees
that conducting research in order to find, especially, realistic ways of generating tests is
worthwhile. We also found that risk-driven testing has been given more attention in the form of
research recently. This subtopic is still in its research phase. It can be expected that research on
the front of history based risk prediction methods will continue.