An experimental study shows that developers often misperceive which software testing techniques are most effective for them. While techniques differ in actual defectAn experimental study shows that developers often misperceive which software testing techniques are most effective for them. While techniques differ in actual defect

Study Finds Software Testers Often Misjudge Which Techniques Work Best

Abstract

1 Introduction

2 Original Study: Research Questions and Methodology

3 Original Study: Validity Threats

4 Original Study: Results

5 Replicated Study: Research Questions and Methodology

6 Replicated Study: Validity Threats

7 Replicated Study: Results

8 Discussion

9 Related Work

10 Conclusions And References

\

4 Original Study: Results

Of the 32 students participating in the experiment, nine did not complete the questionnaire11 and were removed from the analysis. Table 9 shows the balance of the experiment before and after participants submitted the questionnaire. We can see that G6 is the most affected group, with 4 missing people.

Appendix B shows the analysis of the experiment. The results show that program and technique are statistically significant (and therefore are influencing effectiveness), while group and the technique by program interaction are not significant. As regards the techniques, EP shows a higher effectiveness, followed by BT and then by CR. These results are interesting, as all techniques are able to detect all defects. Additionally, more defects are found in ntree compared to cmdline and nametbl, where the same amount of defects are found.

\ Note that ntree is the program applied the first day, has the highest Halstead metrics, and it is not the smallest program or the one with lowest complexity. These results suggest that:

– There is no maturation effect. The program where highest effectiveness is obtained is the one used the first day.

– There is no interaction with selections effect. Group is not significant.

– Mortality does not affect experimental results. The analysis technique used (Linear Mixed-Effects Models) is robust to lack of balance.

– Order of training could be affecting results. The highest effectiveness is obtained in the last technique taught, while the lowest effectiveness is obtained in the first technique taught. This suggests that techniques taught last are more effective than techniques taught first. This could be due to participants remembering better last techniques.

– Results cannot be generalised to other subject types.

4.1 RQ1.1: Participants’ Perceptions

Table 10 shows the percentage of participants that perceive each technique to be the most effective. We cannot reject the null hypothesis that the frequency distribution of the responses to the questionnaire item (Using which technique did you detect most defects? ) follows a uniform distribution12 (χ 2 (2,N=23)=2.696, p=0.260). This means that the number of participants perceiving a particular technique as being more effective cannot be considered different for all three techniques. Our data do not support the conclusion that techniques are differently frequently perceived as being the most effective.

4.2 RQ1.2: Comparing Perceptions with Reality

Table 11 shows the value of kappa along with its 95% confidence interval (CI), overall and for each technique separately. We find that all values for kappa with respect to the questionnaire item (Using which technique did you detect most defects?) are consistent with lack of agreement (κ<0.4, poor). Although the upper bound of the 95% CIs show agreement, 0 belongs to all 95% CI, meaning that agreement by chance cannot be ruled out. Therefore, our data do not support the conclusion that participants correctly perceive the most effective technique for them.

\ It is worth noting that agreement is higher for the code review technique (the upper bound of the 95% CI in this case shows excellent agreement). This could be attributed to participants being able to remember the actual number of defects identified in code reading whereas for testing techniques they only wrote the test cases. On the other hand, participants do not know the number of defects injected in each program.

As lack of agreement cannot be ruled out, we examine whether the perceptions are biased. The results of the Stuart-Maxwell test show that the null hypothesis of existence of marginal homogeneity cannot be rejected (χ 2 (2,N=23)=1.125, p=0.570). This means that we cannot conclude that perceptions and reality are differently distributed. Taking into account the results reported in Section 4.1, this would suggest that, in reality, techniques cannot be considered the most effective a different number of times13.

\ Additionally, the results of the McNemar-Bowker test show that the null hypothesis of existence of symmetry cannot be rejected (χ 2 (3,N=23)=1.286, p=0.733). This means that we cannot conclude that there is directionality when participants’ perceptions are wrong. These two results suggest that participants are not differently mistaken about one technique as they are about the others. Techniques are not differently subject to misperceptions.

4.3 RQ1.3: Comparing the Effectiveness of Techniques

We are going to check if misperceptions could be due to participants detecting the same amount of defects with all three techniques, and therefore being impossible for them to make the right decision. Table 12 shows the value and 95% CI of Krippendorff’s α, overall and for each pair of techniques, for all participants and for every design group (participants that applied the same technique on the same program) separately, and Table 13 shows the value and 95% CI of Krippendorff’s α, overall and for each program/session.

\ For values with all participants, we can rule out agreement, as the upper bound of the 95% CIs are consistent with lack of agreement (α<0.4), except for the case of EP-BT and nametbl-ntree for which the upper bound of the 95% CIs are consistent with fair to good agreement. However, even in this two cases, 0 belongs to the 95% CIs, meaning that agreement by chance cannot be ruled out.

\ This means that participants do not obtain similar effectiveness values when applying the different techniques (testing the different programs) so as to be difficult to discriminate among techniques/programs.

Furthermore, kappa values are negative, which indicates disagreement. This is good for the study, as it means that participants should be able to discriminate among techniques, and lack of agreement cannot be attributed to a problem of being impossible to discriminate among techniques. As regards the results for groups, although α values are negative14, the 95% CIs are too wide to show reliable results (due to small sample size). Note that in most of the cases they range from existence of disagreement in the lower bound (α<-0.4) to the existence of agreement in the upper bound (α>0.4).

4.4 RQ1.4: Cost of Mismatch

Table 14 and Figure 1 show the cost of mismatch. We can see that the EP technique has fewer mismatches compared to the other two. Additionally, the mean and median mismatch cost is smaller. On the other hand, the BT technique has more mismatches, and a higher dispersion. The results of the Kruskal-Wallis test reveal that we cannot reject the null hypothesis of techniques having the same mismatch cost (H(2)=0.685, p=0.710). This means that we cannot claim a difference in mismatch cost between the techniques. The estimated mean mismatch cost is 31pp (median 26pp).

These results suggest that the mismatch cost is not negligible (31pp), and is not related to the technique perceived as most effective. However, note that the existence of very high mismatches and few datapoints could be affecting these results.

4.5 RQ1.5: Expected Loss of Effectiveness

Table 15 shows the average loss of effectiveness that should be expected in a project, where typically different testers participate, and therefore, there would

be both matches and mismatches15. Again, the results of the Kruskal-Wallis test reveal that we cannot reject the null hypothesis of techniques having the same expected reduction in technique effectiveness for a project (H(2)=1.510, p=0.470). This means we cannot claim a difference in project effectiveness loss between techniques. The mean expected loss in effectiveness in the project is estimated as 15pp16 .

These results suggest that the expected loss in effectiveness in a project is not negligible (15pp), and is not related to the technique perceived as most effective. However, we must note again that the existence of very high mismatches for BT and few datapoints could be affecting these results.

4.6 Findings of the Original Study

Our findings are:

– Participants should not base their decisions on their own perceptions, as their perceptions are not reliable and have an associated cost.

– We have not been able to find a bias towards one or more particular techniques that might explain the misperceptions.

– Participants should have been able to identify the different effectiveness of techniques.

– Misperceptions cannot be put down to experience. The possible drivers of these misperceptions require further research. Note that these findings cannot be generalised to other types of developers rather than those with the same profile as the ones used in this study.

:::info Authors:

  1. Sira Vegas
  2. Patricia Riofr´ıo
  3. Esperanza Marcos
  4. Natalia Juristo

:::

:::info This paper is available on arxiv under CC BY-NC-ND 4.0 license.

:::

\

Piyasa Fırsatı
Best Wallet Logosu
Best Wallet Fiyatı(BEST)
$0.003691
$0.003691$0.003691
-4.62%
USD
Best Wallet (BEST) Canlı Fiyat Grafiği
Sorumluluk Reddi: Bu sitede yeniden yayınlanan makaleler, halka açık platformlardan alınmıştır ve yalnızca bilgilendirme amaçlıdır. MEXC'nin görüşlerini yansıtmayabilir. Tüm hakları telif sahiplerine aittir. Herhangi bir içeriğin üçüncü taraf haklarını ihlal ettiğini düşünüyorsanız, kaldırılması için lütfen service@support.mexc.com ile iletişime geçin. MEXC, içeriğin doğruluğu, eksiksizliği veya güncelliği konusunda hiçbir garanti vermez ve sağlanan bilgilere dayalı olarak alınan herhangi bir eylemden sorumlu değildir. İçerik, finansal, yasal veya diğer profesyonel tavsiye niteliğinde değildir ve MEXC tarafından bir tavsiye veya onay olarak değerlendirilmemelidir.

Ayrıca Şunları da Beğenebilirsiniz

China Blocks Nvidia’s RTX Pro 6000D as Local Chips Rise

China Blocks Nvidia’s RTX Pro 6000D as Local Chips Rise

The post China Blocks Nvidia’s RTX Pro 6000D as Local Chips Rise appeared on BitcoinEthereumNews.com. China Blocks Nvidia’s RTX Pro 6000D as Local Chips Rise China’s internet regulator has ordered the country’s biggest technology firms, including Alibaba and ByteDance, to stop purchasing Nvidia’s RTX Pro 6000D GPUs. According to the Financial Times, the move shuts down the last major channel for mass supplies of American chips to the Chinese market. Why Beijing Halted Nvidia Purchases Chinese companies had planned to buy tens of thousands of RTX Pro 6000D accelerators and had already begun testing them in servers. But regulators intervened, halting the purchases and signaling stricter controls than earlier measures placed on Nvidia’s H20 chip. Image: Nvidia An audit compared Huawei and Cambricon processors, along with chips developed by Alibaba and Baidu, against Nvidia’s export-approved products. Regulators concluded that Chinese chips had reached performance levels comparable to the restricted U.S. models. This assessment pushed authorities to advise firms to rely more heavily on domestic processors, further tightening Nvidia’s already limited position in China. China’s Drive Toward Tech Independence The decision highlights Beijing’s focus on import substitution — developing self-sufficient chip production to reduce reliance on U.S. supplies. “The signal is now clear: all attention is focused on building a domestic ecosystem,” said a representative of a leading Chinese tech company. Nvidia had unveiled the RTX Pro 6000D in July 2025 during CEO Jensen Huang’s visit to Beijing, in an attempt to keep a foothold in China after Washington restricted exports of its most advanced chips. But momentum is shifting. Industry sources told the Financial Times that Chinese manufacturers plan to triple AI chip production next year to meet growing demand. They believe “domestic supply will now be sufficient without Nvidia.” What It Means for the Future With Huawei, Cambricon, Alibaba, and Baidu stepping up, China is positioning itself for long-term technological independence. Nvidia, meanwhile, faces…
Paylaş
BitcoinEthereumNews2025/09/18 01:37
Gold continues to hit new highs. How to invest in gold in the crypto market?

Gold continues to hit new highs. How to invest in gold in the crypto market?

As Bitcoin encounters a "value winter", real-world gold is recasting the iron curtain of value on the blockchain.
Paylaş
PANews2025/04/14 17:12
Why The Green Bay Packers Must Take The Cleveland Browns Seriously — As Hard As That Might Be

Why The Green Bay Packers Must Take The Cleveland Browns Seriously — As Hard As That Might Be

The post Why The Green Bay Packers Must Take The Cleveland Browns Seriously — As Hard As That Might Be appeared on BitcoinEthereumNews.com. Jordan Love and the Green Bay Packers are off to a 2-0 start. Getty Images The Green Bay Packers are, once again, one of the NFL’s better teams. The Cleveland Browns are, once again, one of the league’s doormats. It’s why unbeaten Green Bay (2-0) is a 8-point favorite at winless Cleveland (0-2) Sunday according to betmgm.com. The money line is also Green Bay -500. Most expect this to be a Packers’ rout, and it very well could be. But Green Bay knows taking anyone in this league for granted can prove costly. “I think if you look at their roster, the paper, who they have on that team, what they can do, they got a lot of talent and things can turn around quickly for them,” Packers safety Xavier McKinney said. “We just got to kind of keep that in mind and know we not just walking into something and they just going to lay down. That’s not what they going to do.” The Browns certainly haven’t laid down on defense. Far from. Cleveland is allowing an NFL-best 191.5 yards per game. The Browns gave up 141 yards to Cincinnati in Week 1, including just seven in the second half, but still lost, 17-16. Cleveland has given up an NFL-best 45.5 rushing yards per game and just 2.1 rushing yards per attempt. “The biggest thing is our defensive line is much, much improved over last year and I think we’ve got back to our personality,” defensive coordinator Jim Schwartz said recently. “When we play our best, our D-line leads us there as our engine.” The Browns rank third in the league in passing defense, allowing just 146.0 yards per game. Cleveland has also gone 30 straight games without allowing a 300-yard passer, the longest active streak in the NFL.…
Paylaş
BitcoinEthereumNews2025/09/18 00:41