Evaluating Features for Machine Learning Detection of Order- and Non-Order-Dependent Flaky Tests (ICST 2022 - Research Papers)

Who

Owain Parry, Gregory Kapfhammer, Michael Hilton, Phil McMinn

Track

ICST 2022 Research Papers

Time Zone

The program is currently displayed in (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, Vienna.

Use conference time zone: (GMT+02:00) Amsterdam, Berlin, Bern, Rome, Stockholm, ViennaSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Tue 5 Apr 2022 16:15 - 16:30 at Margaret Hamilton - ICST AI I Chair(s): Raihana Ferdous

Abstract

Flaky tests are test cases that can pass or fail without code changes. They cause major problems in software development such as wasting the time of developers and obstructing continuous integration. The research community has presented automated techniques for detecting flaky tests, though many involve repeated test executions and significant instrumentation and therefore may be both intrusive and expensive. While this motivates researchers to evaluate machine learning models for detecting flaky tests, research on the features used to encode a test case is limited. Without further study on this topic, machine learning models cannot perform to their full potential in this domain. Previous studies also exclude a specific, yet prevalent and problematic, category of flaky tests: order-dependent (OD) flaky tests. Because of this, previous research only addresses a subset of the problem of flaky test detection. This paper presents a new feature set for encoding test cases. We compared our new feature set to a previously established feature set when evaluating the detection performance of 54 pipelines of data preprocessing, data balancing, and machine learning models for detecting both non-order-dependent (NOD) and OD flaky tests. As our data set, we used the test suites of 26 Python projects, consisting of over 67,000 test cases. This paper’s empirical study reveals a number of findings, including (1) a 13% increase in overall F1 score when detecting NOD flaky tests using our new feature set; (2) a 17% increase in overall F1 score when detecting OD flaky tests using our new feature set; and (3) the most impactful metrics of our new feature set for detecting both types of flaky test.

Owain Parry

The University of Sheffield

United Kingdom

Gregory Kapfhammer

Allegheny College

United States

Michael Hilton

Carnegie Mellon University, USA

United States

Phil McMinn

University of Sheffield