ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs

ManyBugs and IntroClass are sets of C programs that contain defects and are associated with test suites. They are intended to support reproducible and comparative studies of research techniques in automatic patch generation for defect repair. They are released under a BSD license.

Publications

We would be most appreciative if work that uses either of these datasets cites the following:
The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs, by Claire Le Goues, Neal Holtschulte, Edward K. Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, and Westley Weimer. IEEE Transactions on Software Engineering (TSE), vol. 41, no. 12, December 2015, pp. 1236-1256, DOI: 10.1109/TSE.2015.2454513.
BibTeX PDF Download

Downloads

The overview README replicates much of this information in flat text and also includes the license terms for this dataset.

The GenProg v2.0 README describes how to build and run repair, compiled from GenProg code source, available on the GenProg website; this README is the same as the one that appears with the code. Running AE and TRPAutoRepair requires different flags to the same executable. All GenProg executable runs enumerate the values of all command lines arguments, simplifying reproduction.

ManyBugs

README explains the ManyBugs scenarios and results in detail. Note that a newer, docker-based method for running ManyBugs scenarios now exists. See more info here: https://github.com/squaresLab/BugZoo. Also see below for links to alternative harnesses, though note that our results are with respect to the harnesses we link here.

ManyBugs/ contains the ManyBugs programs and defects and the baseline results of running GenProg 2.0, TRPAutoRepair, and AE on them. The results and benchmark scenarios are in different directories:

results/ is further subdivided into AE, TRPAutoRepair, and GenProg 2.0 results. The results consist of packages that contain the output debug files of the associated run; each debug file contains the command-line arguments that resulted in the run. These packages also contain repaired files, if applicable.
scenarios/ contains 185 tarballs of the defect scenarios, one per defect per program. The naming scheme indicates the subject program and the revision pair corresponding to the defect in question. Each tarball contains a test.sh file that has all "positive" and "negative" tests individually executable.
bug-data.csv contains categorization information about the defects in the set.
The AutomatedRepairApplicabilityData repository contains data on characteristics of the ManyBugs defect scenarios, such as defect priority, size of developer-written patch, time it took developer(s) to fix the defect, etc., as well as scripts for deriving these data.

IntroClass

README explains the IntroClass scenarios and results in detail.

IntroClass/ contains the IntroClass programs and defects, the baseline results of running GenProg 2.0 and AE on those defects, and scripts to help you use the scenarios. IntroClass contains a directory for each of the IntroClass programs; each program directory contains support for running your own experiments as well as the results from our baseline experiments.

IntroClass.tar.gz is a downloadable single-file archive of the IntroClass benchmarks.

Virtual Machines

VirtualMachines/ and its subdirectories contain the ISOs for the two versions of Fedora that appear on the virtual machines. The baseline ManyBugs results were attained on a machine running Fedora 13. We also provide a Fedora 20 image that is suitable for running many of the scenarios.

README describes how to import and use the virtual machine images, either on AWS or using VirtualBox, including information on how to access the former from the EC2 cloud.
genprog_icse2012_aws contains a raw data dump of the Fedora 13 Amazon Machine Image
genprog_icse2012_virtualbox/ contains a VirtualBox image matching to the Fedora 13 Amazon Machine Image.

Errata and updates

We update the scenarios and associated datasets from time to time. We hope to link to the test suites and defects of others and to add more categorization information to the datasets as they are brought to our attention. We are generally very interested in hearing feedback and errata from users who find problems or have questions about these benchmarks and associated information, as we hope these artifacts can evolve.

With respect to the ManyBugs scenarios, we heavily rely on the default input/output test case behavior as defined by the original developers of the underlying projects, in the absence of the context and domain knowledge required to perfectly and programmatically capture original developer intent. Although we have performed several sanity checks as described in the article, there is evidence that certain of these test suites (just like many test suites in practice) are inadequate for making strong claims about correctness (libtiff's test suite is particularly weak). We are particularly grateful to Qi et al. for studying some of these issues in depth; they provide alternative harnesses for the execution of some of these scenarios, with additional correctness checks based on their intuitions and insights regarding desired program behavior, in their replication package for Kali.

In 2017, we published a note on the creation and use of the ManyBugs benchmark, particularly addressing some of the issues outlined by Qi et al., including results using other harnesses for executing the scenarios: Clarifications on the Construction and Use of the ManyBugs Benchmark, by Claire Le Goues, Yuriy Brun, Stephanie Forrest, and Westley Weimer. IEEE Transactions on Software Engineering (TSE), vol. 43, no. 11, November 2017, pp. 1089-1090, DOI: 10.1109/TSE.2017.2755651. PDF Download

A newer, docker-based method for running ManyBugs scenarios now exists. See more info here: https://github.com/squaresLab/BugZoo.