ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs
ManyBugs and IntroClass are sets of C programs that contain defects and are associated with test suites. They are intended to support reproducible and comparative studies of research techniques in automatic patch generation for defect repair. They are released under a BSD license.
We would be most appreciative if work that uses either of these datasets cites the following:
The ManyBugs and IntroClass Benchmarks for Automated Repair of C Programs, by Claire Le Goues, Neal Holtschulte, Edward K. Smith, Yuriy Brun, Premkumar Devanbu, Stephanie Forrest, and Westley Weimer. IEEE Transactions on Software Engineering (TSE), vol. 41, no. 12, December 2015, pp. 1236-1256, DOI: 10.1109/TSE.2015.2454513.
The overview README replicates much of this information in flat text and also includes the license terms for this dataset.
The GenProg v2.0 README describes how to build and run repair, compiled from GenProg code source, available on the GenProg website; this README is the same as the one that appears with the code. Running AE and TRPAutoRepair requires different flags to the same executable. All GenProg executable runs enumerate the values of all command lines arguments, simplifying reproduction.
ManyBugs/ contains the ManyBugs programs and defects and the baseline results of running GenProg 2.0, TRPAutoRepair, and AE on them. The results and benchmark scenarios are in different directories:
- results/ is further subdivided into AE, TRPAutoRepair, and GenProg 2.0 results. The results consist of packages that contain the output debug files of the associated run; each debug file contains the command-line arguments that resulted in the run. These packages also contain repaired files, if applicable.
- scenarios/ contains 185 tarballs of the defect scenarios, one per defect per program. The naming scheme indicates the subject program and the revision pair corresponding to the defect in question. Each tarball contains a test.sh file that has all "positive" and "negative" tests individually executable.
- bug-data.csv contains categorization information about the defects in the set.
- The AutomatedRepairApplicabilityData repository contains data on characteristics of the ManyBugs defect scenarios, such as defect priority, size of developer-written patch, time it took developer(s) to fix the defect, etc., as well as scripts for deriving these data.
README explains the IntroClass scenarios and results in detail.
IntroClass/ contains the IntroClass programs and defects, the baseline results of running GenProg 2.0 and AE on those defects, and scripts to help you use the scenarios. IntroClass contains a directory for each of the IntroClass programs; each program directory contains support for running your own experiments as well as the results from our baseline experiments.
IntroClass.tar.gz is a downloadable single-file archive of the IntroClass benchmarks.
VirtualMachines/ and its subdirectories contain the ISOs for the two versions of Fedora that appear on the virtual machines. The baseline ManyBugs results were attained on a machine running Fedora 13. We also provide a Fedora 20 image that is suitable for running many of the scenarios.
- README describes how to import and use the virtual machine images, either on AWS or using VirtualBox, including information on how to access the former from the EC2 cloud.
- genprog_icse2012_aws contains a raw data dump of the Fedora 13 Amazon Machine Image
- genprog_icse2012_virtualbox/ contains a VirtualBox image matching to the Fedora 13 Amazon Machine Image.
Errata and updates
We update the scenarios and associated datasets from time to time. We hope to link to the test suites and defects of others and to add more categorization information to the datasets as they are brought to our attention. We are generally very interested in hearing feedback and errata from users who find problems or have questions about these benchmarks and associated information, as we hope these artifacts can evolve.
Note with respect to the ManyBugs scenarios that we heavily rely on the default input/output test case behavior as defined by the original developers, in the absence of the context and domain knowledge required to perfectly and programmatically capture original developer intent. Although we have performed several sanity checks as described in the article, there is evidence that certain of these test suites (like many test suites!) are inadequate for making strong claims about correctness (libtiff's test suite is particularly weak). We are particularly grateful to Qi et al. for studying some of these issues in depth; they provide alternative harnesses for the execution of some of these scenarios, with additional correctness checks based on their intuitions and insights regarding desired program behavior, in their replication package for Kali.