Welcome to the IKCEST
Journal
IEEE Transactions on Software Engineering

IEEE Transactions on Software Engineering

Archives Papers: 433
IEEE Xplore
Please choose volume & issue:
Computation Tree Logic Guided Program Repair
Yu LiuYahui SongMartin MirchevAbhik Roychoudhury
Keywords:Maintenance engineeringBenchmark testingSymbolsSemanticsLogicComputer bugsCodesSource codingEnginesVisualizationComputation Tree LogicBenchmarkSemanticNegationTemporal LogicSource CodeFixed PointSet Of RulesFlow ControlDisjunctionRegular ExpressionsBoolean VariableModel CheckingFunction CallsRanking FunctionAbstract StatesConditional StatementsSet Of FactsControl Flow GraphExecution PathTerminal AnalysisLogical ConstraintsTemporal OperatorsHigh StratumProgram LogicSingle TraceProgram analysis and automated repairdatalogloop summarisation
Abstracts:Temporal logics like Computation Tree Logic (CTL) have been widely used as expressive formalisms to capture rich behavioural specifications. CTL can express properties such as reachability, termination, invariants and responsiveness, which are difficult to test. This paper suggests a mechanism for the automated repair of infinite-state programs guided by CTL properties. Our produced patches avoid the overfitting issue that occurs in test-suite-guided repair, where the repaired code may not pass tests outside the given test suite. To realise this vision, we propose a novel find-and-fix framework based on Datalog, a widely used domain-specific language for program analysis, which readily supports nested fixed-point semantics of CTL via stratified negation. Specifically, our framework encodes the program and CTL properties into Datalog facts and rules and performs the repair by modifying the facts to pass the analysis rules. In the framework, to achieve both analysis and repair results, we adapt existing techniques – including loop summarisation and Symbolic Execution of Datalog (SEDL) – with key modifications. Our approach achieves analysis accuracy of 56.6%  on a CTL verification benchmark and 88.5%  on a termination/responsiveness benchmark, surpassing the best baseline performances of 27.7%  and 76.9%, respectively. Our approach repairs all detected bugs, which is not achieved by existing tools.
Facilitating Wise Decision-Making for Bounty Backers in Open Source Software Communities
Xin TanBo HouXianjun NiYuxia ZhangJing JiangMinghui ZhouLi Zhang
Keywords:Software development managementDecision makingComputer bugsTechnological innovationSustainable developmentSurveysOpen source softwareEcosystemsComputer scienceBridgesOpen-source SoftwareOpen Source Software CommunitiesActual ResultsResolution RateHigh Resolution RateRoot Mean Square ErrorModel PerformanceGeneral Linear ModelDecision TreeSpecific TasksIntrinsic MotivationFinancial IncentivesQuantitative MetricsTypes Of IssuesCryptocurrenciesSingle IssueOptimism BiasOpen Source Software ProjectsEmail SurveyBlockchainMaximize Resource UtilizationBitcoinCrowdfundingNumber Of StarsDegree Of TimeCode ChangesGeneral MotivationOpen source softwarebounty issuesincentive mechanismsustainable development
Abstracts:Bounty programs have become a pivotal incentive mechanism in open-source software (OSS) communities, attracting contributors by offering monetary rewards for task completion. Despite their long-standing implementation, the optimal utilization of this mechanism from the perspective of backers (individuals or entities funding bounties) remains insufficiently understood, hindering its refinement and broader adoption. To bridge this gap, we conduct a mixed-methods study analyzing 10,561 bounty issues from Gitcoin, their linked GitHub development data, and surveys from 46 bounty backers. We investigate three core decision-making dimensions: (1) why backers use bounties and the actual outcomes, (2) what issues backers prioritize, and (3) how bounty amounts are set. Our findings reveal that backers primarily seek to enhance developer engagement, project visibility, and task efficiency. However, the actual outcomes often diverge from expectations: although bounty issues have a higher resolution rate (+12%) than non-bounty issues, they also introduce systemic challenges, such as delayed resolutions (+33 days) and difficulties in engaging new developers. Notably, backers tend to prioritize feature-related, intermediate-complexity tasks with short completion timelines, while showing relatively less interest in overly simplistic or highly specialized work. Reward allocation follows a nuanced approach: lower bounties target beginner-friendly tasks, while higher rewards are reserved for advanced skills or multi-week commitments. However, backers often lack systematic methods to calibrate rewards, leading to frequent bounty adjustments. To enable data-driven decision-making, we propose a bounty recommendation predictor that uses empirical factors to predict appropriate bounty amount. By synthesizing these insights, our study offers OSS communities actionable strategies to refine bounty programs, balancing short-term productivity with long-term ecosystem sustainability.
Spatial Semantic Fuzzing for LiDAR-Based Autonomous Driving Perception Systems
An GuoZhiwei SuXinyu GaoChunrong FangSenrong WangHaoxiang TianWu WenLei MaZhenyu Chen
Keywords:Autonomous vehiclesTestingThree-dimensional displaysPoint cloud compressionSemanticsLaser radarObject detectionFuzzingFeature extractionSensorsAutonomous VehiclesPerceptual SystemObject DetectionPoint CloudMetamorphicSpatial CoverageLabeled DataPoint Cloud Data3D DetectionDiverse Test3D Object DetectionPerception ModuleBounding BoxObjective DataLight Detection And RangingTesting CriteriaNumber Of BehaviorsRoad SegmentsSeries Of TransformationsSystem ScenarioScene Graph3D Bounding BoxCoverage CriteriaOriginal Point CloudTransformation OperationsTraditional SoftwareController Area NetworkPoint Cloud GenerationSpurious DetectionNumber Of GraphsSoftware testingfuzz testingautonomous driving systemlight detection and ranging
Abstracts:Autonomous driving systems (ADSs) have the potential to enhance safety through advanced perception and reaction capabilities, reduce emissions by alleviating congestion, and contribute to various improvements in quality of life. Despite significant advancements in ADSs, several real-world accidents resulting in fatalities have occurred due to failures in the autonomous driving perception modules. As a critical component of autonomous vehicles, LiDAR-based perception systems are marked by high complexity and low interpretability, necessitating the development of effective testing methods for these systems. Current testing methods largely depend on manual data collection and labeling, which restricts their ability to detect a diverse range of erroneous behaviors. This process is not only time-consuming and labor-intensive, but it may also result in the recurrent discovery of similar erroneous behaviors during testing, hindering a comprehensive assessment of the systems. In this paper, we propose and implement a fuzzing framework for LiDAR-based autonomous driving perception systems, named LDFuzz, grounded in metamorphic testing theory. This framework offers the first uniform solution for the automated generation of tests with oracle information. To enhance testing efficiency and increase the number of tests that identify erroneous behaviors, we incorporate spatial and semantic coverage based on the characteristics of point cloud data to guide the generation process. We evaluate the performance of LDFuzz through experiments conducted on four LiDAR-based autonomous driving perception systems designed for the 3D object detection task. The experimental results demonstrate that the tests produced by LDFuzz can effectively detect an average of 7.5% more erroneous behaviors within LiDAR-based perception systems than the optimal baseline. Furthermore, the findings indicate that LDFuzz significantly enhances the diversity of failed tests.
CoSQA+: Enhancing Code Search Evaluation With a Multi-Choice Benchmark and Test-Driven Agents
Jing GongYanghui WuLinxi LiangYanlin WangJiachi ChenMingwei LiuZibin Zheng
Keywords:CodesBenchmark testingPythonAnnotationsAccuracyProgrammingSemanticsProgramming professionSurveysScalabilityCode SearchPairingNatural LanguageSuperior QualityLanguage ModelMultiple CodesFunctional VerificationExpert AnnotationsSemantic SearchYears Of ExperienceInput ParametersTesting ProgramSearch QueriesMean Average PrecisionPython CodeCurrent StudentsRelevant CodesAnnotation MethodsCode ExamplesMultiple AnnotationsCode SnippetsStack OverflowFinal ArbiterAnnotation ApproachOriginal PairPrimary MetricsNatural Language DescriptionsKrippendorff’s AlphaEdge CasesBenchmark EvaluationSoftware engineeringinformation search and retrievalhuman-computer interaction
Abstracts:Semantic code search, retrieving code that matches a given natural language query, is an important task to improve productivity in software engineering. Existing code search datasets face limitations: they rely on human annotators who assess code primarily through semantic understanding rather than functional verification, leading to potential inaccuracies and scalability issues. Additionally, current evaluation metrics often overlook the multi-choice nature of code search. This paper introduces CoSQA+, pairing high-quality queries from CoSQA with multiple suitable codes. We develop an automated pipeline featuring multiple model-based candidate selections and the novel test-driven agent annotation system. Among a single Large Language Model (LLM) annotator and Python expert annotators (without test-based verification), agents leverage test-based verification and achieve the highest accuracy of 93.9%. Through extensive experiments, CoSQA+ has demonstrated superior quality over CoSQA. Models trained on CoSQA+ exhibit improved performance. We publicly release both CoSQA+_all, which contains 412,080 agent-annotated pairs, and CoSQA+_verified, which contains 1,000 human-verified pairs.
Synthetic Malware at Scale: Malicious Code Generation With Code Transplanting
Guangzhan WangDiwei ChenXiaodong GuYuting ChenBeijun Shen
Keywords:CodesMalwarePrototypesTrainingFeature extractionDetectorsReactive powerMachine learningSource codingGeneratorsMalwareCode GenerationMachine LearningTraining DataData AugmentationRandom LocationsComplete CodingBenign SamplesIntrusion Detection SystemTraining Data AugmentationCode FragmentsTraining SetFalse Positive RateSource CodeBilingualInternal ValidityProgramming LanguageDetection PerformanceFalse Negative RateLanguage ModelMalicious BehaviorMalware DetectionAdversarial ExamplesSemantic CoherenceTokenizedLarge CorpusFunctional SignaturesVariable NamesConditional StatementsMalicious code generationmalicious code detectioncode transplantingprogram generation
Abstracts:Malicious code detection is one of the most essential tasks in safeguarding against security breaches, data compromise, and related threats. While machine learning has emerged as a predominant method for pattern detection, the training process is intricate due to the severe scarcity of malicious code samples. Consequently, machine learning detectors often encounter malicious patterns in limited and isolated scenarios, hindering their ability to generalize effectively across diverse threat landscapes. In this paper, we introduce MalCoder, a novel method for synthesizing malicious code samples. MalCoder enlarges the quantity and diversity of malicious instances by transplanting a set of malicious prototypes into a vast pool of benign code, thereby crafting a diverse array of malicious instances tailored to various application scenarios. For each malware prototype, MalCoder treats it as an incomplete code fragment and crafts its preceding and subsequent contexts through right-to-left and left-to-right code completion respectively. By leveraging GPTs with various sampling strategies, we can instantiate a large number of code samples bearing the malware prototype. Subsequently, MalCoder masks the original prototypes within the transplanted samples and fine-tunes an LLM code generator to reconstruct the original prototype. This process enables the model to seamlessly transplant malicious code fragments into benign code. During inference, MalCoder can automatically insert malicious fragments into benign samples at random positions, transforming benign code into malicious code. We apply MalCoder to a large pool of benign code in CodeSearchNet and craft over 50,000 malicious samples stemming from 39 malicious prototypes. Both qualitative and quantitative analyses show that the generated samples maintain key characteristics of malicious code while blending seamlessly with benign code, which helps in creating realistic and varied training data. Additionally, by using the generated samples as augmented training data, we witness a remarkable surge in malicious code detection capabilities. Specifically, the F1-score experiences a significant increase compared to utilizing only the original prototype samples.
SemanticLog: Towards Effective and Efficient Large-Scale Semantic Log Parsing
Chenbo ZhangWenying XuJinbu LiuLu ZhangGuiyang LiuJihong GuanQi ZhouShuigeng Zhou
Keywords:SemanticsPrivacyFeature extractionData processingAccuracyAnomaly detectionData privacyChatbotsProtectionLarge language modelsLog ParsingData PrivacyData LoggerPrivacy IssuesPrivacy ProtectionLanguage ModelRich KnowledgeCloud SystemNetwork LatencyLog ParametersTraining TimeSemantic InformationAnomaly DetectionLeaf NodePowerful CapabilityTokenizedSubstringTemplate MatchingCachingSemantic LabelsSemantic SystemSemantic UnderstandingText GenerationAccuracy In GroupCategorical ParametersOrder Of ObjectsSemantic SpacePublic BenchmarkSource TrainingSlower Processing SpeedSemantic log parsinglog analysislarge language models
Abstracts:Logs of large-scale cloud systems record diverse system events, ranging from routine statuses to critical errors. As the fundamental step of automated log analysis, log parsing is to transform unstructured logs into structured data for easier management and analysis. However, existing syntax-based and deep learning-based parsers struggle with complex real-world logs. Recent parsers based on large language models (LLMs) achieve higher accuracy, but they typically rely on online APIs (e.g., ChatGPT), raising privacy concerns and suffering from network latency. Moreover, with the rise of artificial intelligence for IT operations (AIOps), traditional parsers that focus on syntax-level templates fail to capture the semantics of dynamic log parameters, limiting their usefulness for downstream tasks. These challenges highlight the need for semantic log parsing that goes beyond template extraction to understand parameter semantics. This paper presents SemanticLog, an effective and efficient semantic log parser powered by open-source LLMs. SemanticLog adapts the structure of LLMs to the log parsing task, leveraging their rich knowledge while safeguarding log data privacy. It first extracts informative feature representations from log data, then refines them through fine-grained semantic perception to enable accurate template and parameter extraction together with semantic category prediction. To boost scalability, SemanticLog introduces the EffiParsing tree for faster inference on large-scale logs. Extensive experiments on the LogHub-2.0 dataset show that SemanticLog significantly outperforms the state-of-the-art log parsers in terms of accuracy. Moreover, it also surpasses existing LLM-based parsers in efficiency while showcasing advanced semantic parsing capability. Notably, SemanticLog employs much smaller open-source LLMs compared to existing LLM-based parsers (mainly based on ChatGPT), while maintaining better capability of log data privacy protection.
DockerFill: Automatically Completing Dockerfile Code With Syntax-Aware Multi-Task Learning
Yiwen WuYang ZhangTao WangBo DingHuaimin Wang
Keywords:CodesSurveysTransformersWritingTrainingMultitaskingSyntacticsManualsContainersAccuracyMulti-task LearningContextual InformationSoftware DevelopmentExact MatchLanguage ModelContainerizedTransformer ArchitectureDocker ImageMasked Language ModelPre-training TasksSimilarity MeasureNatural LanguageProgramming LanguageComputational OverheadDistance MetricsModel CodeComplete CodingCompany WorkersVocabulary SizeTokenizedPre-training StagePre-trained Language ModelsCode GenerationFile PathShared LayersSyntax ErrorsFine-tuning StageTraining CorpusEdit DistanceDockerfiletransformercode completion
Abstracts:As a kind of infrastructure-as-code, Dockerfile specifies the structure and functionality of a built Docker image and thus plays an important role in the containerized software development process. Nowadays developers need to spend extra time and effort configuring their Dockerfiles in addition to their regular coding work, which requires knowledge and skills orthogonal to those entailed in other software-related experiences. Poorly written Dockerfile code often introduces errors and maintenance costs. However, little automated support is available for assisting developers in configuring Dockerfiles. In this study, we first conduct an online survey to investigate Docker developers’ perceptions of Dockerfile writing, highlighting the needs and potential benefits of Dockerfile auto-completion techniques. Then, we introduce DockerFill, a pre-trained model based approach that provides completion suggestions for Dockerfile-specific code. DockerFill leverages multi-layer Transformer architecture with syntax-aware multi-task learning, which includes contextual file information and three pre-training tasks, i.e., masked language modeling, syntax type identification, and masked identifier prediction. To evaluate DockerFill’s effectiveness, we collect a dataset of 6,350 high-quality real-world Dockerfiles. Our empirical results show that DockerFill provides up to 52.38% accuracy for token-level completion and 19.69% exact match for line-level completion, outperforming the baselines by 7.32%-37.67% and 1.97%-19.69%, respectively. Also, DockerFill obtains significantly higher human evaluation scores compared to the baselines.
Low-Cost Testing for Path Coverage of MPI Programs Using Surrogate-Assisted Changeable Multi-Objective Optimization
Baicai SunLina GongYinan GuoDunwei GongGaige Wang
Keywords:TestingOptimizationCostsOptimization modelsReceiversSynchronizationInformation scienceGenetic algorithmsCodesBenchmark testingMulti-objective OptimizationMessage Passing InterfaceCoverage PathMessage Passing Interface ProgramSample SetAlternative ModelsOptimization AlgorithmOptimal ModelEfficient GenerationIntelligence AlgorithmsProgram ExecutionMulti-objective ModelProgram CoverageMulti-objective Optimization ModelIntelligent Optimization AlgorithmsTest Case GenerationTarget PathGaussian KernelValue FunctionRadial Basis Function NetworkInput ProgramGeneration CostHidden Layer NodesAnt Colony OptimizationCommunication DomainGroup Of IndicatorsCost Of TestingSet Of CasesSoftware TestingPath coverage of MPI programstest case generationchangeable multi-objective optimizationsurrogate model
Abstracts:A target path of Message Passing Interface (MPI) programs typically consists of several target sub-paths. During solving a test case that cover the target path using an intelligent optimization algorithm, we often find that there are some hard-to-cover target sub-paths, which limit the testing efficiency of the entire target path. Therefore, this paper proposes an approach of low-cost testing for path coverage of MPI programs using surrogate-assisted changeable multi-objective optimization, which is used to further improve the effectiveness and efficiency of test case generation. The proposed approach first establishes a changeable multi-objective optimization model, which is used to guide the generation of test cases. During solving the changeable multi-objective optimization model using an intelligent optimization algorithm, we then determine each hard-to-cover target sub-path and form a corresponding sample set. Finally, we manage the surrogate model corresponding to each hard-to-cover target sub-path based on the formed sample set, and select superior evolutionary individuals to really execute the MPI program under test, thus reducing the cost and times of program execution. The proposed approach has been applied to path coverage testing of several benchmark MPI programs, and compared with several state-of-the-art approaches. The experimental results show that the proposed approach significantly improves the effectiveness and efficiency of generating test cases.
Do Automated Fixes Truly Mitigate Smart Contract Exploits?
Sofia BobadillaMonica JinMartin Monperrus
Keywords:Smart contractsMaintenance engineeringCodesSource codingBlockchainsPrevention and mitigationManualsStatic analysisSystematic literature reviewFormal verificationSmart ContractsSource CodeTool For DetectionAccess ControlTypes Of DatasetsOriginal FunctionManual AnalysisStatic AnalysisBinary CodeDenial Of ServiceEffective RepairRepair StrategiesFormal VerificationVulnerable CategoryReproducible ScienceReproducible ToolTypes Of VulnerabilitiesSyntax ErrorsContraction FunctionTemplate-based ApproachBlockchainExecution Environment
Abstracts:Automated Program Repair (APR) for smart contract security promises to automatically mitigate smart contract vulnerabilities responsible for billions in financial losses. However, the true effectiveness of this research in addressing smart contract exploits remains uncharted territory. This paper bridges this critical gap by introducing a novel and systematic experimental framework for evaluating exploit mitigation of program repair tools for smart contracts. We qualitatively and quantitatively analyze 20 state-of-the-art APR tools using a dataset of 143 vulnerable smart contracts, for which we manually craft 91 executable exploits. We are the very first to define and measure the essential “exploit mitigation rate”, giving researchers and practitioners a real sense of effectiveness. Our findings reveal substantial disparities in the state of the art, with an exploit mitigation rate ranging from a low of 29% to a high of 74%. Our study identifies systemic limitations, such as inconsistent functionality preservation, that must be addressed in future research on program repair for smart contracts.
The Power of Small LLMs: A Multi-Agent for Code Generation via Dynamic Precaution Tuning
Junfeng ZhangJinzhi LiaoJiuyang TangXiang Zhao
Keywords:CodesMulti-agent systemsTuningCollaborationSoftware development managementLarge language modelsData privacyComputational modelingComputational efficiencyVocabularyCode GenerationDynamic TuningLarge Language ModelsHigh CostData PrivacyLanguage ModelMulti-agent SystemsIterative RefinementConvolutional Neural NetworkNatural LanguageEducational SettingsSoftware DevelopmentTeacher ModelMultiple PhasesSpecific AgentsUser RequirementsDefinitive RoleUsage ScenariosError MessageError FeedbackTest Case GenerationNatural Language DescriptionsCoding TaskFine-tuning StrategyError PreventionAPI CallsCorrection CodeCode ReviewError PropagationFunctional SignaturesCode generationlarge language modelssmall LLMsmulti-agent systemdynamic precaution tuning
Abstracts:The emergence of large language models (LLMs) has greatly advanced automated code generation, with multi-agent systems comprising multiple LLMs gaining attention for their collaborative potential. However, most multi-agent systems still rely on large LLMs, leading to high computational costs and data privacy risks. Small LLMs provide a resource-efficient and privacy-preserving alternative; however, directly replacing them in conventional multi-agent frameworks leads to considerable performance degradation. This paper initially identifies two fundamental challenges in achieving effective collaboration among small LLMs: the difficulty in accurately interpreting complex role prompts and the fragility of inter-agent coordination. To overcome these, we propose MASDP, a multi-agent system for code generation with dynamic precaution tuning, inspired by the whistleblowing mechanism. MASDP introduces a fine-tuned Reminder agent that proactively anticipates potential errors, shifts the burden of code optimization from the Programmer to itself, and iteratively refines precautions based on execution feedback—thereby enhancing cooperation and reliability. Extensive experiments show that MASDP, built entirely on small LLMs, outperforms state-of-the-art baselines, including GPT-4, while substantially reducing computational overhead and safeguarding data privacy.
Hot Journals