Skip to content

Latest commit

 

History

History
658 lines (328 loc) · 91.6 KB

code_model.md

File metadata and controls

658 lines (328 loc) · 91.6 KB

Code Model

Code Model Training

Source Code Model

IR Code Model

Binary Code Model

Code Model Security

  • An Extensive Study on Adversarial Attack against Pre-trained Models of Code, (FSE2023)

    • Abstract: Transformer-based pre-trained models of code (PTMC) have been widely utilized and have achieved state-of-the-art performance in many mission-critical applications. However, they can be vulnerable to adversarial attacks through identifier substitution or coding style transformation, which can significantly degrade accuracy and may further incur security concerns. Although several approaches have been proposed to generate adversarial examples for PTMC, the effectiveness and efficiency of such appr...
    • Labels: code model, code model security, empirical study
  • An investigation into misuse of java security apis by large language models, (ASIACCS2024)

    • Abstract: The increasing trend of using Large Language Models (LLMs) for code generation raises the question of their capability to generate trustworthy code. While many researchers are exploring the utility of code generation for uncovering software vulnerabilities, one crucial but often overlooked aspect is the security Application Programming Interfaces (APIs). APIs play an integral role in upholding software security, yet effectively integrating security APIs presents substantial challenges. This lead...
    • Labels: code model, code model security, empirical study
  • Attacks and Defenses for Large Language Models on Coding Tasks, (ASE2024)

    • Abstract: Modern large language models (LLMs), such as ChatGPT, have demonstrated impressive capabilities for coding tasks, including writing and reasoning about code. They improve upon previous neural network models of code, such as code2seq or seq2seq, that already demonstrated competitive results when performing tasks such as code summarization and identifying code vulnerabilities. However, these previous code models were shown vulnerable to adversarial examples, i.e., small syntactic perturbations des...
    • Labels: code model, code model security
  • Attribution-guided Adversarial Code Prompt Generation for Code Completion Models, (ASE2024)

    • Abstract: Large language models have made significant progress in code completion, which may further remodel future software development. However, these code completion models are found to be highly risky as they may introduce vulnerabilities unintentionally or be induced by a special input, i.e., adversarial code prompt. Prior studies mainly focus on the robustness of these models, but their security has not been fully analyzed.In this paper, we propose a novel approach AdvPro that can automatically gene...
    • Labels: code generation, code completion, code model, code model security
  • AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models, (EMNLP2024)

    • Abstract: Although Large Language Models (LLMs) are becoming increasingly powerful, they still exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding tasks.As these unexpected errors could lead to severe consequences in practical deployments, it is crucial to investigate the limitations within LLMs systematically.Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies, while manual inspections are costly and not scalable. In this p...
    • Labels: code model, code model security, empirical study
  • CoSec: On-the-Fly Security Hardening of Code LLMs via Supervised Co-decoding, (ISSTA2024)

    • Abstract: Large Language Models (LLMs) specialized in code have shown exceptional proficiency across various programming-related tasks, particularly code generation. Nonetheless, due to its nature of pretraining on massive uncritically filtered data, prior studies have shown that code LLMs are prone to generate code with potential vulnerabilities. Existing approaches to mitigate this risk involve crafting data without vulnerability and subsequently retraining or fine-tuning the model. As the number of par...
    • Labels: code model, code model security
  • CodeIP: A Grammar-Guided Multi-Bit Watermark for Large Language Models of Code, (EMNLP2024)

    • Abstract: Large Language Models (LLMs) have achieved remarkable progress in code generation. It now becomes crucial to identify whether the code is AI-generated and to determine the specific model used, particularly for purposes such as protecting Intellectual Property (IP) in industry and preventing cheating in programming exercises. To this end, several attempts have been made to insert watermarks into machine-generated code. However, existing approaches are limited to inserting only a single bit of inf...
    • Labels: code generation, program synthesis, code model, code model security
  • Constrained Decoding for Secure Code Generation, (arXiv2024)

    • Abstract: Code Large Language Models (Code LLMs) have been increasingly used by developers to boost productivity, but they often generate vulnerable code. Thus, there is an urgent need to ensure that code generated by Code LLMs is correct and secure. Previous research has primarily focused on generating secure code, overlooking the fact that secure code also needs to be correct. This oversight can lead to a false sense of security. Currently, the community lacks a method to measure actual progress in this...
    • Labels: code generation, code model, code model security
  • Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing, (EMNLP2024)

    • Abstract: Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jail...
    • Labels: code model, code model security
  • Instruction tuning for secure code generation, (ICML2024)

    • Abstract: Modern language models (LMs) have gained widespread acceptance in everyday and professional contexts, particularly in programming. An essential procedure enabling this adoption is instruction tuning, which substantially enhances LMs' practical utility by training them to follow user instructions and human preferences. However, existing instruction tuning schemes overlook a crucial aspect: the security of generated code. As a result, even the state-of-the-art instruction-tuned LMs frequently prod...
    • Labels: code generation, code model, code model security
  • Large language models for code: Security hardening and adversarial testing, (CCS2023)

    • Abstract: Large language models (large LMs) are increasingly trained on massive codebases and used to generate code. However, LMs lack awareness of security and are found to frequently produce unsafe code. This work studies the security of LMs along two important axes: (i) security hardening, which aims to enhance LMs' reliability in generating secure code, and (ii) adversarial testing, which seeks to evaluate LMs' security at an adversarial standpoint. We address both of these by formulating a new securi...
    • Labels: code generation, code model, code model security
  • OpenAI’s Approach to External Red Teaming for AI Models and Systems, (OpenAI2024)

    • Abstract: Red teaming has emerged as a critical practice in assessing the possible risks of AI models and systems. It aids in the discovery of novel risks, stress testing possible gaps in existing mitigations, enriching existing quantitative safety metrics, facilitating the creation of new safety measurements, and enhancing public trust and the legitimacy of AI risk assessments. This white paper describes OpenAI’s work to date in external red teaming and draws some more general conclusions from this work....
    • Labels: code model, code model security, benchmark
  • PELICAN: exploiting backdoors of naturally trained deep learning models in binary code analysis, (USENIXSec2023)

    • Abstract: Deep Learning (DL) models are increasingly used in many cyber-security applications and achieve superior performance compared to traditional solutions. In this paper, we study backdoor vulnerabilities in naturally trained models used in binary analysis. These backdoors are not injected by attackers but rather products of defects in datasets and/or training processes. The attacker can exploit these vulnerabilities by injecting some small fixed input pattern (e.g., an instruction) called backdoor ...
    • Labels: code model, code model security, code model, code model training, binary code model
  • Poster: Boosting Adversarial Robustness by Adversarial Pre-training, (CCS2023)

    • Abstract: Vision Transformer (ViT) shows superior performance on various tasks, but, similar to other deep learning techniques, it is vulnerable to adversarial attacks. Due to the differences between ViT and traditional CNNs, previous works designed new adversarial training methods as defenses according to the design of ViT, such as blocking attention to individual patches or dropping embeddings with low attention. However, these methods usually focus on fine-tuning stage or the training of the model itse...
    • Labels: code model, code model security
  • RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent, (arXiv2024)

    • Abstract: Recently, advanced Large Language Models (LLMs) such as GPT-4 have been integrated into many real-world applications like Code Copilot. These applications have significantly expanded the attack surface of LLMs, exposing them to a variety of threats. Among them, jailbreak attacks that induce toxic responses through jailbreak prompts have raised critical safety concerns. To identify these threats, a growing number of red teaming approaches simulate potential adversarial scenarios by crafting jailb...
    • Labels: code model, code model security, benchmark
  • SecCodePLT: A Unified Platform for Evaluating the Security of Code GenAI, (arXiv2024)

    • Abstract: Language models for code (CodeLMs) have emerged as powerful tools for code-related tasks, outperforming traditional methods and standard machine learning approaches. However, these models are susceptible to security vulnerabilities, drawing increasing research attention from domains such as software engineering, artificial intelligence, and cybersecurity. Despite the growing body of research focused on the security of CodeLMs, a comprehensive survey in this area remains absent. To address this g...
    • Labels: code generation, program synthesis, code model, code model security, benchmark
  • Security of Language Models for Code: A Systematic Literature Review, (arXiv2024)

    • Abstract: Language models for code (CodeLMs) have emerged as powerful tools for code-related tasks, outperforming traditional methods and standard machine learning approaches. However, these models are susceptible to security vulnerabilities, drawing increasing research attention from domains such as software engineering, artificial intelligence, and cybersecurity. Despite the growing body of research focused on the security of CodeLMs, a comprehensive survey in this area remains absent. To address this g...
    • Labels: code model, code model security, survey
  • Traces of Memorisation in Large Language Models for Code, (ICSE2024)

    • Abstract: Large language models have gained significant popularity because of their ability to generate human-like text and potential applications in various fields, such as Software Engineering. Large language models for code are commonly trained on large unsanitised corpora of source code scraped from the internet. The content of these datasets is memorised and can be extracted by attackers with data extraction attacks. In this work, we explore memorisation in large language models for code and compare ...
    • Labels: code model, code model security, benchmark
  • TrojanPuzzle: Covertly Poisoning Code-Suggestion Models, (S&P2024)

    • Abstract: With tools like GitHub Copilot, automatic code suggestion is no longer a dream in software engineering. These tools, based on large language models, are typically trained on massive corpora of code mined from unvetted public sources. As a result, these models are susceptible to data poisoning attacks where an adversary manipulates the model’s training by injecting malicious data. Poisoning attacks could be designed to influence the model’s suggestions at run time for chosen contexts, such as ind...
    • Labels: code model, code model security, code generation, code completion
  • Who Wrote this Code? Watermarking for Code Generation, (ACL2024)

    • Abstract: Since the remarkable generation performance of large language models raised ethical and legal concerns, approaches to detect machine-generated text by embedding watermarks are being developed.However, we discover that the existing works fail to function appropriately in code generation tasks due to the task’s nature of having low entropy.Extending a logit-modifying watermark method, we propose Selective WatErmarking via Entropy Thresholding (SWEET), which enhances detection ability and mitigates...
    • Labels: code generation, program synthesis, code model, code model security

Code Model Robustness