<https://arxiv.org/abs/2108.09293>
Comments: Accepted for publication in IEEE Symposium on Security and Privacy 2022
There is burgeoning interest in designing AI-based systems to
assist humans
in designing computing systems, including tools that automatically
generate
computer code. The most notable of these comes in the form of the
first
self-described `AI pair programmer', GitHub Copilot, a language
model trained
over open-source GitHub code. However, code often contains bugs -
and so, given
the vast quantity of unvetted code that Copilot has processed, it
is certain
that the language model will have learned from exploitable, buggy
code. This
raises concerns on the security of Copilot's code contributions.
In this work,
we systematically investigate the prevalence and conditions that
can cause
GitHub Copilot to recommend insecure code. To perform this
analysis we prompt
Copilot to generate code in scenarios relevant to high-risk CWEs
(e.g. those
from MITRE's "Top 25" list). We explore Copilot's performance
on three distinct
code generation axes -- examining how it performs given
diversity of
weaknesses, diversity of prompts, and diversity of domains. In
total, we
produce 89 different scenarios for Copilot to complete,
producing 1,689
programs. Of these, we found approximately 40% to be
vulnerable.