1
Anisa Habib
Honors 3700
7 Dec 2023
GitHub Copilot: AI Audit
GitHub Copilot is a cloud-based artificial intelligence tool developed by Microsoft
subsidiary GitHub and OpenAI that aims to assist users by autocompleting code. Released on 29
June 2021, it is currently available to individual users or businesses that purchase a subscription.
It is possible to add Copilot as an extension in Visual Studio Code, Visual Studio, Vim, Neovim,
Azure Data Studio, and JetBrains integrated development environments (IDEs). Copilot’s
primary objective is to serve as an AI pair programmer and help programmers code faster
through suggesting code completion and writing code in response to natural language prompts.
The OpenAI Codex, a descendant of GPT-3, is the model that powers GitHub Copilot.
According to the model’s official description, “its training data contains both natural language
and billions of lines of source code from publicly available sources, including code in public
GitHub repositories” (Zaremba et al., 2021). While OpenAI Codex is most capable in Python, it
is also proficient in over a dozen other programming languages. As of 2021, it has a memory of
14KB for Python code. Compared to GPT-3’s 4KB memory, GitHub Copilot can take into
account over 3 times as much contextual information while performing any task (Zaremba et al.,
2021). The authors of OpenAI Codex argue that computer programming can be thought of as
having two major tasks: breaking a problem down into simpler problems and mapping these
problems to existing code libraries, APIs, or functions (Zaremba et al., 2021). GitHub Copilot
and OpenAI Codex aim to alleviate the barrier that is this second task through analyzing the
context of the user’s input and suggesting code.
Drawing context from the user’s private code and comments, the algorithm suggests line
completion or even entire blocks of code to the user. The tool allows the user to manually edit
2
and cycle through alternative suggestions, autofill repetitive code, and create unit tests. This all
contributes to GitHub’s goal of allowing developers to “quickly discover alternative ways to
solve problems, write tests, and explore APIs without having to tediously tailor a search for
answers… across the internet” (Friedman, 2021). Some small studies from GitHub report that
professional developers using Copilot take 55% less time to complete programming tasks than
those that do not (Kalliamvakou, 2022). The official webpage also reports that over 37,000
companies have adopted GitHub Copilot and that 55% of developers prefer the technology
according to a StackOverflow 2023 survey.
While the technology is promising, the quality and security of anything developed in part
by an AI tool are important aspects to consider. Since Copilot’s release, there have been many
discussions concerning its security and impact on education. One study reports to have
successfully “extracted 2,702 hard-coded credentials from Copilot and 129 secrets from
CodeWhisper under the black-box setting, among which at least 3.6% and 5.4% secrets are real
strings from GitHub repositories” (Huang et al., 2023). Researchers from the New York
University’s Center for Cybersecurity found that Copilot “ generated code contained security
vulnerabilities about 40% of the time” (Pearce et al., 2021). This means that it is possible to
successfully extract sensitive information such as account credentials through code generated
with Copilot, raising severe privacy and security concerns. Additionally, since Copilot is trained
with billions of lines of public code, this likely includes almost every popular introductory
university programming assignment (Claburn, 2022). While solutions to these assignments could
already be found online with little to some effort, educators could normally rely on “plagiarism
checkers” to determine if students are cheating. However, code generated with GitHub Copilot
“actually generates novel solutions … that are superficially different enough that they plausibly
3
could have come from a student” (Claburn, 2022). This now makes it more difficult for educators
and recruiters to catch cheating and determine the level of ability and understanding that students
have.
Similar to ChatGPT and other LLMs, many in the field consider Copilot as a valuable
tool that can increase productivity by autocompleting redundant tasks. However, others hold the
perspective that Copilot produces weak, insecure code and is a threat to the integrity of
educational and business environments. This audit of GitHub Copilot aims to focus on its quality
of product in terms of performance and security. Further, it attempts to explore the extent to
which its output may be considered plagiarism. In other words, what are the limits of GitHub
Copilot and how do those limits affect user privacy, security, and integrity?
Blue Sky Audit
In a world with unlimited time and resources, it would be ideal to conduct a thorough
analysis of bias and inaccuracies present in the training data of GitHub Copilot and OpenAI
Codex. Information on how this data was collected and any sensitive information it may include,
as well as evaluation metrics used to detect bias are key features of any algorithm (Diakopoulos,
2014). It is possible that training data was not efficiently filtered or tested against all possible
inaccuracies or bias. As the models were trained on both natural language and publicly available
code scraped from the internet, any human-generated bias and error present in the training
dataset could replicate itself in the model output. However, this would require a lot of time and
advanced data scraping skills.
In order to measure “integrity” and quality of output, analyses on the general quality of
Copilot’s suggestions are also required. To evaluate this, a feasible approach could be to prompt
4
the Copilot AI a relevant number of times with a variety of programming tasks and test its
accuracy and security. Researchers at Wuhan University conducted a study that analyzed 435
code snippets generated by Copilot from GitHub projects and used multiple security scanners to
identify vulnerabilities (Fu et al., 2023). A number of similar, small studies on code generated
with Copilot have been done in the past few years (Huang et. al, 2023; Pearce, 2021). These
studies mostly all focused on hard-coded secrets, i.e. embedded credentials or plain text
passwords and other sensitive information in source code. Further testing on the security and
accuracy of Copilot generated code would be beneficial. With unlimited time and resources, this
audit would aim to prompt Copilot with thousands of tasks that consist of generating secure code
and completing assignments at varying levels of difficulty. Through testing and analyzing the
resulting output, we could then gain a greater understanding of the extent to which using Copilot
affects the security and quality of a programmer’s code.
Proof-of-Concept Audit
To conduct this audit I subscribed to a free trial of the Copilot Individual plan on a brand
new GitHub account and installed the Copilot extension on Visual Studio Code. When
subscribing to Copilot, GitHub prompts the user to select whether they would like Copilot to
allow suggestions from public code and whether they allow GitHub to use your code snippets for
product improvements. I selected to allow for public code suggestions, and did not allow for my
code to be used for future model training.
Due to a limited amount of time, skills, and resources, this audit analyzed a small set of
code generation prompts. My first task consisted of using Copilot to create a login page for a
simple PHP application. Through this, I aimed to analyze the security of the tool’s code
5
suggestions for basic applications. For my second task, I asked Copilot to complete my solutions
for 15 different HackerRank problems. HackerRank is an online archive of programming
problems– these problems are similar to ones used in some university classes. Many companies
also use HackerRank as a service to assist with technical recruitment. Solutions to HackerRank
problems are tested for both accuracy and performance. Through tasking Copilot to complete
these problems, I aimed to determine the extent of Copilot’s quality of output and whether it may
be used to violate programmer integrity.
Task 1: Secure Application Code Generation
PHP is an open source scripting language that can be used to write websites and any kind
of web-based application and service. Wikipedia, WordPress, and Etsy are just a few examples of
commercial websites written in the language. To begin auditing Copilot’s suggestions for a
simple PHP login form, I started with a completely blank project. After writing two lines,
Copilot quickly recommended an entire block of code, lines 7-21 as shown in the image below.
While this AI generated code does compile and complete the task, it already contains security
issues. The query on line 10 uses the exact values entered by the user, which makes it vulnerable
6
to SQL injection attacks. An SQL injection attack is a common web hacking technique that
involves the placement of malicious code in SQL statements, via web page input (W3Schools).
A hacker could easily execute arbitrary SQL commands to gain access to user passwords or other
sensitive information. Instead, Copilot should have suggested some lines that “sanitize” user
input before creating a query.
To guard against malicious actors, it is also important to never store direct passwords in a
database (W3Schools). When I prompted Copilot to fill in the code for an account registration
page, it did suggest adding some MD5 hash encryption to the original password entered by the
user. This is shown on line 18 in the image below.
However, for some applications, this amount of encryption may not be enough. There are more
advanced methods to go about password protection that professional developers may prefer to
use instead. It is also important to note that with this registration form, the login page Copilot
previously suggested would not work correctly. The saved password and the password that we
search for upon login would never match. Another alarming discovery is that when I attempted
7
to add an “author” comment on my code, GitHub Copilot suggested auto-completing full names
that were not mine.
Task 2: Problem Solutions
Algorithm practice problems on HackerRank are rated with a difficulty of either easy,
medium, or hard. For this audit, I prompted GitHub Copilot to complete my solutions for 5
randomly selected problems from each of the easy, medium, and hard categories. As OpenAI
Codex is reported to suggest code most accurately for Python, all problems were solved using
Python. Every problem on HackerRank has multiple test cases that must be passed in order for a
solution to be accepted. These tests measure for both accuracy and performance (HackerRank).
A summary of results is shown in the table below.
Table 1: Summary of HackerRank Problems Solved by GitHub Copilot Code Generation
solved unsolved required significant editing?
easy
simple array sum N
minimax sum N
day of the programmer N
subarray division N
minimum distances N
medium
queens attack II Y
encryption N
extra long factorials N
climbing the leaderboard Y
3D surface area Y
hard
morgan and a string N
dfs edges Y
dijkstra: shortest reach II Y
beautiful 3 set Y
traveling salesman in a grid Y
8
Copilot generated code was able to solve 100% of the selected easy problems. All of the
problems in this set had over 90% success rates on HackerRank. Once I began typing a solution
in Visual Studio, Copilot quickly suggested entire code blocks to complete these simple tasks.
The suggested code even included comments explaining what each line accomplishes. One of
these generated solutions is depicted in the image below.
For the selected medium-difficulty problems, Copilot was able to solve about 60%. Out
of the three block suggestions for the Climbing the Leaderboard problem, none accurately
addressed the problem’s context or provided a correct solution. As shown in the image below,
one Copilot suggestion even modified the correct ‘ranked’ and ‘player’ variable names to
‘scores’ and ‘alice’– which are not included anywhere in the surrounding code. This particular
problem only has a 60.73% success rate on HackerRank.
9
For the Queen’s Attack II problem, with a 68.03% success rate, Copilot similarly generated three
different solutions. Only one of the solutions was accepted after I performed some small edits on
the output. Copilot did accurately generate a solution to the Encryption and Extra Long
Factorials problems, which have a 91.97% and 95.58% success rate on HackerRank respectively.
Copilot generated code and suggestions assisted in solving 40% of the selected hard
difficulty problems. Copilot provided accurate suggestions for my starter solutions to Morgan
and a String and DFS Edges. However, it failed to produce any valuable suggestions for the other
problems, which all have around a 50% or less success rate on HackerRank. When providing the
prompter with a description of these problems in natural language, Copilot was unable to
generate an accurate solution to these remaining problems. When attempting solutions to these
problems with my own code, Copilot continued to suggest different line and block
auto-completion options. While some of these suggestions were useful to quickly correct syntax
and auto-complete things such as closing parentheses and variable names, more were incorrect
and did not accurately complete what I intended to write.
Discussion
10
While GitHub Copilot is a practical tool that can enhance developer efficiency
(Kalliamvakou, 2022), it is imperative to exercise caution, especially when handling sensitive
data or building applications with security considerations. The tool’s efficacy is evident in
resolving approximately 67% of common algorithmic problems, particularly those with high
success ratings HackerRank. As some individuals upload their HackerRank problem solutions in
public GitHub repositories, it is likely that the suggested code is taken directly from these users.
However, the accuracy of Copilot’s suggestions diminishes for more complex tasks, and there is
a noteworthy concern regarding the generation of insecure code for common web applications.
GitHub Copilot can only be as secure, accurate, and unbiased as its data set. This is true
for any generative AI tool– massive data sets and predictive analytics do not always reflect
objective truth (Crawford, 2013). OpenAI Codex is trained on billions of lines of both natural
and programming languages, the majority of which is scraped from public sources. The tool may
not prioritize secure programming practices, often opting for general or “most probable”
suggestions. It is crucial to recognize that the biases and limitations present in the training data
can be reflected in the generated code. As stated by Kate Crawford in The Hidden Biases in Big
Data, “data and data sets are not objective; they are creations of human design. We give numbers
their voice, draw inferences from them, and define their meaning through our interpretations.
Hidden biases in both the collection and analysis stages present considerable risks, and are as
important to the big-data equation as the numbers themselves” (Crawford, 2013). Look no
further than OpenAI’s ChatGPT and Meta’s Galactica for examples of large language models
tendencies to assert prejudice and falsehood as facts. These LLMs “are not really knowledgeable
beyond their ability to capture patterns of strings of words and spit them out in a probabilistic
manner” (Heaven, 2022).
11
In its FAQ, GitHub warns users that “you should always use GitHub Copilot together
with good testing and code review practices and security tools, as well as your own judgment.”
Code generated by Copilot should only be used as a starting point— while it may be efficient at
completing simple tasks, it is not possible for the model to be aware of an entire application’s
context and the programmer’s intent. Developers must exercise their own judgment and refrain
from relying solely on Copilot-generated code, as the tool’s efficiency may lead less experienced
developers to believe that the suggested code is always correct. While the tool may allow new
developers to “cheat” through simpler tasks, its limitations become apparent in more challenging
assignments, requiring substantial editing and scrutiny by the developer. For example, this is
shown in how Copilot’s suggestions for more complex algorithms, such as Dijkstra’s problem,
required extensive modifications. This caution becomes particularly crucial when handling
sensitive information. No one is fully protected from data leakage– GitGuardian reportedly
detected 10 million new secrets in public GitHub commits in 2022 (Git Guardian, 2023). While
this audit did not extensively evaluate hard-coded secrets, Copilot’s lack of secure programming
recommendations was evident. Developers must be mindful that models can only be as secure
and efficient as their learning data set. While the capabilities of GitHub Copilot are limited, it
cannot compromise the security and integrity of developers that are cautious of these limitations.
12
Works Cited
Claburn, T. (2022, August 19). GitHub copilot: Perfect for cheating in Compsci exercises?. The
Register® – Biting the hand that feeds IT.
https://www.theregister.com/2022/08/19/copilot_github_students/
Fu, Y. et. al. (2023, October 3). Security Weaknesses of Copilot Generated Code in GitHub, Wuhan
University. https://arxiv.org/abs/2310.02059
GitGaurdian. (2023). State of Secrets Sprawl Report 2023. State of Secrets Sprawl Report 2023.
https://www.gitguardian.com/state-of-secrets-sprawl-report-2023
GitHub (n.d.). GitHub Copilot. https://github.com/features/copilot
HackerRank. (n.d.). Solve algorithms code challenges. HackerRank.
https://www.hackerrank.com/domains/algorithms
Huang, Y. et. al. (2023, September 14). Neural Code Completion Tools Can Memorize Hard-coded
Credentials, Hong Kong University. https://arxiv.org/abs/2309.07639
Kalliamvakou, E. (2022, September 7). Research: Quantifying github copilot’s impact on developer
productivity and happiness. The GitHub Blog.
https://github.blog/2022-09-07-research-quantifying-github-copilots-impact-on-developer-productivity-an
d-happiness/
Pearce, H. (2021, August 20). Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code
Contributions. EEE Symposium on Security and Privacy 2022. https://arxiv.org/abs/2108.09293
Radchenko, V. (2023, April 22). GitHub copilot security concerns. Medium.
https://vlad-rad.medium.com/github-copilot-security-conserns-d4209f0d5c28
Rawat, A. (2022, April 21). GitHub copilot: All you need to know. Medium.
https://medium.com/analytics-vidhya/github-copilot-all-you-need-to-know-8e6fc1d5ccc
Segura, T. (2023, October 12). Yes, github’s copilot can leak (real) secrets. GitGuardian Blog – Automated
Secrets Detection. https://blog.gitguardian.com/yes-github-copilot-can-leak-secrets/
W3Schools. (n.d.). SQL Injection. https://www.w3schools.com/sql/sql_injection.asp
Zaremba, W., Brochman, G., & OpenAI. (2021, August 10). OpenAI Codex.
https://openai.com/blog/openai-codex