By Dwayne McDaniel, GitGuardian Developer and Security Advocate, GitGuardian
The modern world of DevOps means relying on our code connecting to outside services and components imported at run time. All of this access is predicated on secrets, the credentials such as API keys and passwords granting any needed access. Ideally, these secrets should be stored safely in vaults, secret management platforms, or `.env` files located safely outside of version control.
Unfortunately, all too often, secrets end up in places they shouldn’t, such as in the code as plaintext or in an `.env` file shipped with the project and visible to anyone who has access. This continues to be a growing problem, as evidenced by the millions of secrets GitGuardian reported in our annual report.
Furthermore, this issue of secrets sprawling is not limited to in-house-produced code. It is also a serious problem for third-party software we incorporate into our ecosystems. Unlike our custom code, usually meant to run within our data centers or cloud providers, third-party code, such as PyPI packages, are most often intended to be freely distributed as open-source software, so any credentials that are included could be seen by hundreds or potentially even millions of developers before the issue is discovered.
* How widespread is secrets sprawl in PyPI?
At GitGuardian, we worked with security researcher Tom Forbes to scan every PyPI project for embedded secrets. PyPI, The Python Package Index, serves the Python community as the official 3rd party package management platform. We analyzed over 450,000 projects containing over 9.4 million files across 5 million released versions. This is what we found:
- Total unique secrets found: 3,938
- Unique secrets found to be valid: 768
- Total occurrences of secrets across all releases: 56,866
- Projects containing at least one unique secret: 2,922
- Individual types of secrets detected: 151
Caption: Distinct secrets by detector over time
*The files containing the most secrets
Given the research was on Python code, it should not be a surprise that files with the extension `.py` were the number one source for hardcoded credentials. Next most common were configuration and documentation files such as `.JSON` and `.yml` files. We also found valid secrets in some unexpected places, such as 209 README files and test folders with 675 unique secrets.
Most common types of files other than .py containing a hardcoded secret in PyPI packages
*Emergent trends
While everything from Redis credentials to Azure Keys were found among the releases, a few notable trends become apparent in our analysis:
- Google API key leaks have grown steadily over time, including a very large spike that occurred in 2020.
- Telegram bot tokens, found to be valid, have been leaked with increasing frequency, notably doubling in the first part of 2021 and spiking again in early 2023.
- A significant spike in leaked database credentials started in 2022 and continued through the end of the research window.
*Same secret, different releases
One thing that might stand out from these findings is the unbalanced ratio of unique secrets found vs total found across all releases. This is evidence that once a developer adds and publishes a secret, it is likely going to stay in the code across multiple releases. This is due, in part, to the fact that publishing tools lack sensible defaults for ignoring files. PyPI lacks safeguards for what you exclude from a distribution.
For example, Python does not honor `.gitignore` settings when a package is built. While `.gitignore` is great for keeping files out of your git history, that is the whole of its job. There are solutions like using `setuptools-git`, which you can use to safeguard accidental inclusion. This works for local configuration files, like `.cookiecutterrc` and .pypirc files. For reference, we found 43 `.pypirc` files containing PyPI publishing credentials.
*Yanked files are still accessible
When a developer releases something they didn’t intend to, their instinct might be to yank it back out of the project. Unfortunately, the yanking mechanism in PyPI does not actually remove the file from the server; it only marks the file to be ignored by an installer by default. If a user specifies the yanked version, it will still be used. The file is still downloadable, likely forever. Files are only completely removed from PyPI if they have known malicious code.
*Valid secrets granting unintended access
Here is a partial list of the most common types of valid secrets we found, which could give anyone access to the associated systems.
- Auth0 Keys
- Azure Active Directory API Keys
- Database credentials for providers such as MongoDB, MySQL, and PostgreSQL.
- Dropbox Keys.
- Coinbase Credentials
- GitHub OAuth App Keys.
- SSH Credentials
While it is tempting to focus on the larger numbers of total occurrences found, the secrets found to be valid pose the most immediate and critical threat. The researchers used ggshield, the GitGuardian CLI, for their research, which looks for over 400 types of secrets, both specific detectors and generic patterns, with a built validation process. Not all secrets can be checked for validity, but at the time the research was conducted in October 2023, over 190 specific types of credentials could be validated.
It is important to note that just because a credential can not be validated does not mean it should be considered invalid. Some systems, such as Hashicorp Vault, Kubernetes clusters, Okta, or Splunk, do not yet offer a non-intrusive way to test if a credential is valid. Rather, you should think of these findings as divided into ‘valid’ and ‘yet to be validated.’
Work safely
Here are some tips on how to avoid accidentally including secrets in your PyPI, or any other projects.
*Avoid plaintext credentials in code
If you never add a secret to your code, then there is no way for it to end up in your PyPI package. Easier said than done, we admit, but this is a skill just as valuable as avoiding infinite loops or stack overflows in your code. There are multiple tools that make it easy to programmatically call read-only values from files outside of version control, such as `python-dotenv`.
While a well-managed `.env` file is a practical solution, you can stay even safer by leveraging Cloud Secrets Managers, such as Azure Key Vault or AWS Secrets Manager. These secrets managers can be used to create and use secrets across cloud infrastructure, come standard with most modern cloud providers, and are very well documented.
*Scan before you release
Removing a secret from an uncommitted file is easy and very inexpensive. Removing that same secret from shared code is practically impossible and a time drain. We always want to ‘shift left’ and test early and often, especially when secrets are involved. Performing a secrets scan before you release, or before you even make a commit is the most cost-effective way to ensure a secret does not get leaked.
There are multiple tools that will let you automate the scanning process, such as ggshield, which you can use in a pre-commit Git hook. Aside from just finding the secret, any good scanner will also provide information such as type, number of occurrences, and if the secret is valid.
*PyPI secrets sprawl is solvable
Unique secrets added over time
The research ultimately reveals the disturbing trend that the number of secrets being added to PyPI is growing steadily over time. In the last year alone, the research shows over 1,000 unique secrets have been added via new projects and commits on PyPI. While this might sound discouraging, this is a challenge we believe can be addressed through raising awareness, education and ever-improving developer tooling. We hope the findings of this report help you with raising the issue within your organizations and projects.
The Python community continues to innovate and work to make all developers’ lives better. Donating useful code back to the community is something we hope to see more people do, but we want to see it done safely. GitGuardian can help you work safely and keep your projects free of secrets. The GitGuardian Secrets Detection platform is free for open source contributions and teams with 25 or fewer developers. We want to make sure your shared code contains only the intended logic and not your valid secrets.
> Hear directly from Tom Forbes about his PyPI research in his appearance on The Security Repo Podcast.
EMBED: https://www.youtube.com/watch?v=AhH0aGFPoO4
About the Author
Dwayne McDaniel, GitGuardian Developer and Security Advocate, has been working as a Developer Relations professional since 2016 and has been involved in the wider tech community since 2005. He loves sharing his knowledge.
Dwayne can be reached online at https://www.linkedin.com/in/dwaynemcdaniel/ and @McDwayne and at our company website http://www.gitguardian.com/