AverageDevs
DevOps

Secrets Management in Modern Infrastructure Using Vault or SSM

Real-world patterns, operational tradeoffs, and failure scenarios for managing secrets at scale with HashiCorp Vault and AWS SSM Parameter Store.

Secrets Management in Modern Infrastructure Using Vault or SSM

Secrets management is one of those problems that looks simple until you run a production system at scale. You start with environment variables hardcoded in a deployment script. Then someone commits an API key to GitHub. Then you discover your database password has not been rotated in eighteen months because doing so would require coordinated downtime across six services. Then an auditor asks how you know which human accessed which credential on which date, and you realize you do not have an answer.

This article examines secrets management not as a configuration problem but as an operational discipline. We will focus on HashiCorp Vault and AWS Systems Manager (SSM) Parameter Store because they represent two dominant approaches: Vault as a purpose-built security boundary, SSM as a convenience layer inside AWS. The goal is not to pick a winner but to clarify when each tool fits, what breaks in practice, and how to build a system that survives rotation, audits, and incidents.

Why Secrets Management Fails in Practice

Most secrets failures are not technical. They are organizational and operational. You introduce Vault to centralize secrets, but developers still hardcode database URIs in CI because Vault adds latency to their feedback loop. You enforce rotation policies, but the on-call engineer disables them at 3 AM because the orchestration tooling cannot handle credential changes mid-flight. You build perfect IAM policies, but someone screenshots a Slack thread containing production credentials because troubleshooting over shared screens is faster than provisioning temporary access.

The primary failure mode is friction. If your secrets system requires three Jira tickets, two Slack approvals, and a deployment to update a single API token, engineers will route around it. They will paste secrets into .env files, share them in wikis, or stuff them into base64-encoded ConfigMaps. Your security posture degrades not because the tooling is bad but because the operational cost is too high.

The secondary failure mode is invisibility. Static secrets look fine until they leak. If you store a PostgreSQL password in SSM and rotate it every 90 days, you have improved compliance but not security. If an attacker compromises an EC2 instance with IMDSv2 credentials, they extract that password, exfiltrate your database, and leave. The rotation schedule does not matter because the breach window was minutes, not months.

Dynamic secrets address this by issuing short-lived credentials tied to a specific workload. Vault can generate a PostgreSQL role with a two-hour TTL, scoped to a single schema. If that workload is compromised, the blast radius is limited by time and privilege. But dynamic secrets require infrastructure buy-in. Your database must support programmatic user creation. Your application must handle credential refresh mid-lifecycle. Your team must accept that secrets are no longer static configuration but runtime state. For organizations migrating from twelve-factor apps built on static environment variables, that shift is non-trivial.

The Blast Radius Problem and Secret Sprawl

As systems grow, secrets proliferate. A monolith might have five database credentials and three API keys. A microservices architecture with thirty services, six data stores, and integrations to Stripe, Twilio, Auth0, and Datadog might have two hundred secrets. If those secrets are stored in SSM with broad IAM policies, every service can theoretically access every secret. The blast radius of a single compromised instance is the entire secret footprint.

Vault mitigates this with policy-based access control. Each service gets a token scoped to a specific set of paths. The payments service can read secret/data/stripe but not secret/data/postgres/analytics. The analytics service gets read-only database credentials for the analytics cluster but cannot touch production payment data. This segmentation is not automatic. It requires deliberate policy design and enforcement at the orchestration layer. If your deployment tooling provisions Vault tokens with root policies because debugging is easier that way, you have added complexity without reducing risk.

SSM Parameter Store does not enforce path-based policies by default. You can structure parameters hierarchically and write IAM policies to restrict access, but those policies are external to SSM. If an engineer accidentally grants ssm:GetParameter with a wildcard resource, the hierarchy collapses. Vault policies are enforced inside Vault. SSM relies on IAM, which is flexible and powerful but also the source of most AWS misconfigurations.

The operational lesson here is that tooling alone does not solve secret sprawl. You need conventions: naming schemes, path hierarchies, and automated policy generation. Without those, Vault becomes a complicated key-value store, and SSM becomes a dumping ground.

Static Secrets vs Dynamic Secrets: When Each Is Realistic

Dynamic secrets are the gold standard, but they are not always practical. Consider a third-party SaaS API like Stripe or SendGrid. You cannot dynamically generate Stripe API keys. Stripe issues them manually through their dashboard, and you rotate them by hand. Storing that key in Vault or SSM does not change its lifecycle. You still have a long-lived credential. The value is centralization and audit logging, not dynamic issuance.

Dynamic secrets shine for infrastructure you control. Vault can generate database credentials, AWS IAM credentials, SSH certificates, and PKI certificates on demand. A Lambda function requests a PostgreSQL credential with a five-minute TTL. Vault connects to the database, runs CREATE USER, and returns the credentials. The function runs its query and exits. Vault revokes the credential after five minutes. If the function is compromised, the attacker has a credential that expires before they can monetize it.

This model requires tight integration. Your Lambda must handle credential expiry. If the function runs for six minutes, it fails when the credential is revoked. You need retry logic, connection pooling that respects credential rotation, and monitoring that alerts on auth failures. For a team running stateless HTTP services with sub-second response times, dynamic credentials are feasible. For batch jobs that run for hours or legacy apps that cache database connections, they are not.

The pragmatic approach is hybrid. Use dynamic secrets for high-churn, high-risk workloads like production databases and admin access. Use static secrets with automated rotation for third-party APIs and legacy systems. Store both in Vault or SSM, but do not pretend they have the same security properties. A rotated static secret is better than an un-rotated one, but it is not equivalent to a credential with a five-minute lifespan.

IAM as a Control Plane, Not an Afterthought

If you are running on AWS, IAM is unavoidable. Even if you use Vault, your Vault cluster runs on EC2 or ECS, and those instances need IAM roles. The bootstrapping problem is real: how do you give a service access to Vault without hardcoding a Vault token, which itself is a secret?

The answer is AWS IAM authentication. Vault can validate that an EC2 instance or ECS task has a specific IAM role, then issue a Vault token scoped to that role's policies. The service authenticates to Vault using its instance metadata credentials, which AWS manages automatically. No secrets in environment variables. No tokens in CI pipelines. The trust boundary is established by AWS infrastructure.

This pattern works, but it shifts the problem to IAM policy design. If your IAM roles are over-permissioned, Vault inherits that weakness. If every service runs with the same IAM role because managing granular roles is tedious, Vault cannot differentiate between them. The security model degrades to perimeter defense: once inside AWS, everything can access everything.

For SSM, IAM is the only control plane. Access is governed entirely by IAM policies. This is operationally simpler than Vault but less expressive. IAM policies are attached to roles, users, and resource tags. They do not support dynamic policy generation or time-based access. If you need temporary elevated access, you manually assume a role, retrieve secrets, and hope you remember to unassume it. Vault supports token TTLs and renewable leases, which model temporary access more cleanly.

The tradeoff is operational overhead. Vault requires running and securing a Vault cluster. SSM requires designing IAM policies carefully and auditing them regularly. Both demand discipline. The difference is where you place that discipline.

Vault as a Security Boundary vs SSM as an Operational Convenience

Vault is purpose-built for secrets. It encrypts data at rest and in transit, enforces access policies, logs every read and write, and supports plugins for dynamic secret generation. It is also a complex distributed system with quorum requirements, unsealing ceremonies, and failure modes that require deep understanding of Raft consensus.

SSM Parameter Store is a regional AWS service. It stores encrypted parameters backed by KMS, supports versioning, and integrates with CloudFormation and ECS. It is not a distributed system you manage. It has an SLA, but you do not provision nodes, tune performance, or debug replication lag. The tradeoff is flexibility. SSM does what AWS designed it to do, and if your use case falls outside that design, you are stuck.

Vault gives you control. You can run it in multiple regions, federate policies across clusters, and integrate with non-AWS systems like on-premise databases or Kubernetes clusters in GCP. You can write custom secret engines for proprietary systems. You own the availability and performance. If Vault is slow, you scale it. If it goes down, you fix it. For security-sensitive organizations or multi-cloud environments, that control justifies the operational cost.

SSM is convenient. You call aws ssm get-parameter, and AWS handles the rest. If you are all-in on AWS, tightly integrated with ECS and Lambda, and do not need advanced features like dynamic secrets or cross-region federation, SSM is sufficient. The failure case is not catastrophic misconfiguration but gradual feature creep. You start hitting limits: parameter size, throughput, or policy expressiveness. By then, migrating to Vault is expensive.

A practical pattern is to use both. Store application secrets in Vault. Store AWS-specific configuration like AMI IDs, VPC settings, and deployment flags in SSM. Let each tool do what it does best. The cost is integration complexity, but if your infrastructure spans multiple clouds or hybrid environments, that complexity is unavoidable.

Latency, Availability, and Dependency Risks

Secrets systems add a dependency to your critical path. Every service boot, credential refresh, and rotation event queries your secrets backend. If that backend is slow or unavailable, your services do not start, or they crash mid-operation.

Vault runs on infrastructure you control. If you run it in a single availability zone and that zone fails, your entire secrets plane goes offline. Services cannot retrieve new credentials. Existing credentials eventually expire. Your application degrades to read-only or stops entirely. The mitigation is multi-region Vault with replication, which introduces consistency challenges. Vault supports performance replication and disaster recovery replication. Performance replication is eventually consistent, which matters if you rotate a secret in us-east-1 and immediately try to read it in eu-west-1. DR replication is for failover, not active-active operation.

SSM is regional and highly available by design. If us-east-1 SSM fails, AWS has bigger problems. But SSM has rate limits. If you deploy fifty services simultaneously and each one fetches twenty parameters on boot, you hit throttling. The error messages are opaque. Services fail to start, and your deployment rolls back. The fix is client-side caching and exponential backoff, but that logic must be built into every service or centralized into a sidecar.

The operational lesson is that secrets infrastructure is critical infrastructure. You need monitoring, alerting, and runbooks. If Vault is down, what is your fallback? Do you have cached credentials? Can you manually inject secrets for emergency deploys? If SSM is throttling, can you batch parameter fetches or stagger service restarts?

A realistic pattern is to cache secrets locally with a TTL. A service fetches credentials from Vault or SSM on boot, stores them in memory, and refreshes them periodically. If the secrets backend is unreachable during refresh, the service continues with cached credentials until they expire. This reduces dependency on real-time availability but introduces risk. If you revoke a credential because of a suspected breach, cached copies remain valid until their TTL expires. You balance operational resilience against security responsiveness.

Rotation Strategies That Do Not Break Production

Credential rotation is one of those practices everyone agrees is good until they try to implement it. The theory is simple: periodically change secrets to limit exposure windows. The practice is brutal. Database passwords cannot be changed atomically across a fleet of stateful services. API tokens cannot be rotated mid-request. TLS certificates cannot be replaced without downtime unless you have carefully orchestrated zero-downtime reloads.

The problem is consistency. If you rotate a PostgreSQL password in Vault, every service using that password must fetch the new version and reconnect. If one service misses the update because it was restarting during rotation, it crashes when its cached connection expires. If you use dynamic secrets with short TTLs, you avoid this problem because credentials expire naturally. But if you use static secrets with manual or scheduled rotation, you need orchestration.

Dual-write is a common pattern. When rotating a secret, you create the new version but keep the old version valid for a grace period. Services gradually migrate to the new secret. Once telemetry confirms zero usage of the old secret, you delete it. This works for API tokens and database passwords but requires that the downstream system supports multiple active credentials. Not all systems do.

Another pattern is blue-green rotation. You run two sets of credentials: blue and green. All services use blue. You rotate green and deploy updated services that use green. Once all services are on green, you rotate blue and flip back. This adds operational complexity but eliminates the consistency problem. The downside is that you are managing twice as many credentials.

For high-churn systems, dynamic secrets with short TTLs eliminate rotation entirely. The credential lifespan is the rotation interval. A five-minute credential does not need rotation because it never lives long enough to matter. But you pay the cost in complexity: applications must handle expired credentials gracefully, retry logic must distinguish between transient auth failures and real errors, and monitoring must differentiate between expected expiry and breach indicators.

The unglamorous truth is that rotation works best when systems are designed for it from the start. Retrofitting rotation onto legacy applications that cache connections for hours or days is painful. You add retry logic, implement connection pooling that respects credential updates, and test failure modes you never anticipated. For teams operating legacy infrastructure, rotation becomes a long-term architectural goal rather than a quick operational win.

Bootstrapping Trust and the Chicken-and-Egg Problem

The hardest question in secrets management is: how do you give a service its first secret? If secrets are in Vault, the service needs a Vault token to retrieve them. But the Vault token is itself a secret. If you bake it into a container image, you have hardcoded a credential. If you pass it via an environment variable, it appears in process listings and logs. If you retrieve it from another secrets manager, you have moved the problem one layer up.

The practical answer on AWS is IAM authentication. A service authenticates to Vault using its IAM instance profile. Vault validates the instance identity document, checks that the instance has the expected IAM role, and issues a token. The instance never stores a Vault token. It authenticates on demand using credentials managed by AWS.

This pattern works, but it ties your security model to AWS IAM. If your IAM roles are too broad, an attacker who compromises one service can impersonate others. If you are running in GCP or on-premise, you need a different bootstrap mechanism: Kubernetes service accounts, signed JWTs, or AppRole workflows where a trusted orchestrator provisions secrets during deployment.

For SSM, the bootstrap is simpler. IAM instance profiles grant access to SSM parameters directly. The service calls aws ssm get-parameter with its instance credentials. No intermediate token. No unsealing ceremony. The tradeoff is that IAM becomes your entire security boundary. If IAM is misconfigured, your secrets are exposed.

Another pattern is trust-on-first-use (TOFU). The service generates a one-time token during deployment, uses it to authenticate to Vault, and receives a renewable token. The one-time token is immediately revoked. This works for long-lived services but not for ephemeral workloads like Lambda functions, which may not have time to renew tokens between invocations.

The deeper challenge is not technical but cultural. Teams must accept that secrets are runtime state, not compile-time configuration. Secrets cannot be checked into repos or baked into images. They must be fetched dynamically from a trusted source. For organizations accustomed to twelve-factor apps where configuration is environment variables loaded at boot, this shift requires rethinking deployment pipelines, local development workflows, and testing strategies.

Secrets in CI and Deployment Pipelines

CI and CD pipelines are a common leak vector. A deployment script needs database credentials to run migrations. A build step needs an npm token to fetch private packages. A test suite needs API keys for integration tests. All of those secrets must be available in the CI environment, which is often less secure than production.

The typical pattern is to store secrets in the CI system's secret manager: GitHub Actions secrets, GitLab CI variables, or Jenkins credentials. These systems encrypt secrets at rest and inject them as environment variables at runtime. The problem is visibility. Once a secret is in the CI environment, any script in that job can access it. A malicious dependency in your npm install step can exfiltrate secrets. A typo in a shell script can log them to stdout, which is then stored in CI artifacts forever.

Vault and SSM improve this by limiting secret scope. Instead of storing database credentials in GitHub Actions, you store a Vault token or AWS credentials that can only access specific secrets for the duration of the job. The CI job authenticates to Vault or AWS, fetches secrets, uses them, and exits. The token expires. If the job logs are compromised, the secrets are already invalid.

For deployment pipelines, the challenge is propagation. Your pipeline deploys a service to ECS. That service needs database credentials. Do you fetch them in the pipeline and inject them as environment variables? Or do you configure the service to fetch them from Vault or SSM on boot? The former is simpler but less secure. The latter is more secure but requires that the service has network access to Vault or SSM at startup, which fails if your network policies are too restrictive.

A realistic pattern is to separate deployment-time secrets from runtime secrets. Deployment-time secrets, like SSH keys for Git repos or credentials for artifact registries, live in CI. Runtime secrets, like database passwords and API keys, are fetched by the application from Vault or SSM after deployment. This minimizes the blast radius of a CI compromise and keeps long-lived credentials out of pipeline logs.

For local development, engineers need access to secrets without copying production credentials to their laptops. Vault namespaces or SSM parameter paths can provide development-specific credentials that work against staging databases. Engineers authenticate with their corporate SSO, fetch dev secrets from Vault, and run services locally. The secrets are scoped to their identity and expire after a session. This is better than a shared .env file on Dropbox.

Observability, Audits, and Incident Response

When an auditor asks who accessed which secret when, you need an answer. Vault logs every request: who, what, when, and whether it succeeded. SSM logs parameter access to CloudTrail. Both give you the data. The challenge is using it.

Vault audit logs are JSON. You ship them to a SIEM or log aggregator like Splunk or Elasticsearch. You build dashboards showing access patterns, policy violations, and anomalies. If a service suddenly requests secrets outside its normal path scope, you alert. If a Vault token is used from an unexpected IP, you revoke it. This requires operational maturity. Small teams do not have SIEMs. They export logs to S3 and query them with Athena when audits happen.

CloudTrail logs every SSM API call, but CloudTrail is noisy. A single ECS task fetching parameters on boot generates dozens of events. You need filtering and aggregation to make it useful. AWS Config can track parameter changes, but it does not track who read what. You combine CloudTrail and Config to reconstruct access history, which is tedious.

For incident response, speed matters. If you detect a breach, you need to revoke credentials immediately. Vault supports this. You identify the compromised token or role, revoke it, and rotate the affected secrets. Services lose access, which is disruptive, but the breach is contained. With SSM, revocation is manual. You rotate the parameter value, update IAM policies to block access, and redeploy services. The time window is minutes to hours, depending on automation.

Another incident scenario is secret leakage. A developer commits an API key to GitHub. GitHub scans the commit and alerts you. With Vault, you log into the UI, search for the secret, see which services accessed it, rotate it, and update the services. With SSM, you search CloudTrail for GetParameter calls, identify the affected services, rotate the parameter, and trigger deployments. The workflow is similar, but Vault centralizes the operation. SSM scatters it across IAM, CloudTrail, and deployment tooling.

The operational takeaway is that observability is not optional. If you cannot answer "who accessed secret X in the last 30 days," your secrets system is incomplete. Logging, alerting, and incident runbooks are as important as the secrets tooling itself.

When Not to Use Vault or SSM and What Breaks at Scale

Vault and SSM solve specific problems. They do not solve everything. If your infrastructure is small and stable, a single .env file per environment might be sufficient. If your team is two engineers and your secrets change once a quarter, the operational cost of running Vault outweighs the benefit. Pragmatism matters.

Vault breaks at scale when operational load exceeds team capacity. A three-node Vault cluster requires monitoring, patching, unsealing, and disaster recovery. If your team is already stretched thin, adding Vault creates a new operational burden. You need people who understand Raft consensus, TLS certificate rotation, and Vault's policy syntax. If you do not have that expertise, Vault becomes a black box that breaks at 3 AM.

SSM breaks when you hit its limits. Parameter size is capped at 8 KB. Throughput is rate-limited. Policies are IAM-based, which becomes complex at scale. If you have thousands of parameters, navigating the SSM console is painful. If you need cross-account access, you need resource-based policies or AWS Organizations integration. These are solvable problems, but they require AWS expertise.

Another failure mode is over-engineering. Teams hear "secrets management" and immediately deploy Vault in a multi-region HA configuration with automated unsealing and policy-as-code. Then they discover they have five secrets, all of which are third-party API keys that rotate manually. The complexity is waste. Start simple. Use SSM or even encrypted S3 objects. Graduate to Vault when the pain of the simpler solution exceeds the cost of running Vault.

For hybrid or multi-cloud environments, neither tool is sufficient alone. Vault can span AWS, GCP, and on-premise, but it requires careful design. SSM is AWS-only. If you are standardizing on Kubernetes across clouds, consider tools like External Secrets Operator, which syncs secrets from Vault or SSM into Kubernetes Secrets. The tradeoff is yet another dependency and another failure mode.

Tying It Together: Reliability, Audits, and Developer Velocity

Secrets management sits at the intersection of security, operations, and developer experience. Done well, it improves all three. Done poorly, it slows teams, leaks credentials, and fails audits.

Reliability improves when secrets are centralized and rotated automatically. Breaches are contained because credentials are short-lived or scoped. Incidents are faster because audit logs pinpoint access. But reliability depends on the secrets system being available. If Vault is down, services do not start. If SSM is throttling, deployments fail. You cannot treat secrets infrastructure as optional.

Audits improve when every secret access is logged, every rotation is automated, and every policy is versioned. Auditors want evidence that controls exist and are enforced. Vault and SSM both provide that evidence, but you must export logs, build dashboards, and demonstrate that anomalies trigger alerts. A secrets system without observability is a checkbox, not a control.

Developer velocity improves when fetching secrets is fast and predictable. If engineers wait hours for secret access or deploy changes to rotate credentials, they will route around the system. If secrets tooling is flaky, developers hoard credentials in .env files as backups. Velocity requires that the secrets workflow is invisible when it works and recoverable when it fails.

The tradeoff is that all of this requires investment. Running Vault well requires engineering time. Designing granular IAM policies for SSM requires discipline. Building automation for rotation, monitoring, and incident response requires focus. Teams that skip this investment end up with secrets sprawl, compliance violations, and breaches.

Three Actionable Takeaways

  1. Audit your secret access patterns this sprint. Export Vault logs or CloudTrail events, identify which services access which secrets, and map whether those access patterns align with intended least-privilege policies. Revoke over-permissioned roles and document the expected access patterns. This takes one engineer one day and immediately reduces blast radius.

  2. Implement dual-write rotation for one high-risk static secret. Pick your production database password or a critical API key. Create automation that generates a new credential, updates the secrets backend with both old and new versions, gradually migrates services to the new version, and deletes the old version after a grace period. Test this workflow in staging, then run it in production. This proves your rotation process works before audits demand it.

  3. Add secrets fetch latency to your boot-time SLO. Measure how long it takes services to retrieve credentials from Vault or SSM during startup. If it exceeds 500ms, cache secrets locally with a TTL or move to batch fetching. If your secrets backend becomes a boot-time bottleneck, new deploys will be slow, and incident recovery will be painful. Fix it before it becomes critical path.