Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFD] TPM Enrollment and secure secret delivery #40

Open
alexlovelltroy opened this issue Jun 21, 2024 · 7 comments
Open

[RFD] TPM Enrollment and secure secret delivery #40

alexlovelltroy opened this issue Jun 21, 2024 · 7 comments
Assignees
Labels
rfd Request for Discussion WIP Work In Progress

Comments

@alexlovelltroy
Copy link
Member

alexlovelltroy commented Jun 21, 2024

Attestation Background

Attestation is a method for verifying the integrity of a computer’s software, hardware, and firmware using a Trusted Platform Module (TPM). The TPM creates cryptographic measurements, or "quotes," reflecting the system's state, including its software and firmware configuration. These quotes are assembled into a report that is signed by the TPM through an embedded Public Key Infrastructure (PKI), which keeps private keys secure within the TPM.

During remote attestation, this signed report is sent to a remote verifier, which uses PKI to authenticate and validate it. The report includes a nonce—a unique, random number generated for each request—to prevent replay attacks and ensure the report's freshness. The verifier checks the report against expected values, leveraging PKI to confirm the authenticity of the quotes and the system’s integrity. Successful validation allows the verifier to grant access or permissions, affirming the system's trustworthiness.

Beyond integrity verification, the same PKI framework used in attestation can facilitate secure communications between nodes. The TPM can encrypt information so that only nodes with the corresponding TPM can decrypt and read it, ensuring data security. Additionally, PKI allows nodes to prove that a message originated from a TPM-equipped system by signing the message with the TPM's private key. Recipients use PKI to verify this signature, confirming both the message’s origin and its authenticity. This combined use of PKI and TPM strengthens security by enabling both secure connections and reliable verification of communications.

Bootstrapping Attestation

Bootstrapping remote attestation requires establishing trust in TPMs themselves. This foundational trust is essential for the effective functioning of the attestation process.

Initial trust is established through key provisioning and certification. When a TPM is initialized, it generates a primary endorsement key (EK) and additional keys for various functions. The EK is used to obtain a certificate from a trusted Certificate Authority (CA), known as the Endorsement Certificate (EKCert), which binds the TPM’s public key to its identity. This endorsement, signed by a trusted CA, establishes a basis of trust for the TPM’s operations.

With initial trust established, the remote attestation process can proceed. The remote verifier sends a request to the TPM, including a nonce to ensure the report’s freshness. The TPM generates a signed quote, which includes the nonce and a measurement of the system’s state. This quote serves as proof of the system's integrity. The verifier uses the TPM’s EKCert to validate the TPM’s public key and the authenticity of the quote. Successful validation confirms the TPM’s trustworthiness and the system’s integrity.

Ongoing trust management involves periodic attestation checks to ensure system integrity, key rotation to maintain security, and mechanisms for certificate revocation if a TPM is compromised. These practices help maintain the robustness and reliability of the attestation framework.

Infrastructure Challenges for Remote Attestation

Managing the original Endorsement Keys (EKs) securely is a critical challenge, especially when integrating new computers as new racks are delivered or nodes are swapped. The integrity of the attestation process depends on the secure handling of these keys from generation to deployment. Either in the factory, or on delivery, each TPM generates unique EKs that must be securely transmitted to a trusted Certificate Authority (CA) for certification. Ensuring these EKs are encrypted and protected during transmission is essential to prevent unauthorized access.

Ongoing management of EKs also includes secure handling of key rotations and updates. Procedures must be in place to address the replacement of TPMs and the updating or revocation of EKs to maintain system integrity and trustworthiness.

OpenCHAMI Attestation and Enrollment Service

This RFD proposes a process for managing enrollment keys and supporting remote attestation.

Extend OpenCHAMI to use TPMs for identity

In today's system, each node is primarily identified by the xname which denotes a location in the system. This idea of location as primary identifier is inherited from CSM through our use of the CSM service SMD as the primary inventory interface in OpenCHAMI. Location, while unique across the system, isn't stable. It is possible and even somewhat common to remove a blade from one chassis and replace it in another chassis. Tracking errors per blade as it is moved from one part of the system to another is possible in CSM, but not trivial.

The TPM contains several pieces of data that can be used for identity and are both unique and stable. The specification for TPM 2.0 which is linked below in the references describes two ids that are practical for our use.

The 802.1AR standard defines two Device Identity (DevID) types, depending on the CA signing the issued
certificates and expectations for certificate lifetimes.
The initially installed identity is defined as an IDevID/IAK (“I” for initial) and is installed by the product OEM. The
IDevID credential is intended to be usable for the life of the product. The IDevID/IAK is expected to be created at
device manufacturing time.
On the other hand, owner-created and signed identities are named LDevID & LAK (“L” for local). LDevID/LAK
credentials may “overlay” IDevID/IAK credentials, thereby replacing IDevID with LDevID in operation. Alternatively,
IDevID/IAK and LDevID/LAK may be used for different purposes. LDevID/LAK certificates are not expected to be
long-lived certificates. LDevID/LAK credentials are expected to be removed when a device is “zeroized” or is at its
end of life.

One approach would be to include a TPM identifier as an additional piece of data stored by SMD and provide functions for interacting with the unique and stable identities in addition to the xnames.

A second approach would be to create a new service for externally managing these identities and provide integrations with SMD and other microservices.

I recommend the first approach as identity and inventory are intrinsically linked. Keeping them separate introduces race conditions and other potential consistency problems.

Extend OpenCHAMI to boot a dedicated discovery image for collecting TPM keys/IDs

The remote attestation process requires establishing a collection of valid Public Keys/Certificates that identify the TPMs and which can be compared with responses in the remote attestation process. We have considered several options for a process that works with OpenCHAMI.

  1. A fully manual process that involves sysadmins booting the node, logging in as root, retrieving the appropriate certificates and IDevIDs from the tpm, and adding them to SMD.
  2. A fully automatic process in the boot cycle that sends appropriate certificates and DevIDs to an unauthenticated internal service with network protections.
  3. A dedicated system image that allows ansible to connect and retrieve certificates and DevIDs.

We believe the first option to be the most secure, but it is also the most labor intensive. We did not pursue it as impractical.

We believe the second option creates an opportunity for a rogue device to register itself as a fake node and could provide an avenue for future attacks. If we can adjust network settings or provide other protections, it might be workable. We chose not to pursue it at this time.

The third option appears to provide us with the security and manageability we need and allows us to build on tooling we already have. We are pursuing this option, but are ensuring that other options remain available for sites that do not wish to use ansible in this way.

References

  1. Trusted Computing Group Specification for TPM 2.0 Keys for Device
    Identity and Attestation
  2. Google's go-based attestation tooling
@alexlovelltroy alexlovelltroy converted this from a draft issue Jun 21, 2024
@alexlovelltroy alexlovelltroy added the rfd Request for Discussion label Jun 21, 2024
@alexlovelltroy alexlovelltroy added the WIP Work In Progress label Aug 4, 2024
@alexlovelltroy
Copy link
Member Author

Further work here could include using the attestation process to set up a wireguard tunnel for cloud-init and removing the need for authentication in cloud-init itself.

@alexlovelltroy
Copy link
Member Author

https://github.com/keylime/keylime May provide much of the functionality needed for the actual enrollment and management of certs/keys

@davidallendj
Copy link

davidallendj commented Aug 19, 2024

https://github.com/keylime/keylime May provide much of the functionality needed for the actual enrollment and management of certs/keys

Should be looking here at the rust version instead since the python version is deprecated?

@alexlovelltroy
Copy link
Member Author

I think both repos are relevant.

From the README:

Keylime consists of three main components; The Verifier, Registrar and the Agent.

The Verifier continuously verifies the integrity state of the machine that the agent is running on.

The Registrar is a database of all agents registered with Keylime and hosts the public keys of the TPM vendors.

The Agent is deployed to the remote machine that is to be measured or provisioned with secrets stored within an encrypted payload released once trust is established.

I think only the agent has been rewritten in rust and move to the other repo. I don't see evidence that they are moving the Registrar or Verifier to rust at this point.

@alexlovelltroy
Copy link
Member Author

Relevant to this discussion:

https://datatracker.ietf.org/doc/html/rfc9334

@alexlovelltroy
Copy link
Member Author

Possible alternative to Keylime. Doesn't look ready for primetime yet.

https://github.com/veraison

@dev-zero
Copy link

(All views expressed are my own. If at all, they originate from my role as an OpenCUBE developer.)
For the text itself the part about permitting replacement of TPMs and using an embedded IDevID as alternative stable ID to the xname seems to be inconsistent. Furthermore a TPM is just one secure enclave, CPUs, GPUs, NICs and DPUs may provide their own in the future. Furthermore with current architectures PCIe devices are privileged, meaning that an alternative ID might be a hash over device IDs instead. To ensure a machine is re-enrolled if crucial components change.
On the other hand whether someone will actually want to implement that is a different question.
Nonetheless, I would therefore suggest to split this RFD into 2 parts:

  • rollout of TPM enrolment and ongoing remote attestation
  • new unique ID

In general I think that tracking faults per device is yet another story and this data should be tied to subcomponents instead if possible.

I recommend the first approach as identity and inventory are intrinsically linked. Keeping them separate introduces race conditions and other potential consistency problems.

I agree that additional IDs must be maintained within SMD. Keeping it separate makes it unmaintainable, quickly.

Further work here could include using the attestation process to set up a wireguard tunnel for cloud-init and removing the need for authentication in cloud-init itself.

How the actual transport happens is secondary to this proposal I think. I would prefer HTTP mTLS over a full separate protocol, though.

We believe the second option creates an opportunity for a rogue device to register itself as a fake node and could provide an avenue for future attacks. If we can adjust network settings or provide other protections, it might be workable. We chose not to pursue it at this time.

I am missing how the third option improves on the MITM situation. But then I am fine with either.

What I am missing here is a disclaimer that this proposal is geared towards managed nodes and already assumes that the environment where OpenCHAMI runs is to be considered secure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfd Request for Discussion WIP Work In Progress
Projects
Status: In Progress
Development

No branches or pull requests

4 participants