NetHSM Software Update 4.0: Scalable Clustering for High-Availability HSM
We are pleased to announce the largest software update to date for NetHSM: In addition to IPv6 support, NetHSM software 4.0 introduces native clustering. This allows multiple NetHSM devices to be interconnected into a redundant cluster that distributes and maintains the consistency of key data and remains operational even if individual devices fail. The cluster scales to any size in terms of both the number of devices and throughput—without technical limitations or proprietary restrictions to a fixed number of devices. Virtually any load can be easily distributed across the entire cluster. This makes NetHSM’s clustering the most powerful among all HSMs we are aware of. This software update is available to all NetHSM customers at no additional cost. As always, it is open source and Made in Germany for full sovereignty.
Advantages Over Proprietary HSM Clustering
Anyone familiar with high availability for hardware security modules (HSMs) knows the typical alternatives: proprietary clustering software from HSM vendors, manual key exports using PKCS#11, or external databases as a synchronization backend.
NetHSM clustering differs from these in several ways:
Scalable Without Architectural Limits
Many HSM vendors limit their cluster solutions to two nodes in active/passive mode. This protects against the failure of a single device but does not scale either throughput or fault tolerance. The NetHSM cluster, on the other hand, grows with your requirements: More nodes allow for more parallel cryptographic operations and higher fault tolerance. The number of nodes is not limited by the architecture, and clusters with dozens or hundreds of NetHSMs can be implemented.
No Artificial License Limit
Proprietary HSMs often offer clustering as an expensive add-on, with proprietary management software, poor interoperability, and frequently limited to two nodes. With NetHSM, clustering is included without additional license costs or artificial limits.
No Key Export for Synchronization
Traditional replication approaches often require keys to be exported and re-imported onto other nodes. This means that the key is stored in software—albeit only briefly. In the NetHSM cluster, etcd replicates the encrypted keys. Decryption takes place exclusively on the hardware, thereby ensuring a high level of security.
Verifiable Consistency Guarantees
The used Raft algorithm (more details below) is formally verified and well-documented. The consistency guarantees are not marketing promises, but properties of a published algorithm that security researchers can analyze independently. This is a significant difference from proprietary synchronization protocols, which are rarely publicly specified.
Proven Implementation
While the integration in NetHSM is new, the underlying cluster implementation, etcd, has been successfully used for many years in Kubernetes and other systems for billions of deployments. Thus, a stable and low-error implementation can be assumed.
Automatic Resynchronization
Nodes that were temporarily disconnected from the network and then reconnect, synchronize their state without manual intervention. This simplifies maintenance windows, reduces administrative overhead, and significantly increases availability—even in large deployments with many nodes.
etcd and Raft: Proven Consistency Algorithms for Security-Critical Data
At the heart of NetHSM clustering is etcd, a distributed key-value store originally developed for Kubernetes cluster coordination, where it has been coordinating billions of deployments for years. The choice of etcd is no coincidence: It is one of the few systems that combines strong consistency guarantees, horizontal scalability, and a formally verified consensus algorithm.
etcd relies on the Raft consensus algorithm. Raft defines exactly one leader node at any given time, which coordinates write operations. A write operation—such as creating a new key—is only considered successful once a majority of the nodes (the so-called quorum) has confirmed it. This means that no node ever sees a state that has not been confirmed by the majority of the cluster. No split-brain, no silent data corruption due to diverging states.
For a NetHSM cluster with N nodes, at least (N/2)+1 nodes must be active and reachable for the cluster to function. Fault tolerance increases linearly with the number of nodes: A 3-node cluster tolerates the failure of one node, a 5-node cluster tolerates the simultaneous failure of two nodes, a 7-node cluster tolerates three—and so on. Those with higher availability requirements simply add more nodes. No new protocol, no new license, no changes to the existing configuration.
When an isolated node rejoins the network after an outage, it automatically synchronizes with the current cluster state—no manual intervention is required.
What makes this special is that the data stored in etcd—keys, users, configurations—is consistently encrypted with the shared domain key of the NetHSM instances. Even a complete dump of the etcd store would not grant an attacker access to the plaintext data without the hardware-bound keys.
Clustering was already planned during the initial development of NetHSM. Thus, the internal database of a single NetHSM already consisted of an isolated etcd instance. Software version 4.0 is based on this architecture and now enables the connection of multiple NetHSMs and their etcd instances. Clustering is therefore not an afterthought but a realization of the performance potential already inherent in the modern NetHSM architecture.
Witness Enables Quorum Without Additional Hardware
A practical challenge in clustering is the requirement for an odd number of nodes. A 2-node cluster is therefore problematic: if one node fails, the remaining node loses the quorum. To enable continued operation, manual intervention may be required—either at the time of the failure (to switch to the remaining node) or at the time of repair when the cluster resumes normal operation and any diverging data must be reconciled.
NetHSM offers another solution to this problem with the Witness concept. A Witness is a simple etcd instance that can run on any hardware—a VM, a Raspberry Pi, a container. It participates in the Raft consensus and contributes to the quorum, but does not perform any HSM operations itself. With a Witness, a stable 3-node cluster can be set up using two NetHSM devices and a regular server instance—without requiring the budget for a third NetHSM device. The same principle applies when scaling: If you want to switch from four to five nodes, you can initially deploy a cost-effective Witness as the fifth node.
The Witness receives all cluster data, including the encrypted key data. Since all sensitive values are encrypted using device-specific keys, this does not pose a security risk—the Witness is a full-fledged cluster member at the etcd level but has no access to critical plaintext data.
What The Cluster Synchronizes—And What it Doesn't
An important design principle: Not everything is shared. NetHSM 4.0 makes a clear distinction between globally shared data and device-specific data.
All HSM user data is synchronized across the cluster: cryptographic keys, user management, namespaces, as well as the backup passphrase and the cluster CA. This means that a key created on one device is immediately available on all other devices—without manual replication or explicit export. Any additional device joining the cluster automatically receives the complete dataset.
Configuration parameters such as TLS certificates, network configuration, logging, and—crucially—the device key itself, however, remain device-specific. The device key never leaves the hardware. Additionally, each device has an individual unlock passphrase.
Further details can be found in the technical documentation.

Comments
Add new comment