Abstract
The medium-density parity-check (MDPC) code-based cryptosystem remains a finalist of the post-quantum cryptography standard. The row-layered Min-Sum decoding achieves an efficient trade-off between performance and complexity. Parallel row-layered MDPC decoders are designed in previous studies by enforcing constraints on the parity-check matrix. The preliminary work introduces two schemes to reduce the latency without increasing the constraints on the parity-check matrix, which may bring security concerns. The first scheme simultaneously processes multiple identity blocks in the parity-check matrix, and the second processes a larger block of variable width each time. However, their speedup is limited by data access conflicts. In this paper, optimizations are proposed for the computation scheduling of each scheme to further reduce the latency. Out-of-order processing of identity blocks is proposed for the first scheme to reduce memory access conflicts, enabling more simultaneous processing of multiple blocks. For the second scheme, out-of-order processing of block segments is explored to update more messages simultaneously without causing conflicts in layered decoding. Efficient hardware architectures have also been designed for both of the proposed scheduling optimizations. For an example code, the two proposed optimizations achieve 35% and 9%, respectively, speedup over the best prior designs with negligible area overhead. The out-of-order processing is more effective on the first design.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Current cryptographic standards face significant threats from the imminent quantum computing attacks. The National Institute of Standards and Technology has selected the medium-density parity-check (MDPC) code-based cryptosystem as a finalist in its latest post-quantum cryptography standardization efforts [1,2,3]. These MDPC codes have parity-check matrices composed of two large random circulant matrices, which serve as the secret keys. These matrices exhibit structural differences from those of the low-density parity-check (LDPC) codes traditionally used in error correction, rendering most optimization techniques for LDPC decoders unsuitable for MDPC decoders.
Prior research on MDPC decoder design primarily focuses on the basic bit-flipping algorithm and its derivatives [4,5,6,7,8,9,10,11,12]. In contrast, the Min-sum algorithm [13] offers significant improvements in decoding failure rate (DFR) when appropriate scalars are applied to the messages [14,15,16]. Lower DFR improves resilience against reaction-based attacks, which exploit decoding failures to recover the secret parity-check matrix [17, 18].
Parallel decoders are needed to reduce the decryption latency. It is achieved by processing multiple entries from different row segments of the parity-check matrix concurrently in [14]. However, the random distribution of the nonzero entries in the parity-check matrix of MDPC codes limits the achievable speedup to a very small factor and leads to large memories, which are major contributors of the decoder complexity. To enable efficient parallel processing, [15] added a constraint to the parity-check matrix construction that the minimum distance between any pair of adjacent nonzero entries in a column needs to be at least p. The matrix is partitioned into \(p \times p\) identity blocks, enabling p times speedup by processing one block per clock cycle. Row-layered scheduling [19] is also adopted in the design of [15] to reduce the memory requirement. However, generating random secret parity-check matrices with a large p, such as 64, requires a very long time, and the number of usable secret keys is limited. Furthermore, increasing p results in higher DFR, making the system more vulnerable to reaction attacks [15].
In the preliminary work [16], two schemes are introduced to achieve faster row-layered decoding without increasing p. The first approach tries to process multiple identity blocks within the same row layer simultaneously. Memory access conflicts occur when the messages for those blocks are stored at different addresses of the same memory bank, in which case more clock cycles are spent to read the messages. While increasing the number of memory banks reduces the likelihood of such conflicts, it makes each memory bank shallower and hence leads to larger silicon area in the implementation. The second scheme allows the number of rows in a layer to be larger than p. Each layer is partitioned into blocks with variable width dynamically, so that each block starts with a nonzero entry in the top left corner in order to simplify the on-the-fly address generation process during the decoding and the computation units. Additionally, a hybrid decoding scheduling is designed to segment those blocks with multiple nonzero entries in a column so that DFR degradation is avoided. However, each segment is processed in a separate clock cycle in [16], and the effective speedup factor is much less than the block size.
This paper proposes computation scheduling optimizations for each of the schemes in [16], enabling further latency reductions. For the first scheme, out-of-order processing of identity blocks is exploited without further dividing the memory into more banks. Identity blocks whose messages can be accessed from the memories without conflicts are processed simultaneously as much as possible, not necessarily in the order in which they appear in the parity-check matrix. In the second scheme, the top segments of different blocks do not have more than one nonzero entry in the same column. Those segments without memory access conflict can be also processed simultaneously in an out-of-order manner to reduce the latency. Efficient hardware implementation architectures are developed for the two proposed schemes in this paper. The identity blocks or top segments that wrap around the edges of the circulant matrices may cause new memory access conflicts in the next layer. This issue is addressed in the proposed hardware architecture by dynamically modifying the processing order of the blocks/segments. For an example MDPC code, the proposed scheduling optimizations lead to 35% and 9% speedup, respectively, compared to the two designs in [16] with negligible area overhead. The out-of-order processing leads to more significant speedup on the first design.
This paper is structured as follows. Background information is provided in Section 2. Section 3 introduces the two preliminary parallel MDPC decoders. Section 4 details the proposed computation scheduling optimizations and develops efficient hardware implementation architectures. Complexity analyses and conclusions are presented in Sections 5, and 6, respectively.
2 Background
MDPC codes are specified by their parity-check matrices, denoted as \(\textbf{H}\). The private key of the latest MDPC code-based post-quantum cryptosystem considered for the standards [1, 2] is composed of two randomly generated vectors with length r and Hamming weight w. The \(\textbf{H}\) matrix is structured as \([\mathbf {H_0} \vert \mathbf {H_1}]\). \(\mathbf {H_0}\) and \(\mathbf {H_1}\) are circulant matrices whose first columns equal the two random vectors in the secret key. If the second vector results in \(\mathbf {H_{1}}\) being non-invertible, it is regenerated to enable the construction of a systematic generator matrix \(\textbf{G}\), expressed as \([\textbf{I}\vert [\textbf{H}_{1}^{-1}\textbf{H}_0]^T]\). The values of the code parameters r and w are determined by the intended security level, with r ranging from 4801 to 40973, and w between 45 and 137 [2, 3].
The core step of the encryption process includes MDPC encoding. The 2r-bit ciphertext \(\textbf{x}\) corresponding to an r-bit plaintext message \(\textbf{m}\) is generated as
Here \(\textbf{e}\) is a randomly generated error vector with Hamming weight t, which is decided according to the target security level. \(\textbf{c}=\textbf{m}\textbf{G}\) is a codeword of the MDPC code. During the decryption, MDPC decoding is applied on \(\textbf{x}\). If \(\textbf{e}\) is a correctable error pattern, the decoding will generate \(\textbf{c}\). The plaintext corresponds to the first r bits of \(\textbf{c}\) because the generator matrix \(\textbf{G}\) is systematic.
Scaled Min-sum Decoding Algorithm [13].
A Tanner graph provides a visual representation of \(\textbf{H}\), where columns of \(\textbf{H}\) are depicted as variable nodes, rows are drawn as check nodes, and nonzero entries are represented by edges connecting the corresponding nodes. The pseudocodes of the scaled Min-sum decoding algorithm [13] is listed in Algorithm 1. This algorithm operates by iteratively computing reliability information exchanged between variable nodes and check nodes connected in the Tanner graph to decide the bits in the received vector. In the beginning of the decoding, the reliability of each received bit \(x_j\) (\(0 \le j < n\)), represented as \(\gamma _j\), is set to constant \(+C\) or \(-C\) depending on whether \(x_j\) is ‘0’ or ‘1’, respectively. The optimal value of C can be derived from simulations. The variable-to-check (v2c) message from variable node j to check node i is denoted as \(u_{i,j}\). Similarly, the check-to-variable (c2v) message from check node i to variable node j is denotes by \(v_{i,j}\). In Algorithm 1, k indicates the iteration number and \(S_c(j)\) (\(S_v(i)\)) denotes the neighboring check (variable) nodes of variable (check) node j(i). The sum of c2v messages is multiplied with a scalar \(\alpha \) when computing \(u_{i,j}\) and \(\tilde{\gamma _j}\) to improve the DFR. A vector \(\textbf{x}\) is a codeword iff \(\textbf{x}\textbf{H}^T=0\). If the decoder is not able to find any valid codeword after \(I_{\text {max}}\) iterations, the decoding fails. In order to counteract timing side-channel attacks, the decoding process can run for \(I_\text {max}\) full iterations, regardless of whether it converges earlier.
The row-layered decoding [19] divides the \(\textbf{H}\) matrix into L-row blocks, each of which is referred to as a layer. It has the advantage of faster convergence and reduced memory size. To simplify the notations, the c2v and v2c messages for a single nonzero entry of \(\textbf{H}\) are denoted as \(v^{(k,f)}\) and \(u^{(k,f)}\), respectively, when decoding layer f during iteration k. In row-layered scheduling, v2c messages and the a posteriori information should be computed utilizing the most updated c2v messages. The row-layered scheme [19] assumes that each column within a layer contains at most one nonzero entry. In this case, the formula for computing the a posteriori information using the most updated c2v messages becomes
By comparing Lines 16 and 18 of the Algorithm 1, the updating formula for v2c messages is derived as
These two updates replace Lines 18 and 16 of Algorithm 1, respectively, to implement row-layered decoding.
\(\textbf{H}\) matrix of a toy MDPC code with \((r,w)=(17, 3)\) and constraint \(p=4\) divided into identity blocks [15].
Parallel decoders process one portion of the \(\textbf{H}\) matrix in each clock cycle. The efficiency of parallel MDPC decoding is higherly dependent on the partition of \(\textbf{H}\). Straightforwardly dividing the \(\textbf{H}\) matrix of a MDPC code into row and column blocks in a grid manner results in limited speedup and incurs high hardware complexity due to the random locations of the nonzero entries. To improve the parallel processing efficiency, [15] introduces a constraint that the minimum distance between any adjacent two nonzero entries in the two r-bit random vectors for \(\textbf{H}\) generation needs to be at least p. Accordingly, the \(\textbf{H}\) matrix can be divided into identity blocks of dimension \(p\times p\), as illustrated in Fig. 1 for an example case of \(p = 4\) for a toy MDPC code. A speedup factor of p over a serial decoder is achieved by processing one identity block per clock cycle. Since there is a single nonzero entry in each row and each column of an identity matrix, the check and variable processing can be implemented by simple hardware.
3 Prior Designs for Increasing Parallelism
To further reduce the decoding latency, higher parallelism is essential. In [15], higher parallelism requires a larger constraint p. However, large p, such as 64, makes finding a compliant \(\textbf{H}\) difficult, decreases the number of valid secret keys, may raise the DFR, and thus compromises the security. To address this issue, two parallel decoders have been proposed in the preliminary work [16] to achieve higher decoding speed without further increasing p.
3.1 Multi-Identity-Block Parallel Processing
The first decoding scheme in [16] maintains the same \(\textbf{H}\) matrix division as in [15], but processes \(m_1>1\) identity blocks at the same time, enhancing the speed without increasing p. This design does not degrade the DFR of row-layered decoding since each column has at most one nonzero entry within the same layer.
Top-level architecture of the multi-identity-block parallel decoder [16].
Figure 2 depicts the hardware architecture for the multi-identity-block parallel row-layered decoder in [16] with \(m_1=2\). The check node unit (CNU) computes c2v messages from input v2c messages according to Lines 8-14 of Algorithm 1. The implementation architectures for a CNU can be found in [20]. It consists of two parts. CNU A processes \(m_1\) v2c messages at a time, calculating min1, min2, idx, and s according to Lines 8-11 of Algorithm 1. These values of \(L=p\) rows are stored in one RAM M address as compressed c2v messages. CNU B derives actual c2v messages from their compressed form per Lines 13 and 14 of Algorithm 1. Since L rows of \(\textbf{H}\) are processed in parallel, L CNU As and two groups of \(m_1L\) CNU Bs, denoted as CNU B\(_0\) and CNU B\(_1\) in Fig. 2, are utilized. The two groups of CNU B recover \(v^{(k-1,f)}\) and \(v^{(k,f)}\) in Eq. 1, respectively. RAM I is used to store the starting column indices of the blocks in the first layer. The column indices of the next layer are computed by adding L mod r to those of the current layer. RAM S records the sign bits of v2c messages, and RAM T buffers \(u^{(k,l)}\) before it is added with \(\alpha v^{(k,l)}\).
In Fig. 2, RAM U records the a posteriori information during the decoding, with the information of p columns stored at one RAM U address. In the matrix division scheme shown in Fig. 1, an identity block in \(\textbf{H}\) can begin at any column, potentially spanning two adjacent RAM U addresses. In the case that one single identity block is processed in each clock cycle, RAM U can be divided into two memory banks storing even and odd addresses as proposed in [15]. Then the information for one identity block can be extracted in a single clock cycle from up to one address per RAM U bank by using the shifter in Fig. 2. The messages for \(m_1 > 1\) identity blocks span more addresses of RAM U, and those for the blocks to be processed in the same clock cycle may be located in different addresses of the same bank, causing memory access conflicts and hence requiring extra clock cycles to read the messages. To reduce the memory access conflicts, RAM U is divided into more banks in [16].
Table 1 lists the speedup factor compared to the serial decoder achieved by the multi-identity-block decoder design in [16] averaged over 10 randomly generated MDPC codes with \((r, w) = (4801, 45)\) and \(p=32\). For these example codes, if RAM U is divided into \(b > 4\) banks, the depth of each bank drops below 60. In many CMOS processes, shallow memory banks require more silicon area per bit. To avoid using very large silicon area to implement RAM U, it is limited to \(b = 4\) banks in [16]. This is shown in Fig. 2. Each bank stores p messages per address, allowing information for up to b \(p \times p\) identity blocks to be read per clock cycle. Thus, increasing \(m_1\) beyond the number of RAM U banks does not yield further speedup. For the same \(m_1\), increasing the number of RAM U banks helps to reduce the message access conflicts. For the same number of RAM U banks, further increasing \(m_1\) leads to less additional speedup since it is more difficult to avoid memory access conflicts.
Variable-width dynamic division for an MDPC code with \((r, w) = (17, 3)\) and \(p=4\) [16].
3.2 Variable-Width Large-Block Parallel Processing
An alternative parallel decoder was proposed in [16] to partition the \(\textbf{H}\) matrix into layers of \(L =m_2 p\) rows. Each row layer is divided into blocks of height L but variable width as shown in Fig. 3 for a toy MDPC code with parameters \((r, w) = (17, 3)\), \(p=4\), and \(L=8\). In a row layer, each block begins at a column with a nonzero entry in the first row and ends before the next column with a nonzero entry in the first row or after including L consecutive columns, whichever happens earlier. For example, the first block in the first layer of \(L=8\) rows in Fig. 3 starts from column 0 and ends at column 6 because column 7 has a ‘1’ in the first row, while the second block starts from column 7 and ends at column 14 because \(L=8\) columns have been included. A block may include at most \(m_2= L/p \) diagonal lines, each of which begins at the first column of the block. The starting columns of the blocks in the current layer are added with \(L \mod r\) to compute those of the next layer. The relative starting rows of the diagonals in a block do not change from layer to layer.
In this division scheme, when more than one nonzero entry exists in the same column of the same layer, v2c messages for the second and later entries are not calculated using the most recent c2v messages from all check nodes. This will increase the DFR [16]. A hybrid processing scheme is designed to eliminate this performance degradation. Blocks with more than one nonzero entries in a column are horizontally divided into \(m_2\) segments of equal height, each containing at most one nonzero entry in each column, as shown by the dashed blue lines in Fig. 3. One clock cycle is spent on processing each segment. The later segments of divided blocks are processed after all single-diagonal blocks and previous segments in the same layer in order to ensure the incorporation of the most updated c2v messages. According to this processing order, the clock cycles in which the segments and blocks in the first layer of \(\textbf{H}\) are processed are labeled in Fig. 3. Since the computation of each message follows the formula in Eq. 2, this scheme does not have DFR degradation compared to the decoder in [15].
For different combinations of p and \(L=m_2p\), the speedup factors achieved by the variable-width-large-block decoder in [16] compared to the serial decoder averaged over 10 randomly generated MDPC codes with \((r, w) = (4801, 45)\) are shown in Table 2. Larger L and p yield higher speedup, but \(L>64\) should be avoided to prevent shallow memories. While increasing p offers limited speed improvements, the largest p that does not lead to any security concerns should be used.
The implementation architecture for this variable-width large-block decoder with hybrid processing closely resembles that in Fig. 2 with three minor differences: i) the starting column index of the first nonzero diagonal and relative starting rows of the second and later diagonals in a block are stored in a single RAM I address; ii) the second and later segments of divided blocks may include up to \(m_2\) nonzero entries per row, but each of them has only p rows. Hence, p CNUs handle \(m_2\) inputs each, while each of the others takes care of one input; iii) \(m_2\) instead of one shifter and reverse shifter are required to align messages for at most \(m_2\) diagonals in each block.
4 Proposed Computation Scheduling Optimizations
The achievable speedup of both of the schemes introduced in Section 3 is far below \(m_1p\) or \(m_2p\) because of data access conflicts. In this section, out-of-order processing is proposed to mitigate the data access conflicts for both of the designs to achieve further speedup.
4.1 Optimization on Multi-Identity-Block Parallel Processing
The multi-identity-block parallel decoder from Section 3.1 processes multiple identity blocks in parallel. However, memory access conflicts occur when the processing of these blocks needs more than one message from the same memory bank. To mitigate this, an out-of-order processing approach is proposed in this subsection.
Instead of processing the blocks in a layer sequentially, the proposed scheme reorders the blocks to reduce the frequency of memory access conflicts. To achieve this goal, the identity blocks in the \(\textbf{H}\) matrix are categorized into groups based on the RAM U banks they access. For an example MDPC code with \((r, w)=(4801,45)\) and \(p=32\), \(\lceil log_2(4801\times 2)\rceil =14\) bits are needed to represent the addresses of the columns in \(\textbf{H}\) as \(i= i_{13}i_{12}\cdots i_{0}\). Assume that RAM U is divided into b banks, and the messages for column i are stored in address \(\lfloor \lfloor i/p\rfloor /b\rfloor \) of bank \(\lfloor i/p\rfloor \mod b\). As shown in Fig. 1, when \(i \le r-p\), an identity block starting from column i needs to access bank \((\lfloor i/p\rfloor \mod b)\) or both bank \((\lfloor i/p\rfloor \mod b)\) and bank \(((\lfloor i/p\rfloor \mod b)+1\mod b)\). For the case of \(p=32\) and \(b=4\), \(\lfloor (i/p)\rfloor \mod b= i_6i_5\). As a result, \(i_6i_5\) determines the RAM U banks to access as listed in Table 3. The identity blocks with \(i_6i_5\)=‘00’ and those with \(i_6i_5\)=‘10’ do not access the same memory banks and can be processed simultaneously without conflicts, so do the blocks with \(i_6i_5\)=‘01’ and ‘11’. To simplify the notations, the identity blocks whose first columns have \(i_6i_5\)=‘00’, ‘01’, ‘10’, and ‘11’ are put into group 0, 1, 2, and 3, respectively, as shown in Table 3. As also illustrated in Fig. 1, when \(r-p< i < r\), the block starting from column i spans columns i to \(r-1\) and 0 to \(i+(p-1)-r\). As a result, this block needs to access bank \(\lfloor i/p\rfloor \mod b\) and bank 0 of RAM U, and does not necessarily follow Table 3. Each row layer has at most one such block, and a separate clock cycle is spent on processing such a block to avoid memory access conflicts.
For the case of \(m_1=2\), to reduce memory access conflicts, two identity blocks, one from groups 0 and 2 each, are processed simultaneously in each clock cycle. If the cardinalities of groups 0 and 2 are not the same, the blocks in the longer group are processed one by one afterwards. The same processing is repeated for groups 1 and 3. Due to the cyclical shift structure of the \(\textbf{H}\) matrix, an identity block starting from column i in a layer will be shifted to an identity block starting from column \((i+p)\mod r\) in the next layer. If the message for a column is stored in memory bank j in the current layer, then the message for the shifted column in the next layer is stored in memory bank \(j+1\mod b\), except when \(i+p\ge r\), which corresponds to identity blocks wrapping around the edges of the submatrices of \(\textbf{H}\). Hence, when \(i+p<r\), the blocks that do not have memory access conflicts are all shifted to the next group and they can still be processed simultaneously. If \(i+p\ge r\), the corresponding shifted block in the next layer belongs to group 0.
The proposed grouping scheme may be modified to further reduce the latency. An identity block that accesses only one RAM U bank can be processed simultaneously with more identity blocks without memory access conflicts. For example, from Table 3, a block accessing only RAM U0 from group 0 can be also processed in parallel with a block that accesses only RAM U1, both RAM U1 and U2, or only RAM U3 in groups 1 and 3. On the other hand, a block accessing both RAM U0 and U1 from group 0 can not be processed simultaneously with any block in group 1 or 3. Taking this into account, the single-bank-access blocks can be placed at the end of their respective groups. In this case, when groups 0 and 2 have different sizes, the remaining blocks in the larger group are more likely to be single-bank-access blocks. They can be processed simultaneously as the remaining blocks in groups 1 or 3 as much as possible. The single-bank-access blocks can be also categorized into separate groups to facilitate the combination of these blocks for parallel processing. However, none of these additional optimizations leads to non-negligible speedup for codes with moderate or larger p. This is because single-bank-access blocks are those starting from columns whose indices are integer multiples of p. Given the random construction of \(\textbf{H}\), this happens with a probability of 1/p.
The overall implementation architecture of the multi-identity-block parallel MDPC decoder with the proposed optimization is very similar to that shown in Fig. 2. The only differences are that RAM I is divided into the same number of banks as the groups and their accesses are modified to implement the proposed scheme as shown in Fig. 4. The indices of the starting columns of the identity blocks in each group are stored in a separate RAM I bank. This allows the blocks belonging to different groups to be processed simultaneously according to the proposed scheduling scheme. The indices read from RAM I are used to generate the addresses for accessing the RAM U banks and extracting the messages for the identity blocks from RAM U outputs [15].
As it was mentioned previously, if a block starts from column i with \(i+p < r\) and it belongs to group j, then the corresponding shifted block in the next layer belongs to group \(j+1\) mod the number of groups. In the example shown in Table 3, the blocks in groups 0 and 2 are shifted to groups 1 and 3, respectively, and vice versa. Hence, the blocks in groups 0 and 2, as well as those in groups 1 and 3, can still be processed simultaneously without conflicts in the next layer. Considering this, instead of moving the entries among the RAM I banks, an entry i can be updated as \(i+p\mod r\) and written back in place in the next clock cycle to the same RAM I bank as shown in Fig. 4. This helps to eliminate the extra rows needed to write the updated entries. The only exception is the case that \(i+p\ge r\). As it was mentioned previously, its corresponding shifted block belongs to group 0. Now since RAM I bank, whose index equals \(-k \) mod the number of groups, stores the blocks for group 0 during the processing of layer k, the shifted block needs to be written to that RAM I bank. Such an update is also taken care of by the architecture shown in Fig. 4 as explained in the following.
Since each row of \(\textbf{H}\) has 2w nonzero entries, the depth of each RAM I bank is set to 2w to take care of the worst case. The initial indices for each group are stored sequentially in each bank from address 0 to end1 as shown in Fig. 5, if there are a total of \(end1+1\) valid entries. If a block wraps around the edge of a submatrix of \(\textbf{H}\), the end1 of the corresponding RAM I bank is reduced by one to mark that the corresponding entry becomes invalid in the processing of the next layer. On the other hand, the new entry resulted from the shifting is written to address \(end2-1\) in the RAM I bank recording group 0 for the next layer. Initially, end2 is set to 2w, which means there is no valid entry in the bottom part of RAM I. It is decreased by one every time a new entry in written in. In the processing of a layer, the entries in a RAM I bank are accessed from address 0 to end1 and then from end2 to \(2w-1\), so that every valid entry is processed.
In Fig. 4, the output of the comparator in the middle is ‘1’ when \(i+p\ge r\), which means that the corresponding block wraps around the edge. In this case end1 is reduced by one. The starting column of the shifted block in the next layer is available at the output of the multiplexer in the top middle, and it is passed to the input of the OR gate. Each RAM I bank has a tag to keep track of whether it is storing the information for group 0 in the processing of a layer. The tag for bank j is initialized to j and is added by one mod the number of identity block groups when the decoder moves to the next layer. The starting column of the block shifted from the wrap-around block in the previous layer is written into the RAM I bank storing group 0 information through the two multiplexers on the bottom of Fig. 4.
As shown in Fig. 1, the last layer of \(\textbf{H}\) has \((r-\lfloor r/p \rfloor \cdot p)<p\) rows due to r being a prime number. After processing the last layer, every block starting with index i is shifted to a block with index \((i + (r-\lfloor r/p \rfloor \cdot p)) \mod r\) in the first layer of the next iteration. Updating all these entries in RAM I is very difficult. Instead, another set of RAM I banks is used in our design to record the initial indices of the blocks. In the beginning of the next iteration, the initial indices are used and they are also copied to the other set of RAM I banks to be utilized in the beginning of the iteration after.
For other values of \(m_1\) and numbers of RAM U banks, similar scheduling schemes can be easily designed by identifying groups of identity blocks that do not have memory access conflicts. The implementation architectures shown in Figs. 2 and 4 can be accordingly extended. A larger number of groups allows the processing of more identity blocks in parallel without conflicts and thus helps to further reduce the latency. On the other hand, a large number of groups results in shallow memories, which increase the silicon area. These factors must be carefully balanced in the design of the scheduling scheme.
Table 4 shows the speedup factors achieved by the first proposed decoder with computation scheduling optimization compared to a serial decoder for example MDPC codes with \((r, w) = (4801, 45)\) and constraint \(p = 32\). The proposed out-of-order processing can avoid many memory access conflicts. Compared to the speedup factors listed in Table 1 for the preliminary decoder in Section 3.1 [16], it can be observed that the proposed optimization achieves significant additional speedup. For the same \(m_1\), the additional speedup increases with the number of RAM U banks since there is more flexibility in locating the groups of blocks without memory access conflicts. On the other hand, when \(m_1\) increases, it becomes more difficult to find a larger number of blocks without memory access conflicts. Accordingly, the additional speedup achieved by the proposed design becomes less significant for larger \(m_1\). When \(m_1=2\) and that RAM U is divided into 4 banks, our first proposed decoder achieves 59.6 times speedup compared to serial decoder, which is a 59.6/44.3-1=35% additional improvement compared to the preliminary design in Section 3.1 [16].
4.2 Optimization on Variable-Width Large-Block Parallel Processing
In this subsection, the out-of-order processing is applied to the variable-width large-block parallel decoder in Section 3.2 [16] to further reduce the latency.
When the \(\textbf{H}\) matrix is divided to blocks whose height is \(m_2p\) with \(m_2>1\), some blocks have more than one nonzero diagonal as shown in Fig. 3. As it was mentioned previously, to avoid error-correcting performance loss, at most one nonzero entry can be processed for each column at a time in row-layered decoding. As a result, the blocks with more than one diagonal are divided horizontally into \(m_2\) segments and those segments are processed one after another. To further reduce the latency, the segments without memory access conflicts from different blocks can be processed simultaneously. In Fig. 3, each block starts from a nonzero entry at its top left corner. As a result, it is guaranteed that the top segment of each block has one single diagonal. Since p CNUs in the variable-width large-block decoder is able to handle \(m_2\) v2c messages in each clock cycle, up to \(m_2\) top segments can be processed in parallel in an out-of-order manner.
To process \(m_2\) segments in parallel, the corresponding messages need to be accessed from RAM U banks simultaneously. The grouping scheme developed for the out-of-order multi-block parallel decoder can be also extended to the variable-width large-block design to identify the segments that can be processed simultaneously. When the blocks are larger, the RAM U is divided into a smaller number of banks with wider width. In the case that RAM U is divided into two banks, the blocks with more than one diagonal are divided into three groups according to the RAM U bank(s) that their top segments need to access: group 0 accessing only RAM U0, group 1 accessing only RAM U1, and group 2 accessing both RAM U0 and U1. Different from the multi-block design, only p messages need to be accessed for each top segment, and it is only a portion of the width of RAM U, which is \(m_2p\). Hence, the messages for a top segment are located in one single address of RAM U with a higher probability. It should also be noted that, to reduce the memory requirement, the locations of the second through \(m_2\) diagonals in a block with multiple diagonals are stored using their relative offsets inside the block. Hence, the information for all diagonals in a multi-diagonal block are stored in the same address of RAM I.
The information about the nonzero entries in the blocks whose top segments belong to groups 0, 1, and 2 are stored in three separate banks: RAM I0, I1, and I2, respectively, while those of the blocks with a single nonzero diagonal are stored in RAM I3. In the decoding process, the single-diagonal blocks are processed first by reading the information from RAM I3 one by one. Then one pair of entries are read from RAM I0 and I1 to process two top segments in each clock cycle. If group 0 is longer than group 1, or vice versa, more clock cycles are spent to process the remaining entries in the longer group one by one. After that, the top segments for the blocks stored in RAM I2 are processed one at a time. The later segments of the blocks with more than one diagonal are processed by following the same procedure. This is achieved by going through the entries in RAM I0, I1, and I2, one in each clock cycle for \(m_2-1\) times. According to this processing order, the clock cycles in which the blocks and segments are processed are labeled in Fig. 6 for a MDPC code with \((r, w) = (17, 3)\), \(p=4\), and \(m_2=2\).
The locations of the blocks for the next layer are generated on the fly and the contents of RAM I banks are updated accordingly in a similar way as for the multi-block parallel decoder in Section 4.1. The updated indices of the blocks from RAM I0, I1, or I2 that wrap around the edges of the submatrices of \(\textbf{H}\) may be written into other RAM I banks. Similar to the decoder in Section 4.1, another set of RAM I0, I1, and I2 is also used to store the initial indices of the multi-diagonal blocks. However, the updated entries from RAM I3 are always written back to itself since they are not included in the grouping for simultaneously processing the top segments. Hence, an additional copy of RAM I3 is not required.
The proposed out-of-order scheduling for variable-width large-block processing can also be implemented by an architecture similar to that in Fig. 2 with the RAM I component replaced by the architecture in Fig. 4. Since a block can have up to \(m_2\) diagonals, \(m_2\) shifters and reverse shifters are needed in Fig. 2. Up to \(m_2\) top segments can be processed simultaneously, each including one nonzero entry per row. The other segments may include at most \(m_2\) nonzero entries per row, but each of them only contains p rows. Hence, p CNUs are designed to take care of \(m_2\) inputs each, while each of the others handles one input.
For various combinations of p and \(m_2\), Table 5 lists the speedup factors achieved by the proposed variable-width large-block decoding with out-of-order processing over serial decoding averaged over 10 randomly generated MDPC codes with \((r, w) = (4801, 45)\). The results listed in this table are based on the parallel designs using \(b=2\) RAM U banks. Compared to Table 2, the speedup brought by the out-of-order processing is insignificant. For the case of \(p=32\) and \(m_2=2\), it is (50.2/46.2)-1=9%. The blocks with more than one diagonal account for less than a half of all the nonzero blocks due to the sparsity of the \(\textbf{H}\) matrix. Even if the top segments of those blocks can be combined pairwise and processed simultaneously, one clock cycle can be saved from every four clock cycles for the case of \(m_2=2\). Hence, the achievable speedup is less than 0.5/4=12.5%.
5 Hardware Complexity Analyses and Comparisons
This section analyzes the hardware complexities of the two proposed decoders with out-of-order processing and compares them with those of the best prior designs.
In MDPC decoders, memories contribute to the majority of the overall silicon area. Table 6 summarizes the numbers and sizes of RAM banks in the decoders for MDPC codes with parameters (r, w) and constraint p. To avoid dividing RAM U into very shallow memory banks, \(b=4\) and 2 are adopted for the proposed multi-block and variable-width large-block decoders, respectively. In Table 6, they are referred to as the first and second proposed designs, respectively. For the second proposed design, \(m_2\) should not exceed b and hence is set to 2. To compare designs with similar parallelism, \(m_1\) for the first proposed design is also set to 2. In this case, the depths of the RAM U banks are the same, while those in the second proposed design have twice the width. The two proposed designs utilize RAM M, S, and U with different depths, widths and/or numbers of banks compared to those of the decoder in [15]. However, the total sizes of these memories in terms of number of bits stored are very similar. The number of RAM I banks can be decided according to the number of RAM U banks and grouping schemes as explained in the previous section. The depth of each RAM I bank is 2w to take care of the worst case. Compared to the first and second proposed designs in this paper, the decoders in the preliminary work [16] have the same memory requirements, except that they only require one RAM I bank of 2w rows. Since RAM I is much smaller than RAM M, U, and S, the large number of RAM I banks needed in the proposed designs only leads to negligible increase in the overall memory requirement.
The sizes of memories utilized in the decoders are listed in Table 7 for MDPC codes with \((r, w)=(4801,45)\) and \(p=32\). The magnitudes of v2c (c2v) messages and a posteriori information are represented by \(q=4\) bits and \(a=10\) bits, respectively. Compared to the design in [15] and the two preliminary designs in [16], the RAM I of the two proposed designs is larger. However, RAM I is much smaller compared to the other memories. Overall, the memory requirements of both proposed decoders are very similar to those of the preliminary decoders in [16].
In the two proposed decoders, the architecture in Fig. 4 replaces the RAM I components in Fig. 2. The comparators, multiplexers, adders, and registers are used to generate the addresses and select the data to be written back to RAM I. For each decoder, the complexities of the logic components in Fig. 4 and the computing units in Fig. 2, including multiplexers, adders, CNUs, and shifters, are analyzed. Their complexities in terms of the number of NAND2 gates are listed in Table 7. Using the assumption that the silicon area required for one memory bit is equivalent to that of 1.5 NAND2 gates [11], the overall silicon area of each decoder represented by the number of NAND2 gates is estimated and listed in Table 7. It can be observed that the total area of each proposed decoder is almost the same as that of its corresponding preliminary design. This is because the memories dominate the overall area requirement. Both of the proposed decoders result in less than 10% silicon area overhead compared to the design in [15].
Same as those of the decoders in [15] and [16], the critical paths of the proposed decoders are determined by the adder in the a posteriori information computation. Since the same number of bits is used to represent the information, the two proposed decoders achieve the same critical path as the decoders in [15] and [16]. Hence, the speedup achieved by each decoder is calculated using the number of clock cycles needed as listed in Table 7. The two proposed decoders achieve 35% and 9% additional speedup compared to the first and second preliminary decoders in [16], respectively. The proposed out-of-order processing is more effective on the first decoder that tries to process multiple identity blocks in each clock cycle.
The examples given above are for MDPC codes with \((r,w)=(4801,45)\), which are the shortest codes considered for the standard. For longer codes with the same constraint p, RAM U can be divided into a larger number of banks with reasonable depth. This enables more flexibility in the grouping schemes for out-of-order scheduling, thereby further increasing the achievable speedup.
6 Conclusions
This paper proposes computation scheduling optimizations for two row-layered MDPC decoders to further reduce the decoding latency. In the first proposed design, identity blocks in the same layer are processed in an out-of-order manner to mitigate memory access conflicts, enabling more blocks to be processed in parallel. For the second proposed design, out-of-order scheduling is utilized to process more top segments of blocks in the same layer simultaneously. Efficient hardware architectures are also developed for both decoders. The proposed designs achieve significant speedup over prior work with negligible area overhead. Future work will extend the proposed computation scheduling optimizations to other MDPC decoders for post-quantum cryptography.
Data Availability
Not applicable.
References
NIST. (2024). Post-quantum cryptography. NIST. Accessed: Sep. 2024. [Online]. Available: https://csrc.nist.gov/projects/post-quantum-cryptography
Aragon, N., et al. (2024). BIKE: Bit flipping key encapsulation, Round 4 Submission. NIST. Accessed: Sep. 2024. [Online]. Available: https://bikesuite.org
Misoczki, R., Tillich, J.-P., Sendrier, N., & Barreto, P.S.L.M. (20113). MDPC-McEliece: new McEliece variants from moderate density parity-check codes. IEEE international symposium on information theory (pp. 2069-2073)
Bartz, H., & Liva, G. (2019). On decoding schemes for the MDPC-McEliece cryptosystem. Proceedings of International ITG Conference on Systems, Communications and Coding (pp. 1-6)
Paiva, T. B., & Terada, R. (2022). Faster constant-time decoder for MDPC codes and applications to BIKE KEM. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2022(4), 110–134.
Santini, P., Battaglioni, M., Baldi, M., & Chiaraluce, F. (2020). Analysis of the error correction capability of LDPC and MDPC codes under parallel bit-flipping decoding and application to cryptography. IEEE Transactions on Communications, 68(8), 4648–4660.
Arpin, S., et al. (2022). A study of error floor behavior in QC-MDPC codes. Proceedings of International Conference on Post-Quantum Cryptography (pp. 89-103)
Maurich, I.V., & Güneysu, T. (2014). Lightweight code-based cryptography: QC-MDPC McEliece encryption on reconfigurable devices. Proceedings of IEEE Design, Automation & Test in Europe Conference & Exhibition (pp. 1-6).
Maurich, I. V., Oder, T., & Güneysu, T. (2015). Implementing QC-MDPC McEliece encryption. ACM Transactions on Embedded Computing Systems, 14(3), 1–27.
Hu, J., & Cheung, R. (2017). Area-time efficient computation of Niederreiter encryption on QC-MDPC codes for embedded hardware. IEEE Transactions on Computers, 66(8), 1313–1325.
Xie, Z., & Zhang, X. (2023). Sparsity-aware medium-density parity-check decoder for McEliece cryptosystems. IEEE Transactions on Circuits and Systems-II, 70(9), 3343–3347.
Cai, J., & Zhang, X. (2025). Efficient layered new bit-flipping QC-MDPC decoder for BIKE post-quantum cryptography. Proceedings of IEEE International Symposium of Circuits and Systems (pp. 1-5).
Chen, J., & Fossorier, M. (2002). Density evolution for two improved BP-based decoding algorithms of LDPC codes. IEEE Communucations Letters, 6(5), 208–210.
Cai, J., & Zhang, X. (2023). Low-complexity parallel Min-sum medium-density parity-check decoder for McEliece cryptosystem. IEEE Transactions on Circuits and Systems-I, 70(12), 5328–5338.
Cai, J., & Zhang, X. (2024). Highly efficient parallel row-layered Min-Sum MDPC decoder for McEliece cryptosystem. arXiv preprint arXiv:2407.12695.
Cai, J., & Zhang, X. (2024). Low-latency parallel row-layered Min-sum MDPC decoder for McEliece cryptosystem. Proceedings of IEEE Workshop on Signal Processing Systems (pp. 147-152).
Guo, Q., Johansson, T., & Stankovski, P. (2016). A key recovery attack on MDPC with CCA security using decoding errors. Proceedings of ASIACRYPT (pp. 789-815).
Santini, P., Battaglioni, M., Chiaraluce, F., & Baldi, M. (2019). Analysis of reaction and timing attacks against cryptosystems based on sparse parity-check codes. Proceedings of Code-Based Cryptography: 7th International Workshop (pp. 115-136).
Mansour, M., & Shanbhag, N. (2006). A 640-Mb/s 2048-bit programmable LDPC decoder chip. IEEE Journal of Solid-State Circuits, 41(3), 684–698.
Zhang, X. (2015). VLSI Architectures for Modern Error Correcting Codes. CRC Press.
Zhang, X., & Tai, Y. (2014). High-speed multi-block-row layered decoding for Quasi-cyclic LDPC codes. Proceedings of IEEE Global Conference on Signal and Information Processing (pp. 11-14).
Sun, Y., Wang, G., Cavallaro, J. (2011). Multi-layer parallel decoding algorithm and VLSI architecture for quasi-cyclic LDPC codes. Proceedings of IEEE International Symposium of Circuits and Systems (pp. 1776-1779).
Funding
This material is based upon work supported by the National Science Foundation under Award No. 2052641.
Author information
Authors and Affiliations
Contributions
Not applicable.
Corresponding author
Ethics declarations
Conflicts of Interest
The authors have no conflict of interest to declare for this material.
Ethics Approval
This study does not involve any research with human participants or animals conducted by the authors.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Cai, J., Zhang, X. Low-Latency Parallel Row-Layered Min-Sum Decoders with Scheduling Optimization for MDPC Code-Based Post-Quantum Cryptography. J Sign Process Syst 97, 257–268 (2025). https://doi.org/10.1007/s11265-025-01964-9
Received:
Revised:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11265-025-01964-9