Abstract
Homomorphic encryption allows computations to be performed directly on ciphertexts, serving as a key enabler for privacy-preserving cloud computing. The computations over ciphertexts involve large integer modular multiplications. Besides, the overall complexity of ciphertext multiplication can be further reduced by utilizing quotients of integer divisions. An efficient quotient computation architecture was developed in our previous work for the case that the divisor has three nonzero bits. This paper first generalizes our prior design to accommodate divisors with more nonzero bits. To further reduce the latency, a new non-iterative quotient computation method is developed by simplifying the Barrett reduction. Rigorous mathematical proofs are provided for both proposed schemes, and they are also adopted for modular reduction. Besides, this paper developed efficient hardware implementation architectures for our proposed algorithms, and optimizations are carried out to reduce the complexity further. Compared to the best prior integer division (modular reduction) schemes, our proposed designs can achieve around 60% (57%) smaller area and 70% (71%) shorter latency when the modulus has 64 bits.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Homomorphic encryption is a cryptography scheme that allows computations to be performed on encrypted data without requiring decryption. It can be utilized to preserve the privacy of machine learning as a service over the cloud [1]. Popular HE schemes, such as CKKS [2, 3], B/FV [4, 5], and BGV [6], carry out computations over high-degree polynomials whose coefficients are very long integers. High-degree polynomial modular multiplication can be substantially simplified by the number theoretic transform (NTT) [7,8,9]. The modular reduction can be incorporated into a decomposed multiplication process [10, 11] to further reduce the complexity. Using the residue number system (RNS) [3, 12, 13], the computations over long coefficients of the polynomials are broken down into those over shorter integers. After these techniques are adopted, simplifying the integer modular reduction is one of the keys to further reducing the complexity of HE schemes. Besides, the overall complexity of ciphertext multiplication is reduced in [14, 15] by combining and reformulating the related computations, which require quotient computations from integer divisions.
The quotient can be computed using a lookup table-based approach [16]. However, the size of the lookup table grows exponentially with the divisor’s bit length. The quotients are derived utilizing the shifted inverse of the divisor in [17, 18]. For dividing a 2w-bit integer by a w-bit divisor, the algorithm proposed in [17] requires a \(2w\times 2w\) multiplier for quotient computation and introduces errors. The correctly rounded approach (CRA) in [18] computes the quotient precisely as \(\alpha \times \lambda + \beta \), where \(\lambda \) is the 2w-bit dividend, \(\alpha \) represents the highest bits of the shifted inverse of the divisor, and \(\beta \) is a 2w-bit round-off integer. Although the width of \(\alpha \) is reduced to \(\le 2w\), a \(2w \times 2w\) multiplier is still required in the worst case. While the Barrett reduction [19] was originally proposed for modular reduction, it can be easily adapted for integer division. Despite the simplifications in [20], the approach requires two \(w\times w\) multipliers, which impose a large area requirement. The area is reduced in [21] for a constant divisor, q. However, it is limited to the case where \(q=2^w-2^{l_1}\pm \cdots +1\), with \(l_1<w/2\).
Modular multiplication can be implemented using the Montgomery algorithm [22]. Although it integrates multiplication with modular reduction, transformations to and from the Montgomery domain are required at the beginning and end, respectively. Hence, it has a long latency [23]. Modular reduction can be simplified by Barrett reduction [19, 20]. Although the architectures based on Barrett reduction [24] offer lower latency, they require a larger area as previously mentioned. The designs in [25,26,27] integrate modular reduction into decomposed Karatsuba multiplication. For constant divisors in specific formats, their complexities are lower than those of the architectures based on Barrett reductions.
In the preliminary version of this work [28], an iterative procedure is developed to compute the quotient for the case where \(q=2^w-2^l\pm 1\). It only needs shifters and adders and leads to a shorter critical path and a lower area compared to the quotient computation architectures in [17,18,19]. However, the number of primes with only 3 nonzero bits is limited, and those suitable for the NTT transform are even more scarce.
This paper first generalizes the iterative integer divider proposed in [28] to cases where the divisor q contains more nonzero bits. Formulas are developed and rigorous mathematical proofs are provided for the number of iterations needed. The design is also adapted for modular reduction. As the number of nonzero bits in q increases, each iteration requires more adders and shifters to implement. Besides, depending on the differences between the exponents of the nonzero bits, it may need a large number of iterations. To address these issues, a second new design is proposed in this paper to compute the quotient and remainder by simplifying the Barrett reduction. Through rigorous mathematical derivations, it is discovered that the involved multiplications have a small number of partial products to add. Efficient hardware architectures are developed for both proposed designs, and additional reformulations and optimizations are introduced to reduce the hardware complexity further. Compared to the best prior quotient computation architecture [19] (modular reduction architecture [27]), our second proposed design can achieve 60% (57%) smaller area and 70% (71%) shorter latency when \(w=64\) and q has 4 nonzero bits. The improvements are more significant for larger w or primes with fewer nonzero bits.
The organization of this paper is as follows. Section 2 introduces previous research on integer division and modular reduction. Section 3 presents our preliminary work for quotient computation through an iterative process. The iterative process is generalized and adapted to carry out the modular reduction in Section 4. Section 5 proposes simplified Barrett reduction for quotient computation and modular reduction. Hardware implementation architectures and complexity comparisons are provided in Section 6. Section 7 concludes the paper.
2 Existing Integer Division and Modular Reduction Algorithms
Suppose the dividend \(\lambda \) is written as \(bq+r\), where b is the quotient and r is the remainder of the division. In [17], for the case when \(\lambda \) has 2w bits, the quotient is approximated by multiplying \(\lambda \) with a pre-computed 2w-bit constant \(J = \lfloor 2^{3w-1}/q\rfloor + 1\) and extracting the higher w bits of the product. This approach results in either \(\lfloor \lambda /q\rfloor \) or \(\lfloor \lambda /q\rfloor +1\) and requires a (\(2w\times 2w\))-bit multiplier, which occupies a large area. In [18], the quotient b is computed as \(\alpha \times \lambda +\beta \), where \(\alpha \) and \(\beta \) are determined by an algorithm that minimizes the number of nonzero bits in \(\alpha \) for a given q. \(\alpha \) and \(\beta \) have at most 2w bits from the algorithm.
The Barrett reduction [19] algorithm is primarily designed for modular reduction rather than quotient computation. When \(\lambda \) has 2w bits, it computes \(r^*=\lambda - \lfloor \lambda T / 2^{2w}\rfloor q\), where T is precomputed as \(\lfloor 2^{2w} / q \rfloor \). The result \(r^*\) lies within the range of [0, 2q). By comparing \(r^*\) with q, the final remainder is determined as either \(r^*\) or \(r^*-q\). This computation is simplified in [20], which first calculates
It has been derived that \(\tilde{b} \in \{b-2, b-1, b\}\). Subsequently, the remainder r is first approximated as
\(\tilde{r} \in \{r, r+q, r+2q\}\) and r can be determined by comparing \(\tilde{r}\) with q and 2q and carrying out the corresponding subtraction. In [21], it has been shown that when the modulus q is represented in a signed binary form of minimal Hamming weight as \(q=2^{w}\pm 2^{l_1}\pm 2^{l_2}\pm \cdots \pm 2^{l_m}+1\) with \(l_m<\cdots<l_1<w/2\), \(T=\lfloor 2^{2w} / q \rfloor \) can be expressed as \(2^{w}\mp 2^{l_1}\mp 2^{l_2}\mp \cdots \mp 2^{l_m}-1\). For these cases, the multipliers of T and q are replaced by adders.
Modular multiplication can be carried out utilizing the Montgomery algorithm [22, 23]. However, it requires conversion to the Montgomery domain and inverse, which leads to a high complexity. The designs in [25,26,27] propose to incorporate modular reduction into the decomposed multiplication using the Karatsuba algorithm and consider the case that q is in the format of \(q=2^w-\delta \), where [25, 26] assume \(\log _2{\delta }\le w/2\) and [27] considers \(\log _2{\delta }\le 2w/3\). Each operand for the multiplication is written as \(a = a_1 2^{w/2}+a_0\), and modular reduction is applied to the products of the segments of the two operands. Although this approach lowers the hardware complexity due to the shorter operands of the multiplications, it incurs longer latency than those based on Barrett reduction.
3 Iterative Integer Division when \(q = 2^w - 2^l \pm 1\)
Algorithm 1 was proposed in our preliminary work [28] to simplify the quotient computation when the divisor only contains three nonzero bits, i.e., \(q = 2^w - 2^l \pm 1\) \((0<l<w)\). Assume that \(\lambda = bq + r\) is a 2w-bit number. It is defined that \(c = \left\lfloor \lambda / 2^w \right\rfloor \) and \(f(x) = x + v(x)\), where \(v(x) = \left\lfloor (-1) x (2^l \mp 1) / 2^w \right\rfloor \). Then the integer b satisfying \(f(b)=c\) is the quotient of \(\lambda /q\). It has been proved in [28] that the iterative process in Algorithm 1 yields \(b^*\) such that \(f(b^*) = c\), and \(b^*\in \{b, b\pm 1\}\). The actual value of b can be determined from \(b^*\) according to Lines 4-6 in Algorithm 1. Mathematical proofs have been provided to demonstrate that the number of iterations required is at most t, where t is the unique integer satisfying \((t - 1)w / t < l \le tw / (t + 1)\).
Algorithm for calculating \(b=\lfloor \lambda /q\rfloor \) [28].
Hardware architectures were also proposed in [28] to implement Algorithm 1. Each iteration of the algorithm is mapped to separate hardware units so that the overall computation is fully pipelinable. Each iteration requires two w-bit adders, two w-bit Barrel shifters, one w-bit multiplexer, and w registers. Simplified architectures have also been developed to implement Lines 4-6 of Algorithm 1, leveraging the fact that that q has only three nonzero bits and \(r^*\) consists of at most \(w+1\) bits.
In the polynomial ring of HE schemes, modulo reduction by \(x^N+1\) follows multiplications of polynomials of length N. The negatively-wrapped NTT [8] is essential for significantly reducing the complexity of these modular polynomial multiplications. It requires that the modulus q of the coefficients be a prime and \(q=1\mod 2N\). Table 1 lists the numbers of primes with three nonzero bits satisfying these requirements for different bit lengths w and the ring dimension \(N = 2^{14}\), which are commonly used in many HE applications [29]. Since the primes in the format of \(2^w-2^l-1\) equals \(-1 \mod 2N\), they can not be used for negatively wrapped NTT. When l falls into one of the four ranges listed in the table, Algorithm 1 needs 1, 2, 3, and \(\ge 4\) iterations, respectively. It can be observed that the usable primes are very limited, and the total number of usable primes reduces for larger N.
The HE schemes operate on coefficients with thousands of bits. To shorten the involved multiplications, the RNS scheme is necessary and the modulus needs to be decomposed into multiple factors as \(Q=q_0q_1\cdots \), where each \(q_i\) is a prime equaling \(1 \mod 2N\). Also, all \(q_i\) factors should have l within the same range relative to w, ensuring the numbers of iterations needed by Algorithm 1 and hence the latencies for quotient computations across all factors are the same. From Table 1, there are not sufficient primes that can be used as decomposed moduli for RNS representation of HE schemes incorporating the NTT.
Table 2 presents the numbers of primes with four non-zero bits that can be used in NTT for the ring dimension \(N=2^{14}\). Comparing this table and Table 1, many more primes are available if four nonzero bits are allowed in the binary representation of q.
4 Generalized Iterative Integer Division (GID) and Modular Reduction (GMR)
This section proposes a generalized iterative integer division (GID) algorithm that allows the modulus, q, to have more than three nonzero bits. By adopting the GID, a broader range of primes can be utilized to achieve efficient HE implementations. Theorems with rigorous mathematical proofs are provided to establish the correctness of the algorithm and determine the upper bound on the number of iterations required, using primes with four nonzero bits as an example.
This paper proposes employing the same procedure as in Algorithm 1 for quotient computation in the case that the divisor q takes the general form
where \(l_1>l_2>\cdots>l_m>0\). In this case, f(x) is redefined as \(x+v(x)\), where
and \(\delta = 2^{l_1}\mp 2^{l_2}\mp \cdots \mp 2^{l_m}-1\).
The same analyses and mathematical derivations in [28] can be applied to prove that Algorithm 1 terminates after a finite number of iterations in the case where q follows the general format in Eq. 3.
For general q, the \(b^*\) generated by the iterative process also equals b, \(b-1\), or \(b+1\). However, the proof of this fact and the analysis of the number of iterations needed by Algorithm 1 in [28], relies on the assumption that \(q=2^w-2^l+1\). In the following, \(b^*\in \{b, b\pm 1\}\) is re-proved and the number of required iterations is re-analyzed for the general case of q.
For general q and \(\lambda = bq+r\), it can be derived that
Since the \(b^*\) computed by Algorithm 1 satisfies \(f(b^*) = c\), then \(b^*\in \{x:f(x)=c\}\). Let us first show that \(f(b-1)\) may be equal to c. Plugging \(b-1\) into Eq. 4 and utilizing the fact that the remainder r is positive, it can be derived that
Hence, f(b) may equal c. Similarly, it can be shown that f(b) and \(f(b+1)\) may also equal c. Therefore, b and \(b\pm 1\) are the possible values of \(b^*\).
Next, it is proved that \(f(x)\ne c\) for \(x\le b-2\). First, consider \(x=b-2\). Following Eq. 6,
If \(l_1=w-1\), then q can actually be represented using \(w-1\) bits, which conflicts with our assumption. Hence, \(l_1<w-1\), implying that \(2\delta <2^w\). Accordingly, \(f(b-2)\le c-1\). As it was proved in [28], \(f(\cdot )\) is a non-decreasing function. Therefore, \(f(x)\le c-1\) for any \(x<b-2\). It can be also derived that
In summary, \(\{b, b \pm 1\}\) are the only possible solutions of \(f(x) = c\), and hence \(b^*\in \{b, b\pm 1\}\).
It has been proved in [28] that increasing the value of the input \(\lambda \) leads to a non-decreasing number of iterations in Algorithm 1. Hence, the number of iterations needed in Algorithm 1 is upper bounded by that of the case when \(\lambda =2^{2w}-1\), which is the maximum value of the 2w-bit \(\lambda \). The number of iterations needed for \(\lambda =2^{2w}-1\) is derived through the following three theorems for the example case that q has four nonzero bits.
Theorem 1
Assume that \(q=2^w-2^{l_1}+2^{l_2}+1\). When \(\lambda = 2^{2w}-1\), the \(b_i\) computed in Lines 2-3 of Algorithm 1 in iteration i satisfies the following inequalities:
Proof
When \(\lambda =2^{2w}-1\), \(b_0=c=2^w-1\). From Line 3 of Algorithm 1, it can be derived that
Therefore, Eq. 8 holds for \(i=1\). Since \(b_i=c-v(b_{i-1})\), it follows that \(b_i \!\!=\!\! 2^w + \left\lfloor \frac{b_{i-1}(2^{l_1}-2^{l_2}-1)}{2^w}\right\rfloor \). Suppose that Eq. 8 also holds for iteration \(i-1\). It can be derived
It can be proved that the floor function in the second to the last formula above is zero. For conciseness, the proof has been omitted in this paper. Accordingly, Eq. 8 also holds for i, and the proof of Eq. 8 is completed. Similarly, Eq. 9 can also be proved by induction.\(\square \)
Theorem 2
Given an integer \(a<2^w\). There is a unique integer s, such that \(2^{(s-1)w}> a^s\) and \(a^{s+1}\le 2^{sw}\).
Proof
Since \(a<2^w\), \(log_2 a<w\). Hence \(\log _2{a}/(w-\log _2{a}) < w/(w-\log _2{a})\). Because the difference of the left and right-hand side of this inequality is 1, there exists a unique integer s such that \(\log _2{a}/(w-\log _2{a})\le s < w/(w-\log _2{a})\). \(\log _2{a}/(w-\log _2{a})\le s\) translates to \((w-\log _2{a})s\ge \log _2{a}\) and hence \(sw\ge (s+1)\log _2{a}\). Taking both sides as powers of 2, it can be derived that \(2^{sw}\ge a^{s+1}\). Similarly, it also can be proved that \(2^{(s-1)w}> a^s\).\(\square \)
Theorem 3
Let \(t_1\) be the unique integer such that \(2^{(t_1-1)w}<(2^{l_1}-2^{l_2})^{t_1}\) and \((2^{l_1}-2^{l_2})^{t_1+1}\le 2^{t_1w}\). When \(\lambda =2^{2w}-1\), the number of iterations needed in Algorithm 1 is at most \(t_1+1\).
Proof
First, let us show that \(f(b_i)\ne c\) for \(i<t_1\). For \(x\in \mathbb N\), \(f(x) = x+v(x) = \lfloor xq/2^w\rfloor \). Use \(\Delta \) to represent the sum term in Eq. 8 to simplify the notations. Since f(x) is a non-decreasing function, plugging in Eq. 8, it follows that:
Reorganizing and canceling out common terms in the above formula, it can be derived that
Since \(l_1\le w-2\), it follows that
For integers i and \(t_1\), \(i<t_1\) implies that \(t_1-i-1\ge 0\). Also from the assumptions, \(2^{(t_1-1)w}<(2^{l_1}-2^{l_2})^{t_1}\), it can be derived that
Plugging in Eqs. 11 and 12 into Eq. 10 leads to
Since \(b_i\) increases in each iteration and \(f(\cdot )\) is non-decreasing, Algorithm 1 will not terminate at the end of \(i<t_1\) iterations. If \(f(b_{t_1})=c\), the algorithm terminates in iteration \(t_1\). If \(f(b_{t_1})\ne c\), applying Eqs. 8 and 9, along with derivations similar to those in the preceding proofs, yields \(c\le f(b_{t_1+1}) < c+1\). Since \(f(b_{t_1+1})\) is an integer, it follows that \(f(b_{t_1+1}) = c\), and thus Algorithm 1 terminates in iteration \(t_1+1\).\(\square \)
Similarly, it can be shown that when q is of the form \(2^w-2^{l_1}-2^{l_2}+1\), Algorithm 1 needs at most \(t_1+1\) iterations. Moreover, the algorithm can be further generalized for divisors in the format of \(q = 2^{w}-2^{l_1}\pm 2^{l_2}\pm \cdots \pm 2^{l_m}+1\) with more than four nonzero bits.
The proposed GID algorithm can be easily modified to realize a generalized modular reduction (GMR) algorithm, as shown in Algorithm 2. This algorithm follows the same procedure to compute \(b^*\) and \(r^*\) as in Algorithm 1. From Algorithm 1, the quotient, b, may equal \(b^*\), \(b^*-1\), or \(b^*+1\). Hence, the remainder, r, is \(r^*\), \(r^*+q\), and \(r^*-q\) for the corresponding cases, respectively. Therefore, Lines 4-6 are modified accordingly as in Algorithm 2 to compute the remainder r.
In [28], hardware units are replicated and connected serially to implement the iterations of Algorithm 1. This architecture can also be applied to implement the GID. However, the latency increases linearly with the number of iterations in the algorithm. To reduce the latency, a new design is proposed in the next section based on simplifying the Barrett reduction.
5 Simplified Integer Division (SID) and Modular Reduction (SMR)
The quotient can be computed using Eq. 1 according to the Barrett reduction algorithm [19, 20]. The majority of the area is attributed to the multiplication with \(T=\lfloor 2^{2w}/q\rfloor \), which is implemented by a general \(w\times w\)-bit multiplier in prior work. This section proposes a simplified integer division (SID) scheme by deriving an alternative formula for T multiplication. The proposed scheme requires a simpler multiplier when q has consecutive zero bits after the most significant bit or a limited number of nonzero bits, which cover a broad range of usable q. Additionally, a simplified modular reduction (SMR) technique is also developed by carrying out the computations in Eq. 2 after Eq. 1 is calculated by using the SID.
Theorem 4
Write q in the format of \(2^w-m+1\), where m is the decimal value of bits 1 through \(w-1\) in q. From Theorem 2, there exists an unique integer s such that \(2^{(s-1)w}\le m^s\) and \(m^{s+1}<2^{sw}\). Define
Then \(T=\lfloor 2^{2w}/q\rfloor \) equals \(T^*\), \(T^*-1\), or \(T^*+1\).
Proof
Plugging Eq. 14 into \(2^{2w}-qT^*\), applying the formula of the sum of a geometric sequence, and reorganizing the terms, it can be derived that
From the inequality \(\lfloor x\rfloor \le x\),
The inequality in the third row of the above formulas is due to the fact that \(2^{(s-1)w}\le m^s\). Apparently, the right side of the last formula in Eq. 15 is a negative number. Next, it will be shown that the magnitude of this negative number is less than q. After multiplying out and combining terms, as well as replacing q by \(2^w-m+1\), this is equivalent to proving that
Since \(m\le 2^{w-2}\), \(2^{2w}-2^w(3m-2)>2^{2w}-2^{w}(4\times 2^{w-2})=0\). Hence Eq. 16 is proved. Using a similar process, it can be proved that \(2^{2w}\!-\!qT^*<2q\). Overall, \(-q<2^{2w}-qT^*<2q\).
When \((2^{2w}-qT^*)\in (-q,0)\), \(\lfloor 2^{2w}/q-T^*\rfloor =-1\). Since \(T^*\) is an integer, \(T=\lfloor 2^{2w}/q\rfloor =T^*-1\). When \((2^{2w}-qT^*)\in [0,2q)\), \(\lfloor 2^{2w}/q-T^*\rfloor \in \{0,1\}\). Accordingly, \(T=T^*\) or \(T^*+1\) in this case.\(\square \)
For a given q, \(T^*\), and whether T equals \(T^*\), \(T^*+1\), or \(T^*-1\) can all be pre-determined. Hence, the multiplication with T can be implemented by multiplying the constant \(T^*\), as shown in Eq. 14. The expression \(m-1+\left\lfloor \sum _{j=1}^{s-1}2^{-jw}m^{j+1} \right\rfloor \) part from Eq. 14 is less than or equal to
The inequality in the second row of the above formula follows from the fact that \(m/2^{w-1}<1\). Therefore, \(T^*\) can be expressed by a ‘1’ in the w-th bit followed by an integer that does not exceed 2m. Table 3 lists the number of primes with \(w=32\) and 48 that can be utilized in the NTT transform for the ring dimension \(N=2^{14}\) and m in various ranges. It can be observed that there are a sufficient number of primes to use for moderate or larger w even if m is restricted to at most 3w/4 bits. In this case, the multiplication of \(T^*\) can be realized by adding up \(3w/4+2\) partial products. Compared to prior approaches that require summing w partial products for T multiplication, the proposed design significantly reduces computational complexity.
When q has a few nonzero bits, Eq. 14 can be simplified and the exact number of nonzero bits in \(T^*\) can be derived. In the case where \(q = 2^w-2^l+1\), let s be the integer such that \(2^{(s-1)w}\le 2^{ls}\) and \(2^{(s+1)l}<2^{sw}\). Then Eq. 14 is simplified to
Apparently, \(T^*\) has \(s+2\) nonzero bits. When q has more nonzero bits, the formula for the number of nonzero bits in \(T^*\) becomes more complicated.
6 Hardware Architectures and Complexity Analyses
This section presents efficient hardware architectures for implementing the proposed GID and SID algorithms for primes in the general form of \(q=2^w-m+1\). Simplified architectures are also developed for the SID algorithm when q has three or four nonzero bits. For each design, the necessary modifications to implement the GMR and SMR algorithms are also detailed. Finally, complexity analyses and comparisons with previous designs are provided.
6.1 Hardware Architectures for GID and GMR Algorithms
As mentioned previously, the GID algorithm follows the same procedure as Algorithm 1, except that the 3-nonzero-bit input q is replaced by q with a general form. The block diagram for implementing the GID algorithm is shown in Fig. 1. It comprises two main components: the iteration unit (ItrU) and the decision unit for quotient computation (DUQ). One ItrU is employed for each iteration of Algorithm 1 so that the divider is pipelinable. The DUQ implements Lines 4-6 of Algorithm 1. In Fig. 1, assuming that \(\lambda \) has 2w bits, \(\lambda [2w-1:w]\) represents the highest w bits and it equals c. As highlighted in [28], the decisions in Lines 4-6 of Algorithm 1 depend on whether \(\lambda -b^*q\) is larger than zero or not. If it is larger than zero, only whether it is smaller or larger than q needs to be known. This information can be derived using only \(\lambda [w+1:0]\). Hence, only these bits are sent to the DUQ block in Fig. 1.
The ItrU computes Line 3 of Algorithm 1. For an example case of \(q=2^w-2^{l_1}\pm 2^{l_2}-1\), it is equivalent to \(b_{i+1} = c-v(b_i)\), where \(v(b_i) = \left\lfloor (-1)b_i(2^{l_1}\mp 2^{l_2}-1)/2^w\right\rfloor \). Since \(b_i<2^w\) and \(2^{l_1}\mp 2^{l_2}-1\) is an odd number, the numerator \(b_i(2^{l_1}\mp 2^{l_2}-1)\) does not have \(2^w\) as a factor. Accordingly, Line 3 simplifies to
The computations in the above formula can be implemented by adding up the numbers, as shown in Fig. 2. The two bars in the middle of the part (a) of this figure represent \(b_i2^{l_2}\) and \(b_i2^{l_1}\) for the case that \(q=2^w-2^{l_1}-2^{l_2}+1\). The \(-b_i\) inside the floor function of Eq. 18 is implemented by flipping every bit in the 2w-bit extension of \(b_i\), which is represented by the top bar in Fig. 2(a), and replacing the least significant bit (LSB) of \(b_i2^{l_1}\) by ‘1’. After the floor function, the lowest w bits of \(b_i(2^{l_1}\mp 2^{l_2}-1)\) are discarded. Hence, the last ‘1’ in Eq. 18 is added at the w-th bit position in Fig. 2(a). It can be pre-added to all the ‘1’ bits in the first bar, which leads to all ‘0’ in the sum with a carry-out. However, since it is known that the result \(b_{i+1}\) is positive and has w bits, the carry-out can be discarded and no further calculation is needed. In the case where \(q=2^w-2^{l_1}+2^{l_2}+1\), the 2’s complement of \(b_i2^{l_2}\) must also be carried out before the summation. This can be done by flipping every bit in the 2w-bit extended \(b_i2^{l_2}\), which is shown by the third bar in Fig. 2(b), and adding another ‘1’ to the LSB of \(b_i2^{l_1}\), which makes its last two bits equal to ‘10’.
Figure 2(c) illustrates the proposed hardware architecture for the ItrU. The two shifters align \(b_i\) with the other numbers as shown in Figs. 2(a) and (b) to generate \(b_i2^{l_1}\) and \(b_i2^{l_2}\). These shifts can be implemented using the shifter architecture in [14] by padding w bits of ‘0’s to the input. The sign of the \(2^{l_2}\) term in q is denoted by \(\text {sign}_2\). When it is negative, the output bits from the second shifter are flipped. As previously explained, all the ‘1’ bits in the top bar of Fig. 2 can be pre-added with the ‘1’ in the \((w+1)\)-th bit position. The carry-out from the 2w-th bit can be ignored, resulting in all ‘0’ bits. Hence, c in the last bar can be combined with the flipped \(b_i\) bits in the first bar to form a single 2w-bit number. This number and the outputs from the two shifters are sent to a carry-save adder (CSA) to compute the sum. Since \(b_{i+1}\) equals the highest w bits of this sum, the hardware for merging the lowest w bits of the last pair of sum and carry vectors in the CSA can be simplified accordingly. The highest w bits of carry and sum vectors are merged using a carry-select adder to reduce the critical path. The adjustment of the last two bits of \(b_i2^{l_1}\) to ‘01’ or ‘10’ as shown in Figs. 2(a) and (b) can be handled by simple hardware units and hence is not explicitly shown in Fig. 2(c).
The value of \(b^*\) computed from the last iteration of Algorithm 1 falls within the set of \(\{b, b \pm 1\}\). Consequently, \(r^* = \lambda - b^*q\) equals r or \(r \pm q\). Therefore, only the lowest \(w+2\) bits of \(r^*\) must be calculated. In the case where \(q=2^w-2^{l_1}- 2^{l_2}+1\), \(\lambda - b^*q=\lambda - b^*(2^w+1)+b^*2^{l_1}+b^*2^{l_2}\). Hence, \(r^*\) can be calculated by adding up the lower \(w+2\) bits of the four numbers shown in Fig. 3(a). Similarly, \(b^*2^{l_1}\) and \(b^*2^{l_2}\) are represented by the third and fourth bars, respectively. To generate \(-b^*(2^w+1)\), bit-flipped \(b^*\) is put into the higher and lower bits as shown in the second bar, and ‘1’ is introduced to the LSB of \(b^*2^{l_1}\). If \(q=2^w-2^{l_1}+2^{l_2}+1\), to compute \(\lambda -b^*q\), \(b^*2^{l_2}\) is extended and bit flipped as shown by the third bar in Fig. 3(b). The ‘1’ needed for the 2’s complement is added to \(b^*2^{l_1}\) so that its two LSBs become ‘10’. According to Figs. 3(a) and (b), the DUQ is implemented by the architecture in Fig. 3(c). Once \(r^*\) is computed at the output of the CSA, its lowest \(w+1\) bits are compared with q. The comparator generates ‘1’ when \(r^*>q\) and ‘0’ otherwise. Then b equals \(b^*\), \(b^*-1\), and \(b^*+1\) when the \((w+2)\)-th bit of \(r^*\), which is its sign bit and the comparator output are ‘00’, ‘1x’, and ‘01’, respectively.
When q has more than four nonzero bits, the number of iterations in Algorithm 1, and hence the number of required ItrUs can be determined. For a q with num nonzero bits, each ItrU requires \(num-2\) shifters and a CSA with \(num-1\) inputs. A single DUQ is needed regardless of the format of q. However, it necessitates \(num-1\) shifters and a CSA handling \(num+1\) inputs.
The GID design can be easily adapted for GMR by replacing the DUQ in Fig. 1 with a decision unit for remainder computation (DUR), whose hardware implementation architecture is illustrated in Fig. 3(d). It employs the same blocks as shown in Fig. 3(c) to compute the \((w+2)\)-bit \(r^*\). Then, \(r^*\) is added and subtracted by q. The remainder, r, equals \(r^*\), \(r^*-q\), and \(r^*+q\) when the \((w+2)\)-th bit of \(r^*\), which is its sign bit and the sign bit of \(r^*-q\) are ‘01’, ‘00’, and ‘1x’, respectively.
6.2 Hardware Architectures for SID and SMR Algorithms
This subsection first presents the hardware architectures of SID and SMR for primes of the general form \(q=2^w-m+1\). Subsequently, our architecture is further refined for the case where q has a few nonzero bits, in which case \(T^*\) has a very limited number of nonzero bits.
6.2.1 Primes in the Form \(q \!=\! 2^w\!-\!m\!+\!1\)
The hardware architecture for implementing the proposed SID algorithm is depicted in Fig. 4. The components in the first dashed block multiply \(\lfloor \lambda /2^{w-1}\rfloor \) with T using the \(T^*\) formula from Eq. 14. As discussed in Section 5, if q is written in the format of \(q=2^w-m+1\), T can be expressed as \(T=2^w+n\), where \(n\le 2m\). \(q=2^w-m+1\) with \(\log _2 m \le 3w/4\) covers a wide range of primes. For this case, the multiplier width is reduced to \(3w/4\times w\). The product of \(\lfloor \lambda /2^{w-1}\rfloor n\) is added with \(\lfloor \lambda /2^{w-1}\rfloor 2^w\), which is implemented via wire shifting without any logic gate. From Eq. 1, only the highest w bits need to be computed using the adder in this block to obtain \(\tilde{b}\), allowing for further simplifications in the design.
The quotient b equals \(\tilde{b}, \tilde{b}+1, \tilde{b}+2\) when \(\tilde{r}\) in Eq. 2 is \(r, r+q, r+2q\), respectively. The second dashed block in Fig. 4 calculates \(-(\tilde{b}q \mod (2^{w+1}))= (-\tilde{b}2^w - \tilde{b} + \tilde{b}m)\mod (2^{w+1})\) for Eq. 2. Since \(\tilde{b}\) is a w-bit number, \(\tilde{b}2^w + \tilde{b}\) can be derived by concatenating \(\tilde{b}\) to itself. The 2’s complement of this sum is then derived by flipping every bit and adding ‘1’ to the LSB. On the other hand, due to the modular reduction by \(2^{w+1}\) at the end, only the \(w+1\) LSBs need to be retained in \(\tilde{b}2^w + \tilde{b}\) and the product \(\tilde{b}m\). Accordingly, the multiplication by m in this dashed block can also be simplified.
The third dashed block in Fig. 4 completes the rest of Eq. 2 for computing \(\tilde{r}\), and compares it with q and 2q. In this block, \(\lambda \mod 2^{w+1}\) corresponds to the \(w+1\) LSBs of \(\lambda \), denoted by \(\lambda [w:0]\). Depending on the comparison results, \(\tilde{b}\), \(\tilde{b}+1\), or \(\tilde{b}+2\) is selected as the result b. This architecture can also be easily modified to compute the remainder r. In this case, \(\tilde{r}\) is subtracted by q and 2q, and the sign bits of the differences select either \(\tilde{r}\), \(\tilde{r}-q\), or \(\tilde{r}-2q\) as the remainder r.
6.2.2 Primes with a Few Nonzero Bits
When q has only a few nonzero bits, such as three or four, the number of nonzero bits in T is also limited. In this case, the multiplication of T and q can be efficiently implemented by adding a small number of shifted operands. Besides, the addition of the lower w bits of the shifted operands in the multiplication of T can be omitted to further simplify the multiplier by compensating the potential carries from these bits in the decision of b.
Figure 5 illustrates the hardware architectures for implementing the SID When \(q = 2^w-2^l+1\) and \(l\le 3w/4\). In this case \(s \le 3\). The shift-add unit for T (SAU-T), shift-add unit for q (SAU-q), and extended decision unit for quotient computation (eDUQ) implement the counterpart functions of the first, second, and third blocks in Fig. 4, respectively. From Eq. 17, \(T^*\) and hence T, contains up to 5 nonzero bits when q has three nonzero bits and \(s\le 3\). In this case, the multiplication with T can be simplified from a general multiplication to the shift-add operation as shown in Fig. 5(b). Assume that \(T = 2^w\pm \text {EN}_{1}\cdot 2^{u_1}\pm \text {EN}_2\cdot 2^{u_2}\pm \text {EN}_3\cdot 2^{u_3}\pm \text {EN}_4\cdot 2^{u_4}\), where \(\text {EN}_i\) is either 0 or 1 to support the cases when T has less than 5 nonzero bits. Denote the sign of each term by \(\text {sign}_i\). Each shifter in Fig. 5(b) is implemented by the architecture in [14], which shifts the sign-extended input according to the offset \(u_i\) and sign \(\text {sign}_i\). From Eq. 1, only the highest w bits of \(\lfloor \lambda /2^{w-1}\rfloor T\) contribute to \(\tilde{b}\). The lowest w bits of the five partial products for the multiplication of T generate a carry-out of at most 4 to the highest w bits. To simplify the shifters and the following CSA, only the highest w bits are retained from the shifters and added up in the CSA to derive \(\hat{b}\). It can be easily derived that \(\hat{b}\in \{b, b-1, \cdots , b-6\}\) due to the ignored carry-out from the lowest w bits. The actual b is decided in the eDUQ block.
Since \(\hat{b}\in \{b,b-1,\cdots , b-6\}\), the value of \(\hat{r} = \lambda -\hat{b}q\) lies in the set \(\{r, r+q, \cdots , r+6q\}\). Hence, two additional bits are needed to represent \(\hat{r}\) compared to \(\tilde{r}\). Therefore, \(\hat{r}\) should be computed as
Similarly, only the lowest \(w+3\) bits need to be retained in the above calculations. In \(-\hat{b}q=-(2^w\hat{b}+\hat{b})+2^l\hat{b}\), the lowest \(w+3\) bits of \(2^w\hat{b}+\hat{b}\) are formed by concatenating the lowest 3 bits of \(\hat{b}\) to the left of \(\hat{b}\). Accordingly, \(\hat{r}\) is computed as shown in Fig. 5(c).
Comparing \(\hat{r}\) with 6 different values using 6 comparators to decide b leads to a large area. Alternatively, the original Barret reduction [19] formula can be applied to the \((w+3)\)-bit input \(\hat{r}\) to derive the approximate quotient as
Subsequently, \(\hat{r}-\hat{b}_1q\) is either r or \(r+q\), which implies that b equals to either \(\hat{b}+\hat{b}_1\) or \(\hat{b}+\hat{b}_1+1\). Accordingly, the eDUQ can be implemented as in Fig. 5(d). The three multiplexers and CSA calculate \(\hat{r}-\hat{b}_1q=\hat{r}-\hat{r}[w+2:w]q\).
With minor modifications to part (d), the architectures in Fig. 5 can be adapted to implement SMR. To compute the remainder, the output of the CSA in Fig. 5(d) is subtracted by q. The sign bit of the difference selects either the CSA output or the subtractor output as the remainder.
The architectures in Fig. 5 can be extended to implement the quotient computation for q with more than three nonzero bits. Assume that q and T have x and y nonzero bits, respectively. In the SAU-T architecture of Fig. 5(a), \(y-1\) shifters are required and the CSA has y w-bit inputs. Since the addition of the lowest w bits in the y partial products are skipped in the proposed design in Fig. 5(a), the \(\mod 2^{w+1}\) in Eq. 2 must be adjusted to \(\mod 2^{w+1+\lceil \log _2 y\rceil }\). As a result, the \(1+\lceil \log _2 y\rceil \) LSBs of \(\hat{b}\) are concatenated to the left of \(\hat{b}\) in Fig. 5(b), and the lowest \(w+1+\lceil \log _2 y\rceil \) bits of \(\lambda \) are input to the CSA. Besides, the CSA has an additional \(x-2\) inputs generated by \(x-2\) shifters within this block. For the eDUQ block, the only differences are that \(\lceil \log _2 y\rceil \) multiplexers are utilized and the number of inputs at the CSA is increased accordingly.
The number of nonzero bits in T increases fast with the number of nonzero bits in q. The number of shifters in Fig. 5(b) and (c) increase linearly with the numbers of nonzero bits in T and q, respectively. When there are too many nonzero bits, the architecture for general q in Fig. 4 becomes more efficient than the design in Fig. 5.
6.3 Complexity Analyses & Comparisons
This section analyzes and compares the complexities of our proposed architectures with the best prior efforts.
The complexities of the proposed GID and SID integer division architectures are listed in Table 4. For the GID design, primes with four nonzero bits are considered. When \(\log _2{\left( 2^{l_1}\!\mp \!2^{l_2}\right) }\le \frac{3w}{4}\), which covers many q, 4 iterations are needed in Algorithm 1. In this case, 4 copies of the ItrU units are employed in the architecture shown in Fig. 1. To estimate the complexity of the design, it is assumed that the CSAs use rows of full adders (FAs) to compute the sum and carry vectors from the inputs [26]. Then, a carry-ripple adder merges the final pair of sum and carry vectors. Under these assumptions, the complexity of the CSA is broken down in terms of the number of FAs in Table 4. Additionally, if the lowest bits of the CSA outputs are discarded and hence only the carry-out to the highest bits needs to be computed, such as in the case of the CSA in Fig. 2(c), the corresponding FAs can be replaced with half adders (HAs). The detailed architectures of the shifters used in the ItrU and DUQ blocks can be found in [14]. Each shifter consists of rows of multiplexers. In the case that some of the shifter inputs are pre-set to ‘0’ or ‘1’, such as the shifters in Fig. 2(c), the corresponding multiplexers are simplified.
The critical path of the GID design lies in the ItrU block, as shown in Fig. 2(c). It consists of a shifter with \(\log _2 w\) rows of multiplexers, an XOR gate, and a CSA that has three 2w-bit inputs but retains only the highest w bits in the output.
For the proposed SID architecture with \(q=2^w-m+1\) shown in Fig. 4, general multipliers are used. It can be assumed that a \(w\times w\)-bit multiplier is implemented by a \((w-1)\times w\) array of FAs. In the case where m is limited to 3w/4 bits, which still covers a broad range of q, n has at most \(3w/4+1\) bits. Accordingly, the number of rows of FAs in the multipliers for m and n is reduced to around 3w/4. Besides, only the lowest \(w+1\) bits of the product from the multiplication with m are needed. Therefore, around half of the array of FAs for this multiplier is eliminated. The adder in the first dashed block of Fig. 4 only needs to compute the highest w bits of the sum. Hence, it can be implemented using w HAs and w FAs in a carry-ripple architecture.
In the SID architecture for \(q=2^w-2^l+1\) with \(l<3w/4\) in Fig. 5, one operand to the last adder in part (d) has only three bits, and the adder is simplified accordingly. To estimate the logic complexity, each AND gate in part (b) of this figure is assumed to take half of the area of an XOR to implement. The complexity of the SID for \(q=2^w-2^{l_1}\pm 2^{l_2}+1\), with \(\log _2{\left( 2^{l_1}\mp 2^{l_2}\right) }\le 3w/4\), is also listed in Table 4. It is an extension of the architecture shown in Fig. 5. In this case, T has at most 13 nonzero bits, and the number of shifters, CSA inputs, and multiplexers are decided accordingly as discussed in the previous section.
For the purpose of comparison, the complexities of the CRA design with general q [18], the divider using the Barrett reduction architecture in [24], and the preliminary version of this work [28] are also listed in Table 4. For the general case of \(q=2^w-m+1\), our proposed SID design requires smaller area than the architectures from [18] and [24], mainly because of the simpler multiplication with T when m is limited to 3w/4 bits, which still covers a broad range of q. When q has three nonzero bits, the proposed SID design requires similar area to implement as the divider in [28]. When q has four nonzero bits, the proposed SID has a CSA with a large number of inputs, which may result in larger area than the GID design. Nevertheless, the SID design requires only 1 instead of 5 clock cycles, and would be more efficient in terms of area-delay product (ADP). Additionally, our SID architectures and the designs in [18, 24] can be pipelined to achieve higher clock frequency.
To further evaluate and compare the complexities of the divider architectures, the designs listed in Table 4 with \(w=48\) and 64 are implemented in Verilog and synthesized using the GlobalFoundries 22FDX process with GENUS 17.22. The post-synthesis results are listed in Table 5. Since the CRA design [18] is apparently much larger, it is not further synthsized. The timing constraints (TCs) listed in this table are the tightest without causing negative slack for each design. For the area requirement, TC, and latency, the percentage numbers in the parentheses indicate normalized values with respect to those of the design from [24]. Besides, the normalized ADPs are also listed.
For the case of \(w=64\), it can be observed from Table 5 that our SID design for general \(q=2^w-m+1\) and \(\log _2{m}\le 3w/4\) achieves 14% area saving and 16% shorter latency compared to the architecture in [24]. This is mainly due to the smaller multipliers required in the SID design. For designs whose q has only three nonzero bits, the SID design offers a slightly smaller area but only 1400/2400=58% latency compared to the iterative divider in our preliminary work [28] because only 1 instead of 4 clock cycles are needed. When q has 4 non-zero bits, while the SID design exhibits a larger area than the GID design, its latency is much shorter since it requires 1 instead of 5 clock cycles. Overall, its ADP is \(1-12.0\%/17\%=31\%\) smaller.
When w is reduced from 64 to 48, the areas of the design from [24] and the SID design with general \(q=2^w-m+1\) reduce more significantly than those of the others. This is due to the reason that the numbers of FAs required by these two designs are proportional to \(w^2\) instead of w as shown in Table 4. Moreover, the critical paths and hence the minimum achievable TCs without negative slack reduce with w in all designs.
The hardware complexities of our proposed GMR and SMR architectures as well as the previous modular reduction designs are listed in Table 6. Similar to the divider designs, the proposed SMR for \(q=2^w-m+1\) and \(\log _2{m}\le 3w/4\) requires a much smaller number of FAs in comparison to the extended decompose and reduce multiplication (e-DARM) [27] and Barrett reduction [24], mainly because of the smaller multipliers. For the designs whose q has three nonzero bits, the proposed SMR design has a similar area as the design in [28]. When q has 4 nonzero bits, the number of nonzero bits of T is larger in our SMR architecture, resulting in larger area. Despite that, its latency remains much shorter because it needs only 1 instead of 5 clock cycles.
The modular reduction designs listed in Table 6 for \(w=64\) are synthesized using the GlobalFoundries 22FDX process and the post-synthesis results are shown in Table 7. Similarly, the tightest TCs without causing negative slack are reported. From this table, our SMR design for the general case of \(q=2^w-m+1\) and \(\log _2{m}\le 3w/4\) achieves 7% area reduction and 18% shorter latency compared to the eDARM [27]. When q has only three nonzero bits, the proposed SMR design reduces the area and latency by 1-3624/3972=9% and 1-1400/2400=42%, respectively, compared to the architecture in [28] adopted for modular reduction. For q with four nonzero bits, although the SMR design requires larger area than the GMR design, its latency is much shorter because it only needs 1 clock cycle instead of 5. Overall, the ADP is reduced by 1-12.5%/18.5%=32%.
In summary, for q in general form or with 3 nonzero bits, the proposed SID and SMR architectures require less area and much shorter latency than other designs. For q with 4 nonzero bits, although the proposed designs have larger area, they achieve much higher efficiency in terms of ADP. The improvements that can be achieved by the proposed designs increase with larger w.
7 Conclusion
This paper first extends our prior iterative process for quotient computation to a GID design that can handle primes with more than three nonzero bits. Then, the SID design based on simplifying the Barret reduction is proposed. Furthermore, algorithmic reformulations have been developed to reduce the complexity of the SID architecture when the prime has a small number of nonzero bits. The proposed algorithms are also adapted for efficient modular reductions. Efficient hardware architectures have been developed to implement the proposed algorithms. Compared to prior efforts, the proposed designs achieve significant reductions in area and/or latency. Future work aims to further optimize the divider when the dividend has more bits.
References
Dowlin, N., Gilad-Bachrach, R., Laine, K., Lauter, K., Naehrig, M., & Wernsing, J. (2016). Cryptonets: Applying neural networks to encrypted data with high throughput and accuracy. In Proceedings of international conference on machine learning (Vol. 48, pp. 201–210).
Cheon, J. H., Kim, A., Kim, M., & Song, Y. (2017). Homomorphic encryption for arithmetic of approximate numbers. In Proceedings of international conference on the theory and application of cryptology and information security (pp. 409–437). Switzerland: Cham.
Cheon, J. H., Han, K., Kim, A., Kim, M., & Song, Y. (2019). A full RNS variant of approximate homomorphic encryption. In Proceedings of selected areas in cryptography international conference (pp. 347–368). Berlin, Heidelberg.
Brakerski, Z., & Vaikuntanathan, V. (2011). Efficient fully homomorphic encryption from (Standard) LWE. In IEEE 52nd annual symposium on foundations of computer science (pp. 97–106).
Fan, J., & Vercauteren, F. (2012). Somewhat practical fully homomorphic encryption. Cryptology ePrint Archive.
Brakerski, Z., Gentry, C., & Vaikuntanathan, V. (2012). (Leveled) fully homomorphic encryption without bootstrapping. In Proceedings of the 3rd innovations in theoretical computer science conference (pp. 309–325). New York, NY, USA.
Kim, S., Kim, J., Kim, M. J., Jung, W., Kim, J., Rhu, M., & Ahn, J. H. (2022). BTS: An accelerator for bootstrappable fully homomorphic encryption. In Proceedings of annual international symposium on computer architecture (pp. 711–725).
Liu, S.-H., Kuo, C.-Y., Mo, Y.-N., & Su, T. (2024). An area-efficient, conflict-free, and configurable architecture for accelerating NTT/INTT. IEEE Transaction on Very Large Scale Integrated (VLSI) Systems, 32(3), 519–529.
Tan, W., Chiu, S. W., Wang, A., Lao, Y., & Parhi, K. K. (2024). PaReNTT: Low-latency parallel residue number system and NTT-based long polynomial modular multiplication for homomorphic encryption. IEEE Transaction on Information Forensics and Security, 19, 1646–1659.
Huai, Z., Parhi, K. K., & Zhang, X. (2021). Efficient architecture for long integer modular multiplication over Solinas prime. In Proceedings of IEEE Workshop on Signal Processing Systems (pp. 146–151).
Zhang, X., Huai, Z., & Parhi, K. K. (2022). Polynomial multiplication architecture with integrated modular reduction for R-LWE cryptosystems. Journal of Signal Processing Systems, 94(8), 799–809.
Bajard, J.-C., Eynard, J., Hasan, M. A., & Zucca, V. (2017). A full RNS variant of FV like somewhat homomorphic encryption schemes. In R. Avanzi, & H. Heys (Eds.), Selected areas in cryptography (pp. 423–442).
Halevi, S., Polyakov, Y., & Shoup, V. (2019). An improved RNS variant of the BFV homomorphic encryption scheme. In M. Matsui (Ed.), Topics in cryptology (pp. 83–105). Springer: Cham.
Akherati, S., & Zhang, X. (2024). Low-complexity ciphertext multiplication for CKKS homomorphic encryption. IEEE Transactions on Circuits and Systems II, 71(3), 1396–1400.
Akherati, S., & Zhang, X. (2024). Improved ciphertext multiplication for RNS-CKKS homomorphic encryption. In IEEE workshop on signal processing systems (pp. 136–140).
de Dinechin, F., & Didier, L.-S. (2012). Table-based division by small integer constants. In Proceedings of international conference on reconfigurable computing: Architectures, tools and applications (pp. 53–63). Berlin, Heidelberg.
Cavagnino, D., & Werbrouck, A. E. (2008). Efficient algorithms for integer division by constants using multiplication. The Computer Journal, 51(4), 470–480.
Drane, T., Cheung, W.-c., & Constantinides, G. (2012) Correctly rounded constant integer division via multiply-add. In IEEE international symposium on circuits and systems (pp. 1243–1246).
Barrett, P. (1987). Implementing the Rivest Shamir and Adleman public key encryption algorithm on a standard digital signal processor. In Advances in cryptology (pp. 311–323). Berlin, Heidelberg: Springer.
Hankerson, D., Menezes, A. J., & Vanstone, S. (2010) Guide to elliptic curve cryptography (1st ed.). Springer.
Kim, S., Lee, K., Cho, W., Cheon, J. H., & Rutenbar, R. A. (2019). FPGA-based accelerators of fully pipelined modular multipliers for homomorphic encryption. In International conference on ReConFigurable computing and FPGAs (pp. 1–8).
Montgomery, P. L. (1985). Modular multiplication without trial division. Mathematics of Computation, 44, 519–521.
Ding, J., & Li, S. (2020). A low-latency and low-cost Montgomery modular multiplier based on NLP multiplication. IEEE Transactions on Circuits and Systems II, 67(7), 1319–1323.
Satriawan, A., Mareta, R., & Lee, H. (2024). Integer modular multiplication with Barrett reduction and its variants for homomorphic encryption applications: A comprehensive review and an empirical study. IEEE Access, 12, 147283–147300.
Zhang, X., & Parhi, K. K. (2021). Reduced complexity modular polynomial multiplication for R-LWE cryptosystems. In IEEE international conference on acoustics, speech and signal processing (pp. 7853–7857).
Huai, Z., Zhou, J., & Zhang, X. (2022). Efficient hardware implementation architectures for long integer modular multiplication over general Solinas prime. Journal of Signal Processing Systems, 94(10), 1067–1082.
Hu, X., Tian, J., Li, M., & Wang, Z. (2023). AC-PM: An area-efficient and configurable polynomial multiplier for Lattice based cryptography. IEEE Transactions on Circuits and Systems I, 70(2), 719–732.
Akherati, S., Cai, J., & Zhang, X. (2024). Low-complexity integer divider architecture for homomorphic encryption. In IEEE workshop on signal processing systems (pp. 131–135).
Kim, D., & Guyot, C. (2023). Optimized privacy-preserving CNN inference with fully homomorphic encryption. IEEE Transactions on Information Forensics and Security, 18, 2175–2187.
Acknowledgements
This material is based upon work supported by the National Science Foundation under Award No. 2428806 and by a Catalyst grant from the Ohio State University Translational Data Analytics Institute.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Akherati, S., Cai, J. & Zhang, X. Efficient Generalized Integer Division and Modular Reduction Architectures for Homomorphic Encryption. J Sign Process Syst 97, 157–171 (2025). https://doi.org/10.1007/s11265-025-01959-6
Received:
Revised:
Accepted:
Published:
Version of record:
Issue date:
DOI: https://doi.org/10.1007/s11265-025-01959-6