1 Introduction

Homomorphic encryption is a cryptography scheme that allows computations to be performed on encrypted data without requiring decryption. It can be utilized to preserve the privacy of machine learning as a service over the cloud [1]. Popular HE schemes, such as CKKS [2, 3], B/FV [4, 5], and BGV [6], carry out computations over high-degree polynomials whose coefficients are very long integers. High-degree polynomial modular multiplication can be substantially simplified by the number theoretic transform (NTT) [7,8,9]. The modular reduction can be incorporated into a decomposed multiplication process [10, 11] to further reduce the complexity. Using the residue number system (RNS) [3, 12, 13], the computations over long coefficients of the polynomials are broken down into those over shorter integers. After these techniques are adopted, simplifying the integer modular reduction is one of the keys to further reducing the complexity of HE schemes. Besides, the overall complexity of ciphertext multiplication is reduced in [14, 15] by combining and reformulating the related computations, which require quotient computations from integer divisions.

The quotient can be computed using a lookup table-based approach [16]. However, the size of the lookup table grows exponentially with the divisor’s bit length. The quotients are derived utilizing the shifted inverse of the divisor in [17, 18]. For dividing a 2w-bit integer by a w-bit divisor, the algorithm proposed in [17] requires a \(2w\times 2w\) multiplier for quotient computation and introduces errors. The correctly rounded approach (CRA) in [18] computes the quotient precisely as \(\alpha \times \lambda + \beta \), where \(\lambda \) is the 2w-bit dividend, \(\alpha \) represents the highest bits of the shifted inverse of the divisor, and \(\beta \) is a 2w-bit round-off integer. Although the width of \(\alpha \) is reduced to \(\le 2w\), a \(2w \times 2w\) multiplier is still required in the worst case. While the Barrett reduction [19] was originally proposed for modular reduction, it can be easily adapted for integer division. Despite the simplifications in [20], the approach requires two \(w\times w\) multipliers, which impose a large area requirement. The area is reduced in [21] for a constant divisor, q. However, it is limited to the case where \(q=2^w-2^{l_1}\pm \cdots +1\), with \(l_1<w/2\).

Modular multiplication can be implemented using the Montgomery algorithm [22]. Although it integrates multiplication with modular reduction, transformations to and from the Montgomery domain are required at the beginning and end, respectively. Hence, it has a long latency [23]. Modular reduction can be simplified by Barrett reduction [19, 20]. Although the architectures based on Barrett reduction [24] offer lower latency, they require a larger area as previously mentioned. The designs in [25,26,27] integrate modular reduction into decomposed Karatsuba multiplication. For constant divisors in specific formats, their complexities are lower than those of the architectures based on Barrett reductions.

In the preliminary version of this work [28], an iterative procedure is developed to compute the quotient for the case where \(q=2^w-2^l\pm 1\). It only needs shifters and adders and leads to a shorter critical path and a lower area compared to the quotient computation architectures in [17,18,19]. However, the number of primes with only 3 nonzero bits is limited, and those suitable for the NTT transform are even more scarce.

This paper first generalizes the iterative integer divider proposed in [28] to cases where the divisor q contains more nonzero bits. Formulas are developed and rigorous mathematical proofs are provided for the number of iterations needed. The design is also adapted for modular reduction. As the number of nonzero bits in q increases, each iteration requires more adders and shifters to implement. Besides, depending on the differences between the exponents of the nonzero bits, it may need a large number of iterations. To address these issues, a second new design is proposed in this paper to compute the quotient and remainder by simplifying the Barrett reduction. Through rigorous mathematical derivations, it is discovered that the involved multiplications have a small number of partial products to add. Efficient hardware architectures are developed for both proposed designs, and additional reformulations and optimizations are introduced to reduce the hardware complexity further. Compared to the best prior quotient computation architecture [19] (modular reduction architecture [27]), our second proposed design can achieve 60% (57%) smaller area and 70% (71%) shorter latency when \(w=64\) and q has 4 nonzero bits. The improvements are more significant for larger w or primes with fewer nonzero bits.

The organization of this paper is as follows. Section 2 introduces previous research on integer division and modular reduction. Section 3 presents our preliminary work for quotient computation through an iterative process. The iterative process is generalized and adapted to carry out the modular reduction in Section 4. Section 5 proposes simplified Barrett reduction for quotient computation and modular reduction. Hardware implementation architectures and complexity comparisons are provided in Section 6. Section 7 concludes the paper.

2 Existing Integer Division and Modular Reduction Algorithms

Suppose the dividend \(\lambda \) is written as \(bq+r\), where b is the quotient and r is the remainder of the division. In [17], for the case when \(\lambda \) has 2w bits, the quotient is approximated by multiplying \(\lambda \) with a pre-computed 2w-bit constant \(J = \lfloor 2^{3w-1}/q\rfloor + 1\) and extracting the higher w bits of the product. This approach results in either \(\lfloor \lambda /q\rfloor \) or \(\lfloor \lambda /q\rfloor +1\) and requires a (\(2w\times 2w\))-bit multiplier, which occupies a large area. In [18], the quotient b is computed as \(\alpha \times \lambda +\beta \), where \(\alpha \) and \(\beta \) are determined by an algorithm that minimizes the number of nonzero bits in \(\alpha \) for a given q. \(\alpha \) and \(\beta \) have at most 2w bits from the algorithm.

The Barrett reduction [19] algorithm is primarily designed for modular reduction rather than quotient computation. When \(\lambda \) has 2w bits, it computes \(r^*=\lambda - \lfloor \lambda T / 2^{2w}\rfloor q\), where T is precomputed as \(\lfloor 2^{2w} / q \rfloor \). The result \(r^*\) lies within the range of [0, 2q). By comparing \(r^*\) with q, the final remainder is determined as either \(r^*\) or \(r^*-q\). This computation is simplified in [20], which first calculates

$$\begin{aligned} \tilde{b} = \left\lfloor \frac{\lfloor \lambda /2^{w-1}\rfloor T}{2^{w+1}}\right\rfloor . \end{aligned}$$
(1)

It has been derived that \(\tilde{b} \in \{b-2, b-1, b\}\). Subsequently, the remainder r is first approximated as

$$\begin{aligned} \tilde{r} = (\lambda \mod 2^{w+1}) - (\tilde{b} q\mod 2^{w+1}). \end{aligned}$$
(2)

\(\tilde{r} \in \{r, r+q, r+2q\}\) and r can be determined by comparing \(\tilde{r}\) with q and 2q and carrying out the corresponding subtraction. In [21], it has been shown that when the modulus q is represented in a signed binary form of minimal Hamming weight as \(q=2^{w}\pm 2^{l_1}\pm 2^{l_2}\pm \cdots \pm 2^{l_m}+1\) with \(l_m<\cdots<l_1<w/2\), \(T=\lfloor 2^{2w} / q \rfloor \) can be expressed as \(2^{w}\mp 2^{l_1}\mp 2^{l_2}\mp \cdots \mp 2^{l_m}-1\). For these cases, the multipliers of T and q are replaced by adders.

Table 1 Numbers of primes \(q=2^w-2^l+1\) such that \(q=1\mod 2N\) for \(N=2^{14}\).

Modular multiplication can be carried out utilizing the Montgomery algorithm [22, 23]. However, it requires conversion to the Montgomery domain and inverse, which leads to a high complexity. The designs in [25,26,27] propose to incorporate modular reduction into the decomposed multiplication using the Karatsuba algorithm and consider the case that q is in the format of \(q=2^w-\delta \), where [25, 26] assume \(\log _2{\delta }\le w/2\) and [27] considers \(\log _2{\delta }\le 2w/3\). Each operand for the multiplication is written as \(a = a_1 2^{w/2}+a_0\), and modular reduction is applied to the products of the segments of the two operands. Although this approach lowers the hardware complexity due to the shorter operands of the multiplications, it incurs longer latency than those based on Barrett reduction.

3 Iterative Integer Division when \(q = 2^w - 2^l \pm 1\)

Algorithm 1 was proposed in our preliminary work [28] to simplify the quotient computation when the divisor only contains three nonzero bits, i.e., \(q = 2^w - 2^l \pm 1\) \((0<l<w)\). Assume that \(\lambda = bq + r\) is a 2w-bit number. It is defined that \(c = \left\lfloor \lambda / 2^w \right\rfloor \) and \(f(x) = x + v(x)\), where \(v(x) = \left\lfloor (-1) x (2^l \mp 1) / 2^w \right\rfloor \). Then the integer b satisfying \(f(b)=c\) is the quotient of \(\lambda /q\). It has been proved in [28] that the iterative process in Algorithm 1 yields \(b^*\) such that \(f(b^*) = c\), and \(b^*\in \{b, b\pm 1\}\). The actual value of b can be determined from \(b^*\) according to Lines 4-6 in Algorithm 1. Mathematical proofs have been provided to demonstrate that the number of iterations required is at most t, where t is the unique integer satisfying \((t - 1)w / t < l \le tw / (t + 1)\).

Algorithm 1
figure d

Algorithm for calculating \(b=\lfloor \lambda /q\rfloor \) [28].

Hardware architectures were also proposed in [28] to implement Algorithm 1. Each iteration of the algorithm is mapped to separate hardware units so that the overall computation is fully pipelinable. Each iteration requires two w-bit adders, two w-bit Barrel shifters, one w-bit multiplexer, and w registers. Simplified architectures have also been developed to implement Lines 4-6 of Algorithm 1, leveraging the fact that that q has only three nonzero bits and \(r^*\) consists of at most \(w+1\) bits.

Table 2 Numbers of primes \(q=2^w-2^{l_1}\pm 2^{l_2}+1\) such that \(q=1\mod 2N\) for \(N=2^{14}\).

In the polynomial ring of HE schemes, modulo reduction by \(x^N+1\) follows multiplications of polynomials of length N. The negatively-wrapped NTT [8] is essential for significantly reducing the complexity of these modular polynomial multiplications. It requires that the modulus q of the coefficients be a prime and \(q=1\mod 2N\). Table 1 lists the numbers of primes with three nonzero bits satisfying these requirements for different bit lengths w and the ring dimension \(N = 2^{14}\), which are commonly used in many HE applications [29]. Since the primes in the format of \(2^w-2^l-1\) equals \(-1 \mod 2N\), they can not be used for negatively wrapped NTT. When l falls into one of the four ranges listed in the table, Algorithm 1 needs 1, 2, 3, and \(\ge 4\) iterations, respectively. It can be observed that the usable primes are very limited, and the total number of usable primes reduces for larger N.

The HE schemes operate on coefficients with thousands of bits. To shorten the involved multiplications, the RNS scheme is necessary and the modulus needs to be decomposed into multiple factors as \(Q=q_0q_1\cdots \), where each \(q_i\) is a prime equaling \(1 \mod 2N\). Also, all \(q_i\) factors should have l within the same range relative to w, ensuring the numbers of iterations needed by Algorithm 1 and hence the latencies for quotient computations across all factors are the same. From Table 1, there are not sufficient primes that can be used as decomposed moduli for RNS representation of HE schemes incorporating the NTT.

Table 2 presents the numbers of primes with four non-zero bits that can be used in NTT for the ring dimension \(N=2^{14}\). Comparing this table and Table 1, many more primes are available if four nonzero bits are allowed in the binary representation of q.

4 Generalized Iterative Integer Division (GID) and Modular Reduction (GMR)

This section proposes a generalized iterative integer division (GID) algorithm that allows the modulus, q, to have more than three nonzero bits. By adopting the GID, a broader range of primes can be utilized to achieve efficient HE implementations. Theorems with rigorous mathematical proofs are provided to establish the correctness of the algorithm and determine the upper bound on the number of iterations required, using primes with four nonzero bits as an example.

This paper proposes employing the same procedure as in Algorithm 1 for quotient computation in the case that the divisor q takes the general form

$$\begin{aligned} q = 2^w-2^{l_1}\pm 2^{l_2}\pm \cdots \pm 2^{l_m}+1, \end{aligned}$$
(3)

where \(l_1>l_2>\cdots>l_m>0\). In this case, f(x) is redefined as \(x+v(x)\), where

$$\begin{aligned} v(x)= \left\lfloor (-1)x\delta /2^w\right\rfloor , \end{aligned}$$
(4)

and \(\delta = 2^{l_1}\mp 2^{l_2}\mp \cdots \mp 2^{l_m}-1\).

The same analyses and mathematical derivations in [28] can be applied to prove that Algorithm 1 terminates after a finite number of iterations in the case where q follows the general format in Eq. 3.

For general q, the \(b^*\) generated by the iterative process also equals b, \(b-1\), or \(b+1\). However, the proof of this fact and the analysis of the number of iterations needed by Algorithm 1 in [28], relies on the assumption that \(q=2^w-2^l+1\). In the following, \(b^*\in \{b, b\pm 1\}\) is re-proved and the number of required iterations is re-analyzed for the general case of q.

For general q and \(\lambda = bq+r\), it can be derived that

$$\begin{aligned} c \!&=\! \left\lfloor \frac{\lambda }{2^w}\right\rfloor = b+\left\lfloor (-1)\frac{b\delta }{2^w}+\frac{r}{2^w}\right\rfloor . \end{aligned}$$
(5)

Since the \(b^*\) computed by Algorithm 1 satisfies \(f(b^*) = c\), then \(b^*\in \{x:f(x)=c\}\). Let us first show that \(f(b-1)\) may be equal to c. Plugging \(b-1\) into Eq. 4 and utilizing the fact that the remainder r is positive, it can be derived that

$$\begin{aligned} f(b-1)&= b-1 + \left\lfloor (-1)\frac{(b-1)\delta }{2^w}\right\rfloor \nonumber \\&\le b\!-\!1\!+\!\left\lfloor (-1)\frac{b\delta }{2^w}\!+\!\frac{\delta }{2^w}\!+\!\frac{r}{2^w}\right\rfloor \le c . \end{aligned}$$
(6)

Hence, f(b) may equal c. Similarly, it can be shown that f(b) and \(f(b+1)\) may also equal c. Therefore, b and \(b\pm 1\) are the possible values of \(b^*\).

Next, it is proved that \(f(x)\ne c\) for \(x\le b-2\). First, consider \(x=b-2\). Following Eq. 6,

$$\begin{aligned} f(b- 2)&\le b - 2 + \left\lfloor (-1)\frac{b\delta }{2^w}+\frac{ 2\delta }{2^w}+\frac{r}{2^w}\right\rfloor \nonumber \\&\le c-1 + \left\lfloor \frac{ 2\delta }{2^w} \right\rfloor . \end{aligned}$$
(7)

If \(l_1=w-1\), then q can actually be represented using \(w-1\) bits, which conflicts with our assumption. Hence, \(l_1<w-1\), implying that \(2\delta <2^w\). Accordingly, \(f(b-2)\le c-1\). As it was proved in [28], \(f(\cdot )\) is a non-decreasing function. Therefore, \(f(x)\le c-1\) for any \(x<b-2\). It can be also derived that

$$\begin{aligned} f(b+2)&= b+2 + \left\lfloor -\frac{(b+2)\delta }{2^w}\right\rfloor \\&\ge b+2-\frac{b\delta }{2^w}-\frac{2\delta }{2^w}\\&> b+1-\frac{b\delta }{2^w}+\frac{r}{2^w}+\frac{-2\delta }{2^w}\\&\ge b+1+\left\lfloor -\frac{b\delta }{2^w}+\frac{r}{2^w}\right\rfloor +\left\lfloor \frac{-2\delta }{2^w}\right\rfloor =c; \end{aligned}$$

In summary, \(\{b, b \pm 1\}\) are the only possible solutions of \(f(x) = c\), and hence \(b^*\in \{b, b\pm 1\}\).

It has been proved in [28] that increasing the value of the input \(\lambda \) leads to a non-decreasing number of iterations in Algorithm 1. Hence, the number of iterations needed in Algorithm 1 is upper bounded by that of the case when \(\lambda =2^{2w}-1\), which is the maximum value of the 2w-bit \(\lambda \). The number of iterations needed for \(\lambda =2^{2w}-1\) is derived through the following three theorems for the example case that q has four nonzero bits.

Theorem 1

Assume that \(q=2^w-2^{l_1}+2^{l_2}+1\). When \(\lambda = 2^{2w}-1\), the \(b_i\) computed in Lines 2-3 of Algorithm 1 in iteration i satisfies the following inequalities:

$$\begin{aligned}&b_i \le 2^{w}-2 + \sum _{j=1}^{i}2^{-(j-1)w}(2^{l_1}-2^{l_2})^{j}, \end{aligned}$$
(8)
$$\begin{aligned}&b_i \ge 2^{w}-2 + \sum _{j=1}^{i-1}2^{-(j-1)w}(2^{l_1}-2^{l_2})^{j}. \end{aligned}$$
(9)

Proof

When \(\lambda =2^{2w}-1\), \(b_0=c=2^w-1\). From Line 3 of Algorithm 1, it can be derived that

$$\begin{aligned} b_1&= b_0 + \left( c-f(b_0)\right) = c - v(b_0) \\&= 2^w+\left\lfloor \frac{(2^w-1)(2^{l_1}-2^{l_2}-1)}{2^w}\right\rfloor \\&= 2^w+2^{l_1}-2^{l_2}-1 + \left\lfloor (-1)\frac{2^{l_1}-2^{l_2}-1}{2^w}\right\rfloor \\&= 2^w-2+2^{l_1}-2^{l_2}. \end{aligned}$$

Therefore, Eq. 8 holds for \(i=1\). Since \(b_i=c-v(b_{i-1})\), it follows that \(b_i \!\!=\!\! 2^w + \left\lfloor \frac{b_{i-1}(2^{l_1}-2^{l_2}-1)}{2^w}\right\rfloor \). Suppose that Eq. 8 also holds for iteration \(i-1\). It can be derived

$$\begin{aligned} b_i&\le \!\!2^w\!\!+\!\!\left\lfloor \!\frac{\!\left( \!2^{w}\!\!-\!\!2\!\!+\!\!\sum _{j=1}^{i-1} 2^{-(j\!-\!1)w}(2^{l_1}\!\!-\!\!2^{l_2})^{j}\!\right) \!\left( \!2^{l_1}\!\!-\!\!2^{l_2}\!\!-\!\!1\!\right) }{2^w}\!\right\rfloor \\&\le 2^w+2^{l_1}-2^{l_2}-1+\sum _{j=1}^{i-1}2^{-jw}(2^{l_1}-2^{l_2})^{j+1}+\\&\quad \quad \left\lfloor -\frac{\sum _{j=1}^{i-1}2^{-(j-1)w}(2^{l_1}\!-\!2^{l_2})^j\!\!+\!\!2(2^{l_1}\!-\!2^{l_2}\!-\!1)}{2^w}\right\rfloor \\&= 2^w-2+\sum _{j=1}^{i}2^{-(j-1)w}\left( 2^{l_1}-2^{l_2}\right) ^{jw}+ \\&\quad \quad \left\lfloor \!1\!-\!\frac{\sum _{j=1}^{i-1}2^{-(j-1)w}(2^{l_1}\!-\!2^{l_2})^j\!\!+\!\!2(2^{l_1}\!-\!2^{l_2}\!-\!1)}{2^w}\!\right\rfloor \\&= 2^w-2+\sum _{j=1}^{i}2^{-(j-1)w}\left( 2^{l_1}-2^{l_2}\right) ^{jw}. \end{aligned}$$

It can be proved that the floor function in the second to the last formula above is zero. For conciseness, the proof has been omitted in this paper. Accordingly, Eq. 8 also holds for i, and the proof of Eq. 8 is completed. Similarly, Eq. 9 can also be proved by induction.\(\square \)

Theorem 2

Given an integer \(a<2^w\). There is a unique integer s, such that \(2^{(s-1)w}> a^s\) and \(a^{s+1}\le 2^{sw}\).

Proof

Since \(a<2^w\), \(log_2 a<w\). Hence \(\log _2{a}/(w-\log _2{a}) < w/(w-\log _2{a})\). Because the difference of the left and right-hand side of this inequality is 1, there exists a unique integer s such that \(\log _2{a}/(w-\log _2{a})\le s < w/(w-\log _2{a})\). \(\log _2{a}/(w-\log _2{a})\le s\) translates to \((w-\log _2{a})s\ge \log _2{a}\) and hence \(sw\ge (s+1)\log _2{a}\). Taking both sides as powers of 2, it can be derived that \(2^{sw}\ge a^{s+1}\). Similarly, it also can be proved that \(2^{(s-1)w}> a^s\).\(\square \)

Theorem 3

Let \(t_1\) be the unique integer such that \(2^{(t_1-1)w}<(2^{l_1}-2^{l_2})^{t_1}\) and \((2^{l_1}-2^{l_2})^{t_1+1}\le 2^{t_1w}\). When \(\lambda =2^{2w}-1\), the number of iterations needed in Algorithm 1 is at most \(t_1+1\).

Proof

First, let us show that \(f(b_i)\ne c\) for \(i<t_1\). For \(x\in \mathbb N\), \(f(x) = x+v(x) = \lfloor xq/2^w\rfloor \). Use \(\Delta \) to represent the sum term in Eq. 8 to simplify the notations. Since f(x) is a non-decreasing function, plugging in Eq. 8, it follows that:

$$\begin{aligned} f(b_i) \le \left\lfloor \frac{\left( 2^w\!\!-\!\!2\!\!+\!\!\Delta \right) \!\!\left( 2^w\!\!-\!\!\left( 2^{l_1}\!\!-\!\!2^{l_2}\right) \!\!+\!\!1\right) }{2^w}\right\rfloor . \end{aligned}$$

Reorganizing and canceling out common terms in the above formula, it can be derived that

$$\begin{aligned} f(b_i) \le 2^w\!\!-\!\!1\!\!+\!\!\left\lfloor \frac{2\left( 2^{l_1}\!\!-\!\!2^{l_2}\right) \!\!-\!\!2\!\!+\!\!\Delta }{2^w}\!\!-\!\!2^{-iw}\!\left( 2^{l_1}\!\!-\!\!2^{l_2}\right) ^{i\!+\!1}\right\rfloor . \end{aligned}$$
(10)

Since \(l_1\le w-2\), it follows that

$$\begin{aligned}&2(2^{l_1}-2^{l_2})-2+\Delta \nonumber \\&<2^{l_1+1}+2^{l_1}+2^{2l_1-w}+\cdots +2^{il_1-(i-1)w}\nonumber \\&\le 2^{w-1}\!\!+\!\!2^{w-2}\!+\!2^{2l_1-w}\!+\!\cdots \!+\!2^{il_1-(i-1)w}\!\!<\!\!2^w. \end{aligned}$$
(11)

For integers i and \(t_1\), \(i<t_1\) implies that \(t_1-i-1\ge 0\). Also from the assumptions, \(2^{(t_1-1)w}<(2^{l_1}-2^{l_2})^{t_1}\), it can be derived that

$$\begin{aligned} 2^{-iw}(2^{l_1}\!\!-\!\!2^{l_2})^{i\!+\!1} \!\!=\!\! \frac{(2^{l_1}\!\!-\!\!2^{l_2})^{t_1}2^{w(t_1-1-i)}}{2^{(t_1\!-\!1)w}\!(2^{l_1}\!\!-\!\!2^{l_2})^{(t_1\!-\!i\!-\!1)}}\!\!>\!\!1. \end{aligned}$$
(12)

Plugging in Eqs. 11 and 12 into Eq. 10 leads to

$$\begin{aligned} f(b_i) \le 2^{w}-2 = c-1. \end{aligned}$$
(13)

Since \(b_i\) increases in each iteration and \(f(\cdot )\) is non-decreasing, Algorithm 1 will not terminate at the end of \(i<t_1\) iterations. If \(f(b_{t_1})=c\), the algorithm terminates in iteration \(t_1\). If \(f(b_{t_1})\ne c\), applying Eqs. 8 and 9, along with derivations similar to those in the preceding proofs, yields \(c\le f(b_{t_1+1}) < c+1\). Since \(f(b_{t_1+1})\) is an integer, it follows that \(f(b_{t_1+1}) = c\), and thus Algorithm 1 terminates in iteration \(t_1+1\).\(\square \)

Similarly, it can be shown that when q is of the form \(2^w-2^{l_1}-2^{l_2}+1\), Algorithm 1 needs at most \(t_1+1\) iterations. Moreover, the algorithm can be further generalized for divisors in the format of \(q = 2^{w}-2^{l_1}\pm 2^{l_2}\pm \cdots \pm 2^{l_m}+1\) with more than four nonzero bits.

Algorithm 2
figure e

Generalized algorithm for calculating \(r= \lambda \mod q\) (GMR).

The proposed GID algorithm can be easily modified to realize a generalized modular reduction (GMR) algorithm, as shown in Algorithm 2. This algorithm follows the same procedure to compute \(b^*\) and \(r^*\) as in Algorithm 1. From Algorithm 1, the quotient, b, may equal \(b^*\), \(b^*-1\), or \(b^*+1\). Hence, the remainder, r, is \(r^*\), \(r^*+q\), and \(r^*-q\) for the corresponding cases, respectively. Therefore, Lines 4-6 are modified accordingly as in Algorithm 2 to compute the remainder r.

In [28], hardware units are replicated and connected serially to implement the iterations of Algorithm 1. This architecture can also be applied to implement the GID. However, the latency increases linearly with the number of iterations in the algorithm. To reduce the latency, a new design is proposed in the next section based on simplifying the Barrett reduction.

5 Simplified Integer Division (SID) and Modular Reduction (SMR)

The quotient can be computed using Eq. 1 according to the Barrett reduction algorithm [19, 20]. The majority of the area is attributed to the multiplication with \(T=\lfloor 2^{2w}/q\rfloor \), which is implemented by a general \(w\times w\)-bit multiplier in prior work. This section proposes a simplified integer division (SID) scheme by deriving an alternative formula for T multiplication. The proposed scheme requires a simpler multiplier when q has consecutive zero bits after the most significant bit or a limited number of nonzero bits, which cover a broad range of usable q. Additionally, a simplified modular reduction (SMR) technique is also developed by carrying out the computations in Eq. 2 after Eq. 1 is calculated by using the SID.

Theorem 4

Write q in the format of \(2^w-m+1\), where m is the decimal value of bits 1 through \(w-1\) in q. From Theorem 2, there exists an unique integer s such that \(2^{(s-1)w}\le m^s\) and \(m^{s+1}<2^{sw}\). Define

$$\begin{aligned} T^* \!\triangleq \! 2^w \!-\! 1 \!+\! m \!+\! \left\lfloor \sum _{j=1}^{s-1}2^{-jw}m^{j+1}\right\rfloor . \end{aligned}$$
(14)

Then \(T=\lfloor 2^{2w}/q\rfloor \) equals \(T^*\), \(T^*-1\), or \(T^*+1\).

Proof

Plugging Eq. 14 into \(2^{2w}-qT^*\), applying the formula of the sum of a geometric sequence, and reorganizing the terms, it can be derived that

$$\begin{aligned} 2^{2w}\!-\!qT^* = \!(m\!-\!1)^2 \!\!-\!\! q\!\left\lfloor \! \frac{m^2(1\!-\!(2^{-w}m)^{s-1})}{q-1}\!\right\rfloor \!. \end{aligned}$$

From the inequality \(\lfloor x\rfloor \le x\),

$$\begin{aligned} 2^{2w}-qT^*&\ge (m-1)^2-\frac{qm^2(1-(2^{-w}m)^{s-1})}{q-1} \nonumber \\&= 1+\frac{qm(2^{-(s-1)w}m^s-2)+1-(m-1)^2}{q-1}\nonumber \\&\ge 1+\frac{-qm+1-(m-1)^2}{q-1}\nonumber \\&=1-m-\frac{(m-1)m}{q-1}. \end{aligned}$$
(15)

The inequality in the third row of the above formulas is due to the fact that \(2^{(s-1)w}\le m^s\). Apparently, the right side of the last formula in Eq. 15 is a negative number. Next, it will be shown that the magnitude of this negative number is less than q. After multiplying out and combining terms, as well as replacing q by \(2^w-m+1\), this is equivalent to proving that

$$\begin{aligned} 2^{2w}-2^w(3m-2)+m(m-1) > 0 \end{aligned}$$
(16)

Since \(m\le 2^{w-2}\), \(2^{2w}-2^w(3m-2)>2^{2w}-2^{w}(4\times 2^{w-2})=0\). Hence Eq. 16 is proved. Using a similar process, it can be proved that \(2^{2w}\!-\!qT^*<2q\). Overall, \(-q<2^{2w}-qT^*<2q\).

When \((2^{2w}-qT^*)\in (-q,0)\), \(\lfloor 2^{2w}/q-T^*\rfloor =-1\). Since \(T^*\) is an integer, \(T=\lfloor 2^{2w}/q\rfloor =T^*-1\). When \((2^{2w}-qT^*)\in [0,2q)\), \(\lfloor 2^{2w}/q-T^*\rfloor \in \{0,1\}\). Accordingly, \(T=T^*\) or \(T^*+1\) in this case.\(\square \)

Table 3 Numbers of primes in the format of \(q = 2^w - m + 1\) and \(q=1 \mod 2^{15}\) for m in different ranges.
Fig. 1
figure 1

The block diagram for implementing the GID algorithm.

Fig. 2
figure 2

(a) The numbers that need to be added to compute \(b_{i+1}\) when \(q=2^w-2^{l_1}-2^{l_2}+1\); (b) the numbers that need to be added to compute \(b_{i+1}\) when \(q=2^w-2^{l_1}+2^{l_2}+1\); (c) hardware architecture of ItrU.

For a given q, \(T^*\), and whether T equals \(T^*\), \(T^*+1\), or \(T^*-1\) can all be pre-determined. Hence, the multiplication with T can be implemented by multiplying the constant \(T^*\), as shown in Eq. 14. The expression \(m-1+\left\lfloor \sum _{j=1}^{s-1}2^{-jw}m^{j+1} \right\rfloor \) part from Eq. 14 is less than or equal to

$$\begin{aligned} m \!-\!1\!+\! \sum _{j=1}^{s-1}2^{-jw}m^{j+1} \!\!&=\!\! m \!+\! m\frac{m}{2^w}\!+\!\cdots \!+\!m\frac{m^{s-1}}{2^{w(s-1)}}\!-\!1\\ \!&<\!\! m \!+\! \frac{m}{2} \!+\! \cdots \!+\! \frac{m}{2^{s-1}}-1 \!\!<\!\! 2m. \end{aligned}$$

The inequality in the second row of the above formula follows from the fact that \(m/2^{w-1}<1\). Therefore, \(T^*\) can be expressed by a ‘1’ in the w-th bit followed by an integer that does not exceed 2m. Table 3 lists the number of primes with \(w=32\) and 48 that can be utilized in the NTT transform for the ring dimension \(N=2^{14}\) and m in various ranges. It can be observed that there are a sufficient number of primes to use for moderate or larger w even if m is restricted to at most 3w/4 bits. In this case, the multiplication of \(T^*\) can be realized by adding up \(3w/4+2\) partial products. Compared to prior approaches that require summing w partial products for T multiplication, the proposed design significantly reduces computational complexity.

When q has a few nonzero bits, Eq. 14 can be simplified and the exact number of nonzero bits in \(T^*\) can be derived. In the case where \(q = 2^w-2^l+1\), let s be the integer such that \(2^{(s-1)w}\le 2^{ls}\) and \(2^{(s+1)l}<2^{sw}\). Then Eq. 14 is simplified to

$$\begin{aligned} T^* = 2^w-1+\sum _{j=1}^{s}2^{-(j-1)w}2^{jl} \end{aligned}$$
(17)

Apparently, \(T^*\) has \(s+2\) nonzero bits. When q has more nonzero bits, the formula for the number of nonzero bits in \(T^*\) becomes more complicated.

6 Hardware Architectures and Complexity Analyses

This section presents efficient hardware architectures for implementing the proposed GID and SID algorithms for primes in the general form of \(q=2^w-m+1\). Simplified architectures are also developed for the SID algorithm when q has three or four nonzero bits. For each design, the necessary modifications to implement the GMR and SMR algorithms are also detailed. Finally, complexity analyses and comparisons with previous designs are provided.

6.1 Hardware Architectures for GID and GMR Algorithms

As mentioned previously, the GID algorithm follows the same procedure as Algorithm 1, except that the 3-nonzero-bit input q is replaced by q with a general form. The block diagram for implementing the GID algorithm is shown in Fig. 1. It comprises two main components: the iteration unit (ItrU) and the decision unit for quotient computation (DUQ). One ItrU is employed for each iteration of Algorithm 1 so that the divider is pipelinable. The DUQ implements Lines 4-6 of Algorithm 1. In Fig. 1, assuming that \(\lambda \) has 2w bits, \(\lambda [2w-1:w]\) represents the highest w bits and it equals c. As highlighted in [28], the decisions in Lines 4-6 of Algorithm 1 depend on whether \(\lambda -b^*q\) is larger than zero or not. If it is larger than zero, only whether it is smaller or larger than q needs to be known. This information can be derived using only \(\lambda [w+1:0]\). Hence, only these bits are sent to the DUQ block in Fig. 1.

Fig. 3
figure 3

(a) The numbers that need to be added in DUQ when \(q=2^w-2^{l_1}-2^{l_2}+1\); (b) the numbers that need to be added in DUQ when \(q=2^w-2^{l_1}+2^{l_2}+1\); (c) hardware architecture of DUQ; (d) hardware architecture of DUR.

The ItrU computes Line 3 of Algorithm 1. For an example case of \(q=2^w-2^{l_1}\pm 2^{l_2}-1\), it is equivalent to \(b_{i+1} = c-v(b_i)\), where \(v(b_i) = \left\lfloor (-1)b_i(2^{l_1}\mp 2^{l_2}-1)/2^w\right\rfloor \). Since \(b_i<2^w\) and \(2^{l_1}\mp 2^{l_2}-1\) is an odd number, the numerator \(b_i(2^{l_1}\mp 2^{l_2}-1)\) does not have \(2^w\) as a factor. Accordingly, Line 3 simplifies to

$$\begin{aligned} b_{i+1} = c+\left\lfloor b_i(2^{l_1}\mp 2^{l_2}-1)/2^w \right\rfloor +1. \end{aligned}$$
(18)

The computations in the above formula can be implemented by adding up the numbers, as shown in Fig. 2. The two bars in the middle of the part (a) of this figure represent \(b_i2^{l_2}\) and \(b_i2^{l_1}\) for the case that \(q=2^w-2^{l_1}-2^{l_2}+1\). The \(-b_i\) inside the floor function of Eq. 18 is implemented by flipping every bit in the 2w-bit extension of \(b_i\), which is represented by the top bar in Fig. 2(a), and replacing the least significant bit (LSB) of \(b_i2^{l_1}\) by ‘1’. After the floor function, the lowest w bits of \(b_i(2^{l_1}\mp 2^{l_2}-1)\) are discarded. Hence, the last ‘1’ in Eq. 18 is added at the w-th bit position in Fig. 2(a). It can be pre-added to all the ‘1’ bits in the first bar, which leads to all ‘0’ in the sum with a carry-out. However, since it is known that the result \(b_{i+1}\) is positive and has w bits, the carry-out can be discarded and no further calculation is needed. In the case where \(q=2^w-2^{l_1}+2^{l_2}+1\), the 2’s complement of \(b_i2^{l_2}\) must also be carried out before the summation. This can be done by flipping every bit in the 2w-bit extended \(b_i2^{l_2}\), which is shown by the third bar in Fig. 2(b), and adding another ‘1’ to the LSB of \(b_i2^{l_1}\), which makes its last two bits equal to ‘10’.

Figure 2(c) illustrates the proposed hardware architecture for the ItrU. The two shifters align \(b_i\) with the other numbers as shown in Figs. 2(a) and (b) to generate \(b_i2^{l_1}\) and \(b_i2^{l_2}\). These shifts can be implemented using the shifter architecture in [14] by padding w bits of ‘0’s to the input. The sign of the \(2^{l_2}\) term in q is denoted by \(\text {sign}_2\). When it is negative, the output bits from the second shifter are flipped. As previously explained, all the ‘1’ bits in the top bar of Fig. 2 can be pre-added with the ‘1’ in the \((w+1)\)-th bit position. The carry-out from the 2w-th bit can be ignored, resulting in all ‘0’ bits. Hence, c in the last bar can be combined with the flipped \(b_i\) bits in the first bar to form a single 2w-bit number. This number and the outputs from the two shifters are sent to a carry-save adder (CSA) to compute the sum. Since \(b_{i+1}\) equals the highest w bits of this sum, the hardware for merging the lowest w bits of the last pair of sum and carry vectors in the CSA can be simplified accordingly. The highest w bits of carry and sum vectors are merged using a carry-select adder to reduce the critical path. The adjustment of the last two bits of \(b_i2^{l_1}\) to ‘01’ or ‘10’ as shown in Figs. 2(a) and (b) can be handled by simple hardware units and hence is not explicitly shown in Fig. 2(c).

Fig. 4
figure 4

Hardware architecture of SID for \(q=2^w-m+1\).

The value of \(b^*\) computed from the last iteration of Algorithm 1 falls within the set of \(\{b, b \pm 1\}\). Consequently, \(r^* = \lambda - b^*q\) equals r or \(r \pm q\). Therefore, only the lowest \(w+2\) bits of \(r^*\) must be calculated. In the case where \(q=2^w-2^{l_1}- 2^{l_2}+1\), \(\lambda - b^*q=\lambda - b^*(2^w+1)+b^*2^{l_1}+b^*2^{l_2}\). Hence, \(r^*\) can be calculated by adding up the lower \(w+2\) bits of the four numbers shown in Fig. 3(a). Similarly, \(b^*2^{l_1}\) and \(b^*2^{l_2}\) are represented by the third and fourth bars, respectively. To generate \(-b^*(2^w+1)\), bit-flipped \(b^*\) is put into the higher and lower bits as shown in the second bar, and ‘1’ is introduced to the LSB of \(b^*2^{l_1}\). If \(q=2^w-2^{l_1}+2^{l_2}+1\), to compute \(\lambda -b^*q\), \(b^*2^{l_2}\) is extended and bit flipped as shown by the third bar in Fig. 3(b). The ‘1’ needed for the 2’s complement is added to \(b^*2^{l_1}\) so that its two LSBs become ‘10’. According to Figs. 3(a) and (b), the DUQ is implemented by the architecture in Fig. 3(c). Once \(r^*\) is computed at the output of the CSA, its lowest \(w+1\) bits are compared with q. The comparator generates ‘1’ when \(r^*>q\) and ‘0’ otherwise. Then b equals \(b^*\), \(b^*-1\), and \(b^*+1\) when the \((w+2)\)-th bit of \(r^*\), which is its sign bit and the comparator output are ‘00’, ‘1x’, and ‘01’, respectively.

When q has more than four nonzero bits, the number of iterations in Algorithm 1, and hence the number of required ItrUs can be determined. For a q with num nonzero bits, each ItrU requires \(num-2\) shifters and a CSA with \(num-1\) inputs. A single DUQ is needed regardless of the format of q. However, it necessitates \(num-1\) shifters and a CSA handling \(num+1\) inputs.

The GID design can be easily adapted for GMR by replacing the DUQ in Fig. 1 with a decision unit for remainder computation (DUR), whose hardware implementation architecture is illustrated in Fig. 3(d). It employs the same blocks as shown in Fig. 3(c) to compute the \((w+2)\)-bit \(r^*\). Then, \(r^*\) is added and subtracted by q. The remainder, r, equals \(r^*\), \(r^*-q\), and \(r^*+q\) when the \((w+2)\)-th bit of \(r^*\), which is its sign bit and the sign bit of \(r^*-q\) are ‘01’, ‘00’, and ‘1x’, respectively.

6.2 Hardware Architectures for SID and SMR Algorithms

This subsection first presents the hardware architectures of SID and SMR for primes of the general form \(q=2^w-m+1\). Subsequently, our architecture is further refined for the case where q has a few nonzero bits, in which case \(T^*\) has a very limited number of nonzero bits.

6.2.1 Primes in the Form \(q \!=\! 2^w\!-\!m\!+\!1\)

The hardware architecture for implementing the proposed SID algorithm is depicted in Fig. 4. The components in the first dashed block multiply \(\lfloor \lambda /2^{w-1}\rfloor \) with T using the \(T^*\) formula from Eq. 14. As discussed in Section 5, if q is written in the format of \(q=2^w-m+1\), T can be expressed as \(T=2^w+n\), where \(n\le 2m\). \(q=2^w-m+1\) with \(\log _2 m \le 3w/4\) covers a wide range of primes. For this case, the multiplier width is reduced to \(3w/4\times w\). The product of \(\lfloor \lambda /2^{w-1}\rfloor n\) is added with \(\lfloor \lambda /2^{w-1}\rfloor 2^w\), which is implemented via wire shifting without any logic gate. From Eq. 1, only the highest w bits need to be computed using the adder in this block to obtain \(\tilde{b}\), allowing for further simplifications in the design.

Fig. 5
figure 5

Hardware architectures of SID for \(q=2^w-2^l+1\) with \(l\le 3w/4\). (a) the overall block diagram, (b) SAU-T, (c) SAU-q, (d) and eDUQ.

The quotient b equals \(\tilde{b}, \tilde{b}+1, \tilde{b}+2\) when \(\tilde{r}\) in Eq. 2 is \(r, r+q, r+2q\), respectively. The second dashed block in Fig. 4 calculates \(-(\tilde{b}q \mod (2^{w+1}))= (-\tilde{b}2^w - \tilde{b} + \tilde{b}m)\mod (2^{w+1})\) for Eq. 2. Since \(\tilde{b}\) is a w-bit number, \(\tilde{b}2^w + \tilde{b}\) can be derived by concatenating \(\tilde{b}\) to itself. The 2’s complement of this sum is then derived by flipping every bit and adding ‘1’ to the LSB. On the other hand, due to the modular reduction by \(2^{w+1}\) at the end, only the \(w+1\) LSBs need to be retained in \(\tilde{b}2^w + \tilde{b}\) and the product \(\tilde{b}m\). Accordingly, the multiplication by m in this dashed block can also be simplified.

The third dashed block in Fig. 4 completes the rest of Eq. 2 for computing \(\tilde{r}\), and compares it with q and 2q. In this block, \(\lambda \mod 2^{w+1}\) corresponds to the \(w+1\) LSBs of \(\lambda \), denoted by \(\lambda [w:0]\). Depending on the comparison results, \(\tilde{b}\), \(\tilde{b}+1\), or \(\tilde{b}+2\) is selected as the result b. This architecture can also be easily modified to compute the remainder r. In this case, \(\tilde{r}\) is subtracted by q and 2q, and the sign bits of the differences select either \(\tilde{r}\), \(\tilde{r}-q\), or \(\tilde{r}-2q\) as the remainder r.

6.2.2 Primes with a Few Nonzero Bits

When q has only a few nonzero bits, such as three or four, the number of nonzero bits in T is also limited. In this case, the multiplication of T and q can be efficiently implemented by adding a small number of shifted operands. Besides, the addition of the lower w bits of the shifted operands in the multiplication of T can be omitted to further simplify the multiplier by compensating the potential carries from these bits in the decision of b.

Figure 5 illustrates the hardware architectures for implementing the SID When \(q = 2^w-2^l+1\) and \(l\le 3w/4\). In this case \(s \le 3\). The shift-add unit for T (SAU-T), shift-add unit for q (SAU-q), and extended decision unit for quotient computation (eDUQ) implement the counterpart functions of the first, second, and third blocks in Fig. 4, respectively. From Eq. 17, \(T^*\) and hence T, contains up to 5 nonzero bits when q has three nonzero bits and \(s\le 3\). In this case, the multiplication with T can be simplified from a general multiplication to the shift-add operation as shown in Fig. 5(b). Assume that \(T = 2^w\pm \text {EN}_{1}\cdot 2^{u_1}\pm \text {EN}_2\cdot 2^{u_2}\pm \text {EN}_3\cdot 2^{u_3}\pm \text {EN}_4\cdot 2^{u_4}\), where \(\text {EN}_i\) is either 0 or 1 to support the cases when T has less than 5 nonzero bits. Denote the sign of each term by \(\text {sign}_i\). Each shifter in Fig. 5(b) is implemented by the architecture in [14], which shifts the sign-extended input according to the offset \(u_i\) and sign \(\text {sign}_i\). From Eq. 1, only the highest w bits of \(\lfloor \lambda /2^{w-1}\rfloor T\) contribute to \(\tilde{b}\). The lowest w bits of the five partial products for the multiplication of T generate a carry-out of at most 4 to the highest w bits. To simplify the shifters and the following CSA, only the highest w bits are retained from the shifters and added up in the CSA to derive \(\hat{b}\). It can be easily derived that \(\hat{b}\in \{b, b-1, \cdots , b-6\}\) due to the ignored carry-out from the lowest w bits. The actual b is decided in the eDUQ block.

Since \(\hat{b}\in \{b,b-1,\cdots , b-6\}\), the value of \(\hat{r} = \lambda -\hat{b}q\) lies in the set \(\{r, r+q, \cdots , r+6q\}\). Hence, two additional bits are needed to represent \(\hat{r}\) compared to \(\tilde{r}\). Therefore, \(\hat{r}\) should be computed as

$$ \hat{r} = (\lambda \mod 2^{w+3}) - (\hat{b} q\mod 2^{w+3}). $$

Similarly, only the lowest \(w+3\) bits need to be retained in the above calculations. In \(-\hat{b}q=-(2^w\hat{b}+\hat{b})+2^l\hat{b}\), the lowest \(w+3\) bits of \(2^w\hat{b}+\hat{b}\) are formed by concatenating the lowest 3 bits of \(\hat{b}\) to the left of \(\hat{b}\). Accordingly, \(\hat{r}\) is computed as shown in Fig. 5(c).

Comparing \(\hat{r}\) with 6 different values using 6 comparators to decide b leads to a large area. Alternatively, the original Barret reduction [19] formula can be applied to the \((w+3)\)-bit input \(\hat{r}\) to derive the approximate quotient as

$$ \hat{b}_1 = \left\lfloor \frac{\hat{r}\left\lfloor \frac{2^{w+3}}{q}\right\rfloor }{2^{w+3}}\right\rfloor = \hat{r}[w+2:w]. $$

Subsequently, \(\hat{r}-\hat{b}_1q\) is either r or \(r+q\), which implies that b equals to either \(\hat{b}+\hat{b}_1\) or \(\hat{b}+\hat{b}_1+1\). Accordingly, the eDUQ can be implemented as in Fig. 5(d). The three multiplexers and CSA calculate \(\hat{r}-\hat{b}_1q=\hat{r}-\hat{r}[w+2:w]q\).

With minor modifications to part (d), the architectures in Fig. 5 can be adapted to implement SMR. To compute the remainder, the output of the CSA in Fig. 5(d) is subtracted by q. The sign bit of the difference selects either the CSA output or the subtractor output as the remainder.

The architectures in Fig. 5 can be extended to implement the quotient computation for q with more than three nonzero bits. Assume that q and T have x and y nonzero bits, respectively. In the SAU-T architecture of Fig. 5(a), \(y-1\) shifters are required and the CSA has y w-bit inputs. Since the addition of the lowest w bits in the y partial products are skipped in the proposed design in Fig. 5(a), the \(\mod 2^{w+1}\) in Eq. 2 must be adjusted to \(\mod 2^{w+1+\lceil \log _2 y\rceil }\). As a result, the \(1+\lceil \log _2 y\rceil \) LSBs of \(\hat{b}\) are concatenated to the left of \(\hat{b}\) in Fig. 5(b), and the lowest \(w+1+\lceil \log _2 y\rceil \) bits of \(\lambda \) are input to the CSA. Besides, the CSA has an additional \(x-2\) inputs generated by \(x-2\) shifters within this block. For the eDUQ block, the only differences are that \(\lceil \log _2 y\rceil \) multiplexers are utilized and the number of inputs at the CSA is increased accordingly.

Table 4 Hardware complexity of integer division architectures for w-bit q.

The number of nonzero bits in T increases fast with the number of nonzero bits in q. The number of shifters in Fig. 5(b) and (c) increase linearly with the numbers of nonzero bits in T and q, respectively. When there are too many nonzero bits, the architecture for general q in Fig. 4 becomes more efficient than the design in Fig. 5.

6.3 Complexity Analyses & Comparisons

This section analyzes and compares the complexities of our proposed architectures with the best prior efforts.

The complexities of the proposed GID and SID integer division architectures are listed in Table 4. For the GID design, primes with four nonzero bits are considered. When \(\log _2{\left( 2^{l_1}\!\mp \!2^{l_2}\right) }\le \frac{3w}{4}\), which covers many q, 4 iterations are needed in Algorithm 1. In this case, 4 copies of the ItrU units are employed in the architecture shown in Fig. 1. To estimate the complexity of the design, it is assumed that the CSAs use rows of full adders (FAs) to compute the sum and carry vectors from the inputs [26]. Then, a carry-ripple adder merges the final pair of sum and carry vectors. Under these assumptions, the complexity of the CSA is broken down in terms of the number of FAs in Table 4. Additionally, if the lowest bits of the CSA outputs are discarded and hence only the carry-out to the highest bits needs to be computed, such as in the case of the CSA in Fig. 2(c), the corresponding FAs can be replaced with half adders (HAs). The detailed architectures of the shifters used in the ItrU and DUQ blocks can be found in [14]. Each shifter consists of rows of multiplexers. In the case that some of the shifter inputs are pre-set to ‘0’ or ‘1’, such as the shifters in Fig. 2(c), the corresponding multiplexers are simplified.

Table 5 The relative area, TC, latency, and normalized ADP of our proposed integer division designs compared with prior works synthesized using GlobalFoundries 22FDX process.
Table 6 Hardware complexity of modular reduction architectures for w-bit q.

The critical path of the GID design lies in the ItrU block, as shown in Fig. 2(c). It consists of a shifter with \(\log _2 w\) rows of multiplexers, an XOR gate, and a CSA that has three 2w-bit inputs but retains only the highest w bits in the output.

For the proposed SID architecture with \(q=2^w-m+1\) shown in Fig. 4, general multipliers are used. It can be assumed that a \(w\times w\)-bit multiplier is implemented by a \((w-1)\times w\) array of FAs. In the case where m is limited to 3w/4 bits, which still covers a broad range of q, n has at most \(3w/4+1\) bits. Accordingly, the number of rows of FAs in the multipliers for m and n is reduced to around 3w/4. Besides, only the lowest \(w+1\) bits of the product from the multiplication with m are needed. Therefore, around half of the array of FAs for this multiplier is eliminated. The adder in the first dashed block of Fig. 4 only needs to compute the highest w bits of the sum. Hence, it can be implemented using w HAs and w FAs in a carry-ripple architecture.

In the SID architecture for \(q=2^w-2^l+1\) with \(l<3w/4\) in Fig. 5, one operand to the last adder in part (d) has only three bits, and the adder is simplified accordingly. To estimate the logic complexity, each AND gate in part (b) of this figure is assumed to take half of the area of an XOR to implement. The complexity of the SID for \(q=2^w-2^{l_1}\pm 2^{l_2}+1\), with \(\log _2{\left( 2^{l_1}\mp 2^{l_2}\right) }\le 3w/4\), is also listed in Table 4. It is an extension of the architecture shown in Fig. 5. In this case, T has at most 13 nonzero bits, and the number of shifters, CSA inputs, and multiplexers are decided accordingly as discussed in the previous section.

For the purpose of comparison, the complexities of the CRA design with general q [18], the divider using the Barrett reduction architecture in [24], and the preliminary version of this work [28] are also listed in Table 4. For the general case of \(q=2^w-m+1\), our proposed SID design requires smaller area than the architectures from [18] and [24], mainly because of the simpler multiplication with T when m is limited to 3w/4 bits, which still covers a broad range of q. When q has three nonzero bits, the proposed SID design requires similar area to implement as the divider in [28]. When q has four nonzero bits, the proposed SID has a CSA with a large number of inputs, which may result in larger area than the GID design. Nevertheless, the SID design requires only 1 instead of 5 clock cycles, and would be more efficient in terms of area-delay product (ADP). Additionally, our SID architectures and the designs in [18, 24] can be pipelined to achieve higher clock frequency.

To further evaluate and compare the complexities of the divider architectures, the designs listed in Table 4 with \(w=48\) and 64 are implemented in Verilog and synthesized using the GlobalFoundries 22FDX process with GENUS 17.22. The post-synthesis results are listed in Table 5. Since the CRA design [18] is apparently much larger, it is not further synthsized. The timing constraints (TCs) listed in this table are the tightest without causing negative slack for each design. For the area requirement, TC, and latency, the percentage numbers in the parentheses indicate normalized values with respect to those of the design from [24]. Besides, the normalized ADPs are also listed.

Table 7 The relative area, TC, latency, and normalized ADP of modular reduction designs for \(w=64\) synthesized using GlobalFoundries 22FDX process.

For the case of \(w=64\), it can be observed from Table 5 that our SID design for general \(q=2^w-m+1\) and \(\log _2{m}\le 3w/4\) achieves 14% area saving and 16% shorter latency compared to the architecture in [24]. This is mainly due to the smaller multipliers required in the SID design. For designs whose q has only three nonzero bits, the SID design offers a slightly smaller area but only 1400/2400=58% latency compared to the iterative divider in our preliminary work [28] because only 1 instead of 4 clock cycles are needed. When q has 4 non-zero bits, while the SID design exhibits a larger area than the GID design, its latency is much shorter since it requires 1 instead of 5 clock cycles. Overall, its ADP is \(1-12.0\%/17\%=31\%\) smaller.

When w is reduced from 64 to 48, the areas of the design from [24] and the SID design with general \(q=2^w-m+1\) reduce more significantly than those of the others. This is due to the reason that the numbers of FAs required by these two designs are proportional to \(w^2\) instead of w as shown in Table 4. Moreover, the critical paths and hence the minimum achievable TCs without negative slack reduce with w in all designs.

The hardware complexities of our proposed GMR and SMR architectures as well as the previous modular reduction designs are listed in Table 6. Similar to the divider designs, the proposed SMR for \(q=2^w-m+1\) and \(\log _2{m}\le 3w/4\) requires a much smaller number of FAs in comparison to the extended decompose and reduce multiplication (e-DARM) [27] and Barrett reduction [24], mainly because of the smaller multipliers. For the designs whose q has three nonzero bits, the proposed SMR design has a similar area as the design in [28]. When q has 4 nonzero bits, the number of nonzero bits of T is larger in our SMR architecture, resulting in larger area. Despite that, its latency remains much shorter because it needs only 1 instead of 5 clock cycles.

The modular reduction designs listed in Table 6 for \(w=64\) are synthesized using the GlobalFoundries 22FDX process and the post-synthesis results are shown in Table 7. Similarly, the tightest TCs without causing negative slack are reported. From this table, our SMR design for the general case of \(q=2^w-m+1\) and \(\log _2{m}\le 3w/4\) achieves 7% area reduction and 18% shorter latency compared to the eDARM [27]. When q has only three nonzero bits, the proposed SMR design reduces the area and latency by 1-3624/3972=9% and 1-1400/2400=42%, respectively, compared to the architecture in [28] adopted for modular reduction. For q with four nonzero bits, although the SMR design requires larger area than the GMR design, its latency is much shorter because it only needs 1 clock cycle instead of 5. Overall, the ADP is reduced by 1-12.5%/18.5%=32%.

In summary, for q in general form or with 3 nonzero bits, the proposed SID and SMR architectures require less area and much shorter latency than other designs. For q with 4 nonzero bits, although the proposed designs have larger area, they achieve much higher efficiency in terms of ADP. The improvements that can be achieved by the proposed designs increase with larger w.

7 Conclusion

This paper first extends our prior iterative process for quotient computation to a GID design that can handle primes with more than three nonzero bits. Then, the SID design based on simplifying the Barret reduction is proposed. Furthermore, algorithmic reformulations have been developed to reduce the complexity of the SID architecture when the prime has a small number of nonzero bits. The proposed algorithms are also adapted for efficient modular reductions. Efficient hardware architectures have been developed to implement the proposed algorithms. Compared to prior efforts, the proposed designs achieve significant reductions in area and/or latency. Future work aims to further optimize the divider when the dividend has more bits.