# A High-Throughput VLSI Architecture for Hard and Soft SC-FDMA MIMO Detectors

Katayoun Neshatpour, Mahdi Shabany, and Glenn Gulak

Abstract—This paper introduces a novel low-complexity multiple-input multiple-output (MIMO) detector tailored for single-carrier frequency division-multiple access (SC-FDMA) systems, suitable for efficient hardware implementations. The proposed detector starts with an initial estimate of the transmitted signal based on a minimum mean square error (MMSE) detector. Subsequently, it recognizes less reliable symbols for which more candidates in the constellation are browsed to improve the initial estimate. An efficient high-throughput VLSI architecture is also introduced achieving a superior performance compared to the conventional MMSE detectors with less than 28% added complexity. The performance of the proposed design is close to the existing maximum likelihood post-detection processing (ML-PDP) scheme, while resulting in a significantly lower complexity, i.e.,  $4.5 \times 10^2$ and  $7 \times 10^4$  times fewer Euclidean distance (ED) calculations in the 16-QAM and 64-QAM schemes, respectively. The proposed design for the 16-QAM scheme is fabricated in a  $0.13\mu m$  CMOS technology and fully tested, achieving a 1.332 Gbps throughput, reporting the first fabricated design for SC-FDMA MIMO detectors to-date. A soft version of the proposed architecture is also introduced, which is customized for coded systems.

*Index Terms*—ASIC implementation, LTE, MIMO, PDP, SC-FDMA, soft decoding.

#### I. INTRODUCTION

T HE 3rd generation partnership project (3GPP) defined long term evolution (LTE) to meet the requirements of the 4G wireless communication. LTE combines multiple-input multiple-output (MIMO) technology with orthogonal frequency division-multiple access (OFDMA) technology in the downlink and single carrier-frequency division multiple access (SC-FDMA) in the uplink to achieve peak data rates of 300 Mbps and 75 Mbps, respectively.

LTE-Advanced (LTE-A), which is an evolution of LTE, supports single-user spatial multiplexing of up to eight layers in the downlink and four layers in the uplink targeted to achieve peak data rates of 1 Gbps and 500 Mbps, respectively [2]. The SC-FDMA utilizes a discrete Fourier transform-spread OFDM (DFT-S-OFDM) modulation with similar performance compared to the OFDM. Its main advantage is to provide a lower peak-to-average power ratio (PAPR), which makes it

Manuscript received July 03, 2014; revised October 26, 2014; accepted November 22, 2014. Date of publication January 06, 2015; date of current version February 23, 2015. This work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Canadian Microelectronics Corporation (CMC). This paper was been presented in part at ISCAS 2012 [1]. This paper was recommended by Associate Editor X. Zhang.

K. Neshatpour is with the Department of Electrical and Computer Engineering, George Mason University, Fairfax, VA 22030-4444 USA.

M. Shabany is with the Department of Electrical Engineering, Sharif University of Technology, Tehran, Iran (e-mail: mahdi@sharif.edu).

G. Gulak is with the Department of Electrical and Computer Engineering, University of Toronto, Toronto, ON M5S 3G4, Canada.

Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSI.2014.2380637

the technology of the choice for the uplink [3]. However, the implementation of a MIMO detector in an SC-FDMA system is significantly more complicated than that of an OFDMA system. This is due to the fact that the transmitted data is mixed together because of the extra DFT block used naturally in an SC-FDMA system. Therefore, the implementation of a low-complexity MIMO detector is needed and is the main challenge in the SC-FDMA framework.

Several designs have been proposed for SC-FDMA MIMO detectors among which the linear frequency domain equalizer (FDE) receivers, including the minimum mean square error (MMSE) and zero forcing (ZF), are often used due to their simplicity [3], [4]. Similar to the case of MIMO systems, successive interference cancellation and iterative techniques can be used to enhance the performance of the FDE receivers [5]–[7]. However, these techniques introduce long delays due to their iterative nature.

The maximum likelihood (ML) receiver, on the other hand, offers an optimal bit error rate (BER) performance but incurs very high computational complexity especially in the SC-FDMA receivers. Considering the compromise between the BER performance and the complexity, typically suboptimal methods are employed. In this paper, a detection scheme is proposed for MIMO SC-FDMA systems, which provides near-optimal performance with a significant reduction in the complexity especially for large constellation sizes. The proposed design is fabricated in a  $0.13\mu$ m CMOS technology and fully tested. Moreover, in order to benefit from the enhanced signal integrity offered by coded systems, the proposed hard decoding architecture is also modified to create optimized for area and the other, optimized for a better BER performance.

#### II. SYSTEM MODEL

#### A. Transmitter

Fig. 1 shows the transmitter side of a MIMO SC-FDMA system with  $M_t$  transmit and  $M_r$  receive antennae supporting K users. The data stream on each transmit antenna is grouped into blocks of M symbols, as follows

$$\boldsymbol{s_{n_t}^{(k)}} = \left[ s_{n_t}^{(k)}(0), s_{n_t}^{(k)}(1), \dots, s_{n_t}^{(k)}(M-1) \right]^T, \quad (1)$$

where the superscript T represents the transpose operation,  $n_t$  is the antenna index, M is the DFT size, and  $\mathbf{s}_{n_t}^{(k)}$  represents the data on the transmit antenna  $n_t$  for user k, whose elements are chosen from a Q-ary quadrature amplitude modulation (QAM) constellation. After the DFT operation, the frequency domain (FD) representation of data on antenna  $n_t$  is obtained and is denoted by  $\mathbf{S}_{n_t}^{(k)}$ .

The next step in the SC-FDMA transmitter is to map the M frequency domain outputs of the DFT block to N existing or-

1549-8328 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.



Fig. 1. MIMO SC-FDMA transmitter for user k with  $M_t$  transmit antennae.

thogonal sub-carriers, denoted by the "Sub-carrier mapping" in Fig. 1. There are two typical methods for the sub-carrier allocation, i.e., the localized and distributed method [3]. In the localized method, the FD outputs of each DFT core occupy consecutive sub-carriers in the bandwidth, while in the distributed method, the FD outputs are spread over the entire bandwidth with zeros on the unoccupied sub-carriers. Since the sub-carriers allocated to each user are consecutive in the LTE standard [8], the localized method is considered in this paper.

Using the localized sub-carrier allocation scheme, the DFT outputs are mapped to M sub-carriers allocated to each user to produce  $\boldsymbol{D}_{n_t}^{(k)}$  (Fig. 1). The localized sub-carrier mapping matrix for user k is denoted by

$$T_{N,M}^{(k)} = \left[0_{M \times (k-1)M}, I_M, 0_{M \times (N-kM)}\right]^T, \qquad (2)$$

where  $I_M$  is an *M*-dimensional identity matrix. The resulting FD SC-FDMA signal,  $D_{n_t}^{(k)}$ , is transformed into the time-domain (TD) through an *N*-point inverse fast Fourier transform (IFFT) operation, resulting in the TD signals as follows.

$$\boldsymbol{D}_{\boldsymbol{n_t}}^{(\boldsymbol{k})} = F_N^{-1} T_{N,M}^{(\boldsymbol{k})} F_M \boldsymbol{s}_{\boldsymbol{n_t}}^{(\boldsymbol{k})}, \qquad (3)$$

where  $F_M$  is the normalized *M*-point DFT matrix, and  $F_N^{-1}$  is the normalized *N*-point IFFT matrix. Finally, a cyclic prefix (CP) is inserted and the final SC-FDMA signal is ready for transmission.

## B. Receiver

A conventional linear SC-FDMA detector for user k is depicted in Fig. 2. After the CP removal on antenna  $n_r$  at the SC-FDMA receiver with  $M_r$  receive antennae, the received signal is denoted as

$$\boldsymbol{r_{n_r}} = \sum_{k=1}^{K} \sum_{n_t=1}^{M_t} h_{n_r,n_t}^{(k)} \otimes_N \boldsymbol{D_{n_t}^{(k)}} + \boldsymbol{w_{n_r}}, \qquad (4)$$

where  $\otimes_N$  is the *N*-point circular convolution,  $\boldsymbol{w}_{\boldsymbol{n}_r} = [w_{n_r}(0), w_{n_r}(1), \ldots, w_{n_r}(M-1)]^T$  represents the additive white Gaussian noise (AWGN) on antenna  $n_r$ , and  $h_{n_r,n_t}^{(k)}$  is the channel impulse response (CIR) between the transmit antenna  $n_t$  and the receive antenna  $n_r$  for user k. Using an N-point fast Fourier transform (FFT) and performing the sub-carrier de-mapping, the FD signal of user k, received at antenna  $n_r$  is denoted as

$$\boldsymbol{Y}_{\boldsymbol{n_r}}^{(\boldsymbol{k})} = \left[T_{N,M}^{(\boldsymbol{k})}\right]^T F_N \boldsymbol{r_{n_r}}.$$
(5)

Therefore, the transmitted signal of each user can be detected individually, implying that index k can be removed, hereafter, for brevity of discussion.



Fig. 2. MIMO SC-FDMA receiver for user k with  $M_r$  receive antennae.

It is worth mentioning that the symbols in  $s_{n_t}$  are elements of a QAM constellation but due to the presence of the DFT core, which produces a linear combination of the constellation points, the  $S_{n_t}$  signals are no longer from a QAM system. This makes the design of the equalizer very challenging compared to the normal MIMO OFDMA detectors. In other words, due to this blending effect of the DFT, the ML detection is impossible in the frequency domain. However, if the effect of the DFT is taken into account in the form of an effective channel matrix (i.e.,  $H_{\text{eff}}$ ), the time-domain ML detection is theoretically feasible.

In this paper, without loss of generality, it is assumed that the number of transmit and receive antennae are the same. Let  $\boldsymbol{s} = [\boldsymbol{s_1^T}, \boldsymbol{s_2^T}, \dots, \boldsymbol{s_{M_t}^T}]^T$ , an  $(MM_t) \times 1$  matrix, be a set of constellation points at the transmitter, consisting of the signals of all sub-carriers and all antennae and  $\boldsymbol{Y} = [\boldsymbol{Y_1^T}, \boldsymbol{Y_2^T}, \dots, \boldsymbol{Y_{M_r}^T}]^T$ , an  $(MM_r) \times 1$  matrix, be the de-mapped signal obtained from the receiver antennae. Therefore, the effective channel matrix,  $H_{\text{eff}}$  can be defined as an  $(MM_r) \times (MM_t)$  matrix, highlighted in Fig. 2 by a gray box, which takes the mixed effects of both the channel and DFT block into account. Irrespective of the various existing approaches to derive the effective channel ([9], [10]), the new time-domain ML detection problem can be formulated as follows.

$$\boldsymbol{z_{ML}} = \operatorname*{arg\,min}_{\boldsymbol{\hat{s}} \in \mathcal{O}^{M \times M_t}} \|\boldsymbol{Y} - H_{\mathrm{eff}} \cdot \boldsymbol{\hat{s}}\|^2, \tag{6}$$

where O represents a Q-ary QAM constellation. While the ML detection provides the best performance, it is easy to see that the complexity of the detection problem in (6) grows exponentially with  $M \times M_t$ . For instance, in LTE-A, the spatial multiplexing of up to four antennae is allowed, (i.e., a 4 × 4 MIMO system), and the DFT size is a multiple of 12, resulting in  $M \times M_t = 48$  for the case that only one resource block, consisting of 12 consecutive sub-carriers [8], is allocated to each user. Merely running software simulation over the entire expected SNR requires months of running simulations, with no added value at the end. Therefore, intelligent methods need to be devised to drastically reduce the complexity of the detection problem.

#### **III. PREVIOUS WORK**

A number of system-level solutions have been proposed to reduce the complexity of the MIMO detection problem in (6), however none of them have described a detailed hardware implementation.

In [11], a grouping-based ML detector using an orthogonal projector is proposed to reduce the candidate size for the ML detection. While its performance is close to that of the ML detection with a reasonable complexity for the quadrature phase shift keying (QPSK) modulation scheme, its implementation is impractical for large constellation sizes.



#### **Second stage: Finding erroneous symbols in the initial estimate** -For each *z*(*i*) in *z*:

- Find  $N_p$  closest symbols in the QAM constellation to z(i) and create a set called  $Q_i$ . - For each  $\alpha_i^{j}$  in  $Q_i$ ,  $\alpha_i^{j} \neq \hat{z}(i)$ :  $-ED_i^{j} = ||\mathbf{Y} - H_{qf}[\hat{z}(1), \hat{z}(2), ..., \alpha_i^{j}, ..., \hat{z}(M \times M_i)]^{T} ||^2$  $-\lambda_i = \min_{1 \le N_i = 1} ED_i^{j}$ 

- Find  $ind1, ind2, ..., indN_e$  such that  $\lambda_{out1}, \lambda_{out2}, ..., \lambda_{outA_e}$ , have the the lowest  $\lambda_i$  values among all the symbols in the initial estimate.

#### Third stage: Improving the initial estimate

- For *i* in *ind*1,*ind*2,...,*ind* $N_e$ : - Replace  $\hat{z}(i)$  in  $\hat{z}$  with  $\alpha_i^j = \underset{argmin}{\operatorname{argmin}} ED_i^j$ .

# Fig. 3. The proposed PDP algorithm.

In [10], a soft ZF/MMSE estimate of the signal is derived followed by a sphere decoding (SD)/Chase algorithm to improve the performance of the MMSE/ZF FDE. However, the execution of the Chase or the SD on  $M \times M_t$  symbols results in a very high complexity for large constellation sizes.

Alternatively, the post detection processing (PDP) is suggested in [9], where the ML-PDP algorithm selects the erroneous symbols from an initial estimate of the symbols obtained from an MMSE detector and performs the partial ML on them to improve the BER performance, which still incurs a high computational complexity for large  $M \times M_t$  values.

## IV. HARD DECISION DETECTION SCHEME

The PDP algorithm in this paper, consisting of three stages, is illustrated in Fig. 3, where  $P = M \times M_t$ , H(m) is the channel matrix for the *m*-th sub-carrier and the superscript *H* is the Hermitian transform. These stages are described in the sequel.

*First Stage:* An MMSE equalizer<sup>1</sup> is utilized to produce the initial estimate of the symbol sequence by reversing the channel effect for each sub-carrier to estimate the transmitted FD signals. Subsequently, an *M*-point IDFT operation is executed on all sub-carriers to find time-domain signals. Therefore, the effect of the channel and the DFT are taken into account independently in this stage of the detection process. The IDFT outputs (i.e., z) are then mapped to the constellation points and grouped to produce  $M \times M_t$  symbols in the initial estimate,  $\hat{z}$ .

Second Stage: In order to improve the initial MMSE estimate, a number of symbols in the initial estimate are selected. For these selected symbols, extra possible candidates in the constellation are explored to see if they result in a better estimate. The selected symbols are, in fact, the ones that were initially more prone to error, called the "erroneous symbols." In order to find the erroneous symbols, a reliability metric (i.e., the error probability (EP) metric) is defined for each symbol representing its error probability. To calculate the EP metric, each symbol in the initial estimate is replaced with all other possible constellation points, with their corresponding Euclidean Distances (ED) calculated while other symbols remain unchanged. Then the lowest ED among them is defined as the EP metric for that specific symbol. It can be shown that the symbols with the lowest EP metric are the least reliable ones [12]. This process can be mathematically formulated as follows.

Let  $\hat{z}$  be the initial estimate. The sequence  $\bar{z}_i^{(j)}$ ,  $1 \leq j \leq Q-1$ is defined as a sequence in which the *i*-th symbol in  $\hat{z}$  (i.e.,  $\hat{z}(i)$ ) is replaced with the *j*-th symbol in  $\{\hat{z}_i\}^c$ , where  $\{\hat{z}_i\}^c$  is the set of all constellation points excluding the *i*-th symbols in  $\hat{z}$ . Thus,

$$\bar{\boldsymbol{z}}_{i}^{(j)} = \left[\hat{z}(1), \hat{z}(2), \dots, \alpha_{i}^{j}, \dots, \hat{z}(P)\right]^{T}, \alpha_{i}^{j} \in \{\hat{z}_{i}\}^{c},$$

$$\{\hat{z}_{i}\}^{c} = \{z_{j} | \forall (1 \leq j \leq (Q-1)), z_{j} \in \mathcal{O}, z_{j} \neq \hat{z}(i)\}.$$
(8)

Therefore, the EP metric for the *i*-th symbol is defined as

$$\lambda_{i} = \min_{1 \le j \le (Q-1)} \left\| Y - H_{\text{eff}} \cdot \overline{\boldsymbol{z}_{i}^{(j)}} \right\|^{2}, 1 \le i \le P, \qquad (9)$$

and  $\alpha_i$  is introduced as its corresponding constellation point.

In order to find the symbols with higher error probabilities, this metric should be calculated for all P symbols in the initial estimate and  $N_e$  number of them, producing the lowest values of  $\lambda_i$ , are considered as the erroneous symbols. Thus, a total of  $P \times (Q - 1)$  EDs needs to be calculated.

Moreover, in order to decrease the number of ED calculations to find the lowest values, it is proposed to search over only a subset of the constellation points, (i.e.,  $N_p$  points close to the MMSE output), where the minimum value in (9) is more likely to be. This subset can be efficiently selected using the complex Schnorr Euchner (SE) enumeration technique, which is an on-demand technique to enumerate the constellation points in the order of non-decreasing ED values.

Alternatively, the real and imaginary parts of the MMSE output can be rounded to the nearest values in  $\Omega = \{-\sqrt{Q} + 1, \dots, -1, +1, \dots, \sqrt{Q} - 1\}$  to produce the initial estimate and the first symbol in the candidate list. Subsequently,  $N_p$  points in the constellation close to each symbol in the initial estimate are selected to form the candidate list for that symbol. By choosing an appropriate value for  $N_p$  with respect to the constellation size, the BER performance loss is negligible (see Section VIII).

In fact, in the proposed scheme in this paper, in order to produce the minimum value in (9), only the EDs of the points in the candidate list of each symbol are calculated and their minimum value is considered as the EP metric for that symbol. After the calculation of EP metrics for all symbols in the initial estimate,  $N_e$  symbols with the lowest values of EP metric are selected as the erroneous symbols. Using this method, the number of ED calculations is decreased from  $P \times (Q-1)$  to  $P \times (N_p-1)$ (e.g., a 5× decrease in the 16-QAM scheme with  $N_p = 4$  and a 12× decrease in the 64-QAM scheme with  $N_p = 6$ ).

*Third Stage:* In the conventional ML-PDP, an ML detection is performed on a subset of the initial estimate (i.e., the  $N_e$  erroneous symbols) in order to improve the result. However, the complexity of this process grows exponentially with the number of selected symbols (i.e., requires  $Q^{N_e}$  ED calculations).

In this paper, in order to alleviate this computational complexity, the on-demand idea in the second stage can be further utilized in the third stage too. In other words, each erroneous symbol is substituted with the constellation points derived in



Fig. 4. The iterative algorithm (a) with the resource sharing, (b) with additional hardware.

its candidate list, and their corresponding EDs are calculated while other symbols remain unaltered. If any of these points produces an ED less than that of the initial estimate, that point is replaced as the final decision for that symbol, otherwise it remains unaltered. This process is performed for all  $N_e$  selected symbols. Since all required ED values are derived in the previous stage, no additional ED calculations are required for this stage, which is the advantage of the proposed approach. This reduction in complexity comes at the cost of performance loss, which is shown to be negligible in Section IX.

The proposed method can also be implemented iteratively, where in each iteration, a number of erroneous symbols are selected and a partial ML is executed over them, resulting in a better BER performance. Fig. 4 shows two different methods for the realization of the iterative algorithm for one iteration. In Fig. 4(a), the resource sharing results in a little extra hardware compared to the non-iterative approach; however, the throughput of the design is one half. The unfolded architecture in Fig. 4(b) on the other hand, requires more additional hardware; with no reduction in throughput relative to the non-iterative algorithm. In this paper, with the throughput improvement in mind, the unfolded architecture is implemented.

#### V. IMPLEMENTATION ISSUES

# A. L2 Norm and L1 Norm

The proposed detection scheme requires  $P \times (N_p - 1)$  ED calculations in the second stage, each requiring P, L2 norm calculations. In [13], the ED calculation for the SD algorithm was simplified by replacing the L2 norm with L1 and L $\infty$  norms. In a 4 × 4 system with a 16-QAM modulation scheme, the L1 and L $\infty$  implementations infer a 0.4 dB and 1.4 dB SNR penalty, respectively. In [14] a simplified K-best algorithm based on the L1 norm is introduced, which reduces the circuit complexity, while only causing a small BER performance degradation. The L1 norm is selected over the L $\infty$  norm due to its superior BER performance. For the proposed algorithm in this paper, as verified by the simulation results (see Section IX), to further reduce the complexity, the L2 norm calculation is replaced by an L1 norm (the Euclidean distance is replaced with Manhattan distance) without affecting the BER performance.



Fig. 5. The first stage of the proposed architecture.

#### B. Reformulation

Based on the above modification, the equations for deriving the EP metrics can be formulated as follows.

$$MD_{i}^{j} = \sum_{t=1}^{P} |H_{\text{eff}}(t,1)\hat{z}(1) + \ldots + H_{\text{eff}}(t,i)\alpha_{i}^{j} + \ldots$$
$$\ldots + H_{\text{eff}}(t,P)\hat{z}(P)|, \alpha_{i}^{j} \in Q_{i} - \hat{z}(i), \qquad (10)$$

$$\lambda_i = \min_{1 \le j \le N_p - 1} MD_i^j, 1 \le i \le P,\tag{11}$$

where  $MD_i^j$  is the Manhattan distance (MD) of the vector corresponding to the (j + 1)-th symbol in the candidate list of z(i), (i.e.,  $\overline{z}_i^{(j)}$ ) from Y,  $H_{\text{eff}}(t, i)$  denotes the term on the t-th column and i-th row of  $H_{\text{eff}}$ , Y(t) is the t-th symbol in Y, and  $Q_i$  is the candidate list for  $\hat{z}(i)$ . The computation in (10) includes P sum-of-products (SOP), each requiring P complex additions and multiplications, still a high computational complexity. However, as described in the sequel, leveraging the pipelining techniques, and considering the fact that many of the terms that are used for MD calculations are redundant, this complexity can be greatly reduced.

In fact (10) implies that each MD consists of the sum of P terms, each being an SOP. If each SOP is performed in one clock cycle, the accumulative sum of these SOPs yields the final sum after P clock cycles. Thus (10) is rearranged as follows.

$$MD_{i}^{j} = \sum_{t=1}^{P} |SOP_{i}^{j}(t)|$$
  
= 
$$\sum_{t=1}^{P} \left| PartMD(t) + H_{\text{eff}}(t,i) \left( \alpha_{i}^{j} - \hat{z}(i) \right) \right|,$$
(12)

$$PartMD(t) = H_{\text{eff}}(t, 1)\hat{z}(1) + \ldots + H_{\text{eff}}(t, P)\hat{z}(P) - Y(t),$$
(13)

where the subscript *i* refers to the *i*-th symbol in  $\hat{z}$  and the superscript *j* refers to a symbol in the candidate list of z(i). Thus PartMD(t) is produced in the *t*-th clock cycle and as a result all MDs are calculated after *P* clock cycles.

# VI. THE HARD DETECTION VLSI ARCHITECTURE

Fig. 5 depicts the first stage of the detection scheme. According to this architecture, one MMSE block and  $M_t$  number of M-point IDFT blocks are utilized. In each clock cycle, the MMSE block generates  $M_t$  outputs, which are kept in registers via the serial-to-parallel block. After M clock cycles, an M-point IDFT is executed on these outputs. Since efficient implementations of the MMSE and the IDFT block already exist in the literature, the focus of this paper is geared toward the realization of the second and third stages.



Fig. 6. The second and third stages of the proposed hard PDP architecture.

 TABLE I

 The Scheduling of the Inputs to the Proposed Architecture

| inputs/cycle | 1                  | 2                  | <br>Р              |
|--------------|--------------------|--------------------|--------------------|
|              | z(1)               | z(1)               | z(1)               |
| z            | z(2)               | z(2)               | <br>z(2)           |
|              |                    |                    |                    |
|              | z(P)               | z(P)               | z(P)               |
|              | $H_{\rm eff}(1,1)$ | $H_{\rm eff}(2,1)$ | $H_{\rm eff}(P,1)$ |
| н            | $H_{\rm eff}(1,2)$ | $H_{ m eff}(2,2)$  | $H_{\rm eff}(P,2)$ |
|              |                    |                    | <br>               |
|              | $H_{\rm eff}(1,P)$ | $H_{\rm eff}(2,P)$ | $H_{\rm eff}(P,P)$ |
| Y            | Y(1)               | Y(2)               | <br>Y(P)           |

The architecture of the second and third stages of the proposed scheme, is depicted in Fig. 6, where the dashed lines denote a number of the pipelining stages. The inputs of the architecture are the channel coefficients, the outputs of the MMSE detector, and the received FD signals at the receiver. In fact, the "H" inputs represent the values of all the terms in the *t*-th row in  $H_{\text{eff}}$  at the *t*-th clock cycle, the "Z" inputs are the outputs of the first stage, and the "Y" input represents the *t*-th element in Y at the *t*-th clock cycle. The architecture performs the detection in *P* clock cycles. The scheduling of the inputs during these *P* clock cycles is illustrated in Table I.

Disregarding the delay produced through the pipelining stages, PartMD(t) is produced in the *t*-th clock cycle. The candidate list generator (CLG) block in Fig. 6 generates the candidate list for each symbol. Using the generated list, the PMD block calculates the MDs for all the points in the candidate list of each symbol except the initial estimation after P clock cycles. The structure of the PMD block is based on the candidate list size (i.e.,  $N_p$ ). The detailed structure of the PMD calculation block is shown in Fig. 7.

Since the candidate list elements (i.e.,  $\alpha_i^j$ s) and  $\hat{z}(i)$ s are from a QAM constellation, while sharing the same real or imaginary parts, all the multiplications in (12) and (13) are constant complex multipliers, which take only specified values and can be realized by using only shift and add operations.

The Min blocks in Fig. 6 calculate the lowest values of the MDs derived by the PMD blocks for each symbol (i.e.,  $\lambda_i$ ) along with their corresponding constellation points (i.e.,  $\alpha_i$ ). Since *P* clock cycles are required to produce the EP metric values, the sorting can also be done in *P* clock cycles using a minimum hardware, which results in the minimum values of the EP metrics and their corresponding constellation points along with indices indicating the symbols selected as the erroneous symbols (e.g.,  $\operatorname{ind}_1$  and  $\operatorname{ind}_2$ ). The detailed structure of the Sorter with  $N_e = 4$  is depicted in Fig. 8.



Fig. 7. The detailed structure of the PMD block.



Fig. 8. The detailed structure of the Sorter with  $N_e = 4$ .

Since the candidate list of the symbols is the same in the second and the third stage, the MDs that are required be calculated in the third stage are already derived. Therefore, the minimum MD value among the points in the candidate list of each erroneous symbol (derived in the second stage), will be compared to the MD of the initial estimation (i.e.,  $MD_{init}$  in Fig. 6) and if smaller, its corresponding point is selected as the detected symbol. The register bank initially holds the values of the initial estimation. The comparator blocks compare the EP metrics with the MD of the initial estimate and eventually the final decisions will be stored in the register bank.

#### VII. SOFT DETECTION SCHEME

The architecture in Section VI provides a hard decision output (i.e.,  $z_F$ ) based on the transmitted symbols. While the proposed structure provides a superior BER performance compared to the conventional MMSE receivers, a soft-coded system is proposed that complies with advanced wireless standards. In a coded system, the transmitter encodes the message by using an error-correcting code.

At the receiver, the decoding is performed based on the extrinsic log-likelihood ratios (LLR) calculated by the MIMO detector. The LLRs are in fact the soft information representing the reliability of the detection. In contrast to a hard MIMO detector where a hard decision is made for each bit, a soft MIMO detector generates a value for each bit representing the probability of its being one or zero.

In order to enhance the performance of the coded system, the MIMO detector will have to generate a soft decision based on the transmitted symbols. In a *Q*-ary QAM modulation, LLR values must be calculated for all bits in each symbol resulting in  $\log_2 Q \times P$  number of LLR calculations for P symbols.

Let's define a(i) as the *i*-th symbol in the vector *a* and a(i, b) as the *b*-th bit in the *i*-th symbol of *a*. Assuming that there is no prior knowledge of the transmitted signal, and using the *maxlog* operation, the LLR value for the *b*-th bit in the *i*-th symbol of the initial estimation (i.e.,  $\hat{z}(i, b)$ ) can be estimated as follows [15].

$$L\left(\hat{z}(i,b)|Y\right) \approx \frac{1}{2\sigma_i^2} \left(\min_{a \in \zeta^{-1}(i,b)} d(a,Y) - \min_{a \in \zeta^{+1}(i,b)} d(a,Y)\right),$$
(14)

where  $\sigma_i^2$  is the per-stream SNR, *a* is a vector with *P* symbols from the constellation, d(a, Y) is the Euclidean distance of the vector *a* from *Y*,  $\zeta^{+1}(i, b) = \{a | \forall k, a(k) \in \mathcal{O}, a(i, b) = 1\}$ and  $\zeta^{-1}(i, b) = \{a | \forall k, a(k) \in \mathcal{O}, a(i, b) \neq 1\}$ . Thus  $\zeta^{+1}(i, b)$ is the set of all *a* vectors in which, the *b*-th bit in the *i*-th symbol is one.

It can be inferred from (14) that  $2^{P \times \log_2 Q}$  number of EDs should be calculated for each LLR value, which results in a total of  $P \times \log_2 Q \times 2^{P \times \log_2 Q}$  ED calculations, making it impossible to run simulation results for (14) to evaluate the performance, let alone the hardware implementation complications.

In order to minimize the number of ED calculations, it is proposed to search over only a subset of  $\zeta^{+1}(i, b)$  and  $\zeta^{-1}(i, b)$ , called  $\overline{\zeta}^{+1}(i, b)$  and  $\overline{\zeta}^{-1}(i, b)$ , respectively, where the minimum values in (14) are more likely to be. In fact,  $\overline{\zeta}^{+1}(i, b)$  is proposed to be defined as follows.

$$\overline{\zeta}^{+1}(i,b) = \left\{ a | \forall k, \left\{ \begin{array}{ll} a(k) \in \mathcal{O}, a(k,b) = 1 & k = i \\ a(k) = \hat{z}(k) & k \neq i \end{array} \right\}, \ (15)$$

and  $\overline{\zeta}^{-1}(i, b)$  is defined with the same approach for  $a(k, b) \neq 1$ . Thus for each  $\hat{z}(i, b)$  bit, instead of checking all the possible combinations, only the vectors, which are identical to the initial estimation for all symbols excluding the *i*-th symbol are investigated, resulting in only  $\log_2 Q$  Euclidean distance calculations for each bit.

Moreover, in order to implement this soft decoding algorithm for the MMSE output, the candidate list derived in the second stage of the hard PDP scheme (i.e.,  $Q_i$ ) is explored. For each symbol, the candidate list includes the constellation points close to the initial estimation (as discussed in Section V). The simplified LLR values, scaled by a factor of  $2\sigma_i^2$  are thus calculated as follows.

$$L'(\hat{z}(i,b)|Y) \approx \min_{a \in \overline{\zeta}^{-1}(i,b)} d(a,Y) - \min_{a \in \overline{\zeta}^{+1}(i,b)} d(a,Y),$$
(16)

where,

$$\overline{\overline{\zeta}}^{+1}(i,b) = \left\{ \begin{aligned} a|\forall k, \left\{ \begin{array}{ll} a(k) \in Q_i, a(k,b) = 1 & k = i \\ a(k) = \hat{z}(k) & k \neq i \end{aligned} \right\},\\ \overline{\overline{\zeta}}^{-1}(i,b) = \left\{ a|\forall k, \left\{ \begin{array}{ll} a(k) \in Q_i, a(k,b) \neq 1 & k = i \\ a(k) = \hat{z}(k) & k \neq i \end{aligned} \right\}. \end{aligned}$$

$$(17)$$

For each  $\hat{z}(i, b)$  bit, the candidate list elements belong to either  $\overline{\overline{\zeta}}^{+1}(i, b)$  or  $\overline{\overline{\zeta}}^{-1}(i, b)$ . Utilizing this method, and by replacing



Fig. 9. The second and third stage of the MMSE soft detection architecture.



Fig. 10. The detailed structure of the LLR-b block with  $N_p = 4$ .

EDs with MDs, all the values required to calculate the minimum values in (16) can be derived through the proposed hard decoding architecture in Fig. 6. However, whether each calculated MD is associated with a vector in  $\overline{\zeta}^{+1}(i, b)$  or  $\overline{\zeta}^{-1}(i, b)$  is yet to be determined through a new architecture. Thus the architecture in Fig. 6 can be modified to create the architecture for the MMSE soft detector. The resulting modified architecture is depicted in Fig. 9.

The structure of the Manhattan distance block was depicted in Fig. 6. The LLR-*b* block calculates the LLR value for the *b*-th bit in a symbol. The detailed structure of the LLR-*b* block with  $N_p = 4$  is depicted in Fig. 10. Let's define  $W^+(i,b) = \{W^+_{(1)}(i,b), W^+_{(2)}(i,b), \ldots, W^+_{(N_p)}(i,b)\}$  as the set of calculated MDs corresponding to  $\overline{\zeta}^{+1}(i,b)$ , and  $W^-(i,b) = \{W^{-1}_{(1)}(i,b), W^+_{(2)}(i,b), \ldots, W^+_{(N_p)}(i,b)\}$  as the set of MDs corresponding to  $\overline{\zeta}^-(i,b)$ . The Decision(*b*) block in Fig. 10 determines whether each symbol in the candidate list of the *i*-th symbol belongs to  $W^+(i,b)$  or  $W^-(i,b)$ . The structure of the Decision(*b*) block is based on the partition of the constellation and the candidate list. Subsequently the difference between the minimum values in  $W^+(i,b)$  and  $W^-(i,b)$  is calculated to derive the LLR values in (16).

The architecture in Fig. 9 provides soft decisions based on the MMSE outputs. In order to provide soft decisions based on the proposed PDP algorithm, the input to the Manhattan distance block should be the output of the hard decoding architecture instead of  $\hat{z}$ , which results in better performance (see Section IX) with the cost of extra hardware (see Table III). Therefore, the overall architecture for the PDP soft decoding scheme is the



Fig. 11. The architecture of the PDP soft decoding scheme.



Fig. 12. The BER performance of the proposed detection scheme in a 16-QAM,  $4 \times 4$  MIMO system for various choices of  $N_e$ .

combination of the architecture in Fig. 6 and the architecture in Fig. 9 resulting in the overall architecture depicted in Fig. 11.

The LLR values calculated are the simplified LLRs, scaled by a factor of  $2\sigma_i^2$ . Before feeding these values to the decoder they should be scaled by  $(1/\sigma_i^2)$ , which can be calculated in the MMSE detector in the first stage of the detection.

#### VIII. THE CHOICE OF PARAMETERS

Several considerations have to be taken into account for the selection of values of  $N_e$  and  $N_p$ , described in the following. It is assumed that one resource block is allocated to each user in a  $4 \times 4$  MIMO system, thus M = 12 and P = 48.

The value of  $N_e$  is a critical value in the performance of the system. If its value approaches P in the ML-PDP detection scheme, the performance will approach that of ML with the same complexity. Therefore, a judicious value should be selected to offer reasonable complexity while maintaining a good BER performance. Fig. 12 shows the software implementation results for different values of  $N_e$ , based on which the value of  $N_e$  is set to 4. Obviously, that larger values will result in a better BER performance.

In addition, the value of  $N_p$  is a critical parameter in defining both the complexity and performance of the system. An appropriate value should be selected for  $N_p$  with respect to the constellation size, in order to have low complexity while resulting in acceptable performance. According to the simulation results depicted in Fig. 13, the value of  $N_p$  is proposed to be set to 4 for the 16-QAM scheme and for the 64-QAM scheme  $N_p = 6$  will result in a slightly better performance compared to  $N_p = 4$ .



Fig. 13. The BER performance of the proposed detection scheme in a  $4 \times 4$  MIMO system for various choices of  $N_p$  with  $N_e = 4$ .

#### IX. SOFTWARE IMPLEMENTATION RESULTS

The BER performance of the proposed architecture for a  $4 \times 4$  MIMO detector, with one resource block allocated to each user (i.e., M = 12), was evaluated through MATLAB simulations, with  $N_e = 4$ ,  $N_p = 4$  and  $N_p = 6$  for 16-QAM and 64-QAM schemes, respectively. For simulation purposes, an independent and identically distributed (iid) Rayleigh fading channel model with no spatial correlation is assumed. Thus the entries of the channel matrix were chosen independently as zero-mean complex Gaussian random variable with variance one per complex dimension. Perfect channel knowledge is assumed at the receiver. As added in Section IX, while other channel matrix models can be explored to better emulate time-dispersive MIMO scenario, the proposed architecture is independent from the channel model and its effectiveness in enhancing the performance is based on the BER performance of the baseline MMSE detector.

The SNR is defined as the ratio between the total transmit power normalized to one, and the variance of the noise. Thus  $SNR = 1/\sigma^2$ , where  $\sigma^2$  is the the variance of noise vector per complex dimension.

For the soft decoding algorithm, a turbo cyclic redundancy check (CRC) coded system with a (1/2) code rate is considered. The turbo code includes two non-systematic convolutional (NSC) encoders and was generated with the TurboEncode function of the iterative solutions coded modulation library (ISCML). A (1/2) code, enhances the BER performance at the cost of the doubling the bandwidth.

As Fig. 13 demonstrates, the ML-PDP architecture improves the BER performance of the MMSE by more than 2 dB at BER =  $10^{-3}$  and 1 dB at BER =  $10^{-2}$  in the 16-QAM and 64-QAM schemes, respectively. The proposed hard PDP architecture has negligible performance loss compared to the ML-PDP approach, while providing a significant saving in the hardware complexity, as addressed in Section X.

Fig. 15 compares the BER performance of the hard MMSE and the proposed iterative and non-iterative PDP with the soft decoding architectures. Moreover, as expected from the error correcting nature of the coded systems, the soft decoding detectors will result in a significant enhancement in the performance. Comparing the soft MMSE and PDP algorithm yields that en-



Fig. 14. L1, L2-norm and fixed-point results for the 16-QAM scheme.

 TABLE II

 The Word-length and the Fractional-length Values

|        | $\begin{array}{c} \operatorname{Re}\{z(i)\}\\ \operatorname{Im}\{z(i)\} \end{array}$ | $\begin{array}{c} \operatorname{Re}\{Y(i)\}\\ \operatorname{Im}\{Y(i)\} \end{array}$ | $\begin{array}{l} \operatorname{Re} \{H(t,i)\} \\ \operatorname{Im} \{H(t,i)\} \end{array}$ | $\lambda_i$ |
|--------|--------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|-------------|
| 16-QAM | $[5,1]^1$                                                                            | [12, 8]                                                                              | [9, 8]                                                                                      | [16, 8]     |
| 64-QAM | [6,1]                                                                                | [15, 8]                                                                              | [9, 8]                                                                                      | [17, 8]     |

<sup>1</sup> The pair [WL,FL] stands for the word length and the fractional length.



Fig. 15. The BER performance of the proposed hard and soft decoding architectures for the 16-QAM scheme.

hancement in the BER performance in the soft PDP algorithm is significant, i.e., almost 2 dB at  $BER = 10^{-3}$ , which comes at the cost of having a twofold increase in the computational complexity (see Table III).

Several simulations were performed to determine the fixedpoint effect of the variables on the overall design performance. Fig. 14 shows the BER performance of the proposed design with different fractional-length values for the channel coefficients and the MDs for the 16-QAM scheme. The word length (WL) and fractional length (FL) for a number of variables, is depicted in Table II. Fig. 14 shows the L1 and L2-norm simulation results, so based on Fig. 14 the selected WL and FL is listed in Table II which are almost the same on a wide range of SNR values.

# X. COMPLEXITY ANALYSIS

The number of ED/MD calculations is the dominant factor in determining the complexity of suboptimal MIMO detectors. Table III compares the number of ED calculations in the ML-PDP and the proposed hard PDP architecture, as well as the soft MMSE and PDP architectures in the second and third stages. The ML-PDP requires P(Q - 1) and  $Q^{N_e}$  ED calculations in the second and third stages, respectively. While the

TABLE III The Complexity Assessment

|           | ED/MD calculations  | complex multipliers                                           |  |  |
|-----------|---------------------|---------------------------------------------------------------|--|--|
| ML        | $Q^P$               | $Q^P \times P^2$                                              |  |  |
| MMSE      | NA                  | $\frac{\frac{5}{2}M_r M_t^2 + \frac{7}{2}M_r M_t}{+ M_t M^2}$ |  |  |
| ML-PDP    | $(Q-1)P + Q^{N_e}$  | $((Q-1)P+Q^{N_e})P^2$                                         |  |  |
| hard PDP  | $(N_p - 1)P + 1$    | $N_p \times P$                                                |  |  |
| soft MMSE | $(N_p - 1)P + 1$    | $N_p \times P$                                                |  |  |
| soft PDP  | $2((N_p - 1)P + 1)$ | $2N_p \times P$                                               |  |  |

selective ML-PDP method, in [9] tries to reduce the complexity of the selection procedure of the erroneous symbols in the second stage, its complexity for the partial ML detection on the erroneous symbols still grows exponentially with the number of selected symbols. Moreover, its computational complexity is SNR-dependent.

On the other hand, the proposed design in this paper requires  $N \times N_p$  ED calculations in total, regardless of the SNR value. Table III shows that with the parameters selected in Section VIII, the proposed hard PDP architecture results in almost  $4.5 \times 10^2$  and  $7 \times 10^4$  times less ED calculation for 16-QAM and 64-QAM schemes, respectively. Moreover, the required hardware for the calculation of each ED is significantly reduced by based on (12) and (13) and by replacing the ED with MD.

The proposed schemes utilize various computational units including multiplexers, adders, etc. However, the complex multiplier is the largest unit, making it an important factor in determining the complexity of the design. The linear MMSE detector requires  $(5/2)M_rM_t^2 + (7/2)M_rM_t$  complex multipliers [17]. The conventional MMSE detector, which is considered the baseline detector for SC-FDMA requires  $M_t$  number of extra *M*-point IDFT units each accounting for  $M^2$  multipliers.<sup>2</sup> Table III shows the number of complex multiplications for the ML-PDP, the proposed schemes and the MMSE detector for the SC-FDMA. According to Table III, the number of complex multipliers used in the hard PDP, soft MMSE and the soft PDP architecture is 1.24, 1.24 and 1.48 times that of a conventional MMSE detector, respectively, which reflects their relative cost of improved BER performance.

#### XI. HARDWARE IMPLEMENTATION RESULTS

The proposed PDP architecture for the 16-QAM and 64-QAM schemes was implemented and fully tested on a Xilinx Virtex-6 xcvlx240t using the ML605 evaluation kit. Table IV shows the result of the field programmable gate array (FPGA) implementation. The normalized throughput demonstrates the data rate per transmit antenna.

The hard decoding architectures for the 16-QAM and 64-QAM schemes and the soft PDP architecture for the 16-QAM scheme were also synthesized and placed and routed in a  $0.13\mu$ m 1P/8M CMOS technology. The implelmentations include the second and the third stage of the proposed detection scheme (i.e., all the hardware in Fig. 6). An MMSE detector and M point IDFT blocks are required to be added as the first stage to make a full receiver. The result of a deeply pipelined MMSE detector for a 64-QAM OFDM system proposed in [16] is also provided in this table in order to compare with a conventional MMSE detector for the SC-FDMA as a baseline.

<sup>2</sup>The number of multiplications in the IDFT and the linear MMSE blocks may vary based on the implementations.

TABLE IV THE FPGA IMPLEMENTATION RESULTS FOR HARD PDP

|                               | 16-0        | 64-QAM      |             |
|-------------------------------|-------------|-------------|-------------|
|                               | hard soft   |             | hard        |
| Nr Slice Registers            | 20683(6.8%) | 24754(8.2%) | 23934(7.9%) |
| Nr Slice LUTs                 | 38905(26%)  | 58719(38%)  | 42684(28%)  |
| Max. freq. [MHz]              | 147         | 144         | 117.5       |
| Throughput [Mbps]             | 588         | 576         | 705         |
| Norm. Throughput <sup>1</sup> | 147         | 144         | 176         |

<sup>1</sup> Normalized throughput corresponds to data rate per transmit antenna.

TABLE V THE ASIC IMPLEMENTATION RESULTS

|                              |      | 16-QAM         |      | 64-QAM | 64-QAM       |
|------------------------------|------|----------------|------|--------|--------------|
|                              | hard | hard iterative | soft | hard   | MMSE<br>[16] |
| Process [nm]                 | 130  | 130            | 130  | 130    | 90           |
| Max. freq. [MHz]             | 333  | 256            | 209  | 266    | 174          |
| Throughput [Mbps]            | 1332 | 1028           | 846  | 1596   | 1044         |
| Norm. throughput             | 333  | 256            | 209  | 399    | 261          |
| Supply voltage [v]           | 1.2  | 1.2            | 1.2  | 1.2    | 1            |
| Power[mW]                    | 250  | 262            | 255  | 281    | 700          |
| Energy [pJ/bit]              | 188  | 255            | 301  | 176    | 670          |
| Core area [mm <sup>2</sup> ] | 1.94 | 3.55           | 3.93 | 3.79   | 6.23         |
| Gate count[kGE]1             | 220  | 402            | 446  | 430    | 1559         |
| Delay [µs]                   | 0.3  | 0.65           | 0.76 | 0.37   | 2.244        |

<sup>1</sup> Gate equivalent (GE) in 130 nm corresponds to the area of their FO4 NAND gates.



Fig. 16. (a) The Die micrograph. (b) Parameters normalized to the values at the nominal voltage.

According to Table IV, the baseline MMSE detector for 64-QAM utilizes 1556 kGE, while the proposed hard decoding architecture utilizes 430 more kGE which reflects the increased area cost of the proposed design (420/1559 = 28%) Moreover, the MMSE detector for the SC-FDMA requires extra IDFT blocks and serial to parallel units, which will affect its area, latency and throughput, reducing the relative contribution of the proposed design to the overall area.

Table V shows the results of the application specific integrated circuit (ASIC) implementation, where the hard 16-QAM implementation reports the fully tested fabricated design. The second and third stage of the hard decoding architecture for 16-QAM was fabricated and fully tested using a Verigy 93k digital tester. Fig. 16(a) shows the die micrograph. Table VII shows the result of the supply voltage on various parameters. Moreover, Fig. 16(b) shows how the speed, power, throughput and the energy of the fabricated chip may change with the supply voltage variations. As expected, higher supply voltages will increase the speed and throughput at the cost of a higher power and energy per bit consumption.

All of the achieved throughput values in both the FPGA and the ASIC implementations satisfy the throughput requirements of LTE and LTE-A.

 TABLE VI

 COMPARISON TABLE FOR THE MIMO SC-FDMA DETECTORS

|              | Antenna Config.                                                      | Mod.             | Soft/Hard | Tested chip  |
|--------------|----------------------------------------------------------------------|------------------|-----------|--------------|
| [4]          | $\begin{array}{c} 2 \times 2, 4 \times 4, \\ 8 \times 8 \end{array}$ | QPSK             | hard      | No           |
| [5]          | $2 \times 2$                                                         | QPSK             | hard/soft | No           |
| [6]          | $2 \times 2$                                                         | 16-QAM           | hard      | No           |
| [9]          | $2 \times 2$                                                         | QPSK<br>16-QAM   | hard      | No           |
| [10]         | $2 \times 2$                                                         | QPSK             | soft      | No           |
| [11]         | $1 \times 1$                                                         | QPSK             | hard      | No           |
| [18]         | $2 \times 2, 4 \times 4$                                             | 16-QAM           | soft      | FPGA         |
| This<br>work | $4 \times 4$                                                         | 16-QAM<br>64-QAM | hard/soft | FPGA<br>ASIC |

 TABLE VII

 The Effect of the Supply Voltage on Various Parameters

| Voltage[v]        | 0.8  | 0.9 | 1    | 1.08 | 1.2  | 1.32 |
|-------------------|------|-----|------|------|------|------|
| Min clk [ns]      | 8.99 | 7.9 | 3.97 | 3.53 | 3.00 | 2.62 |
| Power [mW]        | 98   | 127 | 162  | 196  | 250  | 326  |
| Throughput [Mbps] | 444  | 505 | 1005 | 1130 | 1332 | 1523 |
| Energy [pJ/bit]   | 220  | 251 | 161  | 174  | 188  | 215  |

 TABLE VIII

 Comparison Table for the Implementation Results

|          | <b>D1</b> C | Throughput [Mbps] |           | Norm. throughput |  |
|----------|-------------|-------------------|-----------|------------------|--|
| Platform |             | Mod.              | hard/soft | hard/soft        |  |
| [18]     | FPGA        | 16-QAM            | NA/220    | NA/110           |  |
|          | FPGA        | 16-QAM            | 588/576   | 147/144          |  |
| This     | HOA         | 64-QAM            | 735/NA    | 175/NA           |  |
| work     | ASIC        | 16-QAM            | 1332/846  | 333/209          |  |
|          | ASIC        | 64-QAM            | 1596/NA   | 399/NA           |  |

Table VI lists the specifications of the existing SC-FDMA detectors in literature. Most of the reported detectors, have proposed a system-level design without hardware implementations. Table VIII compares the implementation results of the proposed architecture with that in [18], which shows that the proposed soft PDP architecture for the 16-QAM scheme achieves  $1.3 \times$  higher normalized throughput and  $2.6 \times$  higher throughput in the FPGA platform, compared to the best design to-date for the soft detection. The hardware implementation of the proposed architecture for 64-QAM is the first reported FPGA or ASIC implementation to-date.

#### XII. CONCLUSION

The architecture of a practical receiver was implemented and tested in an FPGA and ASIC platform for MIMO SC-FDMA systems, resulting in a superior BER performance compared to the MMSE detector and lower complexity compared to current reported designs. The BER performance of the proposed detection scheme is close to ML-PDP while the reduction in the complexity is significant in large constellation sizes. A soft decoding MIMO detector with reasonable complexity was also implemented for a MIMO SC-FDMA coded system, resulting in significant enhancement in the performance.

## ACKNOWLEDGMENT

The authors wish to thank Mario Milicevic for his help on the testing of the fabricated chip.

#### References

- K. Neshatpour, M. Mahdavi, and M. Shabany, "A low-complexity high-throughput asic for the sc-fdma mimo detectors," in *Proc. IEEE Int. Symp. Circuits Syst. (ISCAS).*, May 2012, pp. 3065–3068.
- [2] "Requirement for further advancements for evolved universal terrestrial radio access (E-UTRA) (LTE-Advanced)," 3GPP TR 36.913 Release 9 Dec. 2009.
- [3] H. G. Myung, "Single-carrier orthogonal multiple access technique for broadband wireless communication," Ph.D dissertation, Dept. Elect. Comput., Polytechnic Univ., Brooklyn, NY, USA, 2007.
- [4] W. Bai, D. Yan, Y. Xiao, and S. Li, "Performance evaluation of MIMO SC-FDMA system with FDE receiver," in *Proc. Int. Conf. Wireless Commun. Signal Process. (WCSP)*, Nov. 2009, pp. 1–5.
- [5] Z. Pan, G. Wu, S. Fang, and D. Lin, "Practical soft-SIC detection for MIMO SC-FDMA system with co-channel interference," in *Proc. Int. Conf. Wireless Commun. Signal Process. (WCSP)*, Oct. 2010, pp. 1–5.
- [6] B. Dhivagar, K. Kuchi, and K. Giridhar, "An iterative MIMO-DFE receiver with MLD for uplink SC-FDMA," in *Proc. Natl. Conf. Commun.* (NCC), Feb. 2013, pp. 1–4.
- [7] R. Ferdian, K. Anwar, and T. Adiono, "Efficient equalization hardware architecture for SC-FDMA systems without cyclic prefix," in *Proc. Int. Symp. Commun. Inf. Technol. (ISCIT)*, Oct. 2012, pp. 936–941.
- [8] "Evolved universal terrestrial radio access (E-UTRA), physical channels and modulation," 3GPP TS 36.211 Release 9 Mar. 2010.
- [9] H. Noh, M. Kim, J. Ham, and C. Lee, "A practical MMSE-ML detector for a MIMO SC-FDMA system," *IEEE Commun. Lett.*, vol. 13, no. 12, pp. 902–904, Dec. 2009.
- [10] X. Liu, X. He, W. Ren, and S. Li, "Evaluation of near MLD algorithms in MIMO SC-FDMA system," in *Proc. Int. Conf. Wireless Commun. Netw. Mobile Comput (WiCOM)*, Sep. 2010, pp. 1–4.
- [11] S. Lim, T. Kwon, J. Lee, and D. Hong, "A new grouping-ML detector with low complexity for SC-FDMA systems," in *Proc. IEEE Int. Conf. Commun. (ICC)*, May 2010, pp. 1–5.
- [12] X. N. Tran, A. T. Le, and T. Fujino, "Performance comparison of MMSE-SIC and MMSE-ML multiuser detectors in a STBC-OFDM system," in *Proc. IEEE Int. Symp. Pers., Indoor, Mobile Radio Commun. (PIMRC)*, Sep. 2005, vol. 2, pp. 1050–1054.
- [13] A. Burg, M. Borgmann, M. Wenk, M. Zellweger, W. Fichtner, and H. Bolcskei, "VLSI implementation of MIMO detection using the sphere decoding algorithm," *IEEE J. Solid-State Circuits*, vol. 40, no. 7, pp. 1566–1577, Jul. 2005.
- [14] M. Wenk, M. Zellweger, A. Burg, N. Felber, and W. Fichtner, "K-best MIMO detection VLSI architectures achieving up to 424 Mbps," in *Proc. IEEE Int. Symp. Circuits Syst. (ISCAS)*, May 2006, pp. 1154–1158.
- [15] C. Studer, S. Fateh, and D. Seethaler, "ASIC implementation of soft-input soft-output MIMO detection using MMSE parallel interference cancellation," *IEEE J. Solid-State Circuits*, vol. 46, no. 7, pp. 1754–1765, Jul. 2011.
- [16] S. Yoshizawa, Y. Yamauchi, and Y. Miyanaga, "A complete pipelined MMSE detection architecture in a 4 × 4 MIMO-OFDM receiver," in *Proc. IEEE Int Symp Circuits Syst. (ISCAS)*, May 2008, pp. 2486–2489.

- [17] J. Benesty, H. Yiteng, and C. Jingdong, "A fast recursive algorithm for optimum sequential signal detection in a BLAST system," *IEEE Trans. Signal Process.*, vol. 51, no. 7, pp. 1722–1730, Jul. 2003.
- [18] G. Wang, B. Yin, K. Amiri, Y. Sun, M. Wu, and J. R. Cavallaro, "FPGA prototyping of a high data rate LTE uplink baseband receiver," in *Conf. Rec. Asilomar Conf. Signals, Syst., Comput.*, Nov. 2009, pp. 248–252.



Katayoun Neshatpour received her B.Sc. degree in electrical engineering from Isfahan University of Technology, Isfahan, Iran, in 2009, and her M.Sc. degree in electrical engineering from Sharif University of Technology, Tehran, Iran, in 2012. She is currently a Research Assistant working toward her Ph.D. degree in the Electrical and Computer Engineering Department at George Mason University, Fairfax, VA, USA. Her research interests include digital design and computer architecture.



Mahdi Shabany received his M.Sc. and Ph.D. degrees in electrical engineering from the University of Toronto, Toronto, ON, Canada, in 2004 and 2008, respectively. He is an Associate Professor in the Electrical Engineering Department at the Sharif University of Technology, Tehran, Iran. From 2007 to 2008, he was with Redline Communications Co., Toronto. He holds three U.S. patents. His main research interests include digital electronics, and VLSI architecture/algorithm design for mobile health, and broadband communication systems.



**Glenn Gulak** is a Professor in the Department of Electrical and Computer Engineering at the University of Toronto, Toronto, ON, Canada. He is a Registered Professional Engineer in the Province of Ontario. He has authored or coauthored more than 150 publications in refereed journal and refereed conference proceedings. In addition, he has received numerous teaching awards for undergraduate courses taught in both the Department of Computer Science and the Department of Electrical and Computer Engineering at the University of Toronto. From January

1985 to January 1988 he was a Research Associate in the Information Systems Laboratory and the Computer Systems Laboratory at Stanford University. He has served on the ISSCC Signal Processing Technical Subcommittee from 1990 to 1999, ISSCC Technical Vice-Chair in 2000 and served as the Technical Program Chair from 2005 to 2008. He currently serves as the Vice-President of the Publications Committee for the IEEE Solid-State Circuits Society and a member of the IEEE PSPB.