A LOW COST PACKET DETECTOR IN OFDM-BASED ULTRA-WIDEBAND SYSTEMS

Jyh-Ting (Justin) Lai, Chun-Yuan Chu, An-Yeu (Andy) Wu, and Wen-Chiang Chen*

Graduate Institute of Electronics Engineering, and Department of Electrical Engineering, National Taiwan University
*Faraday Technology Corporation

Abstract
Multiband orthogonal frequency-division multiplexing (MB-OFDM) systems employ frequency-hopping technology to achieve the capabilities of multiple access and frequency diversity. However, they also complicate the packet detector (PD) in terms of the requirement for the high hardware complexity. In this paper, we propose several low-cost design schemes for the PD, such as Walsh-Hadamard decomposition, buffered summation, and sign-bit-remaining methods. The estimated gate count of the resulting implemented PD is less than half that of existing solutions.

1. INTRODUCTION
Multiband orthogonal frequency-division multiplexing (MB-OFDM) technology (or so-called time-frequency interleaved OFDM) has recently been applied to wireless personal area networks [1][2]. This technique increases both the traffic capacity and the frequency diversity. Furthermore, MB-OFDM employs several types of time-frequency code (TFC) in the preamble part to easily identify the packet properties and frequency-hopping sequences. However, even though MB-OFDM can lever the successful experience of other OFDM systems, such as wireless local area network (WLAN) [3][4], there are still many differences between MB-OFDM and conventional OFDM systems. These differences are often the bottlenecks of system performance. As illustrated in the dashed box of Fig. 1, the potential bottlenecks of the receiver front-end are as follows:

1) Feedback control to the radio frequency circuit. Controlling the hopping of the radio frequency (RF) circuit requires a feedback path from baseband to RF. The building blocks of the analog/RF circuit and the front-end for the baseband are shown in Fig. 1. $z^D$ and $z^C$ denote the latencies of analog-to-digital converter (ADC) and packet detector (PD). $z^D$ and $z^C$ are the processing delays of the controller and the RF circuit. The processing delay of the feedback loop results in an iteration bound [7] that limits the system operation speed and performance.

2) Large hardware complexity. The receiver should identify different TFCs by detecting the preamble symbols of received packets. Hence, the order of the hardware complexity is proportional to both the frequency diversity and the types of preamble. For example, there are 6 preamble types for 6 band-hopping sequences, so the receiver PD has to handle 6 different cases and also has to control the RF hopping process very rapidly. In addition, because the OFDM symbols are transmitted sequentially in 3 bands [1], the channel parameters, such as the channel group delays, channel impulse responses (CIRs), and carrier frequency offsets (CFOs), are all unfolded 3 times. Huge hardware complexity is therefore unavoidable.

3) Low power requirement. MB-OFDM systems are often applied in portable devices, and hence low power consumption is also required.

Fig. 1 Critical components of the MB-OFDM receiver.

In this paper we systematically analyze the challenges and implementation difficulties of the MB-OFDM system, focusing on the baseband front-end circuits shown in the dashed box of Fig. 1. To address the above-mentioned difficulties, we propose a combined solution that...
includes both performance and hardware improvements. The proposed methods are the sign-bit-remaining method, Walsh-Hadamard (W-H) decomposition method, buffered-summation method, and pseudo noncausal relaxation (PNCR) method.

2. HARDWARE IMPLEMENTATION PROBLEMS

In addition to the above issues, there are many stringent hardware implementation problems associated with MB-OFDM:

1) Large hardware complexity for different TFCs: To meet the spectrum mask of the RF band defined by the FCC, the MB-OFDM specification exploits coefficients of preamble sequences with more than 5-bit precision [1]. Moreover, there are 6 TFCs, each with 128 different taps. Therefore, if we implement the circuit directly (we denote the circuit as “direct implementation” in this paper), the total coefficients of the matched filter comprise 5×128×6 bits. Low-cost design must therefore be considered both in the transmitter and the receiver.

2) The RF-baseband iteration bound: The building blocks of the RF and baseband circuits are shown in Fig. 1. Because zero padding is exploited in MB-OFDM to reduce the power consumption and the passband ripple [1], the receiver should postpone T samples to collect the postcursor of the OFDM symbol in order to add it back to the FFT window. The circuit is denoted as the overlap and add (OLA) circuit.

The total aforementioned delay should be less than the zero-padding length (denoted as ZP) between adjacent symbols, yielding

\[ ZP \geq D + P + T + R + C. \]  

(1)

Typical values used in the above equation are \( ZP = 37, \ D = 5 \) (pipeline ADC), \( R = 1, \ C = 5 \) (corresponding to a 9-ns switching time as proposed in [1] with a 1.89-ns sampling rate), and \( P = 4 \) (by the unfolding 4 method [7]). Thus, the OLA delay \( T \) is less than 22, which is not met by the required OLA period \( T \) of channels CM3 and CM4 [4]. Therefore, the long latency of the feedback loop could degrade the performance for long-duration channels.

In Section 3 we derive our PD algorithm that can alleviate the above problems.

3 THE PD ALGORITHM

The proposed PD comprises the two major components shown in Fig. 2: the Matched Filter and Power Meter. In this section we briefly introduce the algorithms for both components and the overall properties.

![Packet Detector (PD)](image)

The proposed PD structure is shown in Fig. 2. The upper block in the figure is the matched filter, which is an FIR filter with preamble coefficients as filter coefficients, and the lower block is the power meter, which provides the power sum of tap data. Before deriving the algorithm, we explain the basic properties of the preamble coefficients of the MB-OFDM system. We define the coefficient vector as

\[ C = [c_0, c_1, \ldots, c_{N-1}]^T, \]

(2)

where the \( c_i \) variables are the preamble coefficients. The inner product of the coefficient vector is normalized as [1]

\[ C^H C = N. \]

(3)

Because zero-padding signaling is adopted in the MB-OFDM system, the transmitted OFDM symbol can be expressed as the following vector form:

\[ s(n) = [s(n), s(n+1), \ldots, s(n+W-1)]^T, \]

(4)

\[ s(n) = \begin{cases} c_{((n))_W}, & \text{if } ((n))_W < N \\ 0, & \text{otherwise} \end{cases} \]

(5)

where \((n))_W \) denotes the remainder of \( n \) divided by \( W \). From the quasi-orthogonal properties of the preamble coefficients, the cross-correlation between the coefficient vector and transmitted symbol vector is

\[ C^T .0|^T_s(n) = \delta(k) + v(k) \equiv \delta(k) \]

(6)

and

\[ \kappa = ((n))_x, \]

(7)

where \( \delta(.) \) is the delta function, \( 0_l \) is the zero vector with length L, and \( v(.) \) is the small sidelobe signal which can be neglected because \( ||v(k)|| \gg ||\delta(k)|| \) for all \( \kappa \).

To simplify the derivation, we assume that the CFO is small, so the received signal of the preamble part can be expressed in the following matrix form [5]:

\[ X(n) = S(n)h + w(n), \]

(8)

\[ X(n) = [x(n), x(n+1), \ldots, x(n+W-1)]^T, \]

(9)

\[ h = [h_0, h_1, \ldots, h_{N-1}]^T, \]

(10)
The PD output is connected to the decision circuit (i.e., the Controller), and we can set a threshold level \( \eta \) in the Controller. If the PD output \( D_o \) is larger than the level in the Controller, this means that a packet has arrived. The properties of the proposed PD are as follows:

1. From the extreme case of Eqs. (21) and (22), the output is bounded from \( N \) to \( N_\alpha \), so the threshold level is independent of the input power.

2. The division E/B in Eq.(20) and in Fig. 2 can be avoided simply by checking the following inequality:

\[ E_o \geq B_o \cdot \eta, \]

where \( \eta \) is the threshold level in the controller.

Finally, we define two measures to evaluate the performance of the PD system: the maximal correctly receiving probability (MCRP), which is given by

\[ \text{MCRP} = \max_{\eta} \{ P_{\text{success}}(\eta) \} ; \]

and the threshold range (TR) for which more than 90% packets are correctly received, which is given by

\[ TR = \max_{\eta} \{ \eta \} - \min_{\eta} \{ \eta \} \]

In next section, we will use the two measurements to evaluate the performance of the proposed PD system for different kinds of implementation methods.

4 HARDWARE IMPLEMENTATION OF THE PROPOSED PD

As mentioned in Section 1, the hardware complexity of the PDs is very high, and so a low-cost design is essential. We also propose a PNCR method for relaxing the iteration bound in Eq. (1). For clarity, we list our implementation schemes and their purposes in Table 1.

<table>
<thead>
<tr>
<th>Method</th>
<th>Applied Component</th>
<th>Purpose</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sign-bit remaining</td>
<td>Matched filter</td>
<td>Low cost</td>
</tr>
<tr>
<td>W-H decomposition</td>
<td>Matched filter</td>
<td>Low cost</td>
</tr>
<tr>
<td>Buffered summation</td>
<td>Power meter</td>
<td>Low cost</td>
</tr>
<tr>
<td>PNCR</td>
<td>PD</td>
<td>Relaxation of iteration bound</td>
</tr>
</tbody>
</table>

Table 1 Implementation methods and their purposes.

4.1 Only the sign-bit remaining in preamble coefficients

The coefficients in the matched filter are longer than 5 bits to fully meet the precision of the original specification. This long word length will increase the complexity of both the transmitter and receiver. We reduce the word lengths (and hence the hardware requirements)
using a *canonical sign digit* on the transmitter side and a word-length-truncation method on the receiver side.

Assume that the ideal received signal is a vector $C$ as defined in Eq. (2) and that the truncated coefficient vector of the matched filter is $C_{\text{trun}}$. Table 2 lists the matched-filter output of $C^TC_{\text{trun}}$ for different truncated word lengths of the *preamble* coefficients. For the full word length, the value of $C^TC_{\text{trun}}$ is $N$ (128), as shown by Eq. (3). It is evident that if there is only one bit (the sign bit) for each coefficient, the orthogonality property still holds true and the loss at the matched-filter output is only 0.778 dB.

![Table 2](image)

4.3 *Buffered summation to reduce the squaring circuit*

Fig. 2 shows the 128 squaring circuits that are utilized to calculate the input signal power. However, we use the *bufferedsummation* method to reduce the hardware complexity in a real implementation, as shown in Fig. 4. When new data arrive, we add the new squared data into the accumulator and subtract the oldest squared data from the accumulator. This implementation requires only two squaring circuits and two adders (one adder for adding new data and the other for subtracting the oldest data).

![Fig. 4](image)

4.4 *PNCR method*
To relax the RF-baseband iteration bound as shown in Eq. (1), we approximate the matched-filter output as follows:

\[ y[n] = \sum_{i=1}^{N_c} c_i x_{i+n} + \sum_{i=0}^{K-1} c_{Q+i} x_{i+n+Q} \approx 10^{Q N_i} \frac{x_{in} x_{N}}{N_q}, \quad (27) \]

where the \( Q \) is the relaxation factor and \( k \) is the preamble symbol index. The principle of operation is demonstrated in Fig. 5. Fig. 5(a) shows the output waveform of the received data and the original output of the matched filter. If a preamble sequence is received, the matched filter of the PD will not detect the preamble symbol until the full FFT window of length \( N \) is filled. In contrast, Fig. 5(b) shows how \( Q \) samples of the previous preamble symbol are utilized to replace those data that are not yet received, because the current preamble is almost the same as the previous one if there is no noise or frequency offset, and thus the total latency is relaxed from \( N \) to \( N-Q \). For the first preamble symbol, the sensitivity of matched filter will be degraded by \( (N-Q)/N \) because we do not have previous preamble data in this case. However, the estimation bias can be corrected in the second preamble symbol by using the FHC state machine. Therefore, the PNCR method will only slightly degrade the PD performance. In addition, the PNCR can also enhance the bit error rate (BER) by enlarging the postcursor region and adding the samples in the region into the FFT window using the OLA circuit.

Fig. 5 The PD output for (a) a typical approach waveform and (b) the PNCR method.

4.5 Cost–performance comparison of the low-cost PD

The calculated hardware reductions of the above low-cost design methods are listed in Table 3(a), where \( N \) is the number of points in the FFT, \( F_D \) is the W-H hardware reduction ratio, \( N_C \) is the original word length of the preamble coefficients in the matched filter, \( F_T \) is the word-length-reduction ratio of the preamble coefficient, and \( N_X \) is the input data word length. We assume that the complexity of an adder depends on the input width, and that the complexity of the multiplier is the product of the word lengths of both inputs. Using real numerical values in Table 3(a) results in Table 3(b), where \( F_D = 128/(8+16) = 5, F_T = 5, N = 128, N_C = 5, \) and \( N_X = 5 \). This indicates that the final gate count of the combination circuit is only 2.28% that of the direct implementation. In addition, the hardware complexity is only half that in [8], where the cost-down approach reduces the tap number by 4, which degrades the performance by 6 dB (i.e., \( 10 \log 4 \) dB).

We have introduced several low-cost implementation methods to reduce the area and power requirements. To clarify the performance degradation for each low-cost method, we define PDs I–IV to enable a fair comparison of the performance. The used methods for PDs I–IV are listed in Table 4, in which the related MCRP and TR are also given. The table indicates that the TR is degraded by 8.1 (i.e., 18.2–10.1) after the fixed-point implementation. In addition, there is almost no degradation for the W-H decomposition (PD III) relative to the fixed-point version (PD II). The TR is further decreased by 2.7 (i.e., 9.8–7.1) after the implementation of the PNCR in PD IV. However, all of the PDs meet the stringent
specification of the MCRPs all being larger than 1-10^-5.

5 CONCLUSIONS

We have described the difficulties both in algorithm and hardware implementations. We propose a PD that requires only 2.28% of the gates and 50% of the hardware complexity relative to the direct implementation method and other existing solutions described in [8]. Moreover, the PD can cope with a worst-case SNR with a PDER of less than 10^-5.

<table>
<thead>
<tr>
<th>Component</th>
<th>Device</th>
<th>Direct Implementation</th>
<th>After Coefficient Truncation (Factor: F_T)</th>
<th>After W-H Decomposition (Ratio: F_D)</th>
<th>After Buffered Summation</th>
<th>Final Design</th>
</tr>
</thead>
<tbody>
<tr>
<td>Matched Filter</td>
<td>Adder</td>
<td>(N-1)N_x</td>
<td>-</td>
<td>(N/F_D-1)N_x</td>
<td>-</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Multiplier</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Power Meter</td>
<td>Adder</td>
<td>(N-1)N_x</td>
<td>-</td>
<td>2N_x</td>
<td>2N_x</td>
<td></td>
</tr>
<tr>
<td></td>
<td>Square</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

Table 3 Comparison of hardware complexity: (a) mathematical derivation and (b) numerical values.

<table>
<thead>
<tr>
<th>Component</th>
<th>Device</th>
<th>Direct Implementation</th>
<th>After Coefficient Truncation (Factor: F_T)</th>
<th>After W-H Decomposition (Ratio: F_D)</th>
<th>After Buffered Summation</th>
<th>Final Design</th>
</tr>
</thead>
<tbody>
<tr>
<td>Matched Filter</td>
<td>Adder</td>
<td>635</td>
<td>635</td>
<td>115</td>
<td>115</td>
<td>115</td>
</tr>
<tr>
<td></td>
<td>Multiplier</td>
<td>3200</td>
<td>640</td>
<td>0</td>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>Power Meter</td>
<td>Adder</td>
<td>635</td>
<td>635</td>
<td>635</td>
<td>50</td>
<td>50</td>
</tr>
<tr>
<td></td>
<td>Square</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Total gate / remaining area Percentage</td>
<td>7670</td>
<td>66.62%</td>
<td>51.50%</td>
<td>2.28%</td>
<td>2.28%</td>
<td></td>
</tr>
</tbody>
</table>

"-": No modification

(b)

Table 4 PD hardware schemes and related implementation losses.

<table>
<thead>
<tr>
<th>PD scheme name</th>
<th>Fixed Point</th>
<th>Implementation methods All low-cost methods without PNCR</th>
<th>All low-cost methods with PNCR</th>
<th>MCRP</th>
<th>TR</th>
</tr>
</thead>
<tbody>
<tr>
<td>PD I</td>
<td>-</td>
<td>-</td>
<td>-</td>
<td>&gt;1-10^-5</td>
<td>18.2</td>
</tr>
<tr>
<td>PD II</td>
<td>✓</td>
<td>-</td>
<td>-</td>
<td>&gt;1-10^-5</td>
<td>10.1</td>
</tr>
<tr>
<td>PD III</td>
<td>✓</td>
<td>✓</td>
<td>-</td>
<td>&gt;1-10^-5</td>
<td>9.8</td>
</tr>
<tr>
<td>PD IV</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>&gt;1-10^-5</td>
<td>7.1</td>
</tr>
</tbody>
</table>

"✓/✓/✓": denote using/not using the method.

REFERENCES