# VLSI Reconfiguration Architecture of Radix-2 FFT on 130nm Technology

Pham Dang Lam, Nguyen Trong Ngo Nhat Du, Ngo Thanh Dat, Hoang Trang Faculty of Electrics & Electronics, Ho Chi Minh city University of Technology

Email: lamd.pham@hcmut.edu.vn, nhatdu.bentre@gmail.com, thanhdat5494@gmail.com, hoangtrang@hcm.edu.vn

Abstract- It is irrefutable fact that Fast Fourier Transform (FFT) has applied popularly to a wide range of applications such as Orthogonal Frequency Division Multiplexing (OFDM), Mel Frequency Cepstral Coefficients known as one of methods applying to audio feature extraction in speech recognition systems, and FFT architecture has become an essential component in any digital signal systems. Currently, almost researches have approached FFT feature through software solutions due to flexibility or high accuracy that are also drawbacks as regards hardware architecture while hardware approaches have presented high performance in terms of realtime applications and ability of integration in applied circuits. By approaching hardware development, an effective hardware architecture of radix-2 FFT, in which not only a great variety of FFT points can be reconfigured but high accuracy is also satisfied mainly because of applying floating point operations, is proposed to be able to tackle issues realting to dynamic FFT points and accuracy. Such this structure follows Application-Specific Integrated Circuit (ASIC) design flow so that it can be integrated in any system feasibly.

**Keywords-** FFT (Fast Fourier Transform), MDC (Multipath Delay Commutator), OFDM (Orthogonal Frequency Division Multi-plexing), MFCC (Mel Frequency Cepstral Coefficients), dáu chấm động

## I. INTRODUCTION

With increasing development of digital signal processing field, FFT has become essential component so that such FFT feature need to be concerned as regards convenient architecture as well as high accuracy. It has experienced that a majority number of FFT algorithms have approached basing on software applications accompanying certain high performance computers [1]. As regards hardware approach, because of high requirements of ability of integration as well as low cost of implementation, a considerable number of hardware structure of FFT have proposed specially to be able to integrate in MFFC as well as OFDM architectures. Particularly, current survey [2, 3, 4] in table 1 has shown that authors proposed different FFT configurations that is reasonable for obtaining high performance of whole MFCC while another survey [5, 6, 7, 8, 9] described in table 2 witnessed a narrow range of FFT points with the highest number of FFT points recoded at 256. Through such surveys, it can be concluded that significant change of the number of FFT points bases on certain applications or to be able to achieve final high performance of whole system. The second worth mentioned is that hardware

architectures of FFT have referred to small number of FFT points causing of cost of hardware-base implementation. In order to cover such current issues, a dynamic architecture of FFT with ability of reconfiguring FFT points from 8 up to 4096 as well as integration of 32-bit floating point operations flowing IEEE 754 standard is proposed in this research.

The rest of paper is organized follows. Section II introduces detailed architecture of proposed. Next, Section III experiences obtained results regarding to highest frequency, area report and comparison to other designs, and Section IV concludes the paper and future work.

| Table 1 FFT configuration applied in MFCC hardware architectu |
|---------------------------------------------------------------|
|---------------------------------------------------------------|

| Authors                  | Number<br>of FFT<br>points | Hardware<br>approach | Integrated<br>in whole<br>structure | Timing<br>consumption |
|--------------------------|----------------------------|----------------------|-------------------------------------|-----------------------|
| GIN-DER<br>WU [2]        | 256<br>(radix-2<br>FFT)    | ASIC<br>(0.18µm)     | MFCC in<br>speech<br>recognition    | 10,4 µs               |
| Chin-<br>Teng Lin<br>[3] | 256<br>(radix-16<br>FFT)   | ASIC<br>(0.13µm)     | MFCC in<br>speech<br>recognition    | -                     |
| Dongsuk<br>Jeon [4]      | 1024<br>(radix-4<br>FFT)   | ASIC (65<br>nm)      | MFCC in<br>speech<br>recognition    | 6,7 µs                |

| Author                    | Numb<br>er of<br>FFT<br>points  | Hardware<br>approach        | How to<br>impleme<br>nt | Integrat<br>ed in<br>whole<br>tructur<br>e | Timing<br>consum<br>ption |
|---------------------------|---------------------------------|-----------------------------|-------------------------|--------------------------------------------|---------------------------|
| Lihong<br>Jia [5]         | 128<br>(radix-<br>2/4/8<br>FFT) | ASIC<br>(0.6µm)             | Pipeline                | OFDM                                       | 3 µs                      |
| Atin<br>Mukherj<br>ee [6] | 8<br>(radix-<br>2 FFT)          | FPGA<br>(Xilinx<br>Virtx-6) | Butterfly               | OFDM                                       | 19,598<br>ns              |

Table 2 FFT configuration applied in OFDM system

| Jungmi<br>n<br>Park[7] | 64 –<br>8K<br>(radix-<br>8 FFT) | FPGA<br>(Xilinx<br>Virtex-5) | Butterfly | OFDM          | 0,33 µs<br>(64-point<br>FFT)<br>96,20 µs<br>(8000-<br>point<br>FFT) |
|------------------------|---------------------------------|------------------------------|-----------|---------------|---------------------------------------------------------------------|
| K.<br>Umapat<br>hy[8]  | 128<br>(radix-<br>2/4<br>FFT)   | ASIC (90<br>nm)              | Pipeline  | MIMO-<br>OFDM | 40 µs                                                               |
| Ediz<br>Çetin[9]       | 256<br>(radix-<br>2 FFT)        | ASIC<br>(0.7µm)              | Butterly  | OFDM          | 102,4 µs                                                            |

## II. PROPOSED HARDWARE ARCHITECTURE OF FFT

#### 2.1 FFT algorithm

FFT is one of algorithms applied to calculate DFT quickly in almost digital system popularly. Currently, there are two ways of calculating DFT that base on timing domain or frequency domain in which different radix such as 2, 4, 8 or higher radixes can be selected optionally. Such proposed architecture selects radix-2 configuration in timing domain applied in almost digital system as previous surveys in table 1 and table 2. From such this approach, DFT with N points will be implemented though FFT with m stages where  $N = 2^m$ . Mathematically, DFT is transferred to sum of two components in which the first component is DFT for odd number while another is even one presented from the first to the fifth statement.

$$X[k] = \sum_{n=0}^{N-1} x(n) W_N^{nk} , k = 0, 1, 2, ..., N-1$$
  
với  $W_N^{nk} = e^{-j \left(\frac{2\pi}{N}\right)nk}$  (1)

$$=\sum_{n=0}^{\frac{N}{2}-1} x(2n) e^{-\left(j\frac{2\pi \times (2n)k}{N}\right)} + \sum_{n=0}^{\frac{N}{2}-1} x(2n+1) e^{-\left(j\frac{2\pi \times (2n+1)k}{N}\right)}$$
(2)

$$=\sum_{n=0}^{\frac{N}{2}-1} x(2n) e^{-\left(j\frac{2\pi nk}{N}\right)} + e^{-\left(j\frac{2\pi k}{N}\right)} \sum_{n=0}^{\frac{N}{2}-1} x(2n+1) e^{-\left(j\frac{2\pi nk}{N}\right)}$$
(3)

$$= DFT_{\frac{N}{2}} \left[ \left[ x(0), x(2), ..., x(N-2) \right] \right] + W_{N}^{k} DFT_{\frac{N}{2}} \left[ \left[ x(1), x(3), ..., x(N-1) \right] \right]$$
(4)

$$=DFT_{\frac{N}{2}}\left[x_{even}\left(n\right)\right]+W_{N}^{k}DFT_{\frac{N}{2}}\left[x_{odd}\left(n\right)\right]$$
(5)

According to the fifth statement, butterfly diagram for calculating DFT is proposed as figure 1 bellow in which there are totally m stages. In every stage, different pairs of inputs are calculated basing on rule of butterfly diagram so that total operations after all stages perform statement 5. The operations between two FFT-point values are similar in any butterfly unit as figure 2.



Figure 1 Butterfly diagram for N-point FFT



Figure 2 Operations inside butterfly unit

#### 2.2 Proposed FFT architecture

According to butterfly diagram, it is fact that every butterfly unit is called many times in every stage and that is similar in other stages, but every butterfly unit receives different inputs accompanying different weights. Form such circumstance, an effective FFT architecture in which all operations for a butterfly unit is called FFT core and an FFT-controller structure keeps the role of controlling suitable data inputs transferring into butterfly unit is proposed. As regard flow of input data, initially both value of N points and weight  $W_{N}^{k}$ , initial integer values of N point is loaded from Input\_Memory to MEM\_INIT\_1 and MEM\_INIT\_4 to be able to calculate for the first stage of butterfly diagram. During operation in every butterfly unit inside FFT core, weight  $W_N^k$  is read from *W*-Real and *W*\_image memories simultaneously. Real and Image results for every butterfly unit will be written into MEM\_INT\_3 and MEM\_IN\_6 memories. After finishing all butterfly units per stage and storing all results into such two memories, all data will be read out and written into MEM\_INT\_1, MEM\_INT\_2, MEM\_INT\_4 and <u>MEM\_INT\_5</u> memories to be used for next stage's operations. In every stage, FFT core is called N/2 times corresponding to N points. Basing on operations of all internal memories, interface of proposed FFT architecture is described in figure 3, figure 4 and table 3.



Figure 3. Block diagram of proposed FFT



Figure 4. Interface of proposed FFT architecture

Table 3. Detailed interface of proposed FFT architecture

| Name         | Туре  | Bit<br>Number | Describe                         |
|--------------|-------|---------------|----------------------------------|
| addr_input   | input | 12            | Address for FFT input data       |
| data_input   | input | 32            | FFT input data                   |
| rd_ena_input | input | 1             | Enable signal for FFT input data |
| fft_ena      | input | 1             | Enable signal for complete FFT   |
| addr_w_r     | input | 12            | Address for real value of weight |

| data_w_r   | input  | 32 | Real value of weight                                  |
|------------|--------|----|-------------------------------------------------------|
| addr_w_i   | input  | 12 | Address for image value of<br>weight                  |
| data_w_i   | input  | 32 | Image value of weight                                 |
| rd_ena_w   | output | 1  | Enable signal for weight input<br>data                |
| addr_out_r | output | 12 | Address for real value of FFT output data             |
| data_out_r | output | 32 | Real value of FFT output data                         |
| addr_out_i | output | 12 | Address for image value of FFT output data            |
| data_out_i | output | 32 | Image value of FFT output data                        |
| wr_ena     | output | 1  | Enable signal to write data into<br>internal memories |
| end_fft    | input  | 1  | Notify FFT finished                                   |

In order to read or write data during operating butterfly unit, a controller is considered as figure 5.



Figure 5 FFT controller

There are three main states comprising of initial prepare for input data, FFT core operation and rearrangement after every stage. Detailed states and related output signals are described in figure 6 and table 4, 5, 6.

| Present<br>state | Function                                                |  |  |
|------------------|---------------------------------------------------------|--|--|
| RESET            | Reset a system                                          |  |  |
| INITIAL          | Initial state of internal memories reading              |  |  |
| PREPARE          | Implement the bit-reserved algotithm for the input data |  |  |
| FFT_CORE         | Implement the butterfly computation                     |  |  |
| WAIT             | Until a butterfly computation has been finished         |  |  |
| ARRANGE          | Implement the arrangement of output data of             |  |  |



Figure 5 State machine of FFT with N points.

# Table 5 State machine in FFT block

| Present  | Next state | Condition                            |
|----------|------------|--------------------------------------|
| state    |            |                                      |
| RESET    | INITIAL    | When receive an enable signal from   |
|          |            | Main Control                         |
| INITIAL  | PREPARE    | When receive an enable signal from   |
|          |            | Main Control                         |
| PREPARE  | FFT_CORE   | When receive an enable signal from   |
|          |            | Main Control                         |
| FFT_CORE | WAIT       | Butterfly computation has already    |
|          |            | done (The counter is over)           |
| WAIT     | ARRANGE    | After 1 clock                        |
| ARRANGE  | FFT_CORE   | 1 stage has not yet done             |
|          | INITIAL    | Final stage finishes (The counter is |
|          |            | over)                                |

Table 6 Function of internal memories inside FFT

| Interal<br>Memory | Description                                                                                    |
|-------------------|------------------------------------------------------------------------------------------------|
| MEM_INPUT         | Contain the value of input data ( just has the real part)                                      |
| 0                 | The imaginary part of input data is zero at initial sate                                       |
| MEM_INT_1         | Contain the real part of the first complex number<br>participates in butter fly computation    |
| MEM_INT_2         | Contain the real part of the second complex number participates in butter fly computation      |
| MEM_INT_4         | Contain the imaginary part of the first complex number participates in butter fly computation  |
| MEM_INT_5         | Contain the imaginary part of the second complex number participates in butter fly computation |
| MEM_INT_3         | Contain the real part of the complex result number after butter fly computation                |
| MEM_INT_6         | Contain the imaginary part of the complex result number after butter fly computation           |
| W_REAL            | Contain the real part of a twiddle factor                                                      |
| W_IMAGE           | Contain the imaginary part of a twiddle factor                                                 |

## **III. EXPERIMENTS AND RESULTS**

According to ASIC design flow, initially FFT architecture is developed by Verilog HDL in Register Transfer level, followed

by RTL verification with Matlab model to confirm mean of absolute error as table 7.

| Table 7 Mean of absolute error between | Verilog result and I | Matlab |
|----------------------------------------|----------------------|--------|
| model                                  |                      |        |

| FFT    | Mean of absolute error  | Mean of absolute error  |
|--------|-------------------------|-------------------------|
| points | for real value          | for image value         |
| 8      | 1,034 .10 <sup>-7</sup> | 2,649 .10 <sup>-7</sup> |
| 16     | 2,915 .10 <sup>-7</sup> | 6,786 .10 <sup>-7</sup> |
| 32     | 5,278 .10 <sup>-8</sup> | 2,930 .10 <sup>-8</sup> |
| 64     | 1,286 .10 <sup>-7</sup> | 2,459 .10 <sup>-8</sup> |
| 128    | 3,083 .10 <sup>-7</sup> | 1,326 .10 <sup>-8</sup> |
| 256    | 1,009 .10 <sup>-7</sup> | 1,351 .10 <sup>-8</sup> |
| 512    | 1,107 .10 <sup>-7</sup> | 7,432 .10 <sup>-8</sup> |
| 1024   | 1,138 .10 <sup>-7</sup> | 9,462 .10 <sup>-9</sup> |

Where the formulate

$$E = \frac{\sum_{i}^{n} / \mathbf{x}_{i} - \mathbf{y}_{i} / \mathbf{x}_{i}}{n}$$
(6)

For the number of FFT points from 8 to 4096, table 8 presents the number of cycles and corresponding timing consumption with achieved frequency at 500 MHz. According to FFT architecture and obtained waveforms, timing consumption can be calculated as statement 7.

## **Table 8 Timing consumption**

| Number of<br>FFT points | The total cycle consumption | The total timing consumption (ns) |
|-------------------------|-----------------------------|-----------------------------------|
| 8                       | 780                         | 1560                              |
| 16                      | 1710                        | 3420                              |
| 32                      | 3760                        | 7520                              |
| 64                      | 8370                        | 16740                             |
| 128                     | 18740                       | 37480                             |
| 256                     | 41910                       | 83820                             |
| 512                     | 93240                       | 186480                            |
| 1024                    | 206010                      | 412020                            |
| 2048                    | 451900                      | 903800                            |
| 4096                    | 984510                      | 1969020                           |

$$T = 10 \times [T_{FIRST \ ARRANGE} + (\log_2 N - 2) \times T_{BUTTERFLY \& WAITE \& LOOP \ ARRANGE} + T_{LOOP \ ARRANGE}]$$

$$= 10 [(3N + 15) + (\log_2 N - 2) \times (2N + 13) + (N + 2)]$$

$$= 10 [(2N + 13) \times \log_2 N - 9]$$
(7)

#### Table 9 Comparison to other designs

| Authors                  | Target<br>Hardwar<br>e and<br>Technolo<br>gy | Numb<br>er of<br>FFT<br>points | Achieve<br>d<br>Freque<br>ncy<br>(MHz) | Power<br>(mW) | Timing<br>consumpti<br>on |
|--------------------------|----------------------------------------------|--------------------------------|----------------------------------------|---------------|---------------------------|
| GIN-DER<br>WU [2]        | ASIC<br>(0.18µm)                             | 256                            | 100                                    | 89,18         | 10.4 µs                   |
| Chin-Teng<br>Lin [3]     | ASIC<br>(0.13µm)                             | 256                            | 100                                    | 22.37         | -                         |
| Dongsuk<br>Jeon [4]      | ASIC<br>(65 nm)                              | 1024                           | 19                                     | -             | 6,7 µs                    |
| Lihong Jia<br>[5]        | ASIC<br>(0.6µm)                              | 128                            | 50                                     | 400           | 3 µs                      |
| Atin<br>Mukherjee<br>[6] | FPGA<br>(Xilinx<br>Virtx-6)                  | 8                              | 51                                     | -             | 19.598 ns                 |
| K.<br>Umapathy<br>[8]    | ASIC<br>(90 nm)                              | 128                            | 40                                     | -             | 40 µs                     |
| Ediz Çetin<br>[9]        | ASIC<br>(0.7µm)                              | 256                            | 40                                     | -             | 102,4 µs                  |
| FFT đề<br>nghị           | ASIC<br>(130nm)                              | 8-4096                         | 500                                    | 3.44          | 1.5µs-<br>1.969ms         |

Basing on the table 9 and statement 7, the number of cycle consumption is up to the number of FFT points. However with the achieved highest frequency at 500 MHz, timing consumption is very small with the recorded figure at only 2ms for the largest configuration reported at 4096 points that satisfies any real-time applications. In addition, result of synthesis on target technology at 130nm not only shows high performance in comparison to other designs but it also proves high flexibility with a wide range of number of FFT points. One of concerned issue is that such architecture need considerable

number of internal memories that also constrains both area cost and limitation of FFT points. In this proposed architecture 128Kb memory for maximum FFT points at 4096 for every internal memory is compatible to current state of art embedded memory integrated in chip.

#### **IV.** CONCLUTION

In this research, an effective hardware architecture of FFT in which user can reconfigure the number of FFT points by input data through internal memory is proposed. According to proposed design, not only flexibility is very compatible to a wide range of application but it has experienced high performance compared with other hardware design. Such this FFFT is the first phase of complete dynamic MFCC architecture that will be included in ASR system in the future.

**Acknowledgement** - The research is funded by Vietnam Government through project with KC.01.23/11-15

#### REFERENCES

- Teo Cupaiuolo, Daniele Lo Iacono, "A Flexible and Fast Software Implementation of FFT on the BPE platform" in Design, Automation & Test in Europe Conference & Exhibition, March 2012, pp.1467-1470.
- [2] Gin-der Wu, Ying Lei, "A Register Array Based Low Power FFT Processor" in Journal of Information Science and Engineering, vol.24, Issue 3, pp. 981-991, 2008.
- [3] Chin-Teng Lin, Yuan-Chu Yu, Lan-Da Van, "Cost-Effective Triple-Mode Reconfigurable Pipeline" in IEEE Transactions On Very Large Scale Integration (VLSI) Systems, vol. 16, no. 8, pp. 1058-1071, 2008.
- [4] Dongsuk Jeon, Mingoo Seok, Chaitali Chakrabarti, David Blaauw, Dennis Sylvester, "Energy-Optimized High Performance FFT Processor" in ICASSP, 2011, pp. 1701-1704.
- [5] Lihong Jia, Bingxin Li, Yonghong Gao, Hannu Tenhunen, "Implementation of A Low Power 128-Point FFT" in Solid-State and Integrated Circuit Technology, Beijing, 1998, pp.369-372.
- [6] Atin Mukherjee, Amitabha Sinha, Debesh Choudhury, "A Novel Architecture of Area Efficient FFT" in ACM SIGARCH Computer Architecture News, December 2014.
- Jungmin Park, "Design of a radix-8/4/2 FFT processor for OFDM" in Iowa State University of Science and Technology, Ames, Iowa, 2011.
- [8] K. Umapathy, D. Rajaveerappa, "Low Power 128-Point Pipeline FFT Processor using Mixed Radix 4/2 for MIMO OFDM Systems" in International Journal of Soft Computing and Engineering (IJSCE), vol. 2, no. 5, pp. 177-179, November 2008.
- [9] Ediz Çetin, Richard C. S. Morling, Izzet Kale, "An Integrated 256point Complex FFT Processor for Real-time Spectrum Analysis and Measurement" in IEEE Proceedings of Instrumentation and MeasurementTechnology Conference, Canada, May 1997, pp.96-101.