Towards Deep Learning using TensorFlow Lite on RISC-V

Marcia Sahaya Louis
marcia93@bu.edu
Boston University

Zahra Azad
zazad@bu.edu
Boston University

Leila Delshadtehrani
delshad@bu.edu
Boston University

Suyog Gupta
suyoggupta@google.com
Google Inc.

Pete Warden
peteward@google.com
Google Inc.

Vijay Janapa Reddi
vj@eecs.harvard.edu
Harvard University

Ajay Joshi
joshi@bu.edu
Boston University

Abstract
Deep neural networks have been extensively adopted for a myriad of applications due to their ability to learn patterns from large amounts of data. The desire to preserve user privacy and reduce user-perceived latency has created the need to perform deep neural network inference tasks on low-power consumer edge devices. Since such tasks often tend to be computationally intensive, offloading this compute from mobile/embedded CPU to a purpose-designed “Neural Processing Engines” is a commonly adopted solution for accelerating deep learning computations. While these accelerators offer significant speed-ups for key machine learning kernels, overheads resulting from frequent host-accelerator communication often diminish the net application-level benefit of this heterogeneous system. Our solution for accelerating such workloads involves developing ISA extensions customized for machine learning kernels and designing a custom in-pipeline execution unit for these specialized instructions. We base our ISA extensions on RISC-V: an open ISA specification that lends itself to such specializations. In this paper, we present the software infrastructure for optimizing neural network execution on RISC-V with ISA extensions. Our ISA extensions are derived from the RISC-V Vector ISA proposal, and we develop optimized implementations of the critical kernels such as convolution and matrix multiplication using these instructions. These optimized functions are subsequently added to the TensorFlow Lite source code and cross-compiled for RISC-V. We find that only a small set of instruction extensions achieves coverage over a wide variety of deep neural networks designed for vision and speech-related tasks. On average, our software implementation using the extended instructions set reduces the executed instruction count by 8X in comparison to baseline host implementation. In parallel, we are also working on the hardware design of the in-pipeline machine learning accelerator. We plan to open-source our software modifications to TF Lite, as well as the micro-architecture design in due course.

Keywords
Deep Learning, RISC-V Vector ISA extension, TensorFlow Lite

1 Introduction
Recent developments in deep learning have led to a resurgence in artificial intelligence. Various cognitive tasks such as image recognition [19, 23], speech recognition [31], and natural language processing [6, 20] extensively use deep neural networks. As these “intelligent applications” pervade into mobile/Internet of Things (IoT) platforms, there is a growing demand for efficient execution of deep neural networks on these low-power and resource-constrained platforms. However, state-of-the-art neural networks routinely have millions of parameters and a single inference task can invoke billions of arithmetic operations and memory accesses. Offloading the neural network execution to a dedicated hardware accelerator has emerged as a widely adopted solution for improving the execution time and energy efficiency. Manifestations of this concept are abundant: the Apple A12 Bionic [27] that has an Integrated Neural Processing Unit, the Qualcomm SD 855 that has a Hexagon DSP [5, 12] and an integrated Neural Processing Unit, Huawei’s Kirin 980 SoC that has a Dual Neural Processing Unit [3], and Samsung Exynos 9820, that has an integrated Neural Processing Unit [4].

A heterogeneous solution comprised of accelerators and CPU often requires partitioning the work between the host CPU and the neural accelerator(s) and may trigger frequent host-accelerator communications. Consider a canonical machine learning application that comprises of a) pre-processing the inputs to render them consumable by a neural network, b) running a neural network inference using these inputs, and c) post-processing the predictions generated by the network. The net application-level speed-up is determined by the relative computational complexities of the components listed above as well as the overheads associated with communication between the host and the accelerator. Applications that involve frequent data and/or control exchanges between the host and accelerator land up severely under-utilizing the accelerator and may not see a net benefit of offloading work from the host.

In this paper, we present our work on developing a solution that seeks to eliminate these overheads that surface in a typical
We implemented a subset of instructions from RISC-V V ISA extension. To verify the functionality of the extended instructions, we modified the Spike simulator. We extended the \texttt{class regfile\_t} to support the extended instructions. Subsequently, we used Spike for functional verification and for benchmarking machine learning models. We used the executed instruction count as the metric to support the extended instructions. We then modified the Spike ISA simulator to verify the functionality of the extended instructions.

### 2.2 Instruction simulation support on Spike ISS

Spike is a RISC-V Instruction Set Simulator (ISS) [7] and implements a functional model of RISC-V processor. Spike is a functional simulator that ignores internal delays such as I/O accesses or memory transactions. Therefore, the simulations are not cycle accurate. Spike executes a user space program using proxy kernel for handling the system calls from a C standard library functions.

To support the simulation of the instructions in Table 1, we modified the \texttt{class regfile\_t} with vector registers and macros to read/write values to the registers. In order to load/store data from memory, we extended the \texttt{class mmu\_t} with macros for loading/storing multiple data from memory. Similar to the scalar pipeline, a memory request is handled by the TLB unit in Spike.

We also modified the \texttt{class processor\_t} to configure the two vector CSRs, \texttt{vcfg CSR} and \texttt{vl CSR}. As specified in RISC-V Vector ISA extension [22], the \texttt{vcfg CSR} configures the vector unit by setting two Control Status Register (CSR), i.e., \texttt{vcfg} and \texttt{vl} as required by the RISC-V V ISA extension.

The C inline assembly functions are compiled into assembly code using the RISC-V GCC tool-chain. The assembly code is then converted into machine code using GNU assembler (GAS) [11]. GAS is implemented in two sections, the \texttt{front-end} that handles the parsing of assembly code and the \texttt{back-end} that generates the machine code. We added support for each of the instructions in Table 1 in the GAS front-end to parse the extended instructions and check if the instruction has a valid opcode and operands. Subsequently, the GAS back-end generates the corresponding machine code for the extended instructions. We then modified the Spike ISA simulator to verify the functionality of the extended instructions.

**Listing 1: A function to load vector elements.**

```cpp
template <class T>
inline void __VectorLoadInput(const T* load_address) {
    asm volatile("vls va1, 0(%0), v	
    : "r"(load_address));
}
```

The code snippet in Listing 1 shows the implementation of the vector load template function. The function loads an array of elements to the vector register "va1". The number of elements to load is configured at run-time by setting two Control Status Register (CSR), i.e., vcfg and vl as required by the RISC-V V ISA extension.

### 2.2 Instruction simulation support on Spike ISS

Spike is a RISC-V Instruction Set Simulator (ISS) [7] and implements a functional model of RISC-V processor. Spike is a functional simulator that ignores internal delays such as I/O accesses or memory transactions. Therefore, the simulations are not cycle accurate. Spike executes a user space program using proxy kernel for handling the system calls from a C standard library functions.

To support the simulation of the instructions in Table 1, we modified the \texttt{class regfile\_t} with vector registers and macros to read/write values to the registers. In order to load/store data from memory, we extended the \texttt{class mmu\_t} with macros for loading/storing multiple data from memory. Similar to the scalar pipeline, a memory request is handled by the TLB unit in Spike.

We also modified the \texttt{class processor\_t} to configure the two vector CSRs, \texttt{vcfg CSR} and \texttt{vl CSR}. As specified in RISC-V Vector ISA extension [22], the \texttt{vcfg CSR} configures the vector unit by

**Listing 1: A function to load vector elements.**

```cpp
template <class T>
inline void __VectorLoadInput(const T* load_address) {
    asm volatile("vls va1, 0(%0), v	
    : "r"(load_address));
}
```

The code snippet in Listing 1 shows the implementation of the vector load template function. The function loads an array of elements to the vector register "va1". The number of elements to load is configured at run-time by setting two Control Status Register (CSR), i.e., vcfg and vl as required by the RISC-V V ISA extension.

### 2.2 Instruction simulation support on Spike ISS

Spike is a RISC-V Instruction Set Simulator (ISS) [7] and implements a functional model of RISC-V processor. Spike is a functional simulator that ignores internal delays such as I/O accesses or memory transactions. Therefore, the simulations are not cycle accurate. Spike executes a user space program using proxy kernel for handling the system calls from a C standard library functions.

To support the simulation of the instructions in Table 1, we modified the \texttt{class regfile\_t} with vector registers and macros to read/write values to the registers. In order to load/store data from memory, we extended the \texttt{class mmu\_t} with macros for loading/storing multiple data from memory. Similar to the scalar pipeline, a memory request is handled by the TLB unit in Spike.

We also modified the \texttt{class processor\_t} to configure the two vector CSRs, \texttt{vcfg CSR} and \texttt{vl CSR}. As specified in RISC-V Vector ISA extension [22], the \texttt{vcfg CSR} configures the vector unit by
Towards Deep Learning using TensorFlow Lite on RISC-V

Table 1: The subset of RISC-V Vector ISA extension [22] implemented in our software ecosystem.

<table>
<thead>
<tr>
<th>Inst. Type</th>
<th>Instructions</th>
<th>Function</th>
</tr>
</thead>
<tbody>
<tr>
<td>Memory access</td>
<td>vls[b,h,s,d] VRd, RS1, RS2, m</td>
<td>Loads a vector into VRd from memory address in RS1 with unit/const stride in RS2 or indexed stride in VRS2</td>
</tr>
<tr>
<td></td>
<td>vx[b,h,s,d] VRd, RS1, VRS2, m</td>
<td></td>
</tr>
<tr>
<td></td>
<td>vss[b,h,s,d] VRS3, RS1, RS2, m</td>
<td>Stores a vector in VRS3 to memory address in RS1 with unit/const stride in RS2 or with indexed stride in VRS2</td>
</tr>
<tr>
<td></td>
<td>vssx[b,h,s,d] VRS3, RS1, VRS2, m</td>
<td></td>
</tr>
<tr>
<td>Arithmetic Instructions</td>
<td>vadd VRd, VRS1, VRS2, m</td>
<td>Add/Multiply values in VRS1, VRS2 and writes to VRd</td>
</tr>
<tr>
<td></td>
<td>vmul VRd, VRS1, VRS2, m</td>
<td></td>
</tr>
<tr>
<td></td>
<td>vfadd VRd, VRS1, VRS2, m</td>
<td></td>
</tr>
<tr>
<td></td>
<td>vfmul VRd, VRS1, VRS2, m</td>
<td></td>
</tr>
<tr>
<td></td>
<td>vmax VRd, VRS1, VRS2, m</td>
<td>Multiply values in VRS1, VRS2 and add VRS3, and writes to VRd</td>
</tr>
<tr>
<td></td>
<td>vmin VRd, VRS1, VRS2, m</td>
<td></td>
</tr>
<tr>
<td></td>
<td>vfmmax VRd, VRS1, VRS2, m</td>
<td>Element-wise maximum/minimum of values in VRS1, VRS2 and writes to VRd</td>
</tr>
<tr>
<td></td>
<td>vfmmin VRd, VRS1, VRS2, m</td>
<td></td>
</tr>
<tr>
<td>Data Movement</td>
<td>vsplat VRd, VRS1, RS2, m</td>
<td>Splat the element in VR1[RS2] to VRd</td>
</tr>
<tr>
<td></td>
<td>vbcastx VRd, RS1</td>
<td>Broadcasts value in RS1/FRS1 to VRd</td>
</tr>
<tr>
<td></td>
<td>vbcastf VRd, FRS1</td>
<td></td>
</tr>
<tr>
<td></td>
<td>vredsum VRd, VRS1</td>
<td>Reduction of VRS1 based on sum/max/min, broadcast and store the result to VRd</td>
</tr>
<tr>
<td></td>
<td>vredmin VRd, VRS1</td>
<td></td>
</tr>
<tr>
<td></td>
<td>vredmax VRd, VRS1</td>
<td></td>
</tr>
<tr>
<td></td>
<td>vfredsum VRd, VRd</td>
<td></td>
</tr>
</tbody>
</table>

VRd: Vector destination registers
VRS1,2,3: Vector source registers.
m: Two bit encoding for masking; m=00 -> scalar shape destination, m=01 -> unmasked vector operation, m=10 -> mask enabled where v1.LSB=0, m=11 -> mask enabled where v1.LSB=1; here v1 is the mask register.

setting the highest number of enabled vector registers in vregmax CSR and the maximum width of elements in vmaxw CSR. The v1 CSR holds the current active vector length. Finally, we added support in Spike for all the instructions in Table 1. Listing 2 is an example of implementation vadd instruction in Spike. These modification enabled simulation of the vector instructions. We added functionality to Spike interactive debug mode to facilitate tracing and debugging.


```c
require_extension("V");
require_rv64;
WRITE_VRD(v_add(VRS1, VRS2, EW, insn.m(), VMASK, VL));
```

2.3 RISC-V target for TensorFlow Lite

TensorFlow Lite is a lightweight deep learning framework for mobile and embedded devices [1]. It compresses a TensorFlow model to a .tflite model that has a small binary size. This enables on-device machine learning and uses hardware acceleration to improve performance. The TensorFlow Lite source code has two implementations: reference_ops and optimized_ops, for machine learning kernels such as convolution and depthwise-convolution. The reference_ops implementation is portable, hardware-independent and uses standard C/C++ libraries. The optimized_ops is a hardware specific optimized implementation of kernel operations using gemmlowp, Eigen libraries [13, 18] and other processor specific optimizations. For example, in the case of ARM processors, the optimized_ops implementation leverages gemmlowp, Eigen libraries and Neon instructions [21] to optimize kernel operations.

To support RISC-V target for Tensorflow Lite, we modified some functions to remove library dependencies not supported by Newlib [29] in reference_ops. This made the reference_ops implementation portable and capable of running on mobile and embedded device with RISC-V processors. The C inline assembly functions were used for constructing SIMD-aware optimized functions to be used in optimized_ops implementation for RISC-V vector processors. Listing 3 shows the implementation of a function that performs

1C standard library implementation intended for use on embedded system
element-wise addition of two arrays. Using the instructions in Table 1, we can support a wide range of machine learning models.  

We cross-compiled the TensorFlow Lite source code for RISC-V ISA and executed .tflite models on Spike. With the infrastructure in place, we generate a binary that can run on a RISC-V processor that has micro-architectural support for the RISC-V V ISA extension.

Listing 3: A example function for element-wise addition of two arrays.

```c
void VectorVectorAdd(const float* input1, const float* input2, float* output, int len) {
    int new_len = len - (len & (kMaxVectorLength32 - 1));
    int len_diff = len & (kMaxVectorLength32 - 1);
    SetConfig(kElementWidthMax32, kMaxVectorLength32);
    for (int i = 0; i < new_len; i += kMaxVectorLength32) {
        __VectorLoad((input1 + i), (input2 + i));
        __VectorAddFloat();
        __VectorStore((output + i));
    }
    if (len_diff != 0) {
        SetVl(len_diff);
        __VectorLoad((input1 + new_len), (input2 + new_len));
        __VectorAddFloat();
        __VectorStore((output + new_len));
    }
}
```

3 Evaluation

In this section, we evaluate the code optimizations for RISC-V and compare it with ARM processors, as ARM processors are the most commonly used processors for mobile systems. For comparison purpose we define the Region Of Interest (ROI) as the execution of interpreter->Invoke() function in TensorFlow Lite. The deep learning models [14–17, 25, 26] used in our evaluation are listed in Table 2. These are commonly used machine-learning inference models that are deployed on mobile devices. We cover a wide range of applications using these benchmark models. The models are 32-bit floating point .tflite models and are hosted on TensorFlow Lite website [2].

To evaluate the performance of deep learning models listed in Table 2 for ARM processor, we used gem5 [9] in full system mode with ARM A-class, 4-stage pipeline High Performance In-order (HPI) core configuration [28]. The ARM HPI was configured with 16KB L1 I$, 16KB L1 D$ and without L2$. In this section, we will use the term ARM-base for the baseline implementation of TensorFlow Lite using reference_ops, and ARM-opt for the implementation of TensorFlow Lite using optimized_ops. We inserted m5_reset_stats and m5_dump_stats functions in TensorFlow Lite source code to get gem5 performance stats for ROI. We used number of cycles and committed instructions as our performance metrics for evaluation.

For RISC-V, RV-base and RV-opt represents the RISC-V cross-compiled binaries of TensorFlow Lite using reference_ops and optimized_ops, respectively. We mapped a in-order 5-stage pipeline Rocket core [7] to Zedboard [10] to evaluate the performance of benchmarks in Table 2 for RV-base. The Rocket core is configured with 16KB L1 I$, 16KB L1 D$ and without L2$, as the current version of Rocket chip does not support L2$. We used hardware performance counters, specifically the cycle CSR and instret CSR for evaluation. Currently, the microarchitecture enhancement to Rocket-chip processor for supporting extended instructions in Table 1 is in the ‘pre-pre-alpha stage’. For this paper we use Spike to benchmark number of committed instructions of deep learning benchmarks listed in Table 2 for RV-opt.

---

2We simulated ARM core without L2$ to perform a fair comparison with RISC-V Rocket core
ARM-base and RV-base are cross-compiled from the same source code. Although in both cases, we used “-O3” compiler flag, we noticed that the number of committed instructions and corresponding cycles in ROI were higher for RV-base as compared to ARM-base. Figure 2 shows the number of committed instructions and number of cycles for four variants of MobileNet [14] model used in image classification workload. Figure 2 show the number of committed instructions and cycles are ~2X higher for RV-base-v1 in comparison to ARM-base. Here, RV-base-v1 corresponds to cross-compiled from TensorFlow Lite reference_ops. The difference in instruction and cycle count is due to the difference in the compiler optimizations. As the ARM cross-compiler has matured over the years, the compiler optimizes a nested loops in source code such that the inner-most loop has few instructions. We updated the source code to replicate the compiler loop optimizations. We refer to this updated version as RV-base-v2. As shown in Figure 2a and 2b, the loop optimization reduced the number of committed instructions and cycles for RV-base-v2, and these numbers are now comparable to that of ARM-base. For the rest of our analysis we will use RV-base-v2 and ARM-base as our baseline implementations.

We next compare ARM-opt and RV-opt implementations using the number of committed instructions for the deep learning model listed in Table 2. ARM-opt implements ARM Neon extension [21]. ARM Neon extension has a fixed SIMD width of 128bits. The benchmark models use single precision floating point, therefore the processor operates on 4 single floating precision values in one instruction. We set the RISC-V vector register width to be 128bits for a fair comparison with ARM processor with Neon extension. Also, we evaluated the setup for vector register width of 256bits. Figure 3 shows the comparison of ARM-base, RV-base-v2, ARM-opt and RV-opt-v1 with 128bits register widths and RV-opt-v2 256bits register widths using deep learning models in Table 2. As expected, the number of committed instructions are similar (across all the models) for ARM-base and RV-base-v2. On average, across all benchmarks the number of committed instructions for RV-opt-v1 is 1.25X lower than the ARM-opt. In deep learning models where ‘CONV’ are the dominant layers, RV-opt-v1 has consistently less instructions than ARM-opt. In the case of models where LSTM layers are dominant, ARM-opt has consistently less instructions than RV-opt-v1. This is because of difference in code optimization for ARM and RISC-V. ARM-opt implementation uses block vector-matrix multiplication for LSTM layers. The instruction count for RV-opt-v1 can be improved by implementing block vector-matrix multiplication.

On average, we achieved a 8X reduction on number of committed instructions using RV-opt-v1 implementation in comparison to RV-base. We see an additional ~2X reduction in the number of committed instructions using RV-opt with 256bits register width.

### 4 Summary and Future Work

In this paper, we present the software infrastructure we developed to support compilation and execution of machine learning models used in TensorFlow Lite framework. We are able to support a large range of machine learning applications using a subset of RISC-V Vector instructions. On average, we are able to reduce the number of committed instructions by 8X using RV-opt implementation in comparison to the RISC-V reference implementation.

In our current software pipeline, we handle register naming and register allocations. Moving forward we want the compiler to handle this task. To enable the compiler to do this task, we need to support the new instructions in GCC using intrinsics. The GCC compiler has three stages, the front-end, middle-end and back-end [24]. At a high level, the front-end generates a parse-tree from the input program, the parse-tree is used by middle-end to generate a generic-tree, and the back-end converts the generic-tree to assembly
We will evaluate our design in terms of performance, power and well as the micro-architecture design to the wider community in will also modify the back-end of GCC to create machine description of the new instructions and generate the assembly code. We are developing in-pipeline microarchitectural support for a subset of the vector instructions for machine learning accelerator. Our microarchitecture will include dedicated vector registers, vector caches and support for multiple precision arithmetic and logical operations. Additionally, we will explore sparsity-aware microarchitecture design for the in-pipeline accelerator. As we develop the in-pipeline accelerator we will modify the subset of instructions in Table 1 as needed and update the software tool-chain accordingly. We will evaluate our design in terms of performance, power and area. We will open source our software modifications to TF Lite, as well as the micro-architecture design to the wider community in due course.

References