Tensilica Xtensa

Июль 30, 2022

Содержание

2. Overview Background Changes in progress from Xtensa to Xtensa LX Automated Development Process ISA TIE Language
3. Tensilica Founded in 1997 in Santa Clara, California by a group of engineers from Intel, SGI,
4. Why? Embedded application problems with high cost custom designs or low performance (inefficiencient) processors System on
5. The Problem with RTL Rapidly increasing number of transistors require more RTL blocks on chip Hardcoded
6. Tensilica’s Solution Xtensa Focusing on design through the processor, and not through hardwired RTL
7. Xtensa First appearing in 1999 32-bit microprocessor core with a graphical configuration interface and integrated tool
8. Xtensa – In a Nutshell Enables embedded system designers to build better, more highly integrated products
9. Xtensa - Deliverables Provided as synthesizable RTL cores Gate count range: 25,000 – 150,000+ Increase in
10. Xtensa – Verification Challenges To extensively verify the configurable processor to ensure each possible configuration will
11. Xtensa – Basic Architecture 78 instructions five-stage pipeline that supports single-cycle execution 1 - load/store model
12. Xtensa – Basic Architecture Processor Configuration Power Usage: 200mW, 0.25 μm, 1.5V Clock Speed: 170 MHz
13. Xtensa - ISA Priorities used in ISA Development Code Size, Configurability, Processor Cost, Energy Efficiency, Scalability,
14. Xtensa III With Virtual IP Group developed an MP3 audio decoder for Tensilica's Xtensa configurable microprocessor
15. Xtensa IV Used white box verification methodology for the original development Includes 0-In Check and the
16. Xtensa V 350MHz (synthesized), as small as 18K gates (0.25mm2) More flexible interfaces for multiple processors
17. Xtensa V – Performance Cost Timeline
18. Xtensa 6 Extremely fast customization path Three major enhancements from Xtensa V Auto customize processor from
19. Xtensa LX “Fastest processor core ever” – Tensilica I/O bandwidth, compute parallelism, and low-power optimization equivalent
20. Xtensa 6 Vs Xtensa LX
21. Xtensa LX Strongest selling point is performance DSP operations can be encapsulated into custom instructions High
22. Xtensa LX Vs General Purpose
23. Xtensa LX – Traditional Limitations 1 Operation / cycle Load/Store overhead
24. Xtensa LX Options: Extra load/store unit, wide interfaces, compound instructions Up to 19 GB/sec of throughput
25. Xtensa LX – Highlights Lower power usage I/O throughput at RTL speeds Outstanding computer performance XPRES
26. Xtensa LX – Lower Power Useage Automated the insertion of fine-grain clock gating for every functional
27. Outstanding Computing Performance Extensible using FLIX (Flexible Length Instruction Xtensions) Similar to VLIW – but customizable
28. XPRES Compiler Powerful synthesis tool Creates tailored processor descriptions Run on native C/C++ code
29. Automated Development Clients log into website Accessing Process Generator Builds a model in RTL Verilog or
30. Automated Development Create special instructions described and written in TIE TIE semantics allow system to modify
31. Xtensa LX – Basic Architecture Processor Configuration Power Usage: 76 μW/MHz , 47 μW/MHz ( 5
32. Xtensa LX Architecture 32-bit ALU 1 or 2 Load/Store Model Registers 32-bit general purpose register file
33. Xtensa LX Architecture General Purpose AR Register File 32 or 64 registers Instructions have access through
34. Xtensa LX Architecture
35. Xtensa LX Pipelining 5 or 7 Stage Pipeline Design 5 stage pipeline has stages: IF, Register
36. Xtensa LX Instruction Set ISA consists of 80 core instructions including both 16 and 24 bit
37. Xtensa LX Instruction Set Processor Control Instructions RSR, WSR, XSR Read Special Register, Write Special Register
38. Xtensa LX ISA – Building Blocks MUL32 MUL32 adds 32 bit multiplier MUL16 and MAC16 MUL16
39. Xtensa LX ISA – Building Blocks Floating Point Unit 32-bit, single precision, floating-point coprocessor Vectra LX
40. Vectra LX DSP Engine FLIX-based (why it is 64 bits) Vectra LX instructions encoded in 64
41. Vectra LX DSP Engine
42. Tensilica Instruction Extension Method used to extend the processor’s architecture and instruction set Can be used
43. Tensilica Instruction Extension TIE Compiler Generates file used to configure software development tools so that they
44. TIE Resembles Verilog More concise than RTL (it omits all sequential logic, pipeline registers, and initialization
45. TIE Queues and Ports New way to communicate with external devices Queues: data can be sent
46. TIE TIE Combines multiple operations into one using: Fusion SIMD/Vector Transformation FLIX
47. Fusion Allows you to combine dependent operations into a single instruction Consider: computing the average of
48. Fusion Fuse the two operations into a single TIE instruction operation AVERAGE{out AR res, in AR
49. SIMD/Vector Transformation Single Instruction, Multiple Data Fusing instructions into a “vector” Allows replication of the same
50. SIMD/Vector Transformation Computing four 16-bit averages Each data vector must be 64 bits (4 x 16
51. FLIX Flexible length instruction extension Key in extreme extensibility Huge performance gains possible Code size reduction
52. FLIX
53. FLIX - Usage Used selectively when parallelism is needed Avoids code bloat Used seemlessly and modelessly
54. XPRES Compiler Powerful synthesis tool Creates tailored processor descriptions Run on native C/C++ code Three optimizations
55. XPRES Compiler Analyzes C/C++ code Generates possible configurations Compares performance criteria to silicon size (cost) Returns
56. XPRES Compiler - Results Application dependent Compute intensive programs Data intensive programs More is sometimes less
57. XPRES – 4 Program Test “Bit Manipulator” program Cut cycles to a third
58. XPRES – 4 Program Test H.264 Deblocking Filter 6% performance improvement
59. XPRES – 4 Program Test MPEG4 decoder 23% performance increase
60. XPRES – 4 Program Test SAD – sum of absolute difference 63% performance increase
61. Xtensa Hi-Fi 2 Audio Engine Add-on package for Xtensa LX Advantages over common audio processors: better
62. Xtensa Hi-Fi 2 Audio Engine Audio packages integrated into an SOC design, so no additional codec
63. Xtensa Hi-Fi 2 Audio Engine Uses over 300 audio specific DLP instructions. Features dual-multiply accumulate for
64. Speed-up Example GSM Audio Codec – written in C Profiling code using unaltered RISC architecture showed
65. Speed-up Example Viterbi butterfly instruction Acts like compression for the data Consists of 8 logical operation
66. EEMBC Networking Benchmark Xtensa LX received highest benchmark ever achieved on the Networking version 2 test.
67. EEMBC Networking Benchmark Normalized (per MHz) EEMBC TCPmark Simulates performance in internet enabled client side performance
68. EEMBC Networking Benchmark Normalized (by MHz) EEMBC IPmark Simulates performance in network routers, gateways, and switches
69. EEMBC Networking Benchmark Total Code Size
71. How Xtensa Compares
72. How Xtensa Compares
73. How Xtensa Compares (cont)
74. Uses of Xtensa Products NVIDIA – Licensed Xtensa LX “We were very impressed with Tensilica's automated
75. Uses of Xtensa Products LG Cell Phone Phone is digital broadcast enabled Xtensa processor was used
76. In case you are wondering.. --Tensilica's announced licensees include Agilent, ALPS, AMCC (JNI Corporation), Astute Networks,
78. Скачать презентацию

Слайд 2

Overview
Background
Changes in progress from Xtensa to Xtensa LX
Automated Development Process
ISA
TIE Language
Benchmarks

Слайд 3

Tensilica
Founded in 1997 in Santa Clara, California by a group of

engineers from Intel, SGI, MIPS, and Synopsys to compete with ARC
Goal: To address application specific microprocessor cores and software development tools by designing the first configurable and extensible processor core

Слайд 4

Why?
Embedded application problems with high cost custom designs or low performance

(inefficiencient) processors
System on a Chip (SoC) challenge
Traditionally solved using hardwired RTL blocks

Слайд 5

The Problem with RTL
Rapidly increasing number of transistors require more RTL

blocks on chip
Hardcoded RTL blocks are not flexible
Hand-optimized for application specific purposes

Слайд 6

Tensilica’s Solution
Xtensa
Focusing on design through the processor, and not through hardwired

RTL

Слайд 7

Xtensa
First appearing in 1999
32-bit microprocessor core with a graphical configuration interface

and integrated tool chain
Designed from the start to be user customizable
Emphasizes instruction-set configurability as its primary feature distinguishing it from other core offerings
Has revolutionized the System on a Chip (SoC) challenge through out its development
Configurable and Extensible

Слайд 8

Xtensa – In a Nutshell
Enables embedded system designers to build better,

more highly integrated products in significantly less time
Can add specialized functions or instructions to processor and have them recognized as “native” by the entire software development took chain
Move to a higher level of abstraction by designing with processors rather than RTL

Слайд 9

Xtensa - Deliverables
Provided as synthesizable RTL cores
Gate count range: 25,000 –

150,000+
Increase in gates as customer adds instructions or optional features
Software development tools

Слайд 10

Xtensa – Verification Challenges
To extensively verify the configurable processor to ensure

each possible configuration will be bug free
To enable the customer to rapidly integrate the core while limiting support costs

Слайд 11

Xtensa – Basic Architecture
78 instructions
five-stage pipeline that supports single-cycle execution
1 -

load/store model
32-entry orthogonal register file
32 optional extra registers

Слайд 12

Xtensa – Basic Architecture
Processor Configuration
Power Usage: 200mW, 0.25 μm, 1.5V
Clock Speed:

170 MHz
Cache:
16 KB I-cache
16 KB D-cache
Direct mapped
32 Registers (32-bits)
Extensible via use of TIE instructions
No Floating Point Processor
Zero over head loops

Слайд 13

Xtensa - ISA
Priorities used in ISA Development
Code Size, Configurability, Processor Cost,

Energy Efficiency, Scalability, Features
ISA Influences
MIPS
IBM Power
Sun SPARC
ARM Thumb
HP Playdoh
DSPs

Слайд 14

Xtensa III
With Virtual IP Group developed an MP3 audio decoder for

Tensilica's Xtensa configurable microprocessor architecture. The decoder offers hardware extensions and optimized code for accelerating MP3 decoding
32-bit floating point processing
32x32-bit hardware multiplier
First Coprocessor interface
Vectra DSP enhancements

Слайд 15

Xtensa IV
Used white box verification methodology for the original development
Includes 0-In

Check and the CheckerWare Library made by Mentor Graphics
Could repartition instructions up until point of manufacturing
Support multiple processors in ASIC
128-bit wide local memory interface

Слайд 16

Xtensa V
350MHz (synthesized), as small as 18K gates (0.25mm2)
More flexible

interfaces for multiple processors
Write-back and write-through caches
Enhanced Xtensa Local Memory Interface
Shared data memories
More Automation
Xtensa C/C++ Compiler & TIE Language improvements
XT2000 Emulation kit
World’s fastest embedded core

Слайд 17

Xtensa V – Performance Cost Timeline

Слайд 18

Xtensa 6
Extremely fast customization path
Three major enhancements from Xtensa V
Auto customize

processor from C/C++ based algorithm using XPRES Compiler
30% less power consumption
Advanced security provisions in MMU-enabled configurations

Слайд 19

Xtensa LX
“Fastest processor core ever” – Tensilica
I/O bandwidth, compute parallelism, and

low-power optimization equivalent to hand-optimized, non-programmable, RTL-designed hardware blocks
XPRES Compiler and automated process generator
Uses Flexible Length Instruction Xtension (FLIX)
Ideal for:
embedded processor control tasks
Compute-intensive datapath hardware tasks

Слайд 20

Xtensa 6 Vs Xtensa LX

Слайд 21

Xtensa LX
Strongest selling point is performance
DSP operations can be encapsulated into

custom instructions
High performance leads to power savings
Custom instructions target a special application

Слайд 22

Xtensa LX Vs General Purpose

Слайд 23

Xtensa LX – Traditional Limitations
1 Operation / cycle
Load/Store overhead

Слайд 24

Xtensa LX
Options:
Extra load/store unit, wide interfaces, compound instructions
Up to 19 GB/sec

of throughput

Слайд 25

Xtensa LX – Highlights
Lower power usage
I/O throughput at RTL speeds
Outstanding computer

performance
XPRES Compiler

Слайд 26

Xtensa LX – Lower Power Useage
Automated the insertion of fine-grain clock

gating for every functional element of the Xtensa LX processor
This includes functions created by the designer
Direct I/O capability – like RTL

Слайд 27

Outstanding Computing Performance
Extensible using FLIX
(Flexible Length Instruction Xtensions)
Similar to VLIW

– but customizable to fit application code’s needs
Significant improvement over competitors and previous Xtensa Design
DSP instructions formed using FLIX to be recognized as native to entire development system

Слайд 28

XPRES Compiler
Powerful synthesis tool
Creates tailored processor descriptions
Run on native C/C++ code

Слайд 29

Automated Development
Clients log into website
Accessing Process Generator
Builds a model in RTL

Verilog or VHDL
Sends result via internet to client’s site
Also receive:
Preconfigured synthesis scripts, test benches, and software-development tools
Software tools include:
Assembler, C/C++ compiler, linker, debugger, and instruction-set simulator already modified to match the hardware configuration

Слайд 30

Automated Development
Create special instructions described and written in TIE
TIE semantics allow

system to modify software-development tools
Integrates changes into processor design
Compile with synthesis tool – test – order

Слайд 31

Xtensa LX – Basic Architecture
Processor Configuration
Power Usage: 76 μW/MHz , 47

μW/MHz ( 5 and 7 stage pipeline)
Clock Speed: 350 MHz, 400 MHz (5 and 7 stage pipeline)
Cache:
up to 32 KB and 1,2,3,4 way set associative cache
64 general purpose physical registers (32-bits)
6 special purpose registers
Extensible via use of TIE and FLIX instructions
Zero over head loops

Слайд 32

Xtensa LX Architecture
32-bit ALU
1 or 2 Load/Store Model
Registers
32-bit general purpose register

file
32-bit program counter
16 optional 1-bit boolean registers
16 optional 32-bit floating point registers
4 optional 32-bit MAC16 data registers
Optional Vectra LX DSP registers

Слайд 33

Xtensa LX Architecture
General Purpose AR Register File
32 or 64 registers
Instructions have

access through “sliding window” of 16 registers. Window can rotate by 4, 8, or 12 registers
Register window reduces code size by limiting number of bits for the address and eliminated the need to save and restore register files

Слайд 34

Xtensa LX Architecture

Слайд 35

Xtensa LX Pipelining
5 or 7 Stage Pipeline Design
5 stage pipeline has

stages: IF, Register Access, Execute, Data-Memory Access, and register writeback
5 stage pipeline accesses memory in two stages. 7 stage pipeline is extended version of the 5 stage pipeline with extra IF and Memory Access stage. Extra stages provide more time for memory access. Designer can run at a higher clock speed while using slower memory to improve performance

Слайд 36

Xtensa LX Instruction Set
ISA consists of 80 core instructions including both

16 and 24 bit instructions

Слайд 37

Xtensa LX Instruction Set
Processor Control Instructions
RSR, WSR, XSR
Read Special Register, Write

Special Register
Used for saving and restoring context, Processing Interrupts and Exceptions, Controlling address translation
RUR, WUR
Access User Registers
Used for Coprocessor registers and registers created with TIE
ISYNC – wait for Instruction Fetch related changes to resolve
RSYNC – wait for Dispatch related changes to resolve
ESYNC/DSYNC – Wait for memory/data execution related changes to resolve

Слайд 38

Xtensa LX ISA – Building Blocks
MUL32
MUL32 adds 32 bit multiplier
MUL16 and

MAC16
MUL16 adds 16x16 bit multiplier
MAC16 adds 16x16 bit multiplier and 40-bit accumulator

Слайд 39

Xtensa LX ISA – Building Blocks
Floating Point Unit
32-bit, single precision, floating-point

coprocessor
Vectra LX DSP Engine
Optimized to handle Digital Signal Processing Applications

Слайд 40

Vectra LX DSP Engine
FLIX-based (why it is 64 bits)
Vectra LX instructions

encoded in 64 bits.
Bits 0:3 of a Xtensa instruction determine its length and format, the bits have a value of 14 to specify it is a Vectra LX instruction
Bits 4:27 – contain either Xtensa LX core instruction or Vectra LX Load or Store instruction
Bits 28:45 – contains either a MAC instruction or a select instruction
Bits 46:63 – contains either ALU and shift instructions or a load and store instruction for the second Vectra LX load/store unit

Слайд 41

Vectra LX DSP Engine

Слайд 42

Tensilica Instruction Extension
Method used to extend the processor’s architecture and instruction

set
Can be used in two ways:
For the TIE Compiler
For the Processor Generator

Слайд 43

Tensilica Instruction Extension
TIE Compiler
Generates file used to configure software development tools

so that they recognize TIE Extensions
Estimates hardware size of new instruction
You can modify application code to take advantage of the new instruction and simulate to decide if the speed advantage is worth the hardware cost

Слайд 44

TIE
Resembles Verilog
More concise than RTL (it omits all sequential logic, pipeline

registers, and initialization sequences.
The custom instructions and registers described in TIE are part of the processor’s programming model.

Слайд 45

TIE Queues and Ports
New way to communicate with external devices
Queues: data

can be sent or read through queues. A queue is defined in the TIE and the compiler generates the interface signals required for the additional port needed to connect to the queue. Logic is also automatically generated
Import-wire: processor can sample the value of an external signal
Export-state: drive an output based on TIE

Слайд 46

TIE
TIE Combines multiple operations into one using:
Fusion
SIMD/Vector Transformation
FLIX

Слайд 47

Fusion
Allows you to combine dependent operations into a single instruction
Consider: computing

the average of two arrays
unsigned short *a, *b, *c; . . . for( i = 0; i < n; i++) c[i] = (a[i] + b[i]) >> 1;
Two Xtensa LX Core instructions required, in addition to load/store instructions

Слайд 48

Fusion
Fuse the two operations into a single TIE instruction
operation AVERAGE{out AR

res, in AR input0, in AR input1}{}{ wire [16:0] tmp = input0[15:0] + input1[15:0]; assign res = temp[16:1]; }
Essentially an add feeding a shift, described using standard Verilog-like syntax
Implementing the instruction in C/C++
#include unsigned short *a, *b, *c; . . . for( i = 0; i < n; i++) c[i] = AVERAGE(a[i] + b[i]);

Слайд 49

SIMD/Vector Transformation
Single Instruction, Multiple Data
Fusing instructions into a “vector”
Allows replication of

the same operation multiple times in one instruction
Consider: Computing four averages in one instruction
The follwing TIE code computes multiple iterations in a single instruction by combining Fusion and SIMD
regfile VEC 64 8 v operation VAVERAGE{out VEC res, in VEC input0, in VEC input1} {} { wire [67:0] tmp = { input0[63:48] + input1[63:48], input0[47:32] + input1[47:32], input0[31:16] + input1[31:16], input0[15:0] + input1[15:0] }; assign res = {tmp[67:52], tmp[50:35], tmp[33:18], tmp[16:1]}; }

Слайд 50

SIMD/Vector Transformation
Computing four 16-bit averages
Each data vector must be 64 bits

(4 x 16 bits)
Create new register file, new instruction
VEC - eight 64-bit registers to hold data vectors
VAVERAGE - takes operands from VEC, computes average, saves results into VEC
VEC *a, *b, *c; for (i = 0; i < n; i += 4){ c[i] = VAVERAGE( a[i], b[i] );}
New Datatype recognized
TIE automatically creates new load, store instructions to move 64-bit vectors between VEC register file and memory

Слайд 51

FLIX
Flexible length instruction extension
Key in extreme extensibility
Huge performance gains possible
Code size

reduction without code bloat
Similar to VLIW
Created by XPRES Compiler

Слайд 52

FLIX

Слайд 53

FLIX - Usage
Used selectively when parallelism is needed
Avoids code bloat
Used seemlessly

and modelessly used with standard 16- and 24-bit instructions

Слайд 54

XPRES Compiler
Powerful synthesis tool
Creates tailored processor descriptions
Run on native C/C++ code
Three

optimizations methods
Returns optimal configurations along with pros and cons (tradeoffs)

Слайд 55

XPRES Compiler
Analyzes C/C++ code
Generates possible configurations
Compares performance criteria to silicon size

(cost)
Returns possible configurations

Слайд 56

XPRES Compiler - Results
Application dependent
Compute intensive programs
Data intensive programs
More is

sometimes less
operation slots in FLIX

Слайд 57

XPRES – 4 Program Test
“Bit Manipulator” program
Cut cycles to a third

Слайд 58

XPRES – 4 Program Test
H.264 Deblocking Filter
6% performance improvement

Слайд 59

XPRES – 4 Program Test
MPEG4 decoder
23% performance increase

Слайд 60

XPRES – 4 Program Test
SAD – sum of absolute difference
63% performance

increase

Слайд 61

Xtensa Hi-Fi 2 Audio Engine
Add-on package for Xtensa LX
Advantages over common

audio processors:
better sound quality of compressed files because of increased precision available for intermediate calculations. (24 bits rather than 16)
24-bit audio fully compatible with modern audio standards

Слайд 62

Xtensa Hi-Fi 2 Audio Engine
Audio packages integrated into an SOC design,

so no additional codec development required
Integrated Audio Packages:
Dolby Digital AC-3 Decoder, Dolby Digital AC-3 Consumer Encoder, QSound MicroQ, MP3 Encoder/Decoder, MPEG-4 aacplus v1 and v2 Encoder/Decoder, MPEG-2/4 AAC LC Encoder/Decoder, WMA Encoder/Decoder, AMR narrowband speech codec, AMR wideband speech codec.

Слайд 63

Xtensa Hi-Fi 2 Audio Engine
Uses over 300 audio specific DLP instructions.

Features dual-multiply accumulate for 24x24 and 32x16 bit arithmetic on both units
“delivers noticeably superior sound quality even when decoding prerecorded 16-bit encoded music files. “

Слайд 64

Speed-up Example
GSM Audio Codec – written in C
Profiling code using unaltered

RISC architecture showed that 80% of the processor cycles were devoted to multiplication
Simply by adding a hardware multiplier, the designer can reduce the number of cycles required from 204 million to 28 million

Слайд 65

Speed-up Example
Viterbi butterfly instruction
Acts like compression for the data
Consists of 8

logical operation
8 of these operations are used to decode each symbol in the received digital information stream
The designer can add a Viterbi instruction to the Xtensa ISA. The extension can use the 128-bit memory bus to load data for 8 symbols at once. This results in a average execution time of 0.16 cycles per butterfly. An unaugmented Xtensa LX executes Viterbi in 42 cycles.

Слайд 66

EEMBC Networking Benchmark
Xtensa LX received highest benchmark ever achieved on the

Networking version 2 test.
Xtensa LX has a 4x code density advantage and a 100x advantage in both die area and power dissipation

Слайд 67

EEMBC Networking Benchmark
Normalized (per MHz) EEMBC TCPmark
Simulates performance in internet

enabled client side performance

Слайд 68

EEMBC Networking Benchmark
Normalized (by MHz) EEMBC IPmark
Simulates performance in network routers,

gateways, and switches

Слайд 69

EEMBC Networking Benchmark
Total Code Size

Слайд 70

Слайд 71

How Xtensa Compares

Слайд 72

How Xtensa Compares

Слайд 73

How Xtensa Compares (cont)

Слайд 74

Uses of Xtensa Products
NVIDIA – Licensed Xtensa LX
“We were very

impressed with Tensilica's automated approach for both the processor extensions and the generation of the associated software tools”

Слайд 75

Uses of Xtensa Products
LG Cell Phone
Phone is digital broadcast enabled
Xtensa processor

was used because it enabled LG to “cut design time significantly and be first to market with this exciting new technology.”
Terrestrial digital-multimedia-broadcast system in Korea

Слайд 76

In case you are wondering..
--Tensilica's announced licensees include Agilent, ALPS, AMCC

(JNI Corporation), Astute Networks, ATI, Avision, Bay Microsystems, Berkeley Wireless Research Center, Broadcom, Cisco Systems, Conexant Systems, Cypress, Crimson Microsystems, ETRI, FUJIFILM Microdevices, Fujitsu Ltd., Hudson Soft, Hughes Network Systems, Ikanos Communications, LG Electronics, Marvell, NEC Laboratories America, NEC Corporation, NetEffect, Neterion, Nippon Telephone and Telegraph (NTT), NVIDIA, Olympus Optical Co. Ltd., sci-worx, Seiko Epson, Solid State Systems, Sony, STMicroelectronics, Stretch, TranSwitch Corporation, and Victor Company of Japan (JVC).

Tensilica Xtensa

Содержание

OverviewBackgroundChanges in progress from Xtensa to Xtensa LXAutomated Development ProcessISATIE LanguageBenchmarks

TensilicaFounded in 1997 in Santa Clara, California by a group of

Why?Embedded application problems with high cost custom designs or low performance

The Problem with RTLRapidly increasing number of transistors require more RTL

Tensilica’s SolutionXtensaFocusing on design through the processor, and not through hardwired

XtensaFirst appearing in 199932-bit microprocessor core with a graphical configuration interface

Xtensa – In a NutshellEnables embedded system designers to build better,

Xtensa - DeliverablesProvided as synthesizable RTL coresGate count range: 25,000 –

Xtensa – Verification ChallengesTo extensively verify the configurable processor to ensure

Xtensa – Basic Architecture78 instructionsfive-stage pipeline that supports single-cycle execution1 -

Xtensa – Basic ArchitectureProcessor Configuration Power Usage: 200mW, 0.25 μm, 1.5V Clock Speed:

Xtensa - ISAPriorities used in ISA DevelopmentCode Size, Configurability, Processor Cost,

Xtensa IIIWith Virtual IP Group developed an MP3 audio decoder for

Xtensa IV Used white box verification methodology for the original developmentIncludes 0-In

Xtensa V350MHz (synthesized), as small as 18K gates (0.25mm2) More flexible

Xtensa V – Performance Cost Timeline

Xtensa 6Extremely fast customization pathThree major enhancements from Xtensa VAuto customize

Xtensa LX“Fastest processor core ever” – TensilicaI/O bandwidth, compute parallelism, and

Xtensa 6 Vs Xtensa LX

Xtensa LXStrongest selling point is performanceDSP operations can be encapsulated into

Xtensa LX Vs General Purpose

Xtensa LX – Traditional Limitations1 Operation / cycleLoad/Store overhead

Xtensa LXOptions:Extra load/store unit, wide interfaces, compound instructionsUp to 19 GB/sec

Xtensa LX – HighlightsLower power usageI/O throughput at RTL speedsOutstanding computer

Xtensa LX – Lower Power UseageAutomated the insertion of fine-grain clock

Outstanding Computing PerformanceExtensible using FLIX (Flexible Length Instruction Xtensions)Similar to VLIW

XPRES CompilerPowerful synthesis toolCreates tailored processor descriptionsRun on native C/C++ code

Automated DevelopmentClients log into websiteAccessing Process GeneratorBuilds a model in RTL

Automated DevelopmentCreate special instructions described and written in TIETIE semantics allow

Xtensa LX – Basic ArchitectureProcessor Configuration Power Usage: 76 μW/MHz , 47

Xtensa LX Architecture32-bit ALU1 or 2 Load/Store ModelRegisters32-bit general purpose register

Xtensa LX ArchitectureGeneral Purpose AR Register File32 or 64 registersInstructions have

Xtensa LX Architecture

Xtensa LX Pipelining5 or 7 Stage Pipeline Design5 stage pipeline has

Xtensa LX Instruction SetISA consists of 80 core instructions including both

Xtensa LX Instruction SetProcessor Control InstructionsRSR, WSR, XSRRead Special Register, Write

Xtensa LX ISA – Building BlocksMUL32MUL32 adds 32 bit multiplierMUL16 and

Xtensa LX ISA – Building BlocksFloating Point Unit32-bit, single precision, floating-point

Vectra LX DSP EngineFLIX-based (why it is 64 bits)Vectra LX instructions

Vectra LX DSP Engine

Tensilica Instruction ExtensionMethod used to extend the processor’s architecture and instruction

Tensilica Instruction ExtensionTIE CompilerGenerates file used to configure software development tools

TIEResembles VerilogMore concise than RTL (it omits all sequential logic, pipeline

TIE Queues and PortsNew way to communicate with external devicesQueues: data

TIETIE Combines multiple operations into one using:FusionSIMD/Vector TransformationFLIX

FusionAllows you to combine dependent operations into a single instructionConsider: computing

FusionFuse the two operations into a single TIE instruction operation AVERAGE{out AR

SIMD/Vector TransformationSingle Instruction, Multiple DataFusing instructions into a “vector”Allows replication of

SIMD/Vector TransformationComputing four 16-bit averagesEach data vector must be 64 bits

FLIXFlexible length instruction extensionKey in extreme extensibilityHuge performance gains possibleCode size

FLIX

FLIX - UsageUsed selectively when parallelism is neededAvoids code bloatUsed seemlessly

XPRES CompilerPowerful synthesis toolCreates tailored processor descriptionsRun on native C/C++ codeThree

XPRES CompilerAnalyzes C/C++ codeGenerates possible configurationsCompares performance criteria to silicon size

XPRES Compiler - ResultsApplication dependentCompute intensive programsData intensive programs More is

XPRES – 4 Program Test“Bit Manipulator” programCut cycles to a third

XPRES – 4 Program TestH.264 Deblocking Filter6% performance improvement

XPRES – 4 Program TestMPEG4 decoder23% performance increase

XPRES – 4 Program TestSAD – sum of absolute difference63% performance

Xtensa Hi-Fi 2 Audio EngineAdd-on package for Xtensa LXAdvantages over common

Xtensa Hi-Fi 2 Audio EngineAudio packages integrated into an SOC design,

Xtensa Hi-Fi 2 Audio EngineUses over 300 audio specific DLP instructions.

Speed-up ExampleGSM Audio Codec – written in CProfiling code using unaltered

Speed-up ExampleViterbi butterfly instructionActs like compression for the dataConsists of 8

EEMBC Networking BenchmarkXtensa LX received highest benchmark ever achieved on the

EEMBC Networking BenchmarkNormalized (per MHz) EEMBC TCPmark Simulates performance in internet

EEMBC Networking BenchmarkNormalized (by MHz) EEMBC IPmarkSimulates performance in network routers,

EEMBC Networking BenchmarkTotal Code Size

How Xtensa Compares

How Xtensa Compares

How Xtensa Compares (cont)

Uses of Xtensa ProductsNVIDIA – Licensed Xtensa LX “We were very

Uses of Xtensa ProductsLG Cell PhonePhone is digital broadcast enabledXtensa processor

Overview
Background
Changes in progress from Xtensa to Xtensa LX
Automated Development Process
ISA
TIE Language
Benchmarks

Tensilica
Founded in 1997 in Santa Clara, California by a group of

Why?
Embedded application problems with high cost custom designs or low performance

The Problem with RTL
Rapidly increasing number of transistors require more RTL

Tensilica’s Solution
Xtensa
Focusing on design through the processor, and not through hardwired

Xtensa
First appearing in 1999
32-bit microprocessor core with a graphical configuration interface

Xtensa – In a Nutshell
Enables embedded system designers to build better,

Xtensa - Deliverables
Provided as synthesizable RTL cores
Gate count range: 25,000 –

Xtensa – Verification Challenges
To extensively verify the configurable processor to ensure

Xtensa – Basic Architecture
78 instructions
five-stage pipeline that supports single-cycle execution
1 -

Xtensa – Basic Architecture
Processor Configuration
Power Usage: 200mW, 0.25 μm, 1.5V
Clock Speed:

Xtensa - ISA
Priorities used in ISA Development
Code Size, Configurability, Processor Cost,

Xtensa III
With Virtual IP Group developed an MP3 audio decoder for

Xtensa IV
Used white box verification methodology for the original development
Includes 0-In

Xtensa V
350MHz (synthesized), as small as 18K gates (0.25mm2)
More flexible

Xtensa 6
Extremely fast customization path
Three major enhancements from Xtensa V
Auto customize

Xtensa LX
“Fastest processor core ever” – Tensilica
I/O bandwidth, compute parallelism, and

Xtensa LX
Strongest selling point is performance
DSP operations can be encapsulated into

Xtensa LX – Traditional Limitations
1 Operation / cycle
Load/Store overhead

Xtensa LX
Options:
Extra load/store unit, wide interfaces, compound instructions
Up to 19 GB/sec

Xtensa LX – Highlights
Lower power usage
I/O throughput at RTL speeds
Outstanding computer

Xtensa LX – Lower Power Useage
Automated the insertion of fine-grain clock

Outstanding Computing Performance
Extensible using FLIX
(Flexible Length Instruction Xtensions)
Similar to VLIW

XPRES Compiler
Powerful synthesis tool
Creates tailored processor descriptions
Run on native C/C++ code

Automated Development
Clients log into website
Accessing Process Generator
Builds a model in RTL

Automated Development
Create special instructions described and written in TIE
TIE semantics allow

Xtensa LX – Basic Architecture
Processor Configuration
Power Usage: 76 μW/MHz , 47

Xtensa LX Architecture
32-bit ALU
1 or 2 Load/Store Model
Registers
32-bit general purpose register

Xtensa LX Architecture
General Purpose AR Register File
32 or 64 registers
Instructions have

Xtensa LX Pipelining
5 or 7 Stage Pipeline Design
5 stage pipeline has

Xtensa LX Instruction Set
ISA consists of 80 core instructions including both

Xtensa LX Instruction Set
Processor Control Instructions
RSR, WSR, XSR
Read Special Register, Write

Xtensa LX ISA – Building Blocks
MUL32
MUL32 adds 32 bit multiplier
MUL16 and

Xtensa LX ISA – Building Blocks
Floating Point Unit
32-bit, single precision, floating-point

Vectra LX DSP Engine
FLIX-based (why it is 64 bits)
Vectra LX instructions

Tensilica Instruction Extension
Method used to extend the processor’s architecture and instruction

Tensilica Instruction Extension
TIE Compiler
Generates file used to configure software development tools

TIE
Resembles Verilog
More concise than RTL (it omits all sequential logic, pipeline

TIE Queues and Ports
New way to communicate with external devices
Queues: data

TIE
TIE Combines multiple operations into one using:
Fusion
SIMD/Vector Transformation
FLIX

Fusion
Allows you to combine dependent operations into a single instruction
Consider: computing

Fusion
Fuse the two operations into a single TIE instruction
operation AVERAGE{out AR

SIMD/Vector Transformation
Single Instruction, Multiple Data
Fusing instructions into a “vector”
Allows replication of

SIMD/Vector Transformation
Computing four 16-bit averages
Each data vector must be 64 bits

FLIX
Flexible length instruction extension
Key in extreme extensibility
Huge performance gains possible
Code size

FLIX - Usage
Used selectively when parallelism is needed
Avoids code bloat
Used seemlessly

XPRES Compiler
Powerful synthesis tool
Creates tailored processor descriptions
Run on native C/C++ code
Three

XPRES Compiler
Analyzes C/C++ code
Generates possible configurations
Compares performance criteria to silicon size

XPRES Compiler - Results
Application dependent
Compute intensive programs
Data intensive programs
More is

XPRES – 4 Program Test
“Bit Manipulator” program
Cut cycles to a third

XPRES – 4 Program Test
H.264 Deblocking Filter
6% performance improvement

XPRES – 4 Program Test
MPEG4 decoder
23% performance increase

XPRES – 4 Program Test
SAD – sum of absolute difference
63% performance

Xtensa Hi-Fi 2 Audio Engine
Add-on package for Xtensa LX
Advantages over common

Xtensa Hi-Fi 2 Audio Engine
Audio packages integrated into an SOC design,

Xtensa Hi-Fi 2 Audio Engine
Uses over 300 audio specific DLP instructions.

Speed-up Example
GSM Audio Codec – written in C
Profiling code using unaltered

Speed-up Example
Viterbi butterfly instruction
Acts like compression for the data
Consists of 8

EEMBC Networking Benchmark
Xtensa LX received highest benchmark ever achieved on the

EEMBC Networking Benchmark
Normalized (per MHz) EEMBC TCPmark
Simulates performance in internet

EEMBC Networking Benchmark
Normalized (by MHz) EEMBC IPmark
Simulates performance in network routers,

EEMBC Networking Benchmark
Total Code Size

Uses of Xtensa Products
NVIDIA – Licensed Xtensa LX
“We were very

Uses of Xtensa Products
LG Cell Phone
Phone is digital broadcast enabled
Xtensa processor