

# MARK HOROWITZ

STANFORD UNIVERSITY



#### SCALING PROVIDED A GREAT RIDE



For a 2x scaling

Get 4x more gates,

Gates get 2x faster,

**Energy decrease 8x** 

Dennard, JSSC, pp. 256-268, Oct. 1974

No Exponential is Forever...but We Can Delay 'Forever', Moore ISSCC 2002

#### HOUSTON, WE HAVE A PROBLEM



#### TO CONTINUE TO SCALE PERFORMANCE



Apple's A9 (2015)

https://www.anandtech.com/show/9686/the-apple-iphone-6s-and-iphone-6s-plus-review/3

#### HOW TO CREATE THESE ACCELERATORS?

Study application



#### HOW TO CREATE THESE ACCELERATORS?

Study application

Design hardware

```
always_comb begin
    if (config en && config wr) begin
       // Configuration assumes that 2 * CONFIG_DATA_WIDTH >= BANK_DATA_WIDTH
       if (CONFIG_DATA_WIDTH * 2 < BANK_DATA_WIDTH)
                                                                     //-----//
           $error("Configuration data width must be at least hal 114
       if (config_addr[ADDR_OFFSET-1] == 0) begin
           // configuring LSB bits
                                                                     integer j, k;
           sram_to_mem_wen = 1;
                                                               wire [CONFIG_FEATURE_WIDTH-1:0] config_feature_addr;
           sram_to_mem_ren = 0;
           sram_to_mem_cen = 1;
                                                                119 wire [CONFIG_REG_WIDTH-1:0]
                                                                                                    config_reg_addr;
            sram_to_mem_addr = config_addr[ADDR_OFFSET +: BANK_AD 120
                                                                                                    config_en_io_ctrl [`$num_io_channels-1`:0];
           sram_to_mem_data = {{{BANK_DATA_WIDTH-CONFIG_DATA_WID 121 reg
                                                                                                    config_en_io_int;
           sram_to_mem_bit_sel = {{{BANK_DATA_WIDTH-CONFIG_DATA_ 122
            config rd data = 0:
                                                                     assign config_feature_addr = config_addr[0 +: CONFIG_FEATURE_WIDTH];
                                                                124 assign config_reg_addr = config_addr[CONFIG_FEATURE_WIDTH +: CONFIG_REG_WIDTH];
        else begin
                                                                     always_comb begin
           // configuring MSB bits
                                                                         for(j=0; j<`$num_io_channels`; j=j+1) begin
            sram_to_mem_wen = 1;
                                                                             config_en_io_ctrl[j] = config_en && (config_feature_addr == j);
            sram_to_mem_ren = 0;
            sram_to_mem_cen = 1;
                                                                         config en io int = config en && (config feature addr == `$num io channels`);
           sram_to_mem_addr = config_addr[ADDR_OFFSET +: BANK_AD 130
           sram_to_mem_data = {config_wr_data[BANK_DATA_WIDTH-CO
           sram_to_mem_bit_sel = {{{BANK_DATA_WIDTH-CONFIG_DATA_
                                                                     always ff @(posedge clk or posedge reset) begin
           config_rd_data = 0;
                                                                         if (reset) begin
                                                                             switch sel <= 0;
    else if (config_en && config_rd) begin
                                                                         else begin
        sram_to_mem_wen = 0;
                                                                             if (config_en_io_int && config_wr) begin
       sram_to_mem_ren = 1;
                                                                                 case (config reg addr)
        sram_to_mem_cen = 1;
                                                                                     0: switch sel <= config wr data;
       sram_to_mem_addr = config_addr(ADDR_OFFSET +: BANK_ADDR_W 140
                                                                                 endcase
        sram_to_mem_data = 0;
                                                                             end
       sram_to_mem_bit_sel = 0;
       if (config addr[ADDR OFFSET-1] == 0) begin
           config rd data = data out[0 +: CONFIG DATA WIDTH];
       end
                                                                     always ff @(posedge clk or posedge reset) begin
                                                                         if (reset) begin
           config_rd_data = data_out[BANK_DATA_WIDTH-1 -: CONFIG_147
                                                                             for(j=0; j<`$num_io_channels`; j=j+1) begin
                                                                                 io ctrl mode[i] <= 0;
                                                                                 io_ctrl_start_addr[j] <= 0;
                                                                                 io_ctrl_num_words[j] <= 0;
```

#### **HOW TO CREATE THESE ACCELERATORS?**

Study application

Design hardware

Write software

```
// Identifies for loop name in code statement.
                                                            class Demosaic : public Halide::Generator<Demosaic> {
        Gives name of first for loop
                                                        38 public:
    string name_for_loop(Stmt s) {
       ContainForLoop cfl;
                                                                 GeneratorParam<LoopLevel> intermed_compute_at{"intermed_compute_at", LoopLevel::inlined());
       s.accept(&cfl);
                                                                 GeneratorParam<LoopLevel> intermed_store_at{"intermed_store_at", LoopLevel::inlined()};
       return cfl.varnames[0]:
                                                                 GeneratorParam<LoopLevel> output_compute_at{"output_compute_at", LoopLevel::inlined()};
62 }
                                                       42
                                                       43
                                                                 // Inputs and outputs
     // Identifies all for loop names in code statemer
                                                       44
                                                                 Input<Func> deinterleaved{ "deinterleaved", Int(16), 3 };
     vector<string> contained_for_loop_names(Stmt s)
                                                                 Output<Func> output{ "output", Int(16), 3 };
       ContainForLoop cfl:
                                                        46
       s.accept(&cfl);
                                                       47
                                                                 // Defines outputs using inputs
       return cfl.varnames:
                                                       48
                                                                 void generate() {
                                                       49
                                                                     // These are the values we already know from the input
                                                                     // x v = the value of channel x at a site in the input of channel v
     class UsesVariable : public IRVisitor {
                                                                     // gb refers to green sites in the blue rows
       using IRVisitor::visit;
                                                                     // gr refers to green sites in the red rows
       void visit(const Variable *op) {
        if (op->name == varname) {
                                                        54
                                                                     // Give more convenient names to the four channels we know
                                                                     Func r_r, g_gr, g_gb, b_b;
         return:
                                                                     g_g(x, y) = deinterleaved(x, y, 0);
                                                                     r r(x, y) = deinterleaved(x, y, 1);
                                                                     b_b(x, y) = deinterleaved(x, y, 2);
       void visit(const Call *op) {
                                                                     g_gb(x, y) = deinterleaved(x, y, 3);
        // only go first two variables, not loop bour
        if (op->name == "write_stream" && op->args.si
                                                                     // These are the ones we need to interpolate
          op->args[0].accept(this);
          op->args[1].accept(this);
                                                                     Func b_r, g_r, b_gr, r_gr, b_gb, r_gb, r_b, g_b;
        } else {
                                                       64
          IRVisitor::visit(op);
                                                                     // First calculate green at the red and blue sites
                                                                     // Try interpolating vertically and horizontally. Also compute
                                                                     // differences vertically and horizontally. Use interpolation in
                                                                     // whichever direction had the smallest difference.
                                                                     Expr gv_r = avg(g_gb(x, y-1), g_gb(x, y));
                                                                     Expr gvd_r = absd(g_gb(x, y-1), g_gb(x, y));
       UsesVariable(string varname) : used(false), var
                                                                     Expr gh_r = avg(g_gr(x+1, y), g_gr(x, y));
95 };
                                                                     Expr ghd r = absd(g gr(x+1, y), g gr(x, y));
                                                        74
    // identifies target variable string in code stat
    bool variable_used(Stmt s, string varname) {
                                                                     g_r(x, y) = select(ghd_r < gvd_r, gh_r, gv_r);
```

DISTRIBUTION STATEMENT A. Approved for public release =  $avg(g_gr(x, y+1), g_gr(x, y));$ 

#### NOT SO SECRET DOWNSIDE

\$100M

## Many Years

#### **NOT SURPRISING**



#### It is a waterfall model of design!

DISTRIBUTION STATEMENT A. Approved for public release

#### SOFTWARE ISN'T BUILT THAT WAY

Moved away from that style decades ago

Enables small teams to build amazing apps



#### **AGILE DESIGN**

## Rapidly iterate on end-to-end system

Learn about real problems, and goals



Agile Hardware

#### **AGILE DESIGN**

It is about reuse

It is about clean interfaces

It is about constructors, not instances

#### PRODUCTIVITY IS THE ISSUE IN HARDWARE



Source: IBS

#### STILL NEED TO DEAL WITH FABRICATION



https://upload.wikimedia.org/wikipedia/commons/thumb/e/eb/12-inch\_silicon\_wafer.jpg/1024px-12-inch\_silicon\_wafer.jpg

#### **NEED TO EVOLVE THE HARDWARE**

Use a CGRA – a configurable framework



#### **AHA VISUAL COMPUTING**

#### A new way to create DSSoCs



Compile





**Evolves** 



**Optimize** 

DISTRIBUTION STATEMENT A. Approved for public release

#### FIRST GENERATION CGRA







#### APPLICATION COMPILATION FLOW

#### Halide







DISTRIBUTION STATEMENT A. Approved for public release

#### MAINTAINING THE FLOW THROUGH CHANGES

Many tools need to know about your design

You are building a "world"





DISTRIBUTION STATEMENT A. Approved for public release

#### DSSOC DOESN'T NEED TO BE EXPENSIVE

One just needs to think about the problem differently

We have already created one working chip/system using this flow

And have the next generation of the system working

Stay Tuned for Future Results ...



# ERI ELECTRONICS RESURGENCE INITIATIVE

### S U M M I T

2019 | Detroit, MI | July 15 - 17