

# WADE SHEN

#### **PROGRAM MANAGER** DARPA/MTO

# HIERARCHICAL IDENTIFY VERIFY EXPLOIT (HIVE)

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited

......

#### THE HIVE PROGRAM

**Program Objective:** A graph processing stack yielding a 1,000x increase in computational efficiency over GPU solutions for DoD graph analytics.



### MANY DOD PROBLEMS ARE GRAPH PROBLEMS

| Intelligence                     |                          | Graph               |                                               | algorithm       |                              |          |
|----------------------------------|--------------------------|---------------------|-----------------------------------------------|-----------------|------------------------------|----------|
|                                  | Geolocation inference    |                     | Label Propagation                             |                 |                              |          |
|                                  | Persona de-aliasing      |                     | Stochastic Graph Matching                     |                 |                              |          |
|                                  | Target prioritization    |                     | Personalized PageRank                         |                 |                              |          |
|                                  | Seeded target discovery  |                     | Vertex Nomination                             |                 |                              |          |
|                                  | Organization discovery   |                     | Local Community Detection<br>Query by Example |                 |                              |          |
|                                  | Detection of money laune | dering              |                                               |                 |                              |          |
| Leadership detection             |                          |                     | Role prediction                               |                 |                              |          |
| Operations                       |                          | Graph algorithm     |                                               | Support         | Graph algori                 | thm      |
| Target audience discovery        |                          | Snowball Sampling   |                                               | Logistics/route | Hierarchical Hub<br>Labeling |          |
| Network mapping                  |                          | Community Detection |                                               | plan            |                              |          |
| Network infrastructure discovery |                          | Graph Projection    |                                               | HR selection    | K-nearest ne                 | eighbors |
| Cyber attack detection           |                          | Anomaly Detection   |                                               | Ops planning    | Trellis searcl               | า        |

DISTRIBUTION STATEMENT A. Approved for public release. Distribution is Unlimited.

## **GRAPH VS. NUMERIC WORKLOADS**

- The graph processing fallacy
  - GPUs/supercomputers designed for matrix math
  - All graphs = sparse matrices
  - GPUs/supercomputers process graphs well



### **PROGRAM STRUCTURE**





# ERI ELECTRONICS RESURGENCE INITIATIVE

## SUMMIT

#### 2018 | SAN FRANCISCO, CA | JULY 23-25



# JOSHUA FRYMAN

#### SENIOR PRINCIPAL ENGINEER, PHD INTEL DCG, DARPA HIVE PI



# INTEL'S HIVE: FUTURE GRAPH ANALYTICS

JOSHUA FRYMAN, PHD

DARPA HIVE PI SENIOR PRINCIPAL ENGINEER INTEL DCG

## WHAT'S SO HARD ABOUT GRAPHS . . . ?



| Behavior                                  | Dense Compute                                                                                             | Graph Compute                                                                                              |
|-------------------------------------------|-----------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------|
| Compute intensity / arithmetic properties | Lots of computationally<br>intensive math ops; some data<br>massaging that's spatio-<br>temporal friendly | Not computationally (math)<br>intensive; mostly scheduling<br>memory accesses and control<br>flow problems |
| Cacheline access behavior                 | Worker threads use ~95% of their full cacheline; control threads are complicated                          | ~50% of cachelines evicted<br>with $\leq$ 16B used and ~75%<br>with $\leq$ 32B used                        |
| Control flow inter-arrival behavior       | Workers have long runs<br>between branches; control<br>threads are sufficiently<br>predictable            | ~80% of branches occur inside<br>dependent memory chains;<br>extreme stress on pipelines and<br>structures |
| Memory flow inter-arrival behavior        | ~65% of memory references<br>back-to-back; excellent locality<br>effectiveness                            | ~65% of memory references<br>back-to-back; nested<br>dependent pointer chains                              |

#### WHAT'S SO HARD ABOUT GRAPHS . . . ?



### WHERE DOES "BUSINESS AS USUAL" TAKE US . . . ? 👐



#### **BACK TO THE BASICS – SYSTEM CO-DESIGN**



intel

Distribution Statement A - Approved for Public Release, Distribution Unlimited

#### **BACK TO THE BASICS – SYSTEM CO-DESIGN**



intel

## **BUILDING A GRAPH ANALYTICS SYSTEM**

- Intel is developing a HIVE solution
  - 1,000x Perf/W gain target on 100+TB
- Locality will be problematic
  - Divide and Conquer has imbalance
  - Dynamic graphs warp partitions
- Focus on a scalable platform at all levels
  - Memory, Network, and Compute
  - Target O(seconds) for 100+TB kernels
- Support multiple representations equally
  - Sparse matrix operations & GraphBLAS
  - Meta-data laden graph abstractions
- Opportunities to engage and partner



## **IMPACT AND OPPORTUNITIES**



- Open-source Graph primitives and tools
- Actively seeking workloads and datasets
- Co-design targets with customer input







# ERI ELECTRONICS RESURGENCE INITIATIVE

## SUMMIT

#### 2018 | SAN FRANCISCO, CA | JULY 23-25



# SHEKAR BORKAR

#### SENIOR DIRECTOR OF TECHNOLOGY QUALCOMM

#### HONEYCOMB A GRAPH ANALYTICS PROCESSOR WITH HIGH EFFICIENCY

SHEKHAR BORKAR (PI) MATT RADECIC (PM)

QUALCOMM INTELLIGENT SOLUTIONS, INC. JULY 2018

This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA).

.....

The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

#### **PROGRAM GOALS**

| Performance           | 10 GTEPs / Node                                    |  |
|-----------------------|----------------------------------------------------|--|
| Energy Efficiency     | 0.5 GTEPs / W<br>2 nJ / TE                         |  |
| Processing Efficiency | 100x in Hardware<br>10x in Software<br>1000x Total |  |
| Memory Efficiency     | 90% both, random & sequential accesses             |  |
| Demonstration         | 16 Nodes<br>160 GTEPs system                       |  |
| Scalability           | Beyond 16 nodes to Tera TEPs                       |  |

#### SCALABLE 160 GTEPS SYSTEM, CONSUMING < 320 WATTS

#### HIVE GOALS COMPARED TO GRAPH-500 (Q4-2016)







#### GOALS ARE EVEN HARDER CONSIDERING GRAPH-500 IS PROBABLY NOT A GOOD REPRESENTATIVE

#### CHALLENGES: 160 GTEPS @ 320 W



#### DRAM Power (Watts) vs DRAM Bytes/Edge



#### Interconnect

Power vs Interconnect Bytes/Edge

#### INVESTIGATION PRIORITIES: (1) INTRA-NODE DATA MOVEMENT, (2) INTERCONNECT, (3) COMPUTE

| Memory Subsystem                    | Interconnects                           |
|-------------------------------------|-----------------------------------------|
| Intelligent memory controller       | Hierarchical & heterogeneous            |
| Fine-grain data movement management | Simple, high-radix interconnects        |
| Optimized data layout               | Right balance of Electrical and Optical |

#### SYSTEM ARCHITECTURE



#### Scalable system



#### DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited.

#### **Captured in Functional Simulator**



#### Fully functional and being used for analysis



| MIPS comparison |      |      |  |  |  |
|-----------------|------|------|--|--|--|
|                 | Xeon | HC   |  |  |  |
| Kernel 1        | 1973 | 946  |  |  |  |
| Kernel 2        | 1478 | 724  |  |  |  |
| Kernel 3        | 2716 | 1107 |  |  |  |

0.5X performance @ 100X lower power

#### SENSITIVITY TO PROCESSOR FREQUENCY



#### WORKLOADS ARE NOT VERY SENSITIVE TO PROCESSOR FREQUENCY

#### **SENSITIVITY TO DATA-MOVEMENT PERFORMANCE**



WORKLOADS ARE MORE SENSITIVE TO DATA-MOVEMENT PERFORMANCE

### **MULTI-THREADED WORKLOAD BEHAVIOR (NODE)**



Single thread: Minimal performance change with BW BW overprovisioned by the platform Same behavior with 50% higher latency

### SIMULATED PERFORMANCE, ENERGY, POWER















Energy













- <u>DARPA-hard</u> goals, yet achievable!
  - Simulation based workload analysis shows data movement dominates
    - Not much by compute

## Therefore...

#### SYSTEM DESIGN MUST BE OPTIMIZED FOR DATA MOVEMENT!



# ERI ELECTRONICS RESURGENCE INITIATIVE

## SUMMIT

#### 2018 | SAN FRANCISCO, CA | JULY 23-25