Real-time Process Network Sonar Beamformer
Gregory E. Allen Brian L. Evans
Applied Research Laboratories Dept. Electrical and Computer
Engineering
******@*****.******.*** ******@***.******.***
The University of Texas at Austin, Austin, TX 78712-1084
http://signal.ece.utexas.edu/
P o s te r. f m Copyright 1998, The University of Texas.
All rights reserved.
Introduction
Problem: Beamforming is computationally intensive
Example: High-resolution sonar beamformers (GFLOPS)
Generation 1: expensive custom digital hardware (large state machines)
Generation 2: custom integration of programmable digital signal processors
on commercial-off-the-shelf hardware (e.g. 120 DSPs in a VME rack)
Objective: Unix workstation beamformers
Analysis: evaluate performance of beamforming kernels and systems
Modeling: capture parallelism, guarantee determinate bounded execution
Implementation: use portable, scalable software to achieve real-time
performance on commodity hardware and lower development costs
Solution: Real-time beamforming on workstations
Analysis: optimize kernels and pro le beamfomers to measure scalability
Modeling: Process Networks
Implementation: Real-time POSIX threads using C++ on symmetric
multiprocessor UltraSPARC-II workstation with native signal processing
Time-Domain Beamforming
Delay and sum weighted sensor outputs
Geometrically project the sensor elements onto a line to
compute the time delays
Projection for a beam pointing 20 off axis
M
1 i xi(t i)
20
b(t) =
i= 15
y position, inches
b(t) beam outputi 10
ith sensor output 20
xi(t)
5
i ith sensor delay
i ith sensor weight 0
sensor element
projected sensor element
-5
-20 -15 -10 -5 0 5 10 15 20
x position, inches
Interpolation Beamforming
Quantized time delays perturb beam pattern
Sample at just above the Nyquist rate
Interpolate to obtain desired time-delay resolution
Sample at Interpolate up to Time delay
1
interval interval = /L at interval
z-N1
A/D Interpolate
M
b[n]
z-NM
A/D Interpolate
Sensor Array Digital Interpolation Beamformer
Kernel implementation on UltraSPARC-II
Highly optimized C++ (loop unrolling and SPARCompiler5.0EA)
Currntly operating at 60% of peak, which is 2 FLOPs per cycle
Vertical Beamforming
stave
Multiple vertical transducers for every
horizontal position
Each vertical sensor column is combined into a stave
No time delay or interpolation is required
Staves are calculated by a simple integer dot product
Integer-to- oat conversion must be performed
Output data must be interleaved
Kernel implementation on UltraSPARC-II
Native signal processing with Visual Instruction Set (VIS)
Software data prefetching to hide memory latency and keep the pipeline full
Formal Design Methodology
The Process Network model [Kahn, 1974]
Superset of data ow models of computation
Captures concurrency and parallelism
Provides correctness
Guarantees determinate execution of the program
A program is represented as a directed graph
Each node is an independent process
P
A B Each edge is a one-way queue of data
Blocking reads, non-blocking writes, in nite queues
Scheduling for bounded queues is possible [Parks, 1995]
Blocking reads and writes
Dynamically increase queue capacities to prevent arti cial deadlock
Fits the thread model of concurrent programming
Process Network Implementation
Each node corresponds to a thread
Implemented in C++ using POSIX Pthreads
Low-overhead, high-performance, scalable
Granularity larger than a thread context switch (~10 us)
Symmetric multiprocessing operating system dynamically schedules threads
Ef cient utilization of multiple processors
Optimize queues for high-throughput signal processing
Nodes operate directly on queue memory, avoiding copying
Queues use mirroring to keep data contiguous
Mirrored data
Queue data region Mirror region
Compensates for the lack of circular address buffers
Queues trade-off memory usage for overhead
Virtual memory manager maintains data circularity
System Implementation
Vertical beamformer forms 3 sets of 80 staves from 10
vertical elements per stave
Each horizontal beamformer forms 61 beams from the 80
staves, using a two-point interpolation lter
4 GFLOPS total computation
Digital
sensor Fan 0
Interpolation
data Element data Stave data Beams
Beamformer
sensor
Three-fan Digital
data
Vertical Fan 1
Interpolation
Beamformer Beams
Beamformer
sensor
data
Digital
500 MFLOPS
sensor Fan 2
Interpolation
data Beams
Beamformer
40 MB/sec
each
1200
MFLOPS
each
Performance Results
100 trial mean execution time for 2.6 seconds of data
Sun Ultra Enterprise 4000 with eight 336-MHz
UltraSPARC-IIs, 2 Gb RAM, running Solaris 2.6
Implementation Time (s) MFLOPS Mbytes Process network is 7%
faster than thread pool,
thread pool 3.607 3024.8 832
overhead is small
process network 3.354 3252.8 654
Process network uses 20%
Execution time and MFLOPS vs CPUs
less memory with lower
35 3500
latency
30 3000
Process network scalability is
nearly linear
25 2500
Will continue to scale with
MFLOPS
20 2000
seconds
additional CPUs
15 1500
Real-time performance
achievable with 12 CPUs
10 1000
5 500
0 0
1 2 3 4 5 6 7 8
CPUs
Conclusion
Third generation beamformers: Workstation hardware
Commodity hardware saves development/manufacturing costs
Multiprocessor servers, native signal processing
Upgradable hardware, Moore s Law
Software model: Process Networks
Captures parallelism, guarantees determinate bounded execution
Portable, reusable, scalable C++ code
High-performance, low overhead POSIX threads
Symmetric multiprocessing operating system
The example 4-GFLOPS 1-Gb 3-D sonar beamforming
system does execute in real time using a Sun Ultra
Enterprise 4000 server with twelve 336 MHz UltraSPARC-
II CPUs with 14.5% to spare
http://www.ece.utexas.edu/~allen/Beamforming/
Copyright © 1998, The University of Texas.