Sudnya Padalikar
San Jose, CA
**********@*****.***
OBJECTIVE
To shape the development of a new vertically integrated product using disruptive technology from
the fields of computer vision, computational photography, machine learning, and high performance
parallel computing.
RELEVANT PROJECTS
Multi-Modal Gesture Recognition
• Performs 20-class gesture recognition using a training set of 7,754 labeled RGB-D images.
• Classification is performed using a multi-level neural network.
• Feature selection is performed automatically using a sparse autoencoder on unlabeled video.
Neural Network Classification Library
• C++ implementation of a neural network classifier on arbitrary input data.
• GPU accelerated, scales to millions of input features.
• Applied to image gesture recognition.
Discrete and Linear Optimization Library
• From scratch C++ implementation of a constrained linear and discrete optimization solver for
general purpose NP-hard optimization problems.
• The high level design uses simulated annealing, greedy heuristics, local search, and tabu search.
• Applied to neural network training and vehicle routing.
EDUCATION
Georgia Institute of Technology, Atlanta, GA
• MS - Computer Science – August 2007 - December 2008
Vishwakarma Institute of Technology, Pune University, India
• BE - Computer Engineering – August 2002 - July 2006
EXPERIENCE
NVIDIA: Santa Clara, CA
GPU Streaming Multiprocessor Architect – February 2012 - Present
• Modeled new architecture features in the processor performance simulator.
• Evaluated new instructions using simulator modeling and writing directed performance tests.
This implementation was also used to evaluate the RTL design and performance bottlenecks.
• Implemented hardware support for precise exceptions in the simulator.
• Debugged complex workloads (e.g. Cuda Nested Parallelism uScheduler) on a full chip simu-
lator involving multiple interacting units.
NVIDIA: Santa Clara, CA
GPU Architecture Engineer – March 2010 - February 2012
• Ported an architecture simulation framework to enable evaluation of the next generation GPUs.
• Wrote bringup tests for a new feature in the next generation GPU.
• Developed a tool to translate application traces into unit tests that run on RTL.
• Worked on an instrumentation tool that captures GPU processor state for performance studies.
NFinTes: Marietta, GA
Research Intern – April 2009 - February 2010
• Wrote a discrete event simulator to evaluate GPU remote procedure calls among nodes.
• Simulator was written in C++ with MPI.
• Used CUDA applications as workloads to perform the evaluation.
Qualcomm Inc: San Diego, CA
Summer Engineering Intern – May 2008 - August 2008
• Integrated layer 2 software on a WiMax base station.
• Implemented a framework for performance and functional evaluation of WiMax base station
software on a QDSP6 processor.
IBM: Pune, India
Associate Systems Engineer – August 2006 - July 2007
• Worked on profiling and testing of Proventia integrated security appliances.
G.S. Labs: Pune, India
Enginerring Project Intern – June 2005 - July 2006
• Designed and implemented a voice bridge tying together Skype (P2P VoIP network) and
Asterisk (open source IP-PBX).
SELECTED PUBLICATIONS
Sudnya Padalikar and Gregory Diamos. ”GPU-RPC: Exploiting The Latency Tolerance of CUDA
Applications.” In NVIDIA Research Summit, San Jose, California, USA, September 2009.
MAJOR PROJECTS
A Massively Parallel Simulator - Archaeopteryx
• Built a massively parallel simulator (CUDA based) to simulate future parallel processor archi-
tectures on current GPUs.
• Explicit separation between functional and timing model.
• Achieves high performance by exploiting data structure locality, hierarchical synchronization,
and minimal state-per-thread.
Parallel Discrete Event Simulator
• The simulator is partitioned into models which describe the system being simulated and a
kernel that manages events and time synchronization.
• Detailed timing models for network links, ethernet devices, and layer 3 and 4 protocols.
• Sequential and parallel implementations of the simulator kernel, each with identical interfaces.
The parallel version implements the (Chandy-Misra-Bryant) time synchronization algorithm
using MPI.
SKILLS
Languages:
• C++, Python, Java, Matlab, C, CUDA, PHP, Perl, Intel x86 assembly, ARM assembly,
NVIDIA GPU Assembly, Scripting (Csh, Bash).
Libraries:
• STL, Boost, MPI, Pthreads, OpenMP, BLAS, LAPACK, Numpy.
REFERENCES
• Will be provided on request.