Christian Engelmann, PhD
m www.csm.ornl.gov/ engelman k abqk1g@r.postjobfree.com
Computer Science Research Group
B P.O. Box 2008, Oak Ridge, TN 37831-6173, USA
Computer Science and Mathematics Division
T +1-865-***-**** v +1-865-***-****
Oak Ridge National Laboratory
Dr. Christian Engelmann has 10+ years experience in software research and development for next-generation
extreme-scale high-performance computing (HPC) systems with a strong research funding and publication
record. In collaboration with other U.S. Department of Energy laboratories and universities world-wide, his
research aims at computer science challenges in HPC system software, such as dependability, scalability, and
portability.
Accomplishments
7 research grants (1 as lead-PI) 5 journal articles 52 conference papers
8 co-advised Master theses 10 journal article and book proposal reviews 51 committees at 27 conference series
31 invited talks and seminars 6 conference posters 9 conference booth exhibitions
Skills
principal investigator for federally funded software research and development; MSc thesis co-advisor; small team lead;
project- and milestone-oriented team work; publishing research and development results
Expertise
high-performance computing; supercomputing; cluster computing; parallel and distributed computing; resilience; de-
pendability; reliability; high availability; fault tolerance; fault prediction; fault injection; system monitoring; virtualiza-
tion; middleware; parallel discrete event simulation; programming in MPI, C, C++, Java, Perl, Fortran, shell and PHP
Education
2008 : PhD in Computer Science, University of Reading, UK
2001 : MSc in Computer Science, University of Reading, UK
2001 : MSc in Computer Systems Engineering, University of Applied Sciences Berlin, Germany
Experience
2009 : Research and Development Staff, Oak Ridge National Laboratory, USA
HPC checkpoint storage virtualization and software dual-modular redundancy in HPC
HPC resiliency system software for monitoring, prediction and proactive fault tolerance
Light-weight simulation of future HPC architectures at large scale ( 10,000,000 cores)
2004 2009 : Research and Development Associate, Oak Ridge National Laboratory, USA
Fault tolerance support for MPI: Scalable membership, job pause and process migration
99.9997% high availability for HPC head/service- node services: Torque and PVFS MDS
Virtual system environments for plug-and-play supercomputing with HPC hypervisors
Enhancing application scientist productivity via a common view across HPC platforms
2004 2008 : Collaborating Research Assistant, University of Reading, UK
PhD thesis research: Symmetric active/active high availability for HPC system services
2001 2004 : Post Master s Research Associate, Oak Ridge National Laboratory, USA
Light-weight simulation of HPC architectures at large scale ( 1,000,000 cores)
Harness Distributed Virtual Machine: Pluggable, lightweight, adaptive and fault tolerant
2000 2001 : Software Developer, Oak Ridge National Laboratory, USA
MSc thesis research: Distributed peer-to-peer control for Harness
1998 1999 : Software Developer, Hewlett-Packard, Germany
GUI server (model-view-controller architecture) for a mobile patient monitoring system
Select Publications
[1] S. L. Scott, G. R. Vall e, T. Naughton, A. Tikotekar, C. Engelmann, and H. H. Ong. System-level virtualization
e
research at Oak Ridge National Laboratory. Future Generation Computer Systems (FGCS), 26(3):304 307, 2010.
[2] M. Li, S. S. Vazhkudai, A. R. Butt, F. Meng, X. Ma, Y. Kim, C. Engelmann, and G. Shipman. Functional partitioning
to optimize end-to-end performance on many-core architectures. In 23rd IEEE/ACM International Conference on
1/2
High Performance Computing, Networking, Storage and Analysis (SC), pages 1 12, 2010.
[3] X. He, L. Ou, C. Engelmann, X. Chen, and S. L. Scott. Symmetric active/active metadata service for high availability
parallel le systems. Journal of Parallel and Distributed Computing (JPDC), 69(12):961 973, 2009.
[4] C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Proactive process-level live migration in HPC environments.
In 21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis
(SC), pages 1 12, 2008.
[5] A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive fault tolerance for HPC with Xen virtualiza-
tion. In 21st ACM International Conference on Supercomputing (ICS), pages 23 32, 2007.
[6] C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. A job pause service under LAM/MPI+BLCR for transparent
fault tolerance. In 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1 10,
2007.
[7] X. He, L. Ou, M. J. Kosa, S. L. Scott, and C. Engelmann. A uni ed multiple-level cache for high performance
cluster storage systems. International Journal of High Performance Computing and Networking (IJHPCN), 5(1-2):
97 109, 2007.
[8] K. Uhlemann, C. Engelmann, and S. L. Scott. JOSHUA: Symmetric active/active replication for highly available
HPC job and resource management. In 8th IEEE International Conference on Cluster Computing (Cluster), pages
1 10, 2006.
[9] C. Engelmann, S. L. Scott, C. Leangsuksun, and X. He. Symmetric active/active high availability for high-
performance computing system services. Journal of Computers (JCP), 1(8):43 54, 2006.
[10] J. Varma, C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Scalable, fault-tolerant membership for MPI tasks
on HPC systems. In 20th ACM International Conference on Supercomputing (ICS), pages 219 228, 2006.
[11] C. Engelmann, S. L. Scott, D. E. Bernholdt, N. R. Gottumukkala, C. Leangsuksun, J. Varma, C. Wang, F. Mueller,
A. G. Shet, and P. Sadayappan. MOLAR: Adaptive runtime support for high-end computing operating and runtime
systems. ACM SIGOPS Operating Systems Review (OSR), 40(2):63 72, 2006.
Select Synergistic Activities
2009 : Invited talk and panel on Metrics and Modeling at the National HPC Workshop on Resilience
2009 : Panel on Resiliency at the Workshop on Distributed Supercomputing (SOS)
2009 : Technical program co-chair for the Workshop on Resiliency in High-Performance Computing (Resilience)
at the Intl. Symposium on High Performance Distributed Computing (HPDC)
2007-2010 : Technical program committee for the Intl. Conference on Availability, Reliability and Security (ARES)
2006-2010 : Technical program co-chair for the HPC Resiliency Summit: Workshop on Resiliency for Petascale HPC at
the Los Alamos Computer Science Symposium (LACSS), and for the High Availability and Performance
Workshop (HAPCW) at the Los Alamos Computer Science Institute (LACSI) Symposium
Memberships
ACM, ACM SIGOPS, IEEE, IEEE CS, IEEE CS TCSC/TCPP/TCDP/TCFT
Advisors
Prof. V. N. Alexandrov, University of Reading, UK; G. A. Geist, Oak Ridge National Laboratory, USA; S. L. Scott, Oak
Ridge National Laboratory, USA; Prof. U. Metzler, University of Applied Sciences Berlin, Germany
Advisees
R. Baumann, EADS, Germany; S. B hm, Oak Ridge National Laboratory; I. Jones, University of Reading, UK; F. Lauer,
o
University of Tennessee, Knoxville; A. Litvinova, University of Reading, UK; B. K nning, Technical University of
o
Berlin, Germany; K. Uhlemann, Coca Cola, Germany; M. Weber, Technical University of Dresden, Germany
Citizenship/U.S Immigration Status
Germany/U.S. permanent resident (Green Card)
Full Curriculum Vitae/References
Available upon request
2/2