Resume

Computer Science Software

Location:

Oak Ridge, TN

Posted:

February 15, 2013

Contact this candidate

Resume:

Christian Engelmann, PhD

m www.csm.ornl.gov/ engelman k abqk1g@r.postjobfree.com

Computer Science Research Group

B P.O. Box 2008, Oak Ridge, TN 37831-6173, USA

Computer Science and Mathematics Division

T +1-865-***-**** v +1-865-***-****

Oak Ridge National Laboratory

Dr. Christian Engelmann has 10+ years experience in software research and development for next-generation

extreme-scale high-performance computing (HPC) systems with a strong research funding and publication

record. In collaboration with other U.S. Department of Energy laboratories and universities world-wide, his

research aims at computer science challenges in HPC system software, such as dependability, scalability, and

portability.

Accomplishments

7 research grants (1 as lead-PI) 5 journal articles 52 conference papers

8 co-advised Master theses 10 journal article and book proposal reviews 51 committees at 27 conference series

31 invited talks and seminars 6 conference posters 9 conference booth exhibitions

Skills

principal investigator for federally funded software research and development; MSc thesis co-advisor; small team lead;

project- and milestone-oriented team work; publishing research and development results

Expertise

high-performance computing; supercomputing; cluster computing; parallel and distributed computing; resilience; de-

pendability; reliability; high availability; fault tolerance; fault prediction; fault injection; system monitoring; virtualiza-

tion; middleware; parallel discrete event simulation; programming in MPI, C, C++, Java, Perl, Fortran, shell and PHP

Education

2008 : PhD in Computer Science, University of Reading, UK

2001 : MSc in Computer Science, University of Reading, UK

2001 : MSc in Computer Systems Engineering, University of Applied Sciences Berlin, Germany

Experience

2009 : Research and Development Staff, Oak Ridge National Laboratory, USA

HPC checkpoint storage virtualization and software dual-modular redundancy in HPC

HPC resiliency system software for monitoring, prediction and proactive fault tolerance

Light-weight simulation of future HPC architectures at large scale ( 10,000,000 cores)

2004 2009 : Research and Development Associate, Oak Ridge National Laboratory, USA

Fault tolerance support for MPI: Scalable membership, job pause and process migration

99.9997% high availability for HPC head/service- node services: Torque and PVFS MDS

Virtual system environments for plug-and-play supercomputing with HPC hypervisors

Enhancing application scientist productivity via a common view across HPC platforms

2004 2008 : Collaborating Research Assistant, University of Reading, UK

PhD thesis research: Symmetric active/active high availability for HPC system services

2001 2004 : Post Master s Research Associate, Oak Ridge National Laboratory, USA

Light-weight simulation of HPC architectures at large scale ( 1,000,000 cores)

Harness Distributed Virtual Machine: Pluggable, lightweight, adaptive and fault tolerant

2000 2001 : Software Developer, Oak Ridge National Laboratory, USA

MSc thesis research: Distributed peer-to-peer control for Harness

1998 1999 : Software Developer, Hewlett-Packard, Germany

GUI server (model-view-controller architecture) for a mobile patient monitoring system

Select Publications

[1] S. L. Scott, G. R. Vall e, T. Naughton, A. Tikotekar, C. Engelmann, and H. H. Ong. System-level virtualization

research at Oak Ridge National Laboratory. Future Generation Computer Systems (FGCS), 26(3):304 307, 2010.

[2] M. Li, S. S. Vazhkudai, A. R. Butt, F. Meng, X. Ma, Y. Kim, C. Engelmann, and G. Shipman. Functional partitioning

to optimize end-to-end performance on many-core architectures. In 23rd IEEE/ACM International Conference on

1/2

High Performance Computing, Networking, Storage and Analysis (SC), pages 1 12, 2010.

[3] X. He, L. Ou, C. Engelmann, X. Chen, and S. L. Scott. Symmetric active/active metadata service for high availability

parallel le systems. Journal of Parallel and Distributed Computing (JPDC), 69(12):961 973, 2009.

[4] C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Proactive process-level live migration in HPC environments.

In 21st IEEE/ACM International Conference on High Performance Computing, Networking, Storage and Analysis

(SC), pages 1 12, 2008.

[5] A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive fault tolerance for HPC with Xen virtualiza-

tion. In 21st ACM International Conference on Supercomputing (ICS), pages 23 32, 2007.

[6] C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. A job pause service under LAM/MPI+BLCR for transparent

fault tolerance. In 21st IEEE International Parallel and Distributed Processing Symposium (IPDPS), pages 1 10,

2007.

[7] X. He, L. Ou, M. J. Kosa, S. L. Scott, and C. Engelmann. A uni ed multiple-level cache for high performance

cluster storage systems. International Journal of High Performance Computing and Networking (IJHPCN), 5(1-2):

97 109, 2007.

[8] K. Uhlemann, C. Engelmann, and S. L. Scott. JOSHUA: Symmetric active/active replication for highly available

HPC job and resource management. In 8th IEEE International Conference on Cluster Computing (Cluster), pages

1 10, 2006.

[9] C. Engelmann, S. L. Scott, C. Leangsuksun, and X. He. Symmetric active/active high availability for high-

performance computing system services. Journal of Computers (JCP), 1(8):43 54, 2006.

[10] J. Varma, C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. Scalable, fault-tolerant membership for MPI tasks

on HPC systems. In 20th ACM International Conference on Supercomputing (ICS), pages 219 228, 2006.

[11] C. Engelmann, S. L. Scott, D. E. Bernholdt, N. R. Gottumukkala, C. Leangsuksun, J. Varma, C. Wang, F. Mueller,

A. G. Shet, and P. Sadayappan. MOLAR: Adaptive runtime support for high-end computing operating and runtime

systems. ACM SIGOPS Operating Systems Review (OSR), 40(2):63 72, 2006.

Select Synergistic Activities

2009 : Invited talk and panel on Metrics and Modeling at the National HPC Workshop on Resilience

2009 : Panel on Resiliency at the Workshop on Distributed Supercomputing (SOS)

2009 : Technical program co-chair for the Workshop on Resiliency in High-Performance Computing (Resilience)

at the Intl. Symposium on High Performance Distributed Computing (HPDC)

2007-2010 : Technical program committee for the Intl. Conference on Availability, Reliability and Security (ARES)

2006-2010 : Technical program co-chair for the HPC Resiliency Summit: Workshop on Resiliency for Petascale HPC at

the Los Alamos Computer Science Symposium (LACSS), and for the High Availability and Performance

Workshop (HAPCW) at the Los Alamos Computer Science Institute (LACSI) Symposium

Memberships

ACM, ACM SIGOPS, IEEE, IEEE CS, IEEE CS TCSC/TCPP/TCDP/TCFT

Advisors

Prof. V. N. Alexandrov, University of Reading, UK; G. A. Geist, Oak Ridge National Laboratory, USA; S. L. Scott, Oak

Ridge National Laboratory, USA; Prof. U. Metzler, University of Applied Sciences Berlin, Germany

Advisees

R. Baumann, EADS, Germany; S. B hm, Oak Ridge National Laboratory; I. Jones, University of Reading, UK; F. Lauer,

University of Tennessee, Knoxville; A. Litvinova, University of Reading, UK; B. K nning, Technical University of

Berlin, Germany; K. Uhlemann, Coca Cola, Germany; M. Weber, Technical University of Dresden, Germany

Citizenship/U.S Immigration Status

Germany/U.S. permanent resident (Green Card)

Full Curriculum Vitae/References

Available upon request

2/2

Contact this candidate