Principal Software Engineer
OCI is Oracle's next-generation cloud platform, built for the most demanding enterprise workloads. The AI Platform, Services & Solutions organization within OCI is building a robust ecosystem to support the end-to-end lifecycle of AI and machine learning workloads. From GPU infrastructure and training pipelines to model serving and deployment toolswe empower teams across Oracle and our customers to build and deploy AI at scale. We are looking for a Principal Software Engineer to join our growing team and help shape the future of AI infrastructure and services at Oracle. You will work on critical components of OCI's AI platform, including high-scale GPU cluster management, self-service ML infrastructure, and model serving systems. Collaborate with top engineers and researchers in a fast-paced, innovation-driven environment. Grow your career in a supportive, mission-driven team building the future of enterprise AI.
What You'll Do
Build cloud service on top of the modern Infrastructure as a Service (IaaS) building blocks at OCI. Design and build distributed, scalable, fault tolerant software systems. Participate in the entire software lifecycle development, testing, CI and production operations. Design and lead software projects without needing significant guidance and guide/mentor/coach junior engineers. Balance between product feature development and production operational concerns like writing runbooks, ops automation, structured logging, instrumentation for metrics and events. Leverage internal tooling at OCI to develop, build, deploy and troubleshoot software. Participate in on-call for the service with the team.
Qualifications
8+ years of experience shipping scalable, cloud native distributed systems. Experience with building multi-tenant Kubernetes and security isolation. Built Kubernetes controllers, operators and CRDs to automate lifecycle management of AI/ML workloads. Implement advanced optimizations: distributed and disaggregated inference serving, multi-node inference, KV-cache reuse. Build intelligent request routing and adaptive scheduling to maximize GPU utilization. Experience inference solutions like: Nvidia Dynamo, vLLM, Ray Serve. Experience with production operations and best practices for putting quality code in production and troubleshoot issues when they arise. Able to effectively communicate technical ideas verbally and in writing (technical proposals, design specs, architecture diagrams and presentations). BS in Computer Science, or equivalent experience. Experience in Go, Java, Python.
Preferred Qualifications
MS in Computer Science. Experience building control plane/data plane solutions for cloud native companies. Experience in diagnosing, troubleshooting and resolving performance issues in complex environments. Deep understanding of Unix-like operating systems. Production experience with Cloud and ML technologies. Generative AI, LLM, Machine learning experience.
Disclaimer: Certain US customer or client-facing roles may be required to comply with applicable requirements, such as immunization and occupational health mandates. Range and benefit information provided in this posting are specific to the stated locations only US: Hiring Range in USD from: $96,800 to $223,400 per annum. May be eligible for bonus and equity. Oracle maintains broad salary ranges for its roles in order to account for variations in knowledge, skills, experience, market conditions and locations, as well as reflect Oracle's differing products, industries and lines of business. Candidates are typically placed into the range based on the preceding factors as well as internal peer equity. Oracle US offers a comprehensive benefits package which includes the following: Medical, dental, and vision insurance, including expert medical opinion; Short term disability and long term disability; Life insurance and AD&D Supplemental life insurance (Employee/Spouse/Child); Health care and dependent care Flexible Spending Accounts; Pre-tax commuter and parking benefits; 401(k) Savings and Investment Plan with company match; Paid time off: Flexible Vacation is provided to all eligible employees assigned to a salaried (non-overtime eligible) position. Accrued Vacation is provided to all other employees eligible for vacation benefits. For employees working at least 35 hours per week, the vacation accrual rate is 13 days annually for the first three years of employment and 18 days annually for subsequent years of employment. Vacation accrual is prorated for employees working between 20 and 34 hours per week. Employees working fewer than 20 hours per week are not eligible for vacation. 11 paid holidays; Paid sick leave: 72 hours of paid sick leave upon date of hire. Refreshes each calendar year. Unused balance will carry over each year up to a maximum cap of 112 hours; Paid parental leave; Adoption assistance; Employee Stock Purchase Plan; Financial planning and group legal; Voluntary benefits including auto, homeowner and pet insurance.