DevOps Engineer
Max Vendor Rate: $84
Location: Remote
JD:
We are seeking a highly skilled and experienced Principal Software Engineer focused on Agentic AI and DevOps. The ideal candidate will architect and deliver agentic microservices and platform capabilities, lead cloud-native DevOps at scale, and partner with organizational leaders to communicate strategy, status, and results. Deep hands-on expertise with Azure, Kubernetes, CI/CD, infrastructure as code, and LLM/agent frameworks (LangChain/LangSmith/OpenAI/LiteLLM) is essential. Experience with dataflow orchestration (Apache NiFi), enterprise integrations (ServiceNow/Snowflake/Power BI/SharePoint), and production-grade observability is highly desirable.
What You'll Do:
• Architect, build, and operate agentic AI services and microservices leveraging LangChain, LangSmith, OpenAI/Azure OpenAI, and LiteLLM; implement tool-use orchestration, evaluation, and guardrails.
• Design, build, and maintain CI/CD pipelines using Azure DevOps (ADO) YAML and GitHub Actions; enforce trunk-based workflows, quality gates, progressive delivery, and automated rollbacks.
• Stand up and manage Azure infrastructure (AKS, Service Bus, Event Hubs, Storage Accounts, Key Vault, Bastion); codify environments with Terraform; implement secure networking, secrets, and RBAC.
• Containerize and ship services with Docker/Buildah; operate Kubernetes with CNI networking and Linkerd service mesh; implement canary/blue-green strategies and autoscaling.
• Create and operate Apache NiFi dataflows; deploy and manage NiFi clusters on AKS with VM Scale Sets, enabling resilient, scalable ingestion and orchestration.
• Implement enterprise-grade observability and logging: ELK/EFK (Elasticsearch, Fluentd/Fluent Bit, Kibana), Prometheus metrics, Azure Dashboards, and KQL-based alerting.
• Engineer data and analytics integrations: Azure Databricks, PostgreSQL, Snowflake; operationalize Power BI, SharePoint, and Jupyter-based workflows.
• Build robust platform and app integrations: ServiceNow APIs, REST APIs, SMTP/IMAP/POP email automations; configure and manage NGINX/HAProxy load balancers.
• Lead incident response, root-cause analysis, and postmortems; continuously improve reliability, performance, security, and cost.
• Mentor teams, drive architectural runway, and communicate plans, trade-offs, and outcomes to stakeholders and leadership.
Key Qualifications / Experience Required:
DevOps Experience
• Expert-level hands-on DevOps across Azure and Kubernetes: CI/CD, Git workflows, infrastructure as code, automated testing, monitoring, and secure deployment.
• Proficiency with Azure DevOps (ADO) YAML pipelines and GitHub Actions; experience optimizing pipelines for cloud-native systems.
• Strong Kubernetes operations including CNI networking and service mesh (Linkerd); container build and supply chain (Docker, Buildah).
• Observability at scale using ELK/EFK, Prometheus, Fluentd/Fluent Bit, Azure Monitor dashboards and alerting (KQL).
Automation Skills
• Deep automation with PowerShell, Bash, and Python to eliminate toil across build, release, environment, and operational workflows.
• Infrastructure as Code expertise with Terraform (Azure resources: AKS, Service Bus, Event Hubs, Storage, Key Vault, Bastion).
• Proven track record reducing manual intervention, increasing repeatability, and improving MTTR through automation.
Agentic AI Experience
• Practical, production experience delivering agentic AI solutions (task orchestration, tool-use, planning, retrieval, and evaluation).
• Hands-on with LangChain, LangSmith (tracing/eval), OpenAI/Azure OpenAI, and LiteLLM integration; familiarity with prompt engineering, safety/guardrails, and LLM observability (e.g., Arize).
• Experience operationalizing AI services within DevOps pipelines and platform governance.
Technical Proficiency
• Apache NiFi expertise: authoring and governing dataflows; deploying and scaling NiFi clusters on AKS with VM Scale Sets.
• Azure services: AKS, Service Bus, Event Hubs (setup and integration), Storage Accounts (setup and integration), Key Vault, Bastion, Azure Dashboards & Kusto Query Language (KQL).
• Data/analytics: Azure Databricks, PostgreSQL, Snowflake; Power BI and SharePoint integrations; Jupyter Notebook workflows.
• Networking fundamentals: DHCP/DNS; load balancer configuration and operations (NGINX, HAProxy); Kubernetes ingress best practices.
• Messaging and email protocols: SMTP, IMAP/POP.
• Microservices and app frameworks: Python and Node.js microservices (REST APIs), Electron build and packaging.
Required Technical Skills
• Windows PowerShell; Linux/Unix administration; Bash and Python.
• Azure Cloud (architecture, security, cost, RBAC); Azure DevOps (ADO) with YAML; GitHub Actions.
• Docker and Buildah; Kubernetes (CNI), Linkerd; ELK/EFK, Prometheus, Fluentd/Fluent Bit.
• Apache NiFi flow development and clustered operations on Kubernetes with scale sets.
• Azure Databricks; PostgreSQL; Snowflake; REST APIs; ServiceNow APIs; Power BI; SharePoint.
• Azure Service Bus, Azure Event Hubs, Storage Accounts, Key Vault, Bastion.
• Jira; Jupyter Notebook; Azure Dashboards and KQL; SMTP/IMAP/POP.
• Python and Node.js microservice architecture; Electron build.
Project Management Skills
• Plan, schedule, and coordinate multi-team deliveries and releases; manage dependencies, risks, and change.
• Drive execution across platform, app, data, and AI workstreams with clear milestones and success criteria.
• Establish SLOs/SLAs and error budgets; align roadmaps to business priorities.
Communication and Interpersonal Skills
• Communicate architectural decisions, roadmaps, and trade-offs to technical and executive audiences.
• Lead cross-functional ceremonies; produce clear runbooks, architecture docs, and dashboards.
• Foster collaboration across engineering, product, security, and operations.
Analytical and Problem-Solving Abilities
• Rapid diagnosis and resolution of complex production issues; strong RCA and remediation planning.
• Attention to detail in reliability, security, performance, and cost optimization.
Adaptability and Continuous Learning
• Track and adopt evolving best practices in cloud, containers, DevOps, and agentic AI.
• Champion continuous improvement in engineering excellence and platform governance.
Experience and Education
• Typically requires 10-15+ years in software engineering, DevOps/SRE, or platform engineering with principal-level impact.
• Bachelor's degree in Computer Science, Information Technology, or related field preferred (or equivalent experience).
Secondary Skills and Experience (Desired):
Design and Development
• Define and design subsystems and interfaces; allocate responsibilities across services and platforms.
• Translate non-functional requirements (security, reliability, scalability) into concrete designs.
Technical Enablement
• Provide technical enablement for components and subsystems; drive critical design decisions and reviews.
• Establish patterns and reusable templates for CI/CD, IaC, and agentic service scaffolding.
Continuous Delivery Pipeline
• Plan, define, and implement the continuous delivery pipeline with quality gates, progressive delivery, and rollback strategies.
Architectural Runway
• Develop the architectural runway to support new features and capabilities; align with Solution and Enterprise Architects and portfolio stakeholders.
Integration
• Architect and implement integrations with external components, systems, and platforms (ServiceNow, Snowflake, Power BI, SharePoint, email systems, and enterprise identity/secrets).
Top Skills:
• Windows PowerShell; Linux/Unix administration; Bash and Python
• Azure Cloud (architecture, security, cost, RBAC); Azure DevOps (ADO) with YAML; GitHub Actions
• Docker and Buildah; Kubernetes (CNI), Linkerd; ELK/EFK, Prometheus, Fluentd/Fluent Bit