As the Site Reliability Engineer, will be responsible for ensuring the availability, reliability, and performance of our customer-facing software applications. This role combines planning, engineering, monitoring, incident response, and administration to create highly scalable and fault-tolerant systems. You will handle complex and escalated application and network infrastructure support cases including troubleshooting of ethernet networking problems. You will develop, review, and approve customer-facing and internal documentation on best practices, troubleshooting flowcharts, training materials and FAQs. You will act as a technical team lead, technical resource, and coach for the Associate Engineer – Support and Engineer – Support team members.
Responsibilities:
Collaborate with engineering, technical services, and quality assurance divisions on any problems, software bugs or emerging customer needs
Ensure the high availability and reliability of the production environment by monitoring system health and performance
Provide primary operational support for large-scale distributed software applications
Facilitate incident resolution via triage, communication, engagement, escalation, and documentation
Partner with platform administration (both internal and external) to define and achieve stability and scalability objectives
Collaborate with technical and quality teams to improve services by identifying areas of risk and helping to define and proactively implement solutions
Drive continual improvement in system performance by setting service level objectives in collaboration with a performance center of practice and/or product development teams
Participate in system design, capacity planning, and platform management
Analyze and publish metrics from operating systems and applications to assist in performance tuning and fault finding
Pursue opportunities for automation and process improvements
Delivers an exceptional levels of customer service, providing infrastructure support per Service Level Agreements (SLA).
Handles escalated cases, including troubleshooting of complex audio, video, and ethernet networking problems.
Handles security patch and vulnerability management.
Evaluates, identifies, and replicates issues and follows an escalation process to reach desirable outcomes to ensure positive customer experience.
Serves as a technical resource to other functional groups and individuals to improve service quality and user experience.
Develops customer-facing and internal documentation on best practices, troubleshooting flowcharts, training materials and FAQs to ensure consistent customer experience.
Takes ownership of the escalated cases from Associate Engineers and Engineers and takes it to the resolution.
Qualifications:
Bachelor's Degree - Engineering related discipline required; Master’s Degree preferred
Experience providing first-level incident response and troubleshooting with technical teams to resolve end-user issues
Proficiency with enterprise system monitoring software (examples: NewRelic, Nagios, Solarwinds, Dynatrace, Datadog, Azure Monitor, Splunk)
Experience with cloud-based infrastructure, databases, and applications
Experience with performance tuning and fault finding in large-scale distributed systems.
Experience with designing, implementing, and managing performance testing practices, including specific tools and frameworks
Knowledge of disaster recovery planning and execution.
Ability to effectively work in a highly matrixed organization
Strong understanding of coding, automation, and engineering principles to build resilient, self-healing systems
Familiarity with DevOps practices and tools
Jira (or equivalent work management) Confluence (or equivalent knowledge management)
Licenses/Certificates/Designations - IT industry networking certifications such as CCNP or JNCIP; ITIL or equivalent
Minimum 5 years of experience supporting network and AV operations
5 years required delivering support in ethernet technologies/AV and networking concepts
Advance knowledge of platform OS (router platform, VxLAN, WAN, LAN & routing protocols) and how they interact with the network
Ability to apply principles, theories, and concepts, as well as knowledge or related networking/AV disciplines
Advanced skills and knowledge and adherence to change management process.
Network routing & switching
Possess a customer-centric mindset
Possess strong computer skills, including proficiency with Microsoft Office Outlook, Word, Excel, and PowerPoint
Excellent oral and written communication
Wesco International, Inc., including its subsidiaries and affiliates (“Wesco”) provides equal employment opportunities to all employees and applicants for employment. Employment decisions are made without regard to race, religion, color, national or ethnic origin, sex, sexual orientation, gender identity or expression, age, disability, or other characteristics protected by law. US applicants only, we are an Equal Opportunity and Affirmative Action Employer.
#LI-GS1