Site Reliability Engineering (SRE) at TikTok combines software and systems engineering to build and run large-scale, massively distributed, and fault-tolerant systems.
In our team, you'll have the opportunity to manage the complex challenges of scale, while using expertise in coding, algorithms, complexity analysis, and large-scale system design.
We embrace a culture of diversity, intellectual curiosity, openness, and problem-solving.
We encourage close collaboration while promoting self-direction.
In order to enhance collaboration and cross-functional partnerships, among other things, at this time, our organization follows a hybrid work schedule that requires employees to work in the office 3 days a week, or as directed by their department.
We regularly review our hybrid work model, and the specific requirements may change at any time.
Responsibilities * Engage in and improve the whole lifecycle of services from inception and design, throughout development, capacity planning, and launch reviews, to deployment, operation, and automate * Design and implement various dashboards and monitoring frameworks for efficient, automated, and intelligent service-oriented architecture (SOA) governance * Scale systems elastically through mechanisms such as automation; evolve systems reliability, efficiency, and velocity by pushing for changes * Practice efficient customer support, incident response, and blameless postmortems.