Job Description
About the job Site Reliability Engineer
Contract to hire role
must reside in the Minneapolis area
W2 only
We are seeking a Senior Site Reliability Engineer that will be at the forefront of establishing and driving best practices in system reliability, performance optimization, and observability. With over five years of experience, you bring deep expertise in software development and infrastructure operations, particularly in building and maintaining scalable, data-intensive systems. Your key focus will be on defining and implementing Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure our solutions meet rigorous performance standards. You will work closely with cross-functional teams to build observability frameworks that empower teams to monitor, diagnose, and improve system performance proactively. Your leadership and persistence will be vital in identifying and resolving performance bottlenecks, ensuring long-term scalability and efficiency across our systems.
What Youll Be Doing... - Collaborate with development and operations teams to design, implement, and maintain observability frameworks that provide deep insights into system performance, particularly for data and ML pipelines.
- Lead the establishment of Service Level Objectives (SLOs) and Service Level Indicators (SLIs), ensuring they align with business goals and drive continuous performance improvements.
- Partner with stakeholders to understand system performance requirements and translate them into actionable performance engineering strategies.
- Proactively identify performance bottlenecks and collaborate with teams to implement solutions that enhance system scalability and reliability.
- Design and execute performance regression test suites, focusing on data-intensive and ML workloads, to ensure continuous performance optimization.
- Own the reliability and performance metrics of our systems, driving a culture of performance excellence and proactive issue resolution.
- Collaborate with subject matter experts to gain a deep understanding of domain-specific performance challenges, particularly in data and ML pipelines.
- Utilize tools like Datadog, Jira, and GitHub to monitor system performance, manage projects, and track issues, with a strong emphasis on performance-related metrics.
- Define and monitor success metrics, ensuring our systems consistently meet or exceed performance and reliability targets.
- Actively contribute to the continuous improvement of performance engineering practices across the team, fostering a culture of excellence in observability and system performance.
- Perform other duties as assigned.
What Youll Bring To Us - Bachelors degree in Computer Science, Engineering, or a related field.
- Five years of experience in a site-reliability-focused role responsible for establishing reliability standards in a cloud-native environment
- Strong expertise in establishing SLOs/SLIs and building observability frameworks for complex systems.
- Proficiency with cloud services, particularly AWS, and experience in designing scalable and reliable architectures.
- Hands-on experience with performance monitoring and observability tools like Datadog.
- Proficiency in version control systems like Git/GitHub and infrastructure as code tools like Terraform.
- Strong interpersonal skills and excellent communication abilities, with a focus on driving performance improvements across teams.
Preferred - Proficiency in Java programming and hands-on experience with REST, Spring and microservices development.
- Proficiency in RDBMS schema design and index utilization
Job Tags
Contract work,