Site Reliability Engineering Job Description
Site Reliability Engineering Duties & Responsibilities
To write an effective site reliability engineering job description, begin by listing detailed duties, responsibilities and expectations. We have included site reliability engineering job description templates that you can modify and use.
Sample responsibilities for this position include:
Site Reliability Engineering Qualifications
Qualifications for a job description may include education, certification, and experience.
Licensing or Certifications for Site Reliability Engineering
List any licenses or certifications required by the position: ITIL, AWS, DNS, CCNA, RHCE, SSL, HTTP, TCP, TLS, SQL
Education for Site Reliability Engineering
Typically a job would require a certain level of education.
Employers hiring for the site reliability engineering job most commonly would prefer for their future employee to have a relevant degree such as Bachelor's and Master's Degree in Computer Science, Technical, Engineering, Mathematics, Software Engineering, Science, Systems Engineering, Management, Information Systems, Physics
Skills for Site Reliability Engineering
Desired skills for site reliability engineering include:
Desired experience for site reliability engineering includes:
Site Reliability Engineering Examples
Site Reliability Engineering Job Description
- Ensure adherence to SLAs and quality standards
- Help design and improve data pipelines with the goal of making them easily monitored and cost effective
- Commit code for instrumenting new and existing data pipelines with stats and monitoring hooks
- Install, upgrade, and maintain our production Splunk infrastructure
- Own, scale, and improve our in-house stats infrastructure, supporting over 600,000 individual metrics per minute
- Evaluate next-generation monitoring and metrics collection tools and utilities
- Help guide our Data Engineering team towards SRE best practices
- Own and operate the architecture and systems that collect data in real-time from over 120 million unique users per month
- Select and develop automation tools and scripts to improve the availability, manageability, scalability and operability of services
- Solve performance and stability issues and prevent their recurrence
- Objectionable
- A passion towards automating things
- An understanding of the 12 Factor App
- A high degree of interest in Linux containers and smart clustering solutions like Kubernetes/Mesos/fleet
- Strong experience in at least one infrastructure component (operating systems, compute, storage, networking, distributed systems, big data, cloud, containers, ) and solid understanding of the rest and how they impact services
- Bachelor's degree in Computer Science or equivalent qualification/experience
Site Reliability Engineering Job Description
- Implement comprehensive service monitoring to ensure uptime and performance, including synthetic, real user, system, application performance, dashboards
- Define, measure, and meet key Service Level Objectives including availability, performance, incidents and chronic problems
- Partner with application and business stakeholders to ensure high quality product is developed and released into production
- Partner with application owners to ensure adequate performance, scalability of reliability of underlying infrastructure
- Establish the annual release calendar in partnership with application owners and monitor adherence to the Release Management processes, policies and procedures
- Roll up your sleeves and debug/tune/code/fix alongside your team
- Coach and mentor junior and new college graduates
- Evaluate, innovate, develop, and support any variety of internal PE&O automation systems geared to produce efficiency at scale
- Able to differentiate and articulate the difference between good and bad design at numerous levels
- Provide internal production system support
- Significant experience in designing, delivering and managing data infrastructure at scale
- A deep technical understanding of modern batch and real-time data technologies
- A proven track record of managing large volumes of data in cloud services while controlling costs
- Advanced knowledge of Unix/Linux systems
- Ability to write code
- Ability to learn rapidly and communicate value of new technologies to technical and non-technical audiences
Site Reliability Engineering Job Description
- Providing standardized offerings to facilitate the successful secure access to stacks and the cloud environment overall
- Manage engineers working with the engineering teams on our back-end services like our Hadoop, HDFS, Memcached, Reddis, Kubernetes, AWS, Java, Golang, Linux
- Directly leading and training a team of Site Reliability Engineers focused on high availability
- Coach and train engineers on actively diagnosing real-time production environment by analyzing code, log files, network traces and request/response pairs
- Ensure team is working efficiently and effectively to identify root cause of failures, determine quickest path to resolution, and take actions to prevent similar issues from occurring in the future
- Build and maintain relationships with product managers, support teams and leadership
- Interface with front-end and back-end developers providing performance data and guidance on areas for improvement
- Work with vendor contacts to manage business relationship and support needs
- Participate in shared on-call support phone rotation and handle escalations
- Define and execute on a roadmap evolving our monitoring and reliability capabilities
- Meticulous and careful
- Experience with web-based tool development (Python/Django, Java, Ruby/RoR), and building infrastructure tooling and reporting
- Automation mindset - if you can automate it, do it
- Have expert level skills in Linux/Windows system and network administration and agile implementation of production systems
- 10+ years of hands-on technical experience combined with strong management and communication skills
- Solid understanding of Windows, Linux, Networking, TCP-IP, Routing, Switching, Firewalls, Load balancers and other infrastructure components
Site Reliability Engineering Job Description
- Lead lifecycle management process to ensure clearly defined roadmaps for new technology solutions ensuring seamless transformation and adoption by the business
- Define and document technical requirements for new capabilities, working with key suppliers to solution, build and lab certify ensuring compliance with all functional, operational and business objectives
- Highest escalation point for critical and/or chronic incidents, provide subject matter leadership to Operations, helping to restore service
- Lead/Contributor on key network projects, representing the organization and working closely with Project Managers, Operations, Business Units, Suppliers, Peer Organizations and IT stakeholders
- Development experience with automation functions to setup, configure, and upgrade various network technologies, improving quality and reducing manual efforts
- Lead, coach and develop engineers across 3 shifts, including remote employees
- Manage shift leads who each have direct reports
- Oversee 24/7/365 coverage in support of our domestic and international businesses
- Run major incidents
- Work with the product, infrastructure, and engineering teams daily
- Strong troubleshooting experience and skillset to resolve incidents across multiple domains
- Demonstrated ability of establishing and maintaining metrics based process improvement
- Interest or experience in cloud technologies (AWS, Docker, Kubernetes)
- Practical expertise in managing and leading application reliability practices for consumer facing web and mobile experiences
- Ability to work across teams to continuously analyze system performance in production, troubleshoot consumer reported issues, and proactively identify areas in need of optimization
- Previous experience with developing and driving real time monitoring solutions that provide visibility into site health and key performance indicators
Site Reliability Engineering Job Description
- Lead and support highly experienced PaaS/SaaS product deployment and maintenance team
- Cloud, IT service and support vendor management experiences
- Maintain SLA s of Data Enabled Business’ cloud service and application offerings
- Designing and developing tools and processes to maintain large applications and services at scale
- Helping our engineers and data scientists build software that scales in terms of performance and stability
- Ruthlessly identifying and removing system bottlenecks before they ever impact performance
- Working side by side with on-call engineers to handle emergencies and then running postmortems to ensure they don’t happen in the future
- Establishing best practices inside the organization, proving that they work and then bringing them to other DEB teams and JCI
- Where you can provide the most value
- Promptly responds to incoming communications (telephone calls, emails, instant messaging, ) and directs reports and information requests appropriately
- 3-7 years’ technical experience working with consumer facing (e-commerce) software applications
- Experience with service discovery tools such as
- Can read and write in programming languages
- It requires a strong desire to dig deep into a wide range of technologies, and a relentless drive to make the customer experience better through investments in automation and infrastructure improvements
- Proven experience working with infrastructure components (operating systems, compute, storage, networking, distributed systems, big data, cloud, containers, ) and solid understanding of the rest and how they impact services
- Experience applying SREs skills to both drive quality improvements of already deployed services and, more importantly, cross training colleagues to build site and service quality into new development and integration efforts