Site Reliability Engineer Job Description
Site Reliability Engineer Duties & Responsibilities
To write an effective site reliability engineer job description, begin by listing detailed duties, responsibilities and expectations. We have included site reliability engineer job description templates that you can modify and use.
Sample responsibilities for this position include:
Site Reliability Engineer Qualifications
Qualifications for a job description may include education, certification, and experience.
Licensing or Certifications for Site Reliability Engineer
List any licenses or certifications required by the position: AWS, ITIL, V3, MCSE, II, IAT, RHCSA, SSL, DNS, HIPAA
Education for Site Reliability Engineer
Typically a job would require a certain level of education.
Employers hiring for the site reliability engineer job most commonly would prefer for their future employee to have a relevant degree such as Bachelor's and Master's Degree in Computer Science, Technical, Software Engineering, Engineering, Computer Engineering, Business, Education, Science, Technology, Information Systems
Skills for Site Reliability Engineer
Desired skills for site reliability engineer include:
Desired experience for site reliability engineer includes:
Site Reliability Engineer Examples
Site Reliability Engineer Job Description
- Code Ansible Playbooks in an Amazon Web Services (AWS) Public Cloud environment
- Maintain and operate existing applications via configuration management (Ansible) implementing for new systems as needed
- Collaborate with the centralized infrastructure team as the engineering stakeholder for UA Record, providing feedback and helping to implement the infrastructure roadmap as applicable
- Act as a proactive advisor to the UA Record team on how to build scalable, manageable services
- Evangelize and educate the development team on scalability, security, and reliability concerns while assisting the team in efforts to build these checks into the development workflow
- 40% - Performance Testing / Optimization of Applications
- Fixing the interesting problems we face in the best way possible
- Participates in Company product lifecycle process
- A love of SRE, open-source, self-service tools, and micro-services
- Experience with AWSmulti-region/multi-AZdeployed systems, auto scaling of EC2 instances, CloudFormation, ELBs, VPCs, CloudWatch, SNS, SQS, S3, Route53, RDS, IAM roles, security groups, blue/green deployments, and A/B testing
- Solid understanding of fundamental technologies like TCP/IP, HTTP
- Strong working knowledge of Linux systems and applications
- Must work well with and be able to influence myriad personalities at all levels
- Experience with automation languages like Ruby, Powershell or Unix
- Experience with automation tooling such as Chef, Docker, AWS
- A bachelor’s degree in Computer Science, a related discipline, or equivalent practical experience
Site Reliability Engineer Job Description
- Monitor and maintain applications per agreed upon Service Level Objectives
- Support and maintain configuration management for various applications and systems
- Identify and resolve a broad range of problems that occur in production applications and systems
- Serve as part of the architecture and development lifecycle implementing systems
- Support the recovery and resiliency strategy and architecture for various applications and systems
- Proactively support capacity planning and disaster recovery and resiliency aspects
- Govern support processes, resiliency and automation principles for the larger organization
- Provide direction and guidance to other infrastructure and DevOps engineers
- Work with business teams to identify complex requirements and their integration into existing and new technologies
- Building large scale messaging infrastructure, data replication, auto-scaling and stream processing
- Comfortable with large scale production systems and technologies, for example load balancing, monitoring, distributed systems, and configuration management
- Strong coding skills in at least one programming language, and a desire to pick up more
- Familiarity with and enthusiasm for software engineering best practices such as testing, continuous integration and continuous delivery
- A passion for solving problems using open source software
- The ability to thrive in a rapidly evolving, globally distributed environment
- Strong Security mindset
Site Reliability Engineer Job Description
- Rapidly debug and respond to user-reported issues on the DGX Platform software stack
- Contribute to the overall health, performance, and capacity planning of DGX Services
- Deliver AWS based infrastructure solutions using AWS Cloud Formation (JSON) for configuration management
- Design, develop and implement software that improves the stability, scalability, availability and latency of the Booking.com products
- Take ownership of services and have the freedom to do what is best for our business and customers
- Build effective monitoring to monitor the health of your system, and jump in to handle outages
- Build and run capacity tests to manage the growth of your systems
- Plan for reliability by designing systems to work across our multinational data centers
- Develop tools to assist the product development teams with successfully deploying 1000s of change sets every day
- Share the on-call rotation and be an escalation contact for incidents
- A passion for elegantly solving problems using open source software whenever possible, avoiding complex solutions and reinventing wheels
- A passion for contributing to open source software, fixing bugs and implementing features
- Exposure with cloud and Amazon Web Services (AWS) and APIs
- Experience with automation tooling such as Chef, Docker
- Applies full use and application of engineering methodologies related
- Windows and Apple desktop
Site Reliability Engineer Job Description
- Troubleshoot/understand reliability issues
- Production readiness-ensuring the environment is available and reliable
- Work heavily with AWS technologies (All our systems are in AWS)
- Ensure all systems are highly available, with proper DR solutions in place
- Work to identify and improve upon latency issues
- Work to ensure we are squeezing every bit of performance out of our systems
- Writing code to automate our way through AWS and all related ops processes
- Monitor infrastructure and applications (creating custom metrics, new alarms, dashboards, etc)
- Serving as level 2 escalation for production issues
- Capacity planning, ensuring new systems can support production load and scales appropriately
- Mastery of Linux or Unix
- Proficiency in development languages (Bash, Clojure, Go, Java, Javascript, Python, Ruby, etc)
- In-depth understanding of web application models and key components, including the HTTP
- Experience in a similar role or project
- Experience with various data technologies including relational and non-relational databases and message queues
- At least 2 years' with Virtualization and Cloud Platforms
Site Reliability Engineer Job Description
- Gathering and analyzing data to root out errors, discern trends, and diagnose complex customer-facing issues (pre- and post-sale)
- Responding to incidents, but more importantly preventing incidents through pro-active analysis and monitoring
- Identify and communicate the need for technology improvements/software updates and product innovation
- Work with development engineers to solve platform problems and design code fixes
- Implement changes and design regression tests to make permanent solutions to platform problems
- Experience managing large-scale database systems in a cloud environment
- Strong preference for shipping incrementally with an understanding of the fundamentals of CI / CD
- Design and deliver solutions to improve the availability, scalability, latency, and efficiency of CircleCI’s services
- Diagnose and resolve production issues in conjunction with software engineering teams
- Architect and implement shared infrastructure used by all services within the CircleCI platform, for both SaaS and on-prem configurations
- 4+ years' experience with development and management of Java APIs
- 1+ years' experience with JavaScript Frameworks, Angular JS and Node.js
- 1+ years' experience working with cloudautomation/orchestrationtechnologies (ie
- Various programming and scripting languages
- Various automation languages/platforms
- Multiple application platforms