Site Reliability Engineer, Senior Associate
We have an exciting opportunity in our Blockchain Innovation group, supporting Onyx, by J.P. Morgan. As a Site Reliability engineer you will lead efforts to help the blockchain team in supporting multiple blockchain networks. In this role, you'll be working with the global blockchain engineering team to build a scalable and flexible support model to support the production platform, pipeline, monitoring, analytics and reporting.
This role requires a wide variety of strengths and capabilities, including:
- Deep understanding of the DevOps philosophy, technologies, platforms and tools, SLA management, incident resolution, and automation.
- Mastery of application, data and infrastructure architecture disciplines
- Command of architecture, design and business processes with a keen understanding of financial control and budget management
- Expertise in working in partnership with colleagues throughout the firm, and in leading collaborative teams to achieve common goals
- Hands on experience managing operations of large-scale internet-centric production environments for application or infrastructure services.
- Prior experience in large scale distributed applications, where uptime and continuous availability was core to the business.
- Identify and partner with Infrastructure teams and AD teams to implement automation opportunities to drive down time to market and support effort.
- Apply standards of cloud compliance to application design to achieve reliability
- Understanding of Networking and cloud technologies, for example Security, Load Balancing, Network routing protocols.
- Particpate in weekend support rota.
- Self starter & develop knowledge to support Quorum network & dApps.
- Leads failure analysis / write root cause analysis when required.
- Provides support to develop & improve the quality of technical engineering documentation.
- Provides support to drive the maturity of the software development lifecycle.
- Provides expertise in evaluating new support requirements & operational improvements
- Provides quality control of engineering deliverables.
- Implement Ansible/Scripts that covers automation for repeated tasks and deployment solutions
- Implement diagnostic tools to quickly determine possible causes of applciation issues
- Champion a DevOps model so that services are automated and elastic across all platforms.
- Bachelor's degree in Computer Science, Information Technology, or equivalent technical field
- Minimum 5 years in a Developer/Support role in a mission critical distributed application environment
- In-Depth OS experience (RHEL, Ubuntu, Windows Server) with strong debugging, troubleshooting, and problem-solving skills
- Experience in site reliability engineering in one of the following languages: Python, Java, PowerShell, shell scripting or GO
- Hand-on experience with cloud-based technologies and tools especially in deployment, monitoring and operations, such as Prometheus, Splunk, Elasticsearch, Grafana
- Strong working knowledge of modern development technologies and tools such Agile, CI/CD, Git, Terraform and Jenkins.
- Deep knowledge of Internet protocols and web services technologies such as HTTP, DNS, TCP/UDP, SOAP, JSON and REST
- Good understanding of networking protocols and cybersecurity best practices in cloud environment
- AWS certification is highly desirable