Systems Reliability Engineer

Similar jobs

New York, NY

Posted Aug 4, 2016 - Requisition No. 53185

Our Team:

Bloomberg systems are fast and reliable and we're the team that makes that possible. We build middleware - the software infrastructure designed for creating large-scale, fault-tolerant applications that run on thousands of machines throughout the world. We’re two dozen C++ programmers building a complex infrastructure using a variety of programming paradigms such as RPC, publish/subscribe and message queues. With thousands of clients depending on our infrastructure solutions, we are looking to grow our SRE team. That’s where you come in.

What's in it for you:

As a Systems Reliability Engineer (SRE) working on this critical infrastructure, you’ll focus on automating everything from build and deployment to reaction and remediation to outages. You will work on all aspects of this end-to-end system to support the Bloomberg API.

We'll trust you to:

Take responsibility for deployment after Beta for Bloomberg's messaging and multicast services
Ensure level 1 support for production issues
Automate everything from reaction to outages to quality checks for new builds
Provide feedback to developers to make this infrastructure increasingly resilient

You need to have:

3+ years of experience as a software engineer or developer working on high availability, large-scale distributed applications
Excellent programming skills. You don't need to be a rock star C++ programmer, but you need to know at least little bit about C++ and you do need to be a great programmer in other programming languages such as Python, Ruby, Perl, Scala or JavaScript.
A strong understanding of the UNIX/Linux command line
A passion for performance excellence and an engineering mindset
Previous experience with data, statistics and latency numbers
A Bachelor's degree in Computer Science or equivalent experience

We'd love to see:

Strong leadership skills
Prior experience as a systems performance or site/systems reliability engineer
Extensive experience working with fault-tolerant approaches in a large-scale distributed environment with high performance systems
A deep understanding of Internet and networking protocols, including IP multicast (PGM)
Knowledge of network analysis and performance and application issues using standard tools (Tcpdump or Wireshark)
2+ years of Chef, Puppet or Ansible system configuration experience (error handling, idempotency, configuration management)
Experience with virtualization and Infrastructure as a Service models
The ability to handle periodic on-call duties as well as out-of-band requests

Our infrastructure is built to fully automate deployment and operations using Chef; developed and open sourced at https://github.com/bloomberg/chef-bcpc. We want to work with others who are passionate about automation and community-driven development both within the company and with the wider open source community. If this sounds like you, submit an application, and check out some of our other open source contributions:

https://github.com/bloomberg/nginx-cookbook
https://github.com/bloomberg/zookeeper-cookbook