It is not every day that you get the opportunity to lead the software engineering team that is responsible for the reliability and production quality of some of the most critical application frameworks at Bloomberg.
Our group builds middleware - the software infrastructure designed to create large-scale, fault-tolerant applications that run on thousands of machines throughout the world. This infrastructure includes a handful of systems such as BAS (Bloomberg Application Services, a rich framework for micro-services), MBUS (Message Bus, a publish/subscribe system using multicast) and BMQ (Bloomberg Message Queues, a high-performance MQ system). Our end users are software engineers, who have different needs from each other, and we’re trusted to make architectural decisions that will scale across a wide range of use cases.
With thousands of clients depending on our infrastructure solutions, we are looking for a Team Lead for our System Reliability Engineering team. That’s where you come in.
What's in it for you:
Our application frameworks run on tens of thousands of machines and are used every day by over 5,000 engineers, so your work will have an impact across the entire organization. You’ll be trusted to define the processes that will make the system as reliable, high volume, high performance, high throughput, low latency and scalable as possible, with self-healing characteristics. On any given day, you'll make decisions that impact some of the most critical systems at Bloomberg.
You will be a key member of a development team that our clients rely on, leading a highly technical team and influencing the products' technical direction. The job is very hands-on, and all team members spend the majority of their time writing code.
We'll trust you to:
- Inspire and motivate a high performing team to achieve great results, while supporting individual growth and development
- Establish best practices that result in the highest quality in our products and service
- Review and influence the design and standards of our software
- Respond to and resolve unexpected service problems. Your team will write software to prevent the same problem happening again
- Manage system releases, write production software acceptance tests and coordinate all aspects of the release including coverage and communication plans
- Create dashboards and instrument the code to capture and publish essential metrics, and use this data to define alerts
- Build data analysis tools to keep track of important service level indicators, predict future capacity needs, audit application configurations
- Automate everything from deployment and configuration management to mitigation of outages, all aspects end-to-end
You need to have:
- Demonstrated experience leading a team of software engineers
- An ability to cultivate a collaborative environment through driving a strong culture of teamwork and taking advantage of team diversity
- Strong programming ability
- 3+ years experience with C++ and Python (or other scripting languages)
- A solid understanding of data structures, algorithms, complexity analysis
- Experience in all phases of the software development lifecycle
- Good knowledge of Linux/Unix
- Excellent problem solving skills
- Ability to handle periodic on-call duty as well as urgent requests
- Excellent stakeholder relationship management
- The ability to effectively listen, communicate, challenge and influence team members, peers and senior managers
- The desire to take ownership and responsibility of issues and handle effectively through to resolution
We'd love to see:
- Extensive exposure to working with fault tolerant approaches in a large scale distributed environment and high performance systems
- Good understanding of internet and networking protocols
- Experience with Git, CMake, Jenkins, DPKG, RPM, Docker, Chef, Terraform, OpenStack, VMware, Grafana, time-series databases