Senior Software & Reliability Engineer - Data Distribution
Posted Jul 16, 2021 - Requisition No. 92299
At Bloomberg, data is our business and we deliver it with speed. We source information from more than 370 exchanges, 4,000 FX feeds and 80,000 news wires worldwide – a combined load of more than 60 billion messages, every day.
The Real-time Distribution Platform (RDP) group builds Bloomberg's data-distribution infrastructure. We provide low-latency exchange-sourced market data (for example, stock prices) and value-add Bloomberg-derived data via our open API. We develop scalable, distributed, high-performance software that delivers this mission-critical information to all Bloomberg desktop customers and many enterprise applications.
In the SRE team, our mission is to ensure the optimal availability, latency, scalability and efficiency of the RDP infrastructure for more than ten thousand client-facing applications.
We achieve this with a balance of operational support and software development, applying software engineering principles to improve the overall reliability of the system.
We’ll trust you to:
- Build services and UIs to manage the application configuration for thousands of machines
- Develop and maintain tools to automate and simplify investigating and resolving production problems
- Help to create dashboards, monitoring and alerting to track the health of the live system
- Understand the current system capacity and load, predict future demand and make appropriate scaling recommendations
- Define standards and best practices with respect to logging, latency, troubleshooting and monitoring
- Work with application teams to review and influence the design of software to improve its reliability
- Facilitate continuous integration / continuous deployment to automate deployment and quality control (including functional and capacity testing)
- Investigate, triage, and troubleshoot production problems as they occur
You'll need to have:
- Hands-on software development experience in C/C++, Python or any other programming language
- A strong understanding of how large-scale distributed systems are built and put together
- A proven track record triaging and solving live production problems with such systems
- The ability to work in a collaborative and inclusive team environment
- The skills to effectively listen to, communicate with, challenge and influence team members and peers
In addition, we'd love to see:
- Experience with monitoring software such as Splunk, Humio or Grafana
- Practical knowledge of networking stacks such as TCP/UDP/IP
- Experience of latency monitoring and capacity planning
- Good knowledge of Linux
- Knowledge of continuous integration / deployment systems such as Jenkins. Experience of system testing
We are an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.