Senior SRE - Telemetry
New York, NY
Posted May 8, 2020 - Requisition No. 82707
The Telemetry Engineering teams are the authority on monitoring and drive stability behavior across Bloomberg engineering. We have built a comprehensive telemetry platform, with everything from data access to storage and visualization-as-a-service for all of Bloomberg. The Grand Unified Telemetry System (GUTS) improves our ability to detect and resolve issues before they impact our clients.
Our platform stores application and system logs, metrics and trace to provide better insight and analytics to our engineers and operations staff. We have also built an interface which allow users to search and visualize this data, providing a clear picture of what’s happening and who needs to be alerted. This serves as the foundation for several firm-wide stability initiatives.
Challenges we are working on:
- The Telemetry platform is the backbone of our stability initiatives, so it needs to be done the right way the first time
- We are delivering Telemetry as a service for all of Engineering so ensuring the platform can scale and has optimal up-time is critical
- Many of the technologies we use are unique or new to Bloomberg, so we are also setting standards for how to manage some of the software
Key focuses of the role:
- Automation of machine building, configuring systems and monitoring the infrastructure, we use Chef for configuration management
- Contribute to all aspects of the SDLC for our telemetry platform
- Develop features incrementally to add to our live platform, so we are regularly delivering value to our customers
- Work with multiple "best fit for the job" technologies such as the following:
- Message bus - Kafka, Zookeeper
- Cluster manager - Kubernetes
- Timeseries database - Metrictank
- Bigdata - Cassandra, ScyllaDB
- Log aggregation - Splunk, Humio
- Health check - NRPE, Icinga
- Monitoring UI - Grafana
- CI/CD - Spinnaker
What's in it for you?
- You'll be managing the full stack, which is built on several open source systems
- We're building a telemetry system as a product for the entire company, so you'll have the chance to impact the entire organization
- We work with open source technologies, and you will be encouraged to contribute back to the projects
- This is a fast maturing team in Bloomberg so you'll have the chance to help define how our product works during the current and future phases
We’ll expect you to:
- Automate manual, repeatable processes and ensure they run quickly and reliably
- Conduct stress/benchmark testing, analyze performance, find bottlenecks and fine tune
- Troubleshoot and debug runtime issues with software and hardware
You’ll need to have:
- Proven experience in a programming and/or scripting language (e.g. Python, Golang, Java, Ruby)
- Strong understanding of Linux systems
- Proficiency and experience in provisioning and building infrastructure as code using tools like Chef and SaltStack.
- Excellent problem solving skills
- Proven experience building and scaling out mission-critical, highly available and high throughput distributed systems
- A self-starter approach with a strong sense of ownership
- A strong familiarity with CI/CD methodologies
We would love to see:
- Open source experience is a plus (a well curated blog, upstream accepted contribution or community presence)
- Familiarity with one or more of the following technologies is preferred: Kafka, Kubernetes and Cassandra
If this sounds like you, please don’t hesitate to apply! We’ll get in touch if we believe you’re a good match and get started with a technical phone interview.
Bloomberg is an equal opportunity employer and value diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.