Machine Learning Infrastructure Engineer
New York, NY
Posted Apr 12, 2022 - Requisition No. 102458
Bloomberg’s Data Science Platform was established to support development efforts around data-driven science, machine learning, and business analytics. The platform aims to provide scalable compute, specialized hardware and first-class support for a variety of workloads such as Pytorch, Spark, Tensorflow and Jupyter. The platform was developed to provide a standard set of tooling for addressing the Model Development Life Cycle from experimentation and training to inference. It provides advanced features such as Hyperparameter Tuning as a Service and is beginning to invest in Model Management and Governance. The platform is built leveraging containerization, container orchestration and cloud architecture and built on top of 100% open source foundations.
As the needs of distributed model training, continual learning, data exploration and analysis advance, so do the needs of the compute platform that underscore it. Our platform is poised for continued growth to accommodate the endless number of products across Bloomberg that rely on a robust compute environment. Highlights from our upcoming roadmap focus on creating a highly scaled and performant compute platform that abstract away common requirements that appear across many use cases, including creating a highly available federation layer for batch jobs such as training and spark, increasing compute resource (CPU, Memory, GPU) usage efficiency and visibility, enhancing the training experience and integrating with public cloud.
That’s where you come in. As a member of the multi-disciplinary Data Science Platform team, you’ll have the opportunity to make key technical decisions to keep this platform moving forward. Our team makes extensive use of open source (e.g. Kubernetes, Buildpack, Kubeflow, Pytorch Tensorflow, Jupyter etc.) and is deeply involved in a number of communities. We collaborate widely with the industry, contribute back to the open source projects, and even present at conferences. While working on the platform, the backbone for many of Bloomberg's up and coming products, you will have the opportunity to meet engineers across the company and learn about the technology that delivers products from the news to financial instruments. If you are a software engineer who is passionate about building resilient, highly available infrastructure and seamless, usable full stack solutions, we'd like to talk to you about an opening on our team.
If you are a software engineer who is passionate about building resilient, highly available infrastructure and seamless, usable full stack solutions, we'd like to talk to you about an opening on our team.
We’ll trust you to:
- Interact with data engineers and ML experts across the company to understand their workflows and requirements to inform the next set of features for the platform.
- Provide GPU management solution to enhance distributed training performance and resource usage efficiency
- Enhance distributed training user experience using main stream and internal training frameworks
- Design seamless workflow from model training to model inference
- Troubleshoot and debug user issues
- Provide operational and user facing documentation
- Provide performance analysis and capacity planning for clusters
What we are looking for:
- Have a strong sense of curiosity to solve new problems and keep learning new technologies.
- Have a passion for providing reliable and scalable infrastructure
- Experience building and scaling Docker-based systems using Kubernetes, Swarm or Mesos
- Experience with ML infrastructure open source projects such as Kubeflow, Triton, MLFlow, Feast
- Experience with mainstream machine learning frameworks such as Pytorch, Tensorflow
- Experience with distributed systems eg. Kubernetes, Kafka, Zookeeper, Spark
Nice to haves:
- Experience working with authentication & authorization systems such as Spiffe and Spire
- Experience with data encryption
- Experience working with GPU compute software and hardware
- Ability to identify and perform OS and hardware-level optimizations
- Open source involvement such as a well-curated blog, accepted contribution, or community presence
- Experience with cloud providers such as AWS, GCP or Azure
- Experience with configuration management systems (Chef, Puppet, Ansible, or Salt)
- Experience with continuous integration tools and technologies (Jenkins, Git, Chat-ops)
- Passion for education e.g. providing workshops for tenants
Bloomberg is an equal opportunity employer and we value diversity at our company. We do not discriminate on the basis of age, ancestry, color, gender identity or expression, genetic predisposition or carrier status, marital status, national or ethnic origin, race, religion or belief, sex, sexual orientation, sexual and other reproductive health decisions, parental or caring status, physical or mental disability, pregnancy or maternity/parental leave, protected veteran status, status as a victim of domestic violence, or any other classification protected by applicable law.
Bloomberg is a disability inclusive employer. Please let us know if you require any reasonable adjustments to be made for the recruitment process. If you would prefer to discuss this confidentially, please email firstname.lastname@example.org