R

Machine Learning Engineer

Rethink recruit
Full-time
On-site
Palo Alto, United States

About FEDML


 


FEDML is an AI research and development company with a focus on Generative AI Platform at Scale. FEDML provides the generative AI platform and foundation models to enable developers and enterprises to build and commercialize their own generative AI applications easily, scalably, and economically. Its flag product, FEDML Nexus AI, provides unique features in enterprise AI platforms, model deployment, model serving, AI agent APIs, launching training/Inference jobs on serverless/decentralized GPU cloud, experimental tracking for distributed training, federated learning, security, and privacy.


We are an equal opportunity employer and do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or any other legally protected status.


 


FEDML’s HQ is located at The Lucky Building, 165 University Ave, Palo Alto, CA. Join us in shaping the future of technology.


 


Our Benefits




  • Floating holidays




  • Medical, Dental and Vision for employees and family members




  • Truly competitive salary and equity




  • 401(k) (U.S. specific)




Machine Learning Engineer 


Responsibilities




  • Solve complex problems at scale around model training, fine-tuning and inference




  • Play a key role in the end-to-end design and implementation of our product which is a platform for powering use cases across training and serving of generative AI models




  • Work closely with both ML researchers in the company and customers to identify key areas of development for our generative AI platform




  • Have strong end-to-end product ownership, translating product requirements into user interfaces and backend distributed system design as well as own the implementation of these designs




  • Design and build the core platform infrastructure that supports our customer-facing product features




  • Ensure the reliability, security, and scalability of the backend distributed systems that power all aspects of our product




  • Collaborate with product managers and cross-functional teams to drive technology-first initiatives that enable novel business strategies and product roadmaps




  • Facilitate our user community through documentation, talks, tutorials, and collaborations




  • Contribute to the broader AI community by publishing research, presenting at conferences, and actively participating in open-source projects.




We Look For




  • Hands on experience with the internals of deep learning frameworks (e.g. PyTorch, TensorFlow) and GenAI models (e.g. GPT, StableDiffusion, Mistral, etc.)




  • Experience with large scale, distributed training on GPUs




  • Strong sense of design and usability




  • Effective communication skills and the ability to articulate complex technical ideas to cross-disciplinary internal and external stakeholders




  • Prior history of contributing to or developing open source projects is a bonus but not a requirement