Site Reliability Engineer Job at Berkley Hunt, San Francisco, CA

NVRtcnFsT1lkRmRVWDAwK0hXVXBrakx5RUE9PQ==
  • Berkley Hunt
  • San Francisco, CA

Job Description

Senior Site Reliability Engineer (GPU Compute) | Hybrid – Bay Area, CA

Berkley Hunt is supporting a fast-growing AI startup building a high-performance, cloud-native platform to power cutting-edge machine learning workloads. As they scale, they’re hiring a Senior/Staff Infrastructure Engineer to lead the development of a scalable GPU compute environment from the ground up.

About the Role:

This is a high-impact role for an experienced infrastructure engineer who thrives in fast-paced environments and wants to shape the future of AI infrastructure. You’ll design, build, and operate the systems that enable high-throughput GPU workloads at scale—collaborating closely with the core engineering team to optimize performance, efficiency, and reliability.

If you're excited about solving deep technical challenges in distributed compute and cloud automation, this could be a standout opportunity.

Responsibilities:

  • Build and maintain a large-scale, distributed GPU compute platform powering AI workloads.
  • Develop backend systems in Python to orchestrate GPU jobs, manage routing, observability, and capacity.
  • Design and implement infrastructure with tools like Terraform, Ansible, and Kubernetes across cloud and bare metal environments.
  • Own the reliability, scalability, and performance of the platform, from provisioning to deployment and monitoring.
  • Collaborate with the engineering team to shape infrastructure vision and technical strategy over the next 1–5 years.
  • Drive automation and improvements to minimize operational overhead and scale efficiently.

Requirements:

  • 6+ years of experience in cloud infrastructure or backend engineering roles.
  • Deep knowledge of distributed compute systems, especially involving GPU orchestration.
  • Proficiency with Python and infrastructure-as-code tools (e.g., Terraform, Ansible).
  • Solid experience with Kubernetes and CI/CD pipelines.
  • Strong understanding of cloud platforms (AWS, GCP, or Azure); bare metal experience is a plus.
  • Excellent problem-solving skills and a proactive, ownership-driven mindset.

Nice to Have:

  • Experience at a high-growth startup or in scaling large infrastructure systems.
  • Familiarity with GPU resource scheduling and performance optimization.
  • Hands-on experience with observability stacks (Prometheus, Grafana, Loki, Thanos).
  • A passion for automation, infrastructure design, and moving fast without breaking things.

Job Tags

Similar Jobs

Pride Health

Phlebotomist Job at Pride Health

 ...Job Description Pride Health is hiring a Phlebotomist to support our clients medical facility based in Milwaukee, WI 53211 This is a...  ...week). These positions will require ability to work alternating weekends (Saturday & Sunday) and Holidays. The work schedule will be 04... 

Bottle Raiders

Mobile Application Developer Job at Bottle Raiders

 ...reviews and coverage. We provide you with the most comprehensive, unbiased, and fair ratings available. We're looking for a mobile developer to take over development of our spirits discovery app. You'll be responsible for both maintaining and evolving our cross-platform... 

Assured Nursing

Travel Wound Care Consultant RN Job at Assured Nursing

 ...Job Description Assured Nursing is seeking a travel nurse RN Wound Care for a travel nursing job in Indianapolis, Indiana. Job Description & Requirements ~ Specialty: Wound Care ~ Discipline: RN ~ Start Date: 06/16/2025~ Duration: 13 weeks ~40 hours... 

USAA

Entry-Level Vehicle Insurance Assessor (Hiring Immediately) Job at USAA

 ...adjusting non-injury auto claims and you'll work under supervision to investigate, evaluate, negotiate, and adjust low complexity auto insurance claims presented by or against our members. This will include the end-to-end claims process and settling claims in compliance with... 

Aloha Petroleum, Ltd.

Heavy Equipment Operator Job at Aloha Petroleum, Ltd.

 ...The Heavy Equipment Operator I is responsible for safely and efficiently operating various types of heavy construction equipment, including Loaders, Back Hoes, Skid Steers, and Excavators. This role is crucial to the successful execution of pipeline Equipment Operator...