Senior Site Reliability Engineer

Information Technology
Altais Health
2100379 Requisition #

About Our Company 

At Altais, we're looking for bold and curious innovators who share our passion for enabling better health care experiences and revolutionizing the healthcare system for physicians, patients, and the clinical community. Doctors today are faced with the reality of spending more time on administrative tasks than caring for patients. Physician burnout and fatigue are an epidemic, and the healthcare experience and quality suffer as a result. At Altais, we’re building breakthrough clinical support tools, technology, and services to let doctors do what they do best: care for people. Come join us as an early member of our passionate and growing team as we change the game for the future of healthcare and enable the experience that people need and deserve.


About Your Team

Do you enjoy working with a highly motivated and talented team to deliver mission-critical healthcare solutions that change the way healthcare is delivered? Altais is growing our Site Reliability Engineering team to help deploy, manage, troubleshoot, and enhance our complex cloud-based services for our customers.

Do you want to push the limits on Amazon Web Services to drive value-based care? Using Athena, EMR, Kinesis, Redshift, Glue, MQ, Neptune, Greengrass, SageMaker. Kendra, Lex, Textract, PyTorch, TensorFlow, Transcribe, Polly, and Macie.

We are looking for a highly technical, hands-on Engineer with experience using several open-source projects commonly found in large-scale deployments. You will be managing our Kubernetes Lifecycle: deployments, upgrades, monitoring, and uptime of all K8S clusters. You will help to advance the deployment process of software into Kubernetes with GitLab at massive scale. Additionally, you will work towards perfecting the metrics and alerting from Datadog and Pagerduty­ so that all events are actionable.

Your focus will be on maximizing system uptime. Team members all participate in an on-call rotation.

You will build innovative automated solutions and tools to help debug and resolve problems in production and prevent them from recurring. Further, you will proactively seek out system weaknesses and find ways to fix them before they cause production issues using monitoring data, watching trends, and using Chaos Engineering.

This position is located in our brand-new Oakland City Center location.


About Your Work

  • Keeping your assigned site or service up and running or getting it back up and running quickly when a failure occurs
  • Automating work including infrastructure needs, testing, fail-over mitigation, and much more
  • Developing CI/CD processes to improve cadence
  • Working closely with internal partners and teams to ensure that we ship software that meets security, SLA, and performance requirements
  • Debugging complex problems across an entire stack and creating solid solutions
  • Post incident-reviews to find out what’s working and what’s not and improving them by filling the gaps in the process
  • Writing, updating, and user documentation, including runbooks/playbooks
  • Using Chaos Engineering to test what you build under real-world conditions
  • Running monthly Chaos Engineering “Game Days”

The Skills, Experience & Education You Bring:

  • 10 years of experience with software engineering, software development, or system operations.
  • Experience designing, building, and operating large-scale production Software-as-a-Service platforms.
  • Experience with monitoring and observability such as with Datadog and Prometheus.
  • Production experience with DevOps or site reliability engineering running web and/mobile applications.
  • Excellent communication skills, both verbal and written.
  • Advanced experience on Terraform and/or (Optional: CloudFormation).
  • Hands-on experience with AWS cloud platform (Optional: GCP or Azure).
  • Experience debugging complex problems, including application running on kubernetes platform and EC2 instances.
  • Knows their way around a Unix/Linux shell, can write shell scripts, and understands Linux internals.
  • A solid understanding NodeJS and Java.
  • Moderate understanding on how database works, writing queries to interact with databases, and troubleshooting complex data layers. Open-source databases (MySQL, Postgres, Redis, Cassandra, etc.).
  • A solid understanding of networking and core Internet protocols (e.g. TCP/IP, DNS, SMTP, HTTP, and distributed networks).
  • Understands networking and messaging, especially between services.
  • Has hands-on experience using source control (Git, GitHub, GitLab) and feature branching strategies.
  • Have a track record of embedding security into the fabric of an organization and infrastructure.


You Share our Mission & Values: 

  • You are passionate about improving the healthcare experience and want to be part of the Altais mission.
  • You are bold and curious- willing to take risks, try new things and be creative.
  • You take pride in your work and are accountable for the quality of everything you do, holding yourself and others to a high standard.
  • You are compassionate and are known as someone who demonstrates emotional intelligence, considers others when making decisions and always tries to do the right thing.
  • You co-create, knowing that we can be better as a team than individuals.  You work well with others, collaborating and valuing diversity of thought and perspective. 
  • You build trust with your colleagues and customers by demonstrating that you are someone who values honesty and transparency.

My Profile

Create and manage profiles for future opportunities.

Sign In

My Submissions

Track your opportunities.

My Submissions