Site Reliability Engineer

San Francisco, CA · Full Time - Posted by Jacob Perkins on November 7, 2018

“SRE is what happens when you ask a software engineer to design an operations team.” – Benjamin Treynor Sloss

At Insight Engines, cloud operations is a vital part of the technology and engineering team. Our goal is to be able to sleep peacefully through the night while terabytes of data fly through our systems. We are looking for someone to help us build this large scale data processing platform and lead site reliability engineering.

But enough about us, let’s talk about you. Do you enjoy developing and deploying auto-scaling clusters in the cloud? How about digging into and analyzing metrics for process optimization? Are you down with data at cloud-scale? As an integral member of our technology team, you’ll engineer, operate, and maintain everything from ETL pipelines to OLAP datastores, operationalizing the infrastructure that powers our groundbreaking natural language platform. You’ll wear many hats (no fedoras, please), touch many parts of our system, and have a significant impact on our products.

The kinds of problems you’ll work on include:

  • Scaling high volume data systems
  • Deploying, maintaining, and owning high performance ETL and OLAP systems
  • Designing, implementing, and maintaining robust monitoring and alerting to improve performance and reliability
  • Leveraging existing open source technologies like Kafka, Hadoop, Druid, Spark, Kubernetes, Postgres, Docker, and other tools

When applying, please tell us about your real-world large scale data platform experience. Women, People of Color, Minorities, and LGBTQ candidates are encouraged to apply.

Qualifications

  • BS, MS, PhD in Computer Science, Engineering, or related discipline, or 3+ years equivalent technology experience
  • 3+ years of software development (Go, Python, Java, or equivalent)
  • 3+ years GNU/Linux and/or remote system administration experience or equivalent
  • Design and operation of robust, large-scale distributed systems
  • Operational experience with technologies such as Hadoop, Kafka, or Spark
  • Cloud deployment experience on AWS, GCP, or equivalent
  • Experience with automation, configuration management, and developing infrastructure as code
  • Use engineering best practices — deliver high-quality production code, utilize automated testing, and build reusable components
  • Authorized to work in the United States

Company benefits

  • Open vacation policy
  • Health care insurance
  • Dental & vision insurance
  • Life insurance
  • Short-term & long-term disability insurance
  • Health care FSA
  • Transit & parking FSA
  • Free lunch at SF office
  • Flexible work hours
  • Holiday time off