Engineer, Backend (Events Team)

Job Description

Zapier's mission is to make work easier and more productive for everyone. The Events team is at the heart of that: we own and operate the event streaming and queuing infrastructure that powers Zapier at scale, processing over 24 billion events per day.

If you're the kind of engineer who gets excited about owning critical infrastructure, solving hard reliability problems, and building tooling that hundreds of engineers depend on every day, this role is for you.

The Work

You'll be joining a small, senior team that operates Kafka (MSK), SQS, and RabbitMQ as shared infrastructure for most of Zapier's engineering organization. We sit within Zapier's Internal Platform Zone (ZIP), and our systems are Tier-1: when something goes wrong, it affects everyone.

The work is a blend of platform operations and backend engineering. On any given week that might mean tuning Kafka consumer groups, building a library that simplifies how other teams emit events, contributing to our multi-region expansion, or running an incident response. We work in Kanban, pair frequently, and take on-call seriously but sustainably.

Current projects include:

Building a new SQS-based queuing library to eventually replace our Celery/RabbitMQ setup
Multi-region infrastructure to support Zapier's enterprise growth
An AI-powered Events chatbot to help internal teams self-serve faster
Ongoing observability improvements and SLO work

About You

You have 4+ years of software engineering experience with Python as your primary language. You write Python daily and have used it to build production systems, with at least 2 years focused on building and operating event streaming systems at scale. Go or TypeScript experience is a plus, but won't substitute for Python depth. Your depth of expertise makes you a valuable asset to our engineering team.
You value collaboration. You understand that building modern software is a team sport, and you enjoy working as part of a tight-knit team. You're happy to pitch in and help the team, whether by reviewing code, pairing on a tricky problem, or just thinking about how to solve the challenges we're facing.
You value exploration and versatility. You are comfortable working on problems that may not be well defined. You are eager to jump in to learn, research, and propose options to consider when solving a hard problem that the team has not faced before. You love researching new technology, experimenting with new ideas, and driving forward with implementation details.
You can balance lots of concerns. You will manage incoming work, prioritize tasks effectively, and stay organized. Your role is crucial in meeting project/internal customer demands and delivering reliable and well tested solutions all while working in a fast-paced environment.
You advocate for the user. You have a keen eye for great design, and you're empathetic to the needs of the end-user. When you see users struggling to succeed you take it as a personal challenge to understand why and help the team build a better product.
You embody our values. At Zapier, our values are at the heart of how we work together and how we think about our customers. In our remote setting, they help develop trust and ensure we work and collaborate to democratize automation.

Required Technical Skills/Experience:

Experience working with event architectures and services based on technologies like Kafka (MSK) and Avro. You have supported event-system infrastructure to ensure resiliency and uptime.
Participated in the design or maintenance of highly available, cloud-based infrastructure in AWS or another cloud provider. You understand how to leverage infrastructure-as-code tools (Terraform) and have learned best practices for reliability and observability.
Strong experience with AWS services, cloud computing technologies, and distributed data stores.
Experience with languages like Python or Go to create automated tools. You believe in hands-off deployments and infrastructure as code.

Nice-to-Have Skills/Experience:

Strong problem-solving and analytical thinking skills, combined with excellent collaboration and communication abilities.
A natural curiosity and eagerness to learn and explore new technologies and solutions.
SRE experience working with and supporting existing systems to ensure up time and reliability.
Experience working with queues in the cloud or SAAS solutions. SQS experience is highly preferred.
Knowledge of CI/CD pipelines (e.g., using a tool like GitLab).

Things You'll Do

Operate and improve Kafka (MSK), SQS, and RabbitMQ infrastructure: configuration, performance tuning, schema management (Avro), and reliability work
Build backend services, toolkits, and libraries in Python (primary) and Go or TypeScript that help other teams integrate with our event systems
Use Terraform to manage infrastructure as code across AWS: MSK, SQS, Lambda, S3, Aurora, and Redis
Monitor system health using Grafana and Datadog, define and maintain SLOs, and troubleshoot issues proactively
Participate in on-call rotations (3-day, 24-hour shifts) and lead or support incident response for Tier-1 systems
Contribute to data governance: schema design, event structure, and data hygiene practices across Zapier
Collaborate with cross-functional teams including Edge, Runner, Data, and SRE

You'll also have the opportunity to specialize in a variety of areas of the Zapier codebase. Focusing on a specialization will not limit your growth at Zapier as we believe that each engineer brings a unique perspective and can contribute in all areas. We encourage collaboration and will frequently have engineers contribute across teams to assist with projects as needed.

Job Information

Salary

USD 143,900 - 215,900 / year

Employment Type

Full-time

Job Category

Engineering

Location

Canada North America

Company

Zapier

At Zapier, we build and use automation every day to make work more efficient, creative, and human. So if you're using AI tools while applying here - that's great! We just ask that you use them responsibly and transparently.