Site Reliability Engineer, Zalo
Full-time
A Backend Reliability Engineer (BRE) in Zalo is a crucial role responsible for ensuring the constant availability, optimal performance, and robust scalability of ZA's in-house backend systems. This position blends the skills of a traditional backend administrator with the principles of software engineering and site reliability engineering (SRE). DREs are proactive problem-solvers who leverage automation, deep technical expertise, and a collaborative mindset to build and maintain resilient and efficient data infrastructure
🤖 What you will do
- System Reliability and Availability: Design, build, and maintain highly available and fault-tolerant backend systems. Develop and implement strategies for disaster recovery, backup, and restore processes to minimize downtime and data loss;
- Performance and Scalability: Proactively monitor backend performance, identifying and resolving bottlenecks. Optimize queries, tune backend configurations, and plan for future capacity needs to ensure the system can handle growing data volumes and user loads;
- Automation and Tooling: Develop and implement automation for routine backend tasks, such as provisioning, configuration management, and patching. Build and maintain tools to improve the observability and manageability of the backend environment;
- Incident Response and Troubleshooting: Serve as a primary point of contact for backend-related incidents. Troubleshoot and resolve complex production issues, conducting root cause analysis to prevent recurrence. Participate in on-call rotations;
- Collaboration and Consultation: Work closely with software development teams to advise on backend design, schema changes, and query optimization. Collaborate with infrastructure and SRE teams to ensure the backend environment aligns with overall system architecture and reliability goals;
- Security and Compliance: Implement and maintain security best practices for backends, including access control, encryption, and auditing. Ensure compliance with relevant data protection regulations;
- Documentation and Knowledge Sharing: Create and maintain comprehensive documentation for backend architecture, processes, and procedures. Share knowledge and best practices with other engineering teams;
👾 What you will need
- Proven experience in a backend administration, backend engineering, or a similar role. Experience with SRE principles is highly desirable;
- Experience with NoSQL databases like MongoDB, Cassandra, Redis, Scylla, etc
- Proficiency in programing languages such as C++, Python, Java, etc
- Strong understanding of cloud platforms (AWS, Google Cloud, Azure) and their database services (e.g., RDS, Aurora, Cloud SQL);
- Experience with infrastructure-as-code tools like Terraform or Ansible;
- Knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, Datadog);
- Familiarity with containerization and orchestration technologies (Docker, Kubernetes).