Objectives

You'll be the backbone of the team, playing a vital role within it, responsible for establishing a stable system foundation and driving automation to boost operational efficiency. Ensure our systems and services excel in terms of reliability, availability, and performance. Behind every successful system, there's a robust infrastructure, yeah, we’re talking about you. Join us to make major decisions in the infrastructure to serve the internal customers better through an inclusive culture and numerous brainstorm meetings and real business challenges. Maybe also put on some extra weight while enjoying company snacks and beverages along the way.

Responsibilities

1.Collaborate with the development team to design, build, and maintain scalable and reliable infrastructure and services.

2.Responsible for overall performance and service availability of application systems.

3.Monitor, alert, and incident response processes to proactively identify and resolve system issues.

4.Complete continuous deployment tasks using CI/CD tools.

5.Assist in performance analysis and capacity planning to optimize system performance and resource utilization.

6.Actively research industry best practices and emerging technologies in Site Reliability Engineering and share learnings.

Qualifications

Required

-Bachelor's degree in Computer Science, IT, or a related field.

-1 year or more of practical experience in SRE or a related field.

-Proficiency in networking concepts (e.g., TCP/IP, HTTP, DNS).

-Experience with setting up and configuring web servers ( e.g., Nginx, Apache), for hosting websites and applications.

-Familiarity with Linux system administration (e.g., Ubuntu, CentOS) and understanding of basic underlying operations.

-Familiarity with GCP or other cloud platforms (e.g., AWS, Azure).

-Familiarity with microservices architecture and Docker.

-Familiarity with CI/CD pipelines and tools (e.g., GitLab CI, Jenkins).

-Familiarity with Shell Scripting.

-Familiarity with monitoring and alerting tools (e.g., ELK, Prometheus, or Grafana).

-Strong problem-solving and troubleshooting skills.

-Excellent communication and collaboration abilities.

Preferred

-Experience with caching and queuing systems (e.g., Redis, RabbitMQ, Celery, Kafka).

-Experience with container orchestration tools like Kubernetes.

-Familiarity with mainstream relational and NoSQL databases (e.g., MySQL, MongoDB).

-Familiarity with at least one programming/scripting language (e.g., Python, Node.js, or Bash).