Site Reliability Engineer

Hybrid
- Kuala Lumpur, Wilayah Persekutuan Kuala Lumpur, Malaysia
Engineering

Job description

Join us in providing work that matters

WCC is a global leader in software solutions for HR and employment services. Since 1996, we have supported governments, workforce agencies, and large enterprises with Responsible AI–driven technology that connects people, skills, and opportunities in meaningful and sustainable ways.

With deep roots in the public employment sector, WCC has helped governments worldwide to improve labor-market performance, strengthen citizen services, and achieve better employment outcomes at scale. This experience now extends to large organizations seeking stronger workforce insights, enhanced internal mobility, and more effective career development for their employees.

Every day, hundreds of millions of people rely on our technology to find suitable opportunities, powered by advanced matching, skills intelligence, and real-time labor-market insights.

What’s in it For You?

Our team - Our people believe unity is one of our strengths. So, if teamwork is important for you, we trust you will enjoy working in a team where people feel welcome, valued, and respected.

Work environment - We focus on talent and possibilities, not limitations. We love challenges and exploring new creative horizons. WCC has a diverse environment that gives every person the freedom to express their ideas.

We want to give you the conditions to do your best work, so here are the Perks and Benefits we provide:

competitive salary
Indefinite contract
Health insurance

Travel allowance
21 vacation days
13th salary
personal development opportunities
hybrid working from home / working from the office policy

Home office budget
An opportunity to create an international and diverse network.

Role

As a Site Reliability Engineer you have a unique role in our organization. You play an important role in the dynamics of software development, additional operations experience, sysadmin and IT operations. As site Reliability engineer you support our product owners and DevOps team to determine which new features can be launched and when by using service-level agreements (SLAs) to define the required reliability of the system through service-level indicators (SLI) and service-level objectives (SLO).

Responsibilities

Ensure the availability and efficient working of the services in compliance with the non-functional expectations
Plan and implement continuous improvements and changes in the ecosystem through automation
Handle service interruptions towards resolution within the defined SLAs with a mindset of continuous improvement

React to events (monitor alerts, support escalation issues, internal incidents), i.e. incidents that hit the application or the underlying infrastructure. Troubleshoot and resolve the service interruption (either hands-on or by guiding 3rd party for incident resolution actions with clear instructions)
Provide information for root cause analysis and/or conduct postmortem and provide reports
Provide recommendations/workarounds for identified problems
Liaise and act with others (Vendors, internal teams) for incident and problem management.
Provide and implement improvements in proactive actions: extend monitoring, tune alerting and alert thresholds, increase observability of the services and log management

Documentation: Create documentation tuned for the intended audience, including runbooks, Knowledge Base articles, how-to articles
Communication: Communicate with different stakeholders and vendors on technical level. Able to translate the impact of technical issues and concept to non-technical users for impact assessment.
Increase observability and manageability by:
Building and configuring logging, monitoring, and alerting
Providing information about what needs to be monitored, how, and the recommended thresholds

Participate in tuning and extending the monitoring implementation
Provide the mechanisms and preparation for possible system failures and outages and increase the robustness of the system
Participate in performance and capacity planning
Standby/on-call roster participation

Job requirements

In order to succeed in this role you must have have:

Strong AWS knowledge and experience
At least 5 years of experience in running distributed production loads on a variety of technical stack; with an ability to deep dive into complex problems
Proven experience in using software tools to automate IT operation tasks, including production system management, change management, application monitoring etc. For example,
- Knowledge in maintaining continuous integration (CI) and continuous deployment/delivery (CD) systems for complex, distributed applications, using tools like GitLab, Jenkins etc.
- Automate all aspects of deployment with CI/CD pipelines and Infrastructure as a Code (IaC)
Proven ability to triage problems quickly, assess the problem’s impact and severity, and provide appropriate response. Ability to provide workarounds for the system to work while not ignoring the need for root cause troubleshooting
Good working knowledge of ITIL processes and procedures (e.g. incident, problem, emergency change)
Broad technology experience with Automation Software
Fluent in English (written and spoken)

Bonus point for:

Experience in software development
Containers and container orchestration experience
Terraform, Ansible
Have experience with CouchBase, Keycloak, Nginx, MySQL/MariaDB, PostgreSQL, ActiveMQ, ELK Stack )
Good sense of humor

Sounds good?

Upload your motivation and CV in English via the "Apply" button. You will hear back from us within the blink of an eye. Click here for the application process.

Apply with Linkedin unavailable

Site Reliability Engineer

Job description

Job requirements

All done!

You've already applied for this job