Site Reliability Engineer

Hybrid
- Kuala Lumpur, Wilayah Persekutuan Kuala Lumpur, Malaysia
Engineering

Job description

Join us in providing work that matters

WCC has changed lives since 1996. We are a group of highly ambitious professionals who believe in the greater story. WCC is more than just a software organization, we are a community that strives for improving human life. We provide software that matters.

Our product is an advanced Search and Match engine used in solutions for the private and public sector.

We specialize in:

ID & Security Solutions - WCC enables governments to manage large volumes of Identity and Security data. Protecting borders and citizens while providing legal identity for all

Employment Solutions - WCC enables Public and Private Employment Services to match people quickly and expertly with suitable and sustainable jobs

What’s in it For You?

Our team - Our people believe unity is one of our strengths. So, if teamwork is important for you, we trust you will enjoy working in a team where people feel welcome, valued, and respected.

Work environment - We focus on talent and possibilities, not limitations. We love challenges and exploring new creative horizons. WCC has a diverse environment that gives every person the freedom to express their ideas.

We want to give you the conditions to do your best work, so here are the Perks and Benefits we provide:

competitive salary
Indefinite contract
Health insurance

Travel allowance
21 vacation days
13th salary
personal development opportunities
hybrid working from home / working from the office policy

Home office budget
An opportunity to create an international and diverse network.

Role

As a Site Reliability Engineer you have a unique role in our organization. You play an important role in the dynamics of software development, additional operations experience, sysadmin and IT operations. As site Reliability engineer you support our product owners and DevOps team to determine which new features can be launched and when by using service-level agreements (SLAs) to define the required reliability of the system through service-level indicators (SLI) and service-level objectives (SLO).

Responsibilities

Ensure the availability and efficient working of the services in compliance with the non-functional expectations
Plan and implement continuous improvements and changes in the ecosystem through automation
Handle service interruptions towards resolution within the defined SLAs with a mindset of continuous improvement

React to events (monitor alerts, support escalation issues, internal incidents), i.e. incidents that hit the application or the underlying infrastructure. Troubleshoot and resolve the service interruption (either hands-on or by guiding 3rd party for incident resolution actions with clear instructions)
Provide information for root cause analysis and/or conduct postmortem and provide reports
Provide recommendations/workarounds for identified problems
Liaise and act with others (Vendors, internal teams) for incident and problem management.
Provide and implement improvements in proactive actions: extend monitoring, tune alerting and alert thresholds, increase observability of the services and log management

Documentation: Create documentation tuned for the intended audience, including runbooks, Knowledge Base articles, how-to articles
Communication: Communicate with different stakeholders and vendors on technical level. Able to translate the impact of technical issues and concept to non-technical users for impact assessment.
Increase observability and manageability by:
Building and configuring logging, monitoring, and alerting
Providing information about what needs to be monitored, how, and the recommended thresholds

Participate in tuning and extending the monitoring implementation
Provide the mechanisms and preparation for possible system failures and outages and increase the robustness of the system
Participate in performance and capacity planning
Standby/on-call roster participation

Job requirements

In order to succeed in this role you must have have:

Strong AWS knowledge and experience
At least 5 years of experience in running distributed production loads on a variety of technical stack; with an ability to deep dive into complex problems
Proven experience in using software tools to automate IT operation tasks, including production system management, change management, application monitoring etc. For example,
- Knowledge in maintaining continuous integration (CI) and continuous deployment/delivery (CD) systems for complex, distributed applications, using tools like GitLab, Jenkins etc.
- Automate all aspects of deployment with CI/CD pipelines and Infrastructure as a Code (IaC)
Proven ability to triage problems quickly, assess the problem’s impact and severity, and provide appropriate response. Ability to provide workarounds for the system to work while not ignoring the need for root cause troubleshooting
Good working knowledge of ITIL processes and procedures (e.g. incident, problem, emergency change)
Broad technology experience with Automation Software
Fluent in English (written and spoken)

Bonus point for:

Experience in software development
Containers and container orchestration experience
Terraform, Ansible
Have experience with CouchBase, Keycloak, Nginx, MySQL/MariaDB, PostgreSQL, ActiveMQ, ELK Stack )
Good sense of humor

Sounds good?

Upload your motivation and CV in English via the "Apply" button. You will hear back from us within the blink of an eye. Click here for the application process.

Apply with Linkedin unavailable

Site Reliability Engineer

Job description

Job requirements

All done!