Site Reliability Engineer
Kuala Lumpur, Wilayah Persekutuan Kuala Lumpur, MalaysiaEngineering
Join us in providing work that matters
WCC has changed lives since 1996. We are a group of highly ambitious professionals who believe in the greater story. WCC is more than just a software organization, we are a community that strives for improving human life. We provide software that matters.
Our product is an advanced Search and Match engine used in solutions for the private and public sector.
We specialize in:
- ID & Security Solutions - WCC enables governments to manage large volumes of Identity and Security data. Protecting borders and citizens while providing legal identity for all
- Employment Solutions - WCC enables Public and Private Employment Services to match people quickly and expertly with suitable and sustainable jobs
What’s in it For You?
Our team - Our people believe unity is one of our strengths. So, if teamwork is important for you, we trust you will enjoy working in a team where people feel welcome, valued, and respected.
Work environment - We focus on talent and possibilities, not limitations. We love challenges and exploring new creative horizons. WCC has a diverse environment that gives every person the freedom to express their ideas.
We want to give you the conditions to do your best work, so here are the Perks and Benefits we provide:
- competitive salary
- Indefinite contract
- Health insurance
- Travel allowance
- 21 vacation days
- 13th salary
- personal development opportunities
- hybrid working from home / working from the office policy
- Home office budget
- An opportunity to create an international and diverse network.
As a Site Reliability Engineer you have a unique role in our organization. You play an important role in the dynamics of software development, additional operations experience, sysadmin and IT operations. As site Reliability engineer you support our product owners and DevOps team to determine which new features can be launched and when by using service-level agreements (SLAs) to define the required reliability of the system through service-level indicators (SLI) and service-level objectives (SLO).
- Ensure the availability and efficient working of the services in compliance with the non-functional expectations
- Plan and implement continuous improvements and changes in the ecosystem through automation
- Handle service interruptions towards resolution within the defined SLAs with a mindset of continuous improvement
- React to events (monitor alerts, support escalation issues, internal incidents), i.e. incidents that hit the application or the underlying infrastructure. Troubleshoot and resolve the service interruption (either hands-on or by guiding 3rd party for incident resolution actions with clear instructions)
- Provide information for root cause analysis and/or conduct postmortem and provide reports
- Provide recommendations/workarounds for identified problems
- Liaise and act with others (Vendors, internal teams) for incident and problem management.
- Provide and implement improvements in proactive actions: extend monitoring, tune alerting and alert thresholds, increase observability of the services and log management
- Documentation: Create documentation tuned for the intended audience, including runbooks, Knowledge Base articles, how-to articles
- Communication: Communicate with different stakeholders and vendors on technical level. Able to translate the impact of technical issues and concept to non-technical users for impact assessment.
- Increase observability and manageability by:
- Building and configuring logging, monitoring, and alerting
- Providing information about what needs to be monitored, how, and the recommended thresholds
- Participate in tuning and extending the monitoring implementation
- Provide the mechanisms and preparation for possible system failures and outages and increase the robustness of the system
- Participate in performance and capacity planning
- Standby/on-call roster participation
In order to succeed in this role you must have have:
- Strong AWS knowledge and experience
- At least 5 years of experience in running distributed production loads on a variety of technical stack; with an ability to deep dive into complex problems
- Proven ability to triage problems quickly, assess the problem’s impact and severity, and provide appropriate response. Ability to provide workarounds for the system to work while not ignoring the need for root cause troubleshooting
- Good working knowledge of ITIL processes and procedures (e.g. incident, problem, emergency change)
- Broad technology experience with Automation Software
- Fluent in English (written and spoken)
Bonus point for:
- Experience in software development
- Containers and container orchestration experience
- Terraform, Ansible
- Have experience with CouchBase, Keycloak, Nginx, MySQL/MariaDB, PostgreSQL, ActiveMQ, ELK Stack )
- Good sense of humor
Upload your motivation and CV in English via the "Apply" button. You will hear back from us within the blink of an eye. Click here for the application process.