Apply now »

AVP/VP, Production Engineer

Date: Jul 8, 2026

Location:

Singapore

Office Location: One@Changi City, Singapore

As a Production Support Engineer based in our Singapore office at SMBC, you will be a key technical resource responsible for maintaining the stability, reliability, and performance of our next-generation digital banking platform in Asia Pacific. Working under the guidance of the Production Engineer Lead, this role is heavily focused on hands-on technical troubleshooting, cloud-native app diagnostics, and critical issue resolution for our enterprise-scale web and mobile ecosystems.

You will be positioned in our regional hub, collaborating daily with production support colleagues in Chennai and across the wider APAC footprint to maintain seamless operational stability. Positioned on the front lines of defence for our production ecosystem, you will actively provide 24/7 on-call support to manage live incidents, isolate multi-tier interface failures, and deliver rapid technical workarounds.

About the Opportunity

Hands-On Triage Mastery: Gain extensive experience dissecting real hybrid and cloud-native software application failures in a live banking environment.
UI & API Triage Focus: Develop deep diagnostic expertise isolating complex issues across web/mobile user interfaces and microservice API transaction pathways.
Production-First Impact: Directly safeguard customer banking journeys by minimizing downtime through active participation in 24/7 on-call support frameworks.
Cross-Regional Integration: Operate within an interconnected engineering group, partnering seamlessly across our Singapore, Chennai, and APAC technology offices.
Collaborative Engineering: Actively bridge gaps between operations and engineering by collaborating with development squads to roll out permanent bug fixes.

Key Responsibilities

Technical Troubleshooting & Issue Resolution

Live Incident Handling: Actively respond to automated alerts and regional escalations, initiating immediate technical troubleshooting across web user interfaces, mobile clients, and backend services.
UI & API Dissection: Perform technical triage on user interface breakdowns (web/mobile layout regressions, network delays) and API connection faults (payload errors, status anomalies, broken gateways).
Rapid Restoration: Diagnose system behaviors under pressure to isolate runtime exceptions and apply safe, verified technical workarounds to restore critical banking pathways.
24/7 Support Readiness: Actively participate in rotating 24/7 on-call support cycles to manage critical system failures and guarantee reliable service coverage outside standard regional business hours.
RCA Contribution: Participate in post-incident reviews and compile technical details for Root Cause Analysis (RCA) reports, tracking underlying software or infrastructure defects.

Cloud & Database Triage Operations

Cloud Application Diagnostics: Isolate and resolve system exceptions nested inside cloud-native app infrastructures, evaluating cloud instance availability, tracing connection states, and examining distributed microservice transactions.
Data Layer Queries: Execute SQL queries across production relational and NoSQL databases to inspect data anomalies, check table states, and trace transaction failures.
Telemetry & Log Parsing: Utilize enterprise monitoring tools to isolate production anomalies, build customized queries, correlate distributed transactions, and systematically parse runtime exceptions.

Team & Developer Collaboration

Cross-Hub Handovers: Coordinate daily triage efforts and execute structured technical incident handovers with support colleagues based in Chennai, Singapore, and across other APAC offices to maintain operational continuity.
Developer Handshake: Work closely with core development and DevOps engineering teams to escalate complex software bugs, provide clean stack traces, and test permanent hotfixes in lower environments.
Knowledge Base Ownership: Document newly discovered failure modes and their respective solutions to constantly build and enrich the team's shared operational knowledge base.

Security, Risk & Compliance

Compliant Triage: Perform all production troubleshooting and incident remedies in strict accordance with banking security protocols, least-privilege standards, and relevant regional regulatory mandates (such as MAS guidelines for Singapore or local security standards).
Change Governance: Ensure urgent emergency fixes adhere to proper ITIL emergency change management approvals prior to deployment.

Required Qualifications

Technical Expertise

Must demonstrate high proficiency in at least 4 of the following areas:

UI & API Troubleshooting Exposure: Practical experience debugging frontend/user interfaces (web/mobile browser tools, inspect element, network waterfall graphs) and testing/analyzing backend APIs (Postman, cURL, JSON payload analysis, HTTP status code validation).
Cloud Application Troubleshooting: Strong operational knowledge of isolating faults inside cloud-hosted applications (specifically Azure or AWS environments), analyzing distributed application traces, and reviewing cloud platform diagnostic metrics.
Monitoring & Observability Platforms: Hands-on experience navigating and troubleshooting with enterprise logging, monitoring, and alerting suites—specifically Splunk, Azure Log Analytics, Azure Monitor, and Grafana—to isolate transaction errors, interpret threshold alerts, and visualize platform telemetry metrics.
SQL & Database Inspection: Strong practical ability to write and run SQL queries across enterprise databases (Oracle, SQL Server, or PostgreSQL) to analyze lock behaviors, trace data states, and triage errors.
Shell Scripting (Bash/Unix): Proficiency working inside Linux/Unix environments, utilizing command-line utilities to manipulate files, grep production logs, and execute standard triage scripts.
Multi-Tier Application Knowledge: Clear conceptual understanding of how web/mobile applications, APIs, and microservices interact, allowing for efficient isolation of failure points.
Code Diagnostic Familiarity: Basic reading and diagnostic exposure to Python or Java to interpret system stack traces and error logs during deep-dive investigations.
Containerization Exposure: Basic operational awareness of container states (Docker/Kubernetes) to navigate environments and pull container-level logs during incidents.
CI/CD & Version Control Awareness: Familiarity with Git branching and deployment pipelines (GitLab CI, Azure DevOps, or Jenkins) to understand deployment schedules and recognize build-related regressions.

Experience & Education

5+ years of hands-on experience in Production Support, Application Support, Systems Engineering, or IT Operations.
Proven experience actively participating in 24/7 on-call support frameworks to manage critical runtime incidents.
Experience working in a distributed or cross-regional team structure, collaborating with team members across different technical hubs.
Experience operating within a financial institution, banking environment, or a highly regulated high-transaction fintech ecosystem.
Education: Bachelor’s degree in Computer Science, Information Technology, or a related technical field (or equivalent practical experience).

Professional Qualities

Analytical Problem Solver: Approaches complex system behavior with a structured, step-by-step diagnostic mentality.
Resilient Under Pressure: Maintains strict focus and objective decision-making during high-severity production incidents with tight recovery windows.
Strong Communicator: Able to explain technical log failures clearly to developers, and translate technical states into clear, straightforward incident updates for managers.
Cross-Border Collaborator: Enjoys pairing with team members across different geographical hubs (e.g., Chennai, Singapore, and APAC) to solve multi-tier problems and pass on clean, well-documented issue handovers.
Accountable & On-Call Ready: Highly responsible, punctual, and comfortable taking ownership of critical technical issues within an extended 24/7 support framework.

Apply now »