Fault Management
The process of finding, isolating and troubleshooting network faults in the fastest way possible. It minimises downtime and prevents device failures by resolving faults rapidly, thereby ensuring optimal network availability and preventing business losses. Monitor network from Network Operations Centre (NOC) location and undertaking configuration changes, upgrades and node back-up activities.
Proficiency Level
Level 1 (Follow)
- Ensure continuous monitoring of network alarms on the Network Monitoring System (NMS).
- Ensure monitoring of threshold levels to prevent occurrence of faults.
- Ensure tickets are raised for all alarms as per the priority matrix.
- Coordinate with the Infra NOC to verify if alarm was caused by fault with passive infrastructure sites.
- Follow agreed procedures to identify, register and categorise incidents.
- Gather information to enable incident resolution and allocate incidents as appropriate.
Level 2 (Assist)
- Provide first line investigation and gather information to enable incident resolution and allocate incidents.
- Determine alarm severity, priority, Service Level Agreements (SLAs) and the affected network elements.
- Conduct diagnose from NOC location to identify root cause of fault.
- Isolate the cause of fault by conducting appropriate diagnostic test like remotely interrogating the active equipment.
- Determine the options to rectify the fault and confirm with supervisors if required.
- Advise relevant persons of actions taken.
Level 3 (Apply)
- Able to maintain network uptime by ensuring coordination with field team.
- Able to direct and coordinate with the field team to carry out corrective/change activities on site.
- Ensure clear and concise instructions are given to field staff to facilitate fault rectification efforts.
- Ensure rectification of network problem/ fault within the alarm SLAs and monitor the activities performed by the Infra engineer and technicians.
- Able to upgrade configurations and perform backups.
Level 4 (Ensure)
- Ensure that fault incidents are handled according to agreed procedures.
- Prioritise and diagnose incidents. Investigate causes of incidents and seeks resolution. Escalate unresolved incidents.
- Facilitate recovery, following resolution of incidents. Documents and close resolved incidents.
- Contribute to testing and improving incident management procedures.
- Ensure periodic updates to the SOPs to ensure repeat faults are corrected promptly.
Level 5 (Strategise)
- Analyse performance reports and identify instances of deteriorating performance sites.
- Develop, maintain and test incident management procedures in agreement with service owners.
- Investigate escalated, non-routine and high-impact incidents to responsible service owners and seek resolution.
- Facilitate recovery, following resolution of incidents. Ensure that resolved incidents are properly documented and closed.
- Analyse causes of incidents, and inform service owners to minimise probability of recurrence, and contribute to service improvement. Analyse metrics and reports on the performance of the incident management process.