IT & Technology Continuity Plan

https://gemini.google.com/app/42d976139e4c5fa0

1.0 Objectives

1.1 Summary Statement

This document outlines the comprehensive strategy, procedures, and responsibilities for ensuring the continuity of critical IT services in the event of a significant disruption. The plan is designed to be a proactive and actionable guide for the IT team to minimize operational downtime, maintain data integrity, and facilitate a timely and orderly restoration of all essential technological services that support the business.

1.2 Problem Statement

The organization is heavily reliant on its IT infrastructure for all core business functions. Any significant disruption—ranging from hardware failure and cyberattacks to natural disasters—poses a direct threat to operational continuity, leading to potential financial loss, reputational damage, and an inability to serve customers. Without a formal, tested continuity plan, the response to such incidents would be reactive, disorganized, and inefficient, magnifying the impact of the disruption.

1.3 Key Result Areas & S.M.A.R.T. Goals

1.3.1 KRA 1: Service Availability
- Goal: Achieve a Recovery Time Objective (RTO) of less than 4 hours for High Priority systems (ERPNext, NextCloud) following a declared disaster.
- Deliverable: A fully restored and functional ERPNext instance at the designated DR site, validated within 4 hours during the quarterly drill.
1.3.2 KRA 2: Data Integrity
- Goal: Ensure a Recovery Point Objective (RPO) of less than 1 hour for ERPNext and less than 2 hours for NextCloud.
- Deliverable: Implement and verify automated, hourly backups for ERPNext and bi-hourly backups for NextCloud, with daily reports confirming success.
1.3.3 KRA 3: Infrastructure Resilience
- Goal: Ensure 100% of critical systems are protected from common power and environmental failures.
- Deliverable: All critical servers, network hardware, and storage are connected to a tested Uninterruptible Power Supply (UPS). Documented graceful shutdown procedures are tested semi-annually.
1.3.4 KRA 4: Plan Readiness
- Goal: Validate the effectiveness of the DR plan and team preparedness on a consistent basis.
- Deliverable: Conduct and document quarterly restoration drills for at least one critical system, and full-team tabletop exercises semi-annually.

1.4 Background

This plan is established as a core component of the organization's overall Business Continuity strategy. It recognizes that IT is a foundational pillar of business operations and that its resilience is paramount. This document formalizes previously informal processes, introduces auditable procedures, and assigns clear responsibilities to ensure a coordinated response to any IT incident.

2.0 Scope

2.1 Systems Covered: This plan applies to all critical IT infrastructure, including virtualized environments (Docker), network hardware (PFSense firewalls, switches), and data storage (Synology, TrueNAS).
2.2 Locations Covered: Makati Office, Tech Center (TC), and Cabuyao Manufacturing Plant.

3.0 Definitions

BCP (Business Continuity Plan): The overall organizational plan for maintaining business functions.
DR (Disaster Recovery): The subset of BCP focused on restoring IT infrastructure and operations.
RTO (Recovery Time Objective): The maximum tolerable duration of an outage for a specific system.
RPO (Recovery Point Objective): The maximum acceptable age of data that can be lost in an outage.
Docker: A containerization platform used to package applications and their dependencies, enabling portability and rapid deployment.

4.0 References

4.1 Organizational Process Assets

171023 CSC Basic Documentation Methodology
180818 IMS-01 MS DESCRIPTION (C).pdf

5.0 Responsible Parties and Roles

IT Super Admin: Overall authority for plan activation, critical system changes, and privilege delegation during an incident.
IT Admin Team: Responsible for executing recovery procedures, monitoring system status, and validating data integrity post-recovery.
Site IT Personnel: First responders for on-site issues, responsible for local hardware management and assisting the central IT Admin team.

6.0 IT Continuity Processes

6.1 System Monitoring

Proactive monitoring is the first line of defense, enabling the IT team to identify and address potential issues before they escalate into major incidents.

6.1.1 Monitoring Scope:

Hardware Health: Monitor CPU temperature, disk health (S.M.A.R.T.), memory usage, and power supply status on all physical servers and NAS devices (Synology, TrueNAS).
Network Performance: Track bandwidth utilization, latency, and packet loss on firewalls (PFSense), switches, and key network links. Monitor VPN tunnel status (Wireguard).
Docker Container Health: Use tools like Portainer to monitor the status (up/down), resource consumption (CPU/RAM), and logs of all critical containers.
Application Performance: Implement basic checks to ensure key applications (ERPNext, NextCloud, WordPress) are responsive.
Backup Job Status: Monitor backup logs daily to confirm successful completion, check for errors, and verify data transfer volumes.
Security Logs: Centralize and review logs from firewalls, servers, and key applications for unusual or malicious activity.

6.1.2 Monitoring Tools:

Portainer: For real-time monitoring and management of all Docker containers.
PFSense Dashboard: For network traffic, gateway status, and VPN monitoring.
Synology/TrueNAS UI: For storage pool health, disk status, and hardware alerts.
Custom Scripts/Alerts: Implement scripts to send email or messaging alerts for critical events, such as backup failures or high resource utilization.

6.2 Backup and Restoration Policy

A multi-layered backup and restoration strategy is crucial for data protection and system recovery.

6.2.1 Policy Regarding Backups and Restores

Inventory of Systems: All critical systems will be inventoried with their designated backup schedule and recovery priority. Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO) should be defined for each.
Critical Systems Inventory (Virtual Machines - Docker Containers):
- ERPNext (Dr): High Priority. RTO: < 4 hours. RPO: < 1 hour.
- NextCloud (Dr): High Priority. RTO: < 4 hours. RPO: < 2 hours.
- WordPress (Dr): Medium Priority. RTO: < 8 hours. RPO: < 24 hours.
- NGINX Proxy Manager (Dr): High Priority. RTO: < 2 hours. RPO: < 24 hours.
- Wireguard GUI (Dr): High Priority. RTO: < 2 hours. RPO: < 24 hours.
- Portainer (Dr): High Priority. RTO: < 2 hours. RPO: < 24 hours.
- SYNX (Synology Drive Sync) (Dr): Configuration backup. RTO: < 4 hours. RPO: < 24 hours.
- Cicada (Dr): (Define Priority). RTO: (Define). RPO: (Define).
- Synopsis (Dr): (Define Priority). RTO: (Define). RPO: (Define).
Restoration Tests: Full restoration drills for at least one critical system will be conducted quarterly. Individual file/data restoration tests will be conducted monthly to validate backup integrity. All test results will be documented.

6.2.2 3-2-1 Backup Policy

Three Copies of Data: We will maintain the primary data and at least two additional backups.
Two Different Media: Backups will be stored on physically separate systems (e.g., primary server to a TrueNAS/Synology unit).
One Off-site Copy: A complete backup copy will be maintained at a different geographical location to protect against site-wide disasters.
- Makati Data: Primary on local Synology, with an off-site copy synced to the Tech Center.
- Tech Center Data: Primary on local servers, with an off-site copy synced to the Cabuyao Plant.

6.2.3 Virtualization & Containerization (Docker) Policy

Strategy: The use of Docker containers simplifies disaster recovery by ensuring application environment consistency. Recovery focuses on restoring persistent data and re-deploying the container configuration.
- Backup Process:
  - Persistent Data: All Docker containers MUST use mounted volumes for persistent data. These volumes will be included in the host machine's regular backup schedule.
  - Configuration: Docker Compose (docker-compose.yml) files for all application stacks will be stored in a version-controlled repository (e.g., a local Git server) which is also backed up.
- Recovery Process: To restore a service, the IT team will:
  - Restore the persistent data volume from backup to a new host.
  - Pull the corresponding docker-compose.yml file.
  - Run docker-compose up -d to recreate the application stack. This allows for rapid and consistent redeployment on any machine with Docker installed.

6.3 Risk Management

A structured approach to identifying, assessing, and mitigating risks to IT operations.

6.3.1 Risk Identification: The IT team will hold an annual workshop to identify potential risks across categories: technical (e.g., hardware failure), operational (e.g., human error), and environmental (e.g., typhoon, power outage).
6.3.2 Risk Analysis & Evaluation: Each identified risk will be evaluated based on its likelihood and potential impact on business operations. This will be used to prioritize mitigation efforts.
6.3.3 Risk Treatment: For each significant risk, a mitigation strategy will be chosen:
- Accept: For low-impact/low-likelihood risks.
- Mitigate: Implement controls to reduce the likelihood or impact (e.g., redundant hardware, UPS).
- Transfer: Shift the risk to a third party (e.g., insurance, cloud services).
- Avoid: Change processes to eliminate the risk entirely.
6.3.4 Monitoring & Review: The risk register will be reviewed and updated quarterly or after any significant incident.

6.4 Disaster Recovery (DR) Scenarios

6.4.1 Scenario: Manpower Disruption & Function Redundancy
- Description: Key IT personnel are unavailable due to illness, resignation, or other emergencies.
- Mitigation & Response:
  - Documentation: All system configurations, procedures, and network diagrams are to be kept up-to-date in a central repository (e.g., NextCloud).
  - Cross-Training: At least two team members must be trained on the recovery procedures for critical systems (ERPNext, NextCloud, Core Networking).
  - Password Management: Critical system credentials will be stored in a secure, shared password manager accessible to authorized IT personnel.
  - Succession Plan: A clear succession plan for the IT Super Admin role will be documented.
6.4.2 Scenario: Malware and Security Breach
- Description: A ransomware attack or other security breach compromises servers and data.
- Mitigation & Response (Incident Response Plan):
  - Isolate: Immediately disconnect the affected systems from the network to prevent further spread.
  - Investigate: Determine the entry point and scope of the breach without compromising evidence.
  - Eradicate: Remove the malware and patch the vulnerability.
  - Recover: If systems are unrecoverable, perform a bare-metal restore. Wipe the affected systems, reinstall the OS, and restore configurations and data from a clean, verified backup (taken before the breach).
  - Post-Mortem: Document the incident and implement changes to prevent recurrence.
6.4.3 Scenario: Branch Disruption
- Description: A primary site (e.g., Makati Office) becomes completely inaccessible due to fire, natural disaster, or other major event.
- Mitigation & Response:
  - Activation: The IT Super Admin declares a disaster and activates the DR plan.
  - Failover: Operations will failover to the designated DR site (e.g., Tech Center for Makati).
  - System Recovery: The IT Admin Team will begin restoring critical systems at the DR site using the off-site backups. The Docker recovery process (6.2.3) will be initiated for containerized applications.
  - Network Rerouting: DNS records will be updated to point to the services running at the DR site.
  - Communication: All employees will be notified of the situation and provided with new access instructions (e.g., updated VPN details).
6.4.4 Scenario: ISP Telecom Outages
- Description: The primary internet connection at a key site fails.
- Mitigation & Response:
  - Redundancy: Maintain a secondary, backup internet connection from a different ISP at the Tech Center and Cabuyao Plant.
  - Automatic Failover: The PFSense firewall will be configured to automatically failover to the secondary ISP if the primary connection is lost.
  - VPN Stability: The Wireguard VPN will be configured to function over either connection, ensuring remote and inter-branch connectivity is maintained.
  - Communication: If both connections fail, use mobile data hotspots for essential communication and coordination.
6.4.5 Scenario: Power and Water Interruptions
- Description: Short-term or long-term power outages, or water damage to the server room.
- Mitigation & Response:
  - UPS: All servers, network equipment, and NAS devices are connected to an Uninterruptible Power Supply (UPS) to allow for graceful shutdown during short outages.
  - Generator: For sites with a generator, procedures for starting and switching over will be documented and tested.
  - Graceful Shutdown: If a prolonged outage is expected and no generator is available, a documented shutdown sequence will be initiated to prevent data corruption.
  - Environmental Monitoring: Implement temperature and humidity sensors in server rooms to alert for HVAC failures or water leaks.
6.4.6 Scenario: Limited On-site Access / Remote Work Mandate
- Description: Access to physical offices is restricted due to health crises, civil unrest, or other external factors, forcing all work to be done remotely.
- Mitigation & Response:
  - VPN Capacity: Ensure the Wireguard VPN can handle the entire workforce connecting simultaneously. Monitor bandwidth and server performance.
  - Remote Access: Confirm all critical applications (ERPNext, NextCloud) are accessible and performant over the VPN.
  - Endpoint Security: Enforce security policies on remote devices (antivirus, disk encryption, secure passwords).
  - Communication: Utilize cloud-based communication and collaboration tools to maintain operational effectiveness.

7.0 Documentation

(This section will be detailed to outline the schedule and scope of BCP/DR testing.)

8.0 Plan Review & Improvement

(This section will be detailed to establish a formal process for reviewing and updating this plan annually or post-incident.)