Building a disaster recovery (DR) microservices platform for Talentnet company

1. About Talentnet

Formerly the Human Resources Services of PricewaterhouseCoopers Vietnam, Talentnet Vietnam (Talent Connection Company) is now regarded as a leading human resources consulting firm, with a team of consultants with over 20 years of experience in HR management. Talentnet currently provides comprehensive, professional, and practical HR management solutions that are tailored to the needs of businesses of all sizes.

2. Needs and Challenges for Talentnet's IT Team

To meet the diverse needs of its clients, Talentnet has developed a system that provides "Online HR Management" services. This system is fully deployed on the Docker and Kubernetes platforms, which are leading Container and Microservices platforms today.

For Talentnet's clients, the security and availability of HR management data for real-time access from anywhere at any time is a top priority when using the "Online HR Management" service. This requirement puts pressure on Talentnet to demonstrate the ability to ensure continuous system operation and quick and complete recovery after a disaster. Therefore, Talentnet needs to build a disaster recovery system and data synchronization with the following requirements:

Build a similar system with redundancy capabilities for the main system at a different location.
Develop scenarios for potential service disruptions and system recovery processes at the backup site.
Ensure continuous data synchronization from the main system to the backup system.
Automatically update when there are changes to the application source code.

However, deploying a backup system for a container and microservices platform that integrates closely with many other components to become a complete platform such as Nginx, Postgres Database, Minio Object Storage, Local Registry, Jenkins, etc., within a short period is a technical challenge. Moreover, the customer's Kubernetes-based application system is still in the process of improvement, and the related components need to be tightly integrated to build the best-compatible system.

Lastly, there is a challenge in terms of schedule. The project needs to be deployed as quickly as possible to allow Talentnet to deliver the product to the customer on time and demonstrate the backup system's recovery capability directly to the customer. The timeframe from the start of design to project completion should be under one month.

3. Solution

For a project with complex components and tight deadlines, HPT's engineering team conducted thorough research and designed the system components meticulously, applying the most suitable and innovative technologies. In addition, a detailed and specific plan was prepared to ensure a smooth and favorable deployment, minimizing risks and deployment time.

Building a similar Kubernetes system at the DR site

Constructing a complete system at the secondary site, including all components necessary for functioning like the primary site. Thoroughly survey the existing system to capture all components and their configurations, then design a comprehensive system for the DR site.

The system needs to include the following components:

Nginx system as the load balancer.
Virtual machine cluster running Kubernetes on CentOS operating system, consisting of 1 Master node and 2 Worker nodes.
Local registry system to store images.
PostgreSQL Database and Minio Storage system with seamless connections to the Kubernetes cluster.
Jenkins system for automation tasks.

Data synchronization between the DC site and DR site

Building an automatic data synchronization system where all data from the primary site is continuously and automatically synchronized to the secondary site, ensuring no data loss during incidents at the primary site. The system consists of various separate components, requiring different synchronization methods for each component to achieve the best efficiency.

Synchronization methods:

Minio Storage: Utilizing MinIO Mirror Job on MinIO Client for scheduled data synchronization as defined.
PostgreSQL Database: Using Replication feature on PostgreSQL in a Master-Slave model for real-time data synchronization.
Jenkins Automation: Connecting Jenkins to the Kubernetes system at both sites, automatically updating container images to the DR site if there are any changes at the DC site.

Developing specific recovery scenarios for potential incidents

Constructing detailed scenarios to follow the correct recovery procedures, ensuring the system is restored as quickly as possible without any errors.

Performing simulations and running multiple tests to refine and ensure the most accurate recovery process.

4. Benefits achieved

The entire Disaster Recovery Site system has been completed after nearly a month of deployment, providing a backup system capable of rapid recovery in the event of incidents, ensuring uninterrupted service delivery to customers and exceeding the set timeline.

Data between the two systems is continuously and seamlessly synchronized, ensuring data safety.
The DR site operates stably before and after the recovery process.
The Recovery Point Objective (RPO) meets the established targets (under 15 minutes).
The disaster recovery scenario has been successfully completed and directly demonstrated to Talentnet's customers.
All simulated disaster exercises achieved a 100% success rate as expected.

Talentnet are satisfied with the DR site system and desire to collaborate with HPT in the future to further develop and strengthen the entire system.

For HPT, the project has affirmed the capability to deploy systems with various new, complex, and deep technologies by the HPT technical team. It also demonstrates the ability to build and design Container, Docker, Kubernetes, and disaster recovery systems (DR Site) on different platforms.

With experience in deploying Microservice, Docker, Kubernetes systems, as well as building disaster recovery systems (DR Site), HPT is fully capable of providing comprehensive solutions, including consultation, deployment, and maintenance for systems of different scales, ensuring system safety and enhancing the reliability of businesses in their operations.