Client: We want integration between our systems. We will create customer complaints and pass on the rest to you, to manage the workflow around it.
Project Manager: Sounds good. We are experts in it.
Tech Lead (thinking): Uh-oh..
Often such conversations do happen, sending expectations through the roof. Across the board, it’s common understanding that 100% of records created in the client system will be easily passed on via integration. The heavy lifting will happen later - to handle workflows around the records.
High Impact Problems
For any stakeholders involved in integration logic implementation, it is critical to understand the issues that can crop up. Many a time, implementation plans get derailed because of this.
Integrations, however simplistic they may sound, have their own set of challenges. A typical use-case in this regard was integration of a Field Service app for one of our brand partners, who receive close to 10,000 service requests every day. Even a 1% failure rate for integration means losing out 100 service requests. This would have a highly negative impact on the business association because of sheer numbers.
Some of the reasons why records might be lost in transit are as follows:
- Data validation. Is the data sent in the correct format?
- Received high volumes of data within a short duration.
- Receiving the same data again.
- Acknowledgment of receipt sent successfully, but some error (known/unknown) occurred while processing the data.
- Sending the acknowledgment of processed data but the client didn’t accept the acknowledgment.
These sorts of issues have an exponential risk when there are no clear dashboards around it. These issues are unlike business logic issues, which are visible to various stakeholders over UI to validate and rectifications can be done around it, to an extent. So the only way to identify is to look into logs or database, for which there will be developer(s) dependency.
Given these unique challenges of integration, it is very critical to have a framework in place so that no records are lost in transit.
So, what’s the solution?
Break down the integration flow into as granular transaction tasks as possible, and have a retry mechanism in place for all transactions.
Zooming in further, following are the three fundamental transactional steps which can be performed independently:
- Receiving, performing base level validation and dumping it.
- Processing records from dump table.
- Sending acknowledgment back to the client.
If these three steps performed properly along with retry logic, it will make an impeccable framework for integration.
Detailed implementation plan for each of these three steps:
- Receiving, performing base level validation and dumping it:
- At this stage, all the records sent from the third party are directly stored in a single table. While storing those records, basic validation like data type checks and data sanity are performed.
- Acknowledgment should be sent to the client stating that we have received these records. This helps the system performance improvement because the client doesn’t need to send the same records again and again.
- This step will help us resolve challenges no. 1 and 2.
- Processing records from the dump table:
Sending acknowledgment back to the client:
- Processing records is the step in which third party records are processed and inserted into main tables.
- Entire business logic for translating client keys into generic platform level keys and performing relevant CRUD operations are done at this stage. It is critical that in case, we have received more very high spike in step 1, all of them should not be passed on to this step. The load should be smoothed out in such cases.
- The best way to go about it to implement a scheduler which can process a specific number of records at a time. A flag should be set in dump table to indicate which records are out for processing.
- Once processing is completed successful, a flag should be set in the dump table to indicate the same.
- If processing is not successful, the error needs to be identified for the same. Error identification helps us recognise if it one system-related error (did I hear: System timed out ?) or data related error or if it’s a duplicate record. If it’s a one-time system-related error, the flag should be set indicating that retry needs to be done for this record. Similarly, for data related error or duplicate record, the flag should be set, indicating there is no need to retry.
- Records which are selected for retry should be again pushed via the same scheduler. Depending on business input, priority can be given to re-try records or newer records.
- This step will help us resolve challenges no. 3 and 4.
- For all the records for which processing is completed, optionally, we can send status back to the client. For this step, it is important to ask the client, to expose an API wherein we can send the status updates.
- Implement a scheduler which will send a specific number of record updates back to the client.
- For the records for which processing is completed either successfully or with errors (but not pending for re-try), a status update will be sent to the client.
- The client should send us an acknowledgment for the same. In case acknowledgment is not received, a retry flag should be set to resend the record.
- This step will help us resolve challenges no. 5.
And what was the outcome?
Implementing the above-mentioned framework for the scale at which the integration was happening, it helped us narrow down the number of failed cases to almost zero, the only missing component being data integrity issue. Even that is highlighted via acknowledgment sent back to our brand partner.
Thinking broadly, this framework for integration is language, domain and use-case agnostic. As long as the basics are proper, any form of integration can be seamlessly done.
- Jinav Shah