Thought Leadership

Mule Batch - Adding Resiliency to the Manual Pagination Approach

By Irving Casillas, Cloud Engineer at IO Connect Services

November 27, 2020

In the previous post Benchmarking Mule Batch Approaches, written by my friend and colleague Victor Sosa, we demonstrated different approaches for processing big files (1-10 GBs) in a batch fashion. The manual pagination strategy proved to be the fastest algorithm, but with one important drawback: it is not resilient. This means that after restarting the server or the application, all the processing progress is lost. Some of the post commenters highlighted that this lack was needed to evaluate this approach against the Mule Batch components equally since the Mule Batch components provide resiliency by default. In this post, I show how to enhance the manual pagination approach by making it resilient. For the testing of this approach I use a Linux virtual machine with the following hardware configuration:

Intel Core i5 7200U @ 2.5 GHz (2 cores)
8 GB RAM
100gb SSD

Using the following software:

MySQL Community Server 5.7.19
AnyPoint Studio Enterprise Edition 6.2.5
Mule Runtime Enterprise Edition 3.8.4

To process a comma-separated value (.csv) file that contains 821000+ records with 40 columns each, the steps are as follows:

Read the file.
Build the pages.
Store in a Database.

You can find the code in our GitHub repository http://github.com/ioconnectservices/upsertpoc.

The Approach.

We based on the manual pagination approached from the aforementioned article and created a Mule app that processes .cvs files. This time we added a VM connector to decouple the file read and page creation from the bulk page-based database upsert. We configured the VM connector to use a persistent queue so that messages are stored in the hard disk.

Description of the processing flow: 1. The file (.csv) is read and the number of pages is calculated according to with the number of records configured per batch size, in this case, 800 records are set per page.

2. The file put in the payload as a stream in order to be accessible and forward read it to create pages in each ForEach loop.

3. Each page is sent to a persistent VM connector to store the pages in the DB in a different flow. Make it the VM connector persistent means that pages are written into files in the disk, hence the inbound VM connector can resume the consumption of the messages as files after an application reboot so that the records in those messages can get upserted in the database.

Metrics.

I took the metrics used in the previous Mule Batch article as a baseline to compare the efficiency of this new approach. I recreated very similar flows to test in my environment and I obtained the following results:

Out-of-the-box Mule batch jobs and batch commit components.

The total time of the execution is 7 minutes average.
The total of memory usage is 1.34Gb average.

Custom pagination.

The total time of the execution is 6 minutes average.
The total of memory usage is 1.2Gb average.

Custom Pagination with VM connector (This Approach).

At first, I obtain good results with this approach, but they were 30 seconds slower than the "Custom Pagination" one:

The total time of the execution (without stopping the server) is 6 minutes and 30 seconds average.
The total of memory usage is 1.2Gb average.

After increasing the number of Threads from 25 to 30 in the Async connector configuration these are the results:

The total time of the execution (without stopping the server) is 6 minutes average.
The total of memory usage is 1.2Gb average.

Conclusions

When designing an enterprise system many factors come into the play and we have to make sure it will work even in disastrous events. Adding resiliency to a solution is a must-have in every system. For us, the VM connector brings this resiliency while keeping the execution costs within the desired parameters. Also, you need to know that some performance tuning should be implemented in order to obtain the best results in resiliency without compromising performance.