A Data Pipeline Architecture For Classification Of Potential Claimants In Reunification Of Unclaimed Financial Assets

Mudambo, Nick A

View/Open

Fulltext (2.392Mb)

Downloads: 191

Date

2021

Author

Mudambo, Nick A

Metadata

Show full item record

Abstract

As data grows exponentially, organizations are leveraging on the capabilities of technology to generate knowledge that can support decision making. However, storage and processing data through traditional data pipeline architectures presents a risk of single point of failure. The Unclaimed Financial Assets Authority, like other organizations has faced such challenges when reunifying unclaimed financial assets, due to the inability to harness and process data received from the disintegrated systems. The main aim of this study was to develop a modern data pipeline architecture for classification of potential claimants in the reunification of unclaimed assets. Target population for the study was potential claimants that had registered on the various platforms provided by the Authority and records submitted by holders between July 1, 2020 and November 1, 2020. Secondary data was extracted from the various platforms and systems. The data used is 210587 and 1378953 records for potential claimants and holders’ reports respectively. Data cleaning was done using Python’s Pandas library. Use-modify-create development approach was used to design and implement the proposed classification of potential claimants’ data pipeline architecture by leveraging on the Lambda architecture and data lake approach. The approach facilitated activities like ingestion into Hadoop data lake. Pyspark was used to transform the data through Map Reduce approach, before classification algorithm was applied. HiBench was used to evaluate the architecture implemented where the Micro-benchmark metrics were used to refine the architecture. The major findings of the study were the high utilization of allocated resources by the Non-DFS storage and the Non-Heap memory which calls for management and monitoring to avoid out of storage and memory issues. The study recommends Neural Network algorithm for classification with an accuracy of 94.27% and F1-Score of 1. Use of Micro-benchmark workloads to indicate instances where CPU requires optimization and where disk I/O utilization is heavy was also recommended. A further comparative study that includes other ML techniques using different dataset, evaluation metrics, and Hibench workloads is recommended.

URI

http://repository.kca.ac.ke/handle/123456789/546

Collections

Faculty of Computing and Information Management [112]