
Data Science Intern
PRGX Global is a leading analytics-driven company specializing in recovery audit, spend analytics, and information management. With a global presence and a client base that includes many Fortune 500 companies, PRGX helps organizations identify cost recovery opportunities. I joined the Audit Operations Team and contributed to the Data initiatives.
WebsiteI focused on building data-driven infrastructure to streamline audit operations and enhance query efficiency. I began with implementing & improving ETL pipelines to extract and transform data from large relational databases for further QA screening. This included optimizing SQL queries and views. One of the core projects I worked on involved engineering a GUI-driven pipeline in Python to automate the extraction and organization of embedded objects from complex Excel files. This tool drastically simplified internal audit processes. To manage client datasets, I developed parallel column-level fluctuation detection scripts with the goal of identifying optimal segmentation criteria. This enabled smart partitioning of combined tables and supported efficient deduplication logic. This naturally required system stored procedures so I also worked on optimizing SSP’s within the server environment.
Conducted research into Random Forest anomaly detection approach for client-statistic report analysis. An internal tool would use SQL scripts to produce stat reports which would then be sent back to the client. However, prior to that, analytics must be done on the niche details of the report such as "vendor_name", "vendor_population", "field_population", etc to make sure they match up with previous years. This was a hefty manual task that required hours to finish. It can be simplified by using historical validated reports. We would train our model to predict whether a row needs QA attention(binary prediction or confidence score). Admittedly - this approach is more complicated than say a One-Class SVM as we are using a variant of Random Forest, called Isolation Forest. Overall, it was an insightful experience researching this approach and understanding it's relevance to the data I was working with.
Saved on average ≈ 2.75 hours per audit session
Reduced QA screening time by 2x