Lily Yuli Zheng - Data Scientist & Analyst

Technical Skills

Python SQL Snowflake Spark SQL Airflow Power BI Tableau AWS Machine Learning Data Visualization

University of Ottawa

Master of Mathematics and Statistics (Co-op Option)

Ottawa, ON

Graduating Summer 2025

University of Waterloo

Bachelor of Computer Science and Statistics (Artificial Intelligence Option)

Waterloo, ON

Canada Post

Junior Data Analyst Intern - Pricing and Costing Team

Ottawa, ON

Jan 2024 – August 2024

Built and orchestrated an automated ETL pipeline with Apache Airflow, Spark SQL, Snowflake and Python to integrate competitor bidding-price data from multiple sources, cutting operational lead time by 20%
Delivered actionable real-time business insights into competitor pricing strategies by developing Power BI dashboards in collaboration with cross-functional teams, earning praise during team presentations
Designed advanced dynamic pricing models on Amazon S3 AWS using Ridge Regression and SVM (sklearn), increasing the likelihood of securing high-value contracts and maximizing revenue potential in real time

Hamilton Health Science

Data Analyst Intern - Corporate Planning and Analysis Team

Hamilton, ON

May 2023 – Aug 2023

Developed automated tests in JavaScript to validate website database outcomes by adding new modules to the codebase and updating testing results using JIRA tickets
Communicated with healthcare providers including hospitals and public health organizations, presenting monthly financial reports and automated dashboards using Excel
Acquired, analyzed, and visualized healthcare financial data using Tableau, identifying patterns and valuable insights from variance in monthly funding letters for stakeholders, decreasing 30% deficit for month end

Experian

Data Scientist Intern - Data Team

Shanghai, China

May 2022 – Aug 2022

Integrated pre-PBC rules targeting small-business clients and operationalized them into production-ready Python scripts, improving risk stratification that led to an estimated 12% decrease in average PCL per account
Automated and extended waterfall reports into independent rule-impact reports across all campaigns targeting 18M clients by isolating individual rule effects in Python
Redesigned campaign-execution workflow and migrated from SAS to Python & SQL, eliminating repetitive stacking and enabling scalable implementation following HSBC acquisition

WPP

Data Scientist Intern - Data Team

Shanghai, China

May 2019 – Aug 2019

Performed web scraping using Python (Selenium) to collect competitor customer product reviews, applied NLP sentiment analysis to identify actionable insights, leading to 10% increase in ROI
Developed KNN clustering to optimize targeting strategies of advertisement campaigns, improving click-through rate by 20%
Established automated visualization dashboards in Tableau to provide monthly P&L reports and insights, reducing operational lead time by 15%

Loan Approval Prediction with Machine Learning

Cleaned and engineered the 598-record Loan Approval Prediction dataset: imputed missing values, one-hot encoded 9 categorical fields, and derived Total Income & Income-to-Loan ratio in R
Benchmarked logistic regression, decision-tree, and 100-tree random-forest classifiers; tuned hyper-parameters to boost test-set accuracy to ≈ 82% while improving denial-class recall by 14pp
Applied PCA plus K-Means/SOM clustering to uncover three borrower segments and confirmed credit-history as the dominant approval driver

Coding Style Learner with LSTM

Aimed to train a code autocompletion model that would learn user's coding style using LSTM neural networks
Scraped and pre-processed data from GitHub to one-hot encodings using Python Selenium, Pandas and NumPy
Implemented a character-wise LSTM RNN and built a GPU-accelerated training pipeline using PyTorch

Conversational Medical-Literature Recommender

Built a chat-based assistant that retrieves the 3 most relevant papers across five neuro-medical domains using SciBERT embeddings and cosine-similarity search
Orchestrated intent handling in Google Dialogflow and fulfilled requests via Python + Flask webhook, delivering recommend/compare/refine actions with ≈ 0.85 Precision@3
Indexed 5,000 pre-embedded abstracts with FAISS and applied K-Means clustering to diversify results, cutting manual literature-screening time by 60%

Research in Cancer Gene Search with Genetic Algorithms

with Professor Shirley Mills

Applied GA-CFS feature-selection to prostate (12,600 genes) and lung (12,534 genes) microarray datasets, pruning ≈ 52% of features while retaining signal
Built ensemble classifiers in R (Bagged DTs, SVM, weighted stacking), boosting test accuracy to 94%—an 8pp gain over published baseline
Led code for bagging & stacking modules, interpreted results, and produced slide deck summarizing biomarker insights and model performance