Technical Projects
Technical Portfolio
My technical work focuses on language technology, natural language processing, and software development. Here are selected projects that demonstrate my technical skills and approach to solving complex problems.
NLP & Machine Learning Projects
Beyond Pattern Matching: Dataset Artifacts in SQuAD
Systematic analysis of shortcuts in reading comprehension models with two mitigation strategies: adversarial training and question-type aware loss.
- Uncovered model brittleness through comprehensive testing: 26.1% adversarial vulnerability and 35.5% reasoning gap
- Achieved 1.4x robustness improvement through adversarial training (50.1% → 72.5% EM)
- Improved reasoning performance (+2.6% on reasoning questions) while maintaining 77.2% overall accuracy
- Technologies: Python, PyTorch, Transformers, ELECTRA, NLP research methods
Fact-Checking LLM Outputs with Textual Entailment
An automated verification system that validates ChatGPT-generated biographies against Wikipedia using bag-of-words and neural entailment models.
- Implemented a verification pipeline to decompose model outputs into atomic facts and validate them against BM25-retrieved Wikipedia passages
- Developed a high-precision fact-checker using a fine-tuned DeBERTa-v3 model to determine logical entailment between claims and source text
- Conducted detailed error analysis of false positives and negatives to identify linguistic patterns where LLMs struggle with factual consistency
- Technologies: Python, PyTorch, DeBERTa-v3, Textual Entailment (NLI), FActScore, BM25
Transformer-Based Character Language Model
A custom-built Transformer architecture designed for sequence-to-sequence character counting and next-token prediction.
- Engineered a Transformer encoder from scratch, implementing self-attention, residual connections, and positional encodings without high-level library abstractions
- Developed a causal-masked language model trained on the text8 Wikipedia collection to predict next-character probability distributions
- Optimized training performance through hyperparameter tuning and attention map visualization to achieve a target perplexity of less than 7
- Technologies: Python, PyTorch, Transformer Architecture, Self-Attention, Positional Encoding, Language Modeling
Deep Averaging Networks for Robust Sentiment Analysis
A neural text classification system exploring the impact of word embeddings and architectural depth on sentiment detection.
- Implemented a Deep Averaging Network (DAN) using GloVe embeddings to classify movie review sentiment into binary positive/negative labels
- Developed a typo-robust generalization module using prefix embeddings to maintain performance on misspelled text where standard word-level models fail
- Optimized training performance through mini-batching and dynamic sequence padding in PyTorch to handle varying sentence lengths efficiently
- Technologies: Python, PyTorch, GloVe Embeddings, Deep Averaging Networks, String Edit Distance
Lexical Substitution System
An NLP system that identifies contextually appropriate word replacements using multiple approaches.
- Combined WordNet, pre-trained Word2Vec embeddings, and BERT for contextual word substitution
- Achieved 10% higher accuracy than baseline methods in suggesting replacements
- Technologies: Python, NLTK, Gensim, BERT
PCFG Parsing Implementation
Implementation of the CKY algorithm for parsing with Probabilistic Context-Free Grammars (PCFGs).
- Developed efficient implementation of the CKY dynamic programming algorithm
- Created probabilistic grammar handling for syntactic analysis
- Technologies: Python, NLTK
Computational Research Projects
Neighborhood-based Clustering for Visual Mental Imagery
Applied machine learning techniques to categorize visual mental imagery and perceptual domains.
- Developed clustering algorithms to identify domain-specific patterns in visual processing
- Implemented dimension reduction techniques to analyze performance score relationships
- Created visualization tools for complex cognitive data
- Technologies: Python, scikit-learn, pandas, matplotlib, k-means clustering
Multilingual Acquisition Analysis
Comparative corpus-driven study of sentence-final particle acquisition across different language backgrounds.
- Designed data processing pipeline for analyzing 10,000+ multilingual utterances
- Implemented statistical models to identify acquisition patterns across language groups
- Created quantitative metrics for cross-linguistic influence measurement
- Technologies: Python, R, NLTK, pandas, lme4, statistical modeling
Web & Application Development
CantoLeap - Cantonese Vocabulary Learning App
A beginner-friendly mobile application designed to help users learn essential Cantonese vocabulary through interactive exercises.
- Developed vocabulary learning system with flashcard-style exercises
- Implemented practice quizzes to reinforce learning retention
- Created community notes feature for users to share learning tips and insights
- Technologies: Kotlin, Android Development
Django Web Application
Backend system developed with Django framework.
- Created RESTful API endpoints and database models
- Implemented authentication and authorization systems
- Technologies: Python, Django, SQL, REST APIs
Speech Recognition System
Multi-API speech recognition project with sentiment analysis capabilities.
- Integrated multiple speech recognition APIs for comparative performance
- Implemented real-time transcription and sentiment analysis
- Technologies: Python, AssemblyAI API, OpenAI API
