AI Code Reviewer
Semantic Code Review System
Motivation
Exploring how LLMs can power smarter developer tooling. This project analysed submitted code and returned high-quality, contextual feedback similar to pull request comments from experienced engineers.
How it Worked
Data Collection — Custom web scraper collected real-world code diffs and PR comments from GitHub, alongside public datasets.
Embedding & Storage — Each code snippet embedded using OpenAI embeddings and stored in a Qdrant vector database alongside associated review comments.
Semantic Search — When users submitted code, the system performed semantic similarity search to retrieve the most relevant code + comment pairs.
Review Generation — Retrieved pairs passed into GPT-4o, which generated professional-style code reviews based on contextually similar examples.
Key Learnings
Semantic Search Pipelines
Structuring retrieval pipelines with OpenAI embeddings for code-level similarity.
Vector Database Management
Querying and managing Qdrant for vector-based retrieval at scale.
Prompt Engineering
Crafting prompts for natural, contextual feedback using LLMs.
Production Deployment
Building a scalable Java + Quarkus API and deploying via Docker.