DeepVision — Vision-and-Language Transformer (ViLT) Implementation

Recreated the ViLT multimodal transformer architecture from the original research paper, implementing the full pipeline from data loading and preprocessing to model inference and evaluation on the MSCOCO dataset.

Year

2025

Scope

Machine Learning / AI Research

Client

Independent Research

Duration

2 weeks

Research implementation project built to develop genuine architectural understanding of multimodal AI systems.

Challenge:

Most ML practitioners use pretrained models off-the-shelf. Understanding multimodal AI at an architectural level, well enough to implement it from a research paper, requires a fundamentally different depth of engagement.

Solution:

Implemented the complete ViLT pipeline from scratch: data loading, preprocessing, model architecture, inference, and evaluation using the MSCOCO dataset. Built custom vision-language question answering functionality aligned precisely with the paper's methodology, going beyond off-the-shelf pretrained models to demonstrate genuine architectural understanding of multimodal transformers.

Create a free website with Framer, the website builder loved by startups, designers and agencies.