
Machine Learning Intern
Walker Industries is a Miami-based VR/XR startup working on integrating operating systems into extended and virtual reality. I joined the Project Replicant team working on LLM local ingestion. A part of Project Replicant was fully open sourced by the Walker Industries developers. You may contribute here.
WebsiteI began working on data collection and formatting using python & web scraping. This was followed by a filtering stage to focus on high-quality and domain specific content since the LLM itself was domain-specific. At the time, OpenAI released a new query(input) format called ChatML or Chat Markup Language. Traditionally, LLM's would tackle unstructured data but OpenAI's initial v0 models expected the query to be in the form of ChatML. See below for example:
[ {"token": "<|im_start|>"}, "system\nYou are ChatGPT, a large language model trained by OpenAI. {"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"}, "user\nHow are you", {"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"}, "assistant\nI am doing well!", {"token": "<|im_end|>"}, "\n", {"token": "<|im_start|>"}, "user\nHow are you now?", {"token": "<|im_end|>"}, "\n" ]Although this is a bit more detailed formatting, ChatML follows this general structure. That data that we collected with scraping was then formatted using tags like <|system|> & <|user|>. This was then further used for a modified RAG pipeline which used a Vector DB for semantic search and retrieval. The LLM in use was a transformer and the context of the conversation between the AI & the User was guaranteed to remain within the VR/XR domain. This meant that the volume of the training data was relatively small. When the conversation between the AI & User concluded, post processing was done on a transcript that was generated. It included running the transcript through a light-weight prebuilt NLP model to extract key details. I also worked on building this process "session-aware", meaning that each conversation transcript was stored as an embedding in the Vector DB which allowed the system to re-inject prior conversation context into an ongoing conversation. This was possible as every session generated a unique chat_name. The processed transcripts would be stored in the Vector DB with metadata tags like {chat_name: "X", timestamp: "T", vector: [...]}. Now when a user wants to recall an old conversation "X", we can just filter our Vector DB for chat_name=="X" - provided that the user is using the EXACT metadata tag which was saved from the previous chat.
Achieved a 67% increase in text synthesis efficiency, especially for knowledge-heavy or context-driven queries
Increased LLM output query accuracy by 15%