LLM Projects Structure
Go from proof of concept to production-grade infrastructure.
Blueprint for Data Science Teams: From ML/LLM Experimentation to Production
This blueprint provides a comprehensive roadmap for data science teams looking to move their ML and LLM projects from experimentation to production.
Dean, CEO of DagsHub, has spoken with 300+ ML operators over the last five years, shares the essential stages, processes, and best practices for scaling machine learning operations effectively.
The Core Distinction Between MLOps and LLMOps
While the high-level workflow remains similar across both traditional machine learning and LLM projects, the implementation details differ significantly.
One example that Dean shared:
Even if the Chain Of Thought is very very long and complicated, in the end the reason you're tracking it is not because you know what you're going to do with it, it's because you don't know what you'll need in the future.
The stack used for MLOps is different from the stack used for LLMOps.
Dean breaks down the stack for each of them - this chart shows the difference.
ML/LLM Project Workflow
Each node represents a critical phase with its associated tools and activities.
Data Set Building
Create, curate and version datasets
Experimentation
Develop, test and optimize models
Evaluation
Comprehensive testing before production
Deployment
Operationalize with monitoring
The Four Stages of ML/LLM Development
The journey from idea to production can be divided into four critical stages:
This foundational stage involves preparing the data that will power your models and evaluations. It must happen early in the development process, closely tied to your problem definition.
Key Activities:- Data collection and organization from structured and unstructured sources
- Dataset versioning for reproducibility
- Data cleaning and preprocessing
- For RAG applications: preparing data for context retrieval
- Building separate datasets for training and evaluation
Data curation isn't a one-time activity but an ongoing process that supports both initial development and continuous improvement.
The experimentation phase encompasses all activities related to model development, selection, and optimization.
For Traditional ML:- Feature engineering and selection
- Model architecture selection
- Hyperparameter tuning
- Training and validation
- Prompt engineering and optimization
- Model selection (comparing commercial APIs vs. open-source models)
- Fine-tuning strategies (particularly with LoRA for efficiency)
- Tracking different combinations of prompts, models, and parameters
Regardless of approach, systematic tracking of experiments is crucial.
Before deploying to production, comprehensive evaluation ensures your model meets all requirements—both functional and ethical.
Key Evaluation Areas:- Task performance against metrics
- Testing for bias, fairness, and potential harmful outputs
- Evaluating for hallucinations (particularly for LLMs)
- Edge case handling
- Performance at scale
Evaluation should include two aspects -
- - Your model's performance on the task.
- - Basic guardrails to ensure the model doesn't produce harmful outputs.
The final stage involves operationalizing your model and establishing ongoing monitoring and improvement cycles.
Critical Components:- Deployment infrastructure
- Real-time guardrails and safety measures
- Monitoring systems for performance and drift
- Feedback collection mechanisms
- Iterative improvement processes
From Notebook to Production:
Evolution of ML / LLM Workflows
Many teams begin their AI journey with simple workflows—often a Jupyter notebook with separate cells for each stage of the process. While this approach works for proof-of-concept, scaling to production requires more structured systems.
- Individual cells handling different parts of the workflow
- Manual tracking of experiments
- Limited reproducibility and collaboration
- Suitable for initial exploration but not production
- Version-controlled datasets
- Systematic experiment tracking
- Standardized evaluation protocols
- Dedicated deployment pipelines
- Team collaboration support
The transition from V1 to V2 is crucial for organizations serious about implementing AI solutions at scale. As Dean points out,
The 99 percentile performance is crucial for production applications
— meaning that moving beyond isolated examples to reliable, consistent performance requires robust infrastructure.
Implementation Recommendations
To successfully implement this blueprint, consider the following recommendations:
- Data Management First: Invest early in data management infrastructure, as data quality and organization impact all downstream processes
- Documentation Standards: Create clear documentation standards for experiments and models
- Evaluation Protocols: Build evaluation protocols before finalizing models
- Iterative Design: Design for iterative improvement from the beginning
- Cross-functional Collaboration: Establish cross-functional collaboration between data scientists, engineers, and business stakeholders
Conclusion
The key differentiator between successful and struggling AI initiatives often lies not in the sophistication of individual components but in the robustness of the overall workflow and the team's ability to learn and iterate effectively.