LLM Projects Structure

Go from proof of concept to production-grade infrastructure.

Blueprint for Data Science Teams: From ML/LLM Experimentation to Production

This blueprint provides a comprehensive roadmap for data science teams looking to move their ML and LLM projects from experimentation to production.

Dean, CEO of DagsHub, has spoken with 300+ ML operators over the last five years, shares the essential stages, processes, and best practices for scaling machine learning operations effectively.

The Core Distinction Between MLOps and LLMOps

While the high-level workflow remains similar across both traditional machine learning and LLM projects, the implementation details differ significantly.
One example that Dean shared:

Even if the Chain Of Thought is very very long and complicated, in the end the reason you're tracking it is not because you know what you're going to do with it, it's because you don't know what you'll need in the future.

The stack used for MLOps is different from the stack used for LLMOps.
Dean breaks down the stack for each of them - this chart shows the difference.

ML/LLM Project Workflow

Each node represents a critical phase with its associated tools and activities.

Data Set Building

Create, curate and version datasets

Data Collection

Preprocessing

Versioning

Tools: DagsHub, Git LFS, DVC

Experimentation

Develop, test and optimize models

Model Architecture

Feature Engineering

Hyperparameter Tuning

Tools: Jupyter, W&B, MLflow

Evaluation

Comprehensive testing before production

Performance Metrics

Bias Testing

Prediction Analysis

Tools: Scikit-learn, TensorBoard

Deployment

Operationalize with monitoring

Model Serving

API Development

Performance Monitoring

Tools: Docker, Kubernetes, MLflow

The Four Stages of ML/LLM Development

The journey from idea to production can be divided into four critical stages:

"data is still the hardest thing to handle in the data science and AI realm"1. Data Set Building and Curation

This foundational stage involves preparing the data that will power your models and evaluations. It must happen early in the development process, closely tied to your problem definition.

Key Activities:

Data collection and organization from structured and unstructured sources
Dataset versioning for reproducibility
Data cleaning and preprocessing
For RAG applications: preparing data for context retrieval
Building separate datasets for training and evaluation

Data curation isn't a one-time activity but an ongoing process that supports both initial development and continuous improvement.

"while you can manage this in spreadsheets or notebooks initially, structured tracking becomes essential as projects scale"2. Experimentation

The experimentation phase encompasses all activities related to model development, selection, and optimization.

For Traditional ML:

Feature engineering and selection
Model architecture selection
Hyperparameter tuning
Training and validation

For LLMs:

Prompt engineering and optimization
Model selection (comparing commercial APIs vs. open-source models)
Fine-tuning strategies (particularly with LoRA for efficiency)
Tracking different combinations of prompts, models, and parameters

Regardless of approach, systematic tracking of experiments is crucial.

"making sure that it not only improves on the task that we want it to be better at but also that it doesn't become racist or biased"3. Evaluation and Pre-Production

Before deploying to production, comprehensive evaluation ensures your model meets all requirements—both functional and ethical.

Key Evaluation Areas:

Task performance against metrics
Testing for bias, fairness, and potential harmful outputs
Evaluating for hallucinations (particularly for LLMs)
Edge case handling
Performance at scale

Evaluation should include two aspects -

- Your model's performance on the task.
- Basic guardrails to ensure the model doesn't produce harmful outputs.

"report back samples from your customers if you're allowed to do that, and use that for both monitoring but also improving the model in the next iteration"4. Deployment and Production

The final stage involves operationalizing your model and establishing ongoing monitoring and improvement cycles.

Critical Components:

Deployment infrastructure
Real-time guardrails and safety measures
Monitoring systems for performance and drift
Feedback collection mechanisms
Iterative improvement processes

From Notebook to Production:
Evolution of ML / LLM Workflows

Many teams begin their AI journey with simple workflows—often a Jupyter notebook with separate cells for each stage of the process. While this approach works for proof-of-concept, scaling to production requires more structured systems.

V1: The Notebook Approach

Individual cells handling different parts of the workflow
Manual tracking of experiments
Limited reproducibility and collaboration
Suitable for initial exploration but not production

V2: Structured Development

Version-controlled datasets
Systematic experiment tracking
Standardized evaluation protocols
Dedicated deployment pipelines
Team collaboration support

The transition from V1 to V2 is crucial for organizations serious about implementing AI solutions at scale. As Dean points out,

The 99 percentile performance is crucial for production applications

— meaning that moving beyond isolated examples to reliable, consistent performance requires robust infrastructure.

Implementation Recommendations

To successfully implement this blueprint, consider the following recommendations:

Data Management First: Invest early in data management infrastructure, as data quality and organization impact all downstream processes
Documentation Standards: Create clear documentation standards for experiments and models
Evaluation Protocols: Build evaluation protocols before finalizing models
Iterative Design: Design for iterative improvement from the beginning
Cross-functional Collaboration: Establish cross-functional collaboration between data scientists, engineers, and business stakeholders

Conclusion

The key differentiator between successful and struggling AI initiatives often lies not in the sophistication of individual components but in the robustness of the overall workflow and the team's ability to learn and iterate effectively.