LLM Projects Structure

Go from proof of concept to production-grade infrastructure.

Blueprint for Data Science Teams: From ML/LLM Experimentation to Production

This blueprint provides a comprehensive roadmap for data science teams looking to move their ML and LLM projects from experimentation to production.

Dean, CEO of DagsHub, has spoken with 300+ ML operators over the last five years, shares the essential stages, processes, and best practices for scaling machine learning operations effectively.

The Core Distinction Between MLOps and LLMOps

While the high-level workflow remains similar across both traditional machine learning and LLM projects, the implementation details differ significantly.
One example that Dean shared:

Even if the Chain Of Thought is very very long and complicated, in the end the reason you're tracking it is not because you know what you're going to do with it, it's because you don't know what you'll need in the future.

The stack used for MLOps is different from the stack used for LLMOps.
Dean breaks down the stack for each of them - this chart shows the difference.

ML/LLM Project Workflow

Each node represents a critical phase with its associated tools and activities.

Data Set Building

Create, curate and version datasets

Data Collection
Preprocessing
Versioning
Tools: DagsHub, Git LFS, DVC

Experimentation

Develop, test and optimize models

Model Architecture
Feature Engineering
Hyperparameter Tuning
Tools: Jupyter, W&B, MLflow

Evaluation

Comprehensive testing before production

Performance Metrics
Bias Testing
Prediction Analysis
Tools: Scikit-learn, TensorBoard

Deployment

Operationalize with monitoring

Model Serving
API Development
Performance Monitoring
Tools: Docker, Kubernetes, MLflow
ML

The Four Stages of ML/LLM Development

The journey from idea to production can be divided into four critical stages:

"data is still the hardest thing to handle in the data science and AI realm"1. Data Set Building and Curation

This foundational stage involves preparing the data that will power your models and evaluations. It must happen early in the development process, closely tied to your problem definition.

Key Activities:
  • Data collection and organization from structured and unstructured sources
  • Dataset versioning for reproducibility
  • Data cleaning and preprocessing
  • For RAG applications: preparing data for context retrieval
  • Building separate datasets for training and evaluation

Data curation isn't a one-time activity but an ongoing process that supports both initial development and continuous improvement.

"while you can manage this in spreadsheets or notebooks initially, structured tracking becomes essential as projects scale"2. Experimentation

The experimentation phase encompasses all activities related to model development, selection, and optimization.

For Traditional ML:
  • Feature engineering and selection
  • Model architecture selection
  • Hyperparameter tuning
  • Training and validation
For LLMs:
  • Prompt engineering and optimization
  • Model selection (comparing commercial APIs vs. open-source models)
  • Fine-tuning strategies (particularly with LoRA for efficiency)
  • Tracking different combinations of prompts, models, and parameters

Regardless of approach, systematic tracking of experiments is crucial.

"making sure that it not only improves on the task that we want it to be better at but also that it doesn't become racist or biased"3. Evaluation and Pre-Production

Before deploying to production, comprehensive evaluation ensures your model meets all requirements—both functional and ethical.

Key Evaluation Areas:
  • Task performance against metrics
  • Testing for bias, fairness, and potential harmful outputs
  • Evaluating for hallucinations (particularly for LLMs)
  • Edge case handling
  • Performance at scale

Evaluation should include two aspects -

  • - Your model's performance on the task.
  • - Basic guardrails to ensure the model doesn't produce harmful outputs.

"report back samples from your customers if you're allowed to do that, and use that for both monitoring but also improving the model in the next iteration"4. Deployment and Production

The final stage involves operationalizing your model and establishing ongoing monitoring and improvement cycles.

Critical Components:
  • Deployment infrastructure
  • Real-time guardrails and safety measures
  • Monitoring systems for performance and drift
  • Feedback collection mechanisms
  • Iterative improvement processes

From Notebook to Production:
Evolution of ML / LLM Workflows

Many teams begin their AI journey with simple workflows—often a Jupyter notebook with separate cells for each stage of the process. While this approach works for proof-of-concept, scaling to production requires more structured systems.

V1: The Notebook Approach
  • Individual cells handling different parts of the workflow
  • Manual tracking of experiments
  • Limited reproducibility and collaboration
  • Suitable for initial exploration but not production
V2: Structured Development
  • Version-controlled datasets
  • Systematic experiment tracking
  • Standardized evaluation protocols
  • Dedicated deployment pipelines
  • Team collaboration support

The transition from V1 to V2 is crucial for organizations serious about implementing AI solutions at scale. As Dean points out,

The 99 percentile performance is crucial for production applications

— meaning that moving beyond isolated examples to reliable, consistent performance requires robust infrastructure.

Implementation Recommendations

To successfully implement this blueprint, consider the following recommendations:

  • Data Management First: Invest early in data management infrastructure, as data quality and organization impact all downstream processes
  • Documentation Standards: Create clear documentation standards for experiments and models
  • Evaluation Protocols: Build evaluation protocols before finalizing models
  • Iterative Design: Design for iterative improvement from the beginning
  • Cross-functional Collaboration: Establish cross-functional collaboration between data scientists, engineers, and business stakeholders

Conclusion

The key differentiator between successful and struggling AI initiatives often lies not in the sophistication of individual components but in the robustness of the overall workflow and the team's ability to learn and iterate effectively.

Join Our Newsletter

ReadyForAgents

By Omer