Data Science Documentation

Alamsyah Hanza
4 min readJan 23, 2023

--

Documentation Creation (source)

Data scientists are not working alone. They work to solve problems that need another team's support to generate and implement the solutions. However, other teams would not have all the time in the world to discuss everything with the Data Science team. Hence, good communication through a document will be the best way to speed things up.

Disclaimer: The Documentation type and communication line here based on the writer experiences.

Background

Data scientists usually use machine learning or other mathematical methods to solve stakeholders’ problems. They must have the bandwidth to share the result with stakeholders orally or written. The data scientist must also communicate the result to the engineering team to implement the solutions. Moreover, they also sometimes must share the knowledge with the other team to showcase or increase utilization of the result. These communication lines have a widespread impact on the organization.

Data Science Communication

Hence, well-organized documentation would have a significant contribution to the organization’s success. It means the data scientist must have, not only one but, several docs which depend on the purposes and the audience target. Based on the target, the data science documentation could be divided into 2 categories: Specs documents and Technical documents.

Specs Document

Product Requirements Document

In general, the Data Science team is creating a “product” from the data using machine learning or other algorithms. This document is the first document detailing some components of the product such as :

  • Overview / Background
  • Main Objective of the product — Problem Statement
  • Stakeholders who relate to the product
  • Milestones of the product — Relate to versioning
  • Related Documents/Dashboard
  • Important Tables
  • Feature Set

This Product Requirement Document (PRD) would be the root of all documentation related to the project. PRD is a high-level explanation of the whole process related to the product. It will explain the urgency and the current state of the product. Because of those purposes, PRD would be good for new joiners, other Product Managers, or Board members. Here is the example template of the PRD.

Feature Spec/Changes Docs

Along with PRD, Feature Specs detailing how the product is being developed and deployed across teams. This doc is being used as notes for any inputs or questions during development. It should contain

  • Roadmaps
  • Plan Discussion
  • Timeline
  • PIC of each part
  • Questions from related teams
  • Minutes of Meeting

The feature spec usually explains all steps of the product journey within one particular version. So, if there is a plan for changes or a whole new version of the product, then there would be a new feature spec. It means, one product could have multiple feature specs.

Technical Document

Model Governance

This document consists step by step on how the data science team generates the model or solutions. It is useful for internal teams to learn from the case so they may implement it in another problem. The Model Governance document is also advantageous for stakeholders to discuss deeper how actually the model works and which feature is the most important in their case. Here are some elements in the docs :

  • Problem Statement
  • Data Preparation
  • Modeling and Evaluation
  • Business Impact

Experimentation Document

The document will consist of the plan and the results of the experiment which relate to a particular model version. It helps to record how the experiment is being conducted, the sample calculations, and also the monitored metrics. These are the sections in the experiment docs :

  • Experiment Design
  • Hypothesis
  • Sample Size
  • Impact analysis
  • Insight and Next Steps

Deployment Docs and Load Test Docs

This doc is mostly for communication with Backend Engineers about the flow or API contract regarding the model. Usually, it informs about differences between the current version and the new version. Moreover, the Deployment doc must record the load test if an API call involves. Here are possible sections for the doc :

  • Changes from the previous version
  • API Contract
  • Load Test

Root cause analysis

Last but not least, the RCA document detailing any incident that happens to the model. The docs are regularly updated until the incident is solved. It consists of the reason for incidents and all possible ways to solve the problem. The docs will help us, especially Backend Engineers and Data Scientists to prevent the same problem in the future or solve the same incident in the future faster.

Conclusion

Documentation hierarchy

Structured documentation will not only help a data scientist explain a model or solution to stakeholders but also support the team to trace solution versions and reproduce them. Documentation also explains the technical solution and the impact detail on the product and business. With good communication skills on the doc, the data scientist's impact should be well known in the organization or company. In conclusion, start to write the docs now.

Reference

https://www.datascience-pm.com/documentation-best-practices/

https://drive.google.com/drive/folders/1R78Rs3hlfPp2ngTocKcgKVtbrrji9aG-?usp=share_link

--

--