Welcome to the ever-present challenge of code documentation. We all know its importance: onboarding new team members, maintaining long-term projects, reducing technical debt. Yet, it’s often the first thing to be deprioritized in a fast-paced development cycle. The result? Outdated, incomplete, or non-existent documentation that ultimately slows us down.
This guide will walk us through a practical approach to using AI to automate the generation of high-quality code documentation. We’ll learn how to build a basic script that extracts code, prompts an AI model for documentation, and integrates it back into our codebase. By the end, we’ll have a foundational understanding of how to streamline this often-tedious task, ensuring our documentation stays more consistent and up-to-date, even if it still requires human oversight.
Prerequisites
Before we dive in, ensure we have the following set up:
- Python 3.8+: We’ll be using Python for our scripting. We can download it from the official Python website or use a package manager like
pyenvorconda. pip: Python’s package installer, which usually comes bundled with Python.- An IDE or text editor: Visual Studio Code is recommended for its excellent Python support.
- An OpenAI API Key: We’ll use OpenAI’s models (like GPT-4o mini or GPT-4o) for generating documentation. We can obtain an API key from the OpenAI developer platform. Be aware that API usage incurs costs, so monitor our usage.
- Basic understanding of Python: Familiarity with functions, classes, and file I/O will be helpful.
- Basic understanding of Git (optional but recommended): For integrating with pre-commit hooks later.
Step-by-step sections
Step 1: Set Up Our Environment and API Key
First, let’s create a new project directory and install the necessary Python library.
- Create a project directory:
mkdir ai-doc-generator
cd ai-doc-generator
```
2. **Create a virtual environment (recommended):**
```bash
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
```
3. **Install the OpenAI Python client:**
```bash
pip install openai
```
4. **Set our OpenAI API key as an environment variable:**
This is the most secure way to manage our API key. Replace `sk-YOUR_OPENAI_API_KEY` with our actual key.
```bash
export OPENAI_API_KEY="sk-YOUR_OPENAI_API_KEY"
```
For persistent storage, add this line to our shell's profile file (e.g., `~/.bashrc`, `~/.zshrc`, or for Windows, configure system environment variables).
### Step 2: Identify Code for Documentation and Initial Extraction
We'll start by focusing on generating docstrings for Python functions. We need a way to read a Python file, identify functions, and extract their source code. Python's `ast` (Abstract Syntax Tree) module is perfect for this.
Let's create a sample Python file named `my_module.py`:
```python
# my_module.py
def calculate_average(numbers):
"""
Calculates the average of a list of numbers.
"""
if not numbers:
raise ValueError("Input list cannot be empty.")
return sum(numbers) / len(numbers)
class DataProcessor:
def __init__(self, data):
self.data = data
def process_data(self):
# This function needs a docstring!
processed = [x * 2 for x in self.data]
return processed
def get_summary(self):
"""
Generates a summary of the processed data.
"""
return f"Processed {len(self.data)} items."
Now, let’s create a script doc_generator.py to extract the process_data function’s code.
# doc_generator.py
import ast
import inspect
def extract_function_source(file_path, function_name):
"""
Extracts the source code of a specific function from a Python file.
"""
with open(file_path, 'r') as f:
tree = ast.parse(f.read())
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
if node.name == function_name:
# Re-read the file to get exact lines for inspect.getsource
with open(file_path, 'r') as f_again:
lines = f_again.readlines()
start_line = node.lineno - 1
# Find the end of the function. This is a bit tricky with AST
# A simpler approach for demonstration is to use inspect if available,
# or manually find the end based on indentation.
# For a robust solution, consider a library like `astunparse` or
# more advanced AST traversal to get the exact source.
# For now, let's simplify and just return the function's body
# based on AST node positions, or use inspect if it's a live object.
# Let's use a simple approach for this example: find the function
# and assume we want to process it.
# For actual source, inspect.getsource is better if we have the object.
# Since we only have the AST node, we'll approximate.
# A better way for this guide is to get the full source of the file
# and then extract the relevant lines based on line numbers from AST.
# For simplicity, let's just return the function name and its starting line
# and we'll manually extract the code for the prompt in the next step.
# A truly robust extractor would get the exact source code block.
# For this example, let's modify to return the content around the function
# or, even simpler, just the function name for the AI to process based on context.
# Let's simplify: we will pass the entire file content to the AI
# and tell it which function to document. This reduces complex AST parsing for extraction.
# Or, better, let's use a simpler method for extracting the *specific* function's text.
# We can get the start and end line numbers from the AST node.
#
# Corrected approach:
source_lines = open(file_path, 'r').readlines()
function_lines = source_lines[node.lineno-1 : node.end_lineno]
return "".join(function_lines)
return None
if __name__ == "__main__":
file_path = "my_module.py"
function_to_document = "process_data"
code_to_document = extract_function_source(file_path, function_to_document)
if code_to_document:
print(f"Extracted code for '{function_to_document}':\n```python\n{code_to_document.strip()}\n```")
else:
print(f"Function '{function_to_document}' not found.")
Run this script: python doc_generator.py. It should print the source code of process_data.
Step 3: Craft the Initial AI Prompt
The quality of our generated documentation heavily depends on the prompt we provide to the AI. Let’s start with a basic prompt and then refine it. We aim for Google-style Python docstrings.
Add the following to doc_generator.py:
import os
from openai import OpenAI
# ... (rest of the previous code) ...
def generate_docstring(code_snippet, model="gpt-3.5-turbo"):
"""
Generates a docstring for a given Python code snippet using an AI model.
"""
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
prompt = f"""
You are an expert Python developer. Your task is to generate a Google-style Python docstring for the following function.
The docstring should include:
- A concise summary of what the function does.
- A description of each argument, prefixed with `Args:`.
- A description of what the function returns, prefixed with `Returns:`.
- A description of any exceptions raised, prefixed with `Raises:`.
- Ensure correct indentation and formatting for a Python docstring.
Do not include the function signature or any example usage. Just provide the docstring content.
```python
{code_snippet}
"""
try:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": prompt}
],
temperature=0.7, # Controls randomness. Lower for more deterministic output.
max_tokens=500 # Adjust based on expected docstring length
)
return response.choices[0].message.content.strip()
except Exception as e:
print(f"Error generating docstring: {e}")
return None
if name == “main”: file_path = “my_module.py” function_to_document = “process_data” code_to_document = extract_function_source(file_path, function_to_document)
if code_to_document:
print(f"Extracted code for '{function_to_document}':\n```python\n{code_to_document.strip()}\n```")
print("\nGenerating docstring with AI...")
generated_doc = generate_docstring(code_to_document)
if generated_doc:
print("\nGenerated Docstring:\n")
print(f'"""\n{generated_doc}\n"""')
else:
print("Failed to generate docstring.")
else:
print(f"Function '{function_to_document}' not found.")
Run `python doc_generator.py` again. We should now see a generated docstring for `process_data`.
### Step 4: Integrate the Generated Docstring Back into the File
Now for the tricky part: inserting the generated docstring into the actual file. We need to parse the file, find the function, and insert the docstring in the correct place, respecting indentation.
This requires manipulating the AST and then reconstructing the file, which can be complex. A simpler, more solid approach for this guide is to use a text-based replacement or a dedicated library if available. For our purposes, we'll find the function definition and insert the docstring right after it, maintaining indentation.
Let's modify `doc_generator.py` to insert the docstring:
```python
# doc_generator.py
import ast
import os
import re
from openai import OpenAI
# ... (extract_function_source and generate_docstring functions from above) ...
def insert_docstring_into_function(file_path, function_name, docstring_content):
"""
Inserts a generated docstring into the specified function in the file.
Assumes the docstring should be inserted right after the function definition line.
Handles existing docstrings by replacing them.
"""
with open(file_path, 'r') as f:
lines = f.readlines()
output_lines = []
in_target_function = False
docstring_inserted = False
indentation = ""
for i, line in enumerate(lines):
output_lines.append(line)
# Check for function definition
if re.match(r'^\s*def\s+' + re.escape(function_name) + r'\(.*\):', line):
in_target_function = True
# Extract indentation of the function definition
match = re.match(r'^(\s*)def', line)
if match:
indentation = match.group(1) + " " # Add 4 spaces for docstring
# Check if there's an existing docstring immediately after
if i + 1 < len(lines) and (lines[i+1].strip().startswith('"""') or lines[i+1].strip().startswith("'''")):
# Found an existing docstring, we need to skip it
j = i + 1
while j < len(lines):
output_lines.pop() # Remove the function def line we just added
if lines[j].strip().endswith('"""') or lines[j].strip().endswith("'''"):
# Found end of existing docstring, skip it and the lines in between
output_lines.append(line) # Re-add the function def line
break
j += 1
else: # Docstring not properly closed or EOF
j = i + 1 # Reset if not found, we'll overwrite
# Now insert the new docstring
output_lines.append(f'{indentation}"""\n')
for doc_line in docstring_content.splitlines():
output_lines.append(f'{indentation}{doc_line}\n')
output_lines.append(f'{indentation}"""\n')
docstring_inserted = True
in_target_function = False # Done with this function
continue # Continue to next line of original file
elif in_target_function and not docstring_inserted:
# No existing docstring, insert new one
output_lines.append(f'{indentation}"""\n')
for doc_line in docstring_content.splitlines():
output_lines.append(f'{indentation}{doc_line}\n')
output_lines.append(f'{indentation}"""\n')
docstring_inserted = True
in_target_function = False # Done with this function
continue # Continue to next line of original file
if docstring_inserted:
with open(file_path, 'w') as f:
f.writelines(output_lines)
return True
return False
if __name__ == "__main__":
file_path = "my_module.py"
function_to_document = "process_data"
code_to_document = extract_function_source(file_path, function_to_document)
if code_to_document:
print(f"Extracted code for '{function_to_document}':\n```python\n{code_to_document.strip()}\n```")
print("\nGenerating docstring with AI...")
generated_doc = generate_docstring(code_to_document)
if generated_doc:
print("\nGenerated Docstring:\n")
print(f'"""\n{generated_doc}\n"""')
print(f"\nInserting docstring into '{file_path}'...")
if insert_docstring_into_function(file_path, function_to_document, generated_doc):
print("Docstring inserted successfully. Please review 'my_module.py'.")
else:
print("Failed to insert docstring (function might not be found or other issue).")
else:
print("Failed to generate docstring.")
else:
print(f"Function '{function_to_document}' not found.")
Now, run python doc_generator.py. After execution, open my_module.py and observe the inserted docstring for process_data. Always review the changes!
Step 5: Automate with a Pre-commit Hook
Integrating this into our development workflow can ensure documentation is generated (and reviewed) before commits. pre-commit is a framework for managing and maintaining multi-language pre-commit hooks.
- Install
pre-commit:
pip install pre-commit
```
2. **Create a wrapper script for our doc generator:**
Let's create `generate_docs_hook.py` that takes a file path and function name as arguments. This will be a simplified version of our `doc_generator.py` that just processes one function. For a real-world scenario, we'd want to iterate over all undocumented functions in a changed file.
```python
# generate_docs_hook.py
import sys
import os
import ast
import re
from openai import OpenAI
# --- Re-use extract_function_source, generate_docstring, insert_docstring_into_function from doc_generator.py ---
# For a real hook, these would be imported or refactored.
# For simplicity, copy them here or ensure they are accessible.
def extract_function_source(file_path, function_name):
# ... (copy paste from doc_generator.py) ...
"""
Extracts the source code of a specific function from a Python file.
"""
with open(file_path, 'r') as f:
tree = ast.parse(f.read())
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
if node.name == function_name:
source_lines = open(file_path, 'r').readlines()
function_lines = source_lines[node.lineno-1 : node.end_lineno]
return "".join(function_lines)
elif isinstance(node, ast.ClassDef): # Handle methods within classes
for item in node.body:
if isinstance(item, (ast.FunctionDef, ast.AsyncFunctionDef)):
if item.name == function_name:
source_lines = open(file_path, 'r').readlines()
function_lines = source_lines[item.lineno-1 : item.end_lineno]
return "".join(function_lines)
return None
def generate_docstring(code_snippet, model="gpt-3.5-turbo"):
# ... (copy paste from doc_generator.py) ...
"""
Generates a docstring for a given Python code snippet using an AI model.
"""
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
prompt = f"""
You are an expert Python developer. Your task is to generate a Google-style Python docstring for the following function.
The docstring should include:
- A concise summary of what the function does.
- A description of each argument, prefixed with `Args:`.
- A description of what the function returns, prefixed with `Returns:`.
- A description of any exceptions raised, prefixed with `Raises:`.
- Ensure correct indentation and formatting for a Python docstring.
Do not include the function signature or any example usage. Just provide the docstring content.
```python
{code_snippet}
"""
try:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": prompt}
],
temperature=0.7,
max_tokens=500
)
return response.choices[0].message.content.strip()
except Exception as e:
print(f"Error generating docstring: {e}", file=sys.stderr)
return None
def insert_docstring_into_function(file_path, function_name, docstring_content):
# ... (copy paste from doc_generator.py) ...
"""
Inserts a generated docstring into the specified function in the file.
Assumes the docstring should be inserted right after the function definition line.
Handles existing docstrings by replacing them.
"""
with open(file_path, 'r') as f:
lines = f.readlines()
output_lines = []
in_target_function = False
docstring_inserted = False
indentation = ""
for i, line in enumerate(lines):
output_lines.append(line)
# Check for function definition (handles both top-level and class methods)
func_def_match = re.match(r'^\s*(?:class\s+\w+:\s*)?def\s+' + re.escape(function_name) + r'\(.*\):', line)
if func_def_match:
in_target_function = True
# Extract indentation of the function definition
match = re.match(r'^(\s*)def', line) or re.match(r'^(\s*)class\s+\w+:\s*(\s*)def', line)
if match:
indentation = match.group(1) + " " # Add 4 spaces for docstring
# Check if there's an existing docstring immediately after
if i + 1 < len(lines) and (lines[i+1].strip().startswith('"""') or lines[i+1].strip().startswith("'''")):
# Found an existing docstring, we need to skip it
j = i + 1
while j < len(lines):
if lines[j].strip().endswith('"""') or lines[j].strip().endswith("'''"):
# Found end of existing docstring, remove all lines from def to end of docstring
del output_lines[i+1:] # Remove lines from after def to current
output_lines.append(line) # Re-add the function def line
break
j += 1
else: # Docstring not properly closed or EOF
pass # Let's assume we'll overwrite
# Now insert the new docstring
output_lines.append(f'{indentation}"""\n')
for doc_line in docstring_content.splitlines():
output_lines.append(f'{indentation}{doc_line}\n')
output_lines.append(f'{indentation}"""\n')
docstring_inserted = True
in_target_function = False # Done with this function
continue # Continue to next line of original file
elif in_target_function and not docstring_inserted:
# No existing docstring, insert new one
output_lines.append(f'{indentation}"""\n')
for doc_line in docstring_content.splitlines():
output_lines.append(f'{indentation}{doc_line}\n')
output_lines.append(f'{indentation}"""\n')
docstring_inserted = True
in_target_function = False # Done with this function
continue # Continue to next line of original file
if docstring_inserted:
with open(file_path, 'w') as f:
f.writelines(output_lines)
return True
return False
def get_undocumented_functions(file_path):
"""
Finds all functions in a file that do not have a docstring.
Returns a list of (function_name, is_method) tuples.
"""
undocumented = []
with open(file_path, 'r') as f:
tree = ast.parse(f.read())
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
if not ast.get_docstring(node):
undocumented.append((node.name, False))
elif isinstance(node, ast.ClassDef):
for item in node.body:
if isinstance(item, (ast.FunctionDef, ast.AsyncFunctionDef)):
if not ast.get_docstring(item):
undocumented.append((item.name, True))
return undocumented
def main():
if len(sys.argv) < 2:
print("Usage: python generate_docs_hook.py <file_path>...", file=sys.stderr)
sys.exit(1)
for file_path in sys.argv[1:]:
if not file_path.endswith(".py"):
continue
print(f"Processing {file_path} for undocumented functions...")
undocumented_funcs = get_undocumented_functions(file_path)
if not undocumented_funcs:
print(f"No undocumented functions found in {file_path}.")
continue
for func_name, is_method in undocumented_funcs:
print(f" Attempting to document function: {func_name}")
code_to_document = extract_function_source(file_path, func_name)
if code_to_document:
generated_doc = generate_docstring(code_to_document)
if generated_doc:
if insert_docstring_into_function(file_path, func_name, generated_doc):
print(f" Successfully added docstring to '{func_name}' in '{file_path}'.")
else:
print(f" Failed to insert docstring for '{func_name}'.")
else:
print(f" Failed to generate docstring for '{func_name}'.")
else:
print(f" Could not extract source for '{func_name}'.")
# Indicate success or failure for the pre-commit hook.
# If any file was modified, pre-commit will stage it and re-run.
# If we want to force manual review, we could exit with 1 here.
sys.exit(0)
if __name__ == "__main__":
main()
```
Note: For this example, we’ve copied the functions into generate_docs_hook.py for self-containment. In a larger project, we’d structure this better with imports.
- Initialize Git and
pre-commit:
git init
pre-commit install
```
4. **Create a `.pre-commit-config.yaml` file:**
```yaml
# .pre-commit-config.yaml
repos:
- repo: local
hooks:
- id: generate-python-docs
name: Generate Python Docs with AI
entry: python generate_docs_hook.py
language: system
files: \.py$
pass_filenames: true
stages: [commit]
# If we want to force review and prevent commit on auto-generation:
# always_run: false
# fail_fast: true
```
5. **Test the hook:**
Modify `my_module.py` by removing the docstring from `get_summary`:
```python
# my_module.py (modified)
# ...
def get_summary(self):
# This docstring is removed for testing the hook
return f"Processed {len(self.data)} items."
```
Then, try to commit the changes:
```bash
git add my_module.py
git commit -m "Test AI doc generation"
```
The `pre-commit` hook should run, detect `get_summary` as undocumented, generate a docstring, and modify `my_module.py`. We will then need to `git add my_module.py` again and re-commit to include the generated documentation. This two-step process allows for human review.
## Common Issues
* **API Rate Limits and Cost:** OpenAI API usage isn't free. High volumes of documentation requests can quickly accumulate costs. Monitor our usage on the OpenAI dashboard. Consider setting usage limits or using cheaper models like `gpt-3.5-turbo` for initial drafts.
* **Inaccurate or Hallucinated Documentation:** AI models can sometimes generate incorrect or misleading information, especially for complex or ambiguous code. *Human review is absolutely essential.* This automation is a productivity enhancer, not a replacement for understanding.
* **Formatting Inconsistencies:** While we prompt for Google-style docstrings, the AI might occasionally deviate. Post-processing the generated docstrings with a linter or formatter (like `black` or `flake8` with docstring plugins) can help maintain consistency.
* **Large Codebases and Token Limits:** For very large functions or files, we might hit the AI model's token limits. Strategies include:
* Processing smaller chunks of code.
* Using models with larger context windows (e.g., `gpt-4-turbo`).
* Sending only the function signature and a minimal context, relying more on the AI's general programming knowledge.
* **Security and Privacy:** Sending proprietary or sensitive code to external AI APIs might be a concern for some organizations. Evaluate our company's policies. Consider self-hosting open-source LLMs if privacy is important.
* **Integration Complexity:** Making this work across different languages, frameworks, and existing documentation tools can be challenging. Our basic script is a starting point.
* **Overwriting Existing Documentation:** Our current script will replace existing docstrings. While useful for updating, ensure we have version control and review processes to prevent accidental loss of valuable hand-written documentation.
## Next Steps
After mastering the basics, here are some avenues to explore:
* **Refine Prompt Engineering:** Experiment with more advanced prompting techniques (e.g., few-shot learning by providing examples, Chain-of-Thought prompting) to improve docstring quality and adherence to specific style guides.
* **Extend Language Support:** Adapt the `extract_function_source` and `insert_docstring_into_function` logic to support other languages like JavaScript (JSDoc), Java (JavaDoc), Go, or C#.
* **Document Classes and Modules:** Expand the script to identify and generate documentation for entire classes, methods, and even module-level docstrings.
* **Integrate with CI/CD:** Instead of just a pre-commit hook, consider a CI/CD job that identifies undocumented code, generates documentation, and creates a pull request for review. This can be useful for maintaining documentation across the entire repository.
* **Use Open-Source LLMs:** Explore using local or self-hosted open-source language models (e.g., from Hugging Face) for documentation generation, especially if privacy or cost is a major concern. Tools like `ollama` can make this easier.
* **IDE Extensions:** Look into existing IDE extensions that offer AI-powered documentation or consider developing a custom one using our script as a backend.
* **Dynamic Doc Generation:** Explore generating documentation on-the-fly or integrating with tools like Sphinx, MkDocs, or Docusaurus to build comprehensive documentation sites from our generated docstrings.
* **Semantic Search:** Once we have rich docstrings, we could use embeddings and vector databases to enable semantic search over our codebase's documentation, making it easier for developers to find relevant information.
Automating documentation with AI is a powerful way to improve developer productivity and code quality. Remember that the AI is a co-pilot, not an autopilot. Human review and judgment remain critical to ensure the generated documentation is accurate, clear, and truly helpful.
## Recommended Reading
*Deepen your skills with these highly-rated books. Links go to Amazon — as an affiliate, we may earn a small commission at no extra cost to you.*
- [Docs for Developers](https://www.amazon.com/s?k=docs+for+developers+bhatti&tag=devtoolbox-20) by Bhatti et al.
- [Living Documentation](https://www.amazon.com/s?k=living+documentation+martraire&tag=devtoolbox-20) by Cyrille Martraire