Retrieving Data from GitHub for Reporting: CLI, REST API, and Python SDK Compared
READER BEWARE: THE FOLLOWING WRITTEN ENTIRELY BY AI WITHOUT HUMAN EDITING.
Introduction
GitHub isn’t just a code repository—it’s a treasure trove of project data, analytics, and configuration information. Whether you need to generate reports on repository activity, audit security configurations, track issue metrics, or analyze team productivity, GitHub provides multiple ways to extract this data programmatically.
This comprehensive guide explores three primary methods for retrieving GitHub data for reporting purposes:
- GitHub CLI (
gh) - A command-line interface for quick queries and scripts - GitHub REST API - Direct HTTP access for maximum flexibility and control
- PyGithub - The official Python SDK for object-oriented GitHub interaction
We’ll compare these approaches, show how to authenticate with each, and provide practical examples for common reporting scenarios.
Why Extract GitHub Data Programmatically?
Common Use Cases
- Repository Analytics: Track commits, pull requests, issues, and contributor activity
- Security Auditing: Review access permissions, scan for vulnerabilities, and monitor security alerts
- Team Metrics: Measure code review turnaround times, issue resolution, and sprint velocity
- Configuration Management: Document repository settings, branch protection rules, and webhooks
- Compliance Reporting: Generate evidence for audits and regulatory requirements
- Custom Dashboards: Build tailored visualizations beyond GitHub’s built-in insights
- Automated Notifications: Alert on specific events or threshold breaches
Benefits of Automation
- Consistency: Eliminate manual errors and ensure repeatable processes
- Scalability: Process data across hundreds of repositories simultaneously
- Timeliness: Schedule regular reports and real-time monitoring
- Integration: Combine GitHub data with other systems (Jira, Slack, etc.)
- Historical Analysis: Track trends and patterns over time
Method 1: GitHub CLI (gh)
The GitHub CLI is the fastest way to get started with GitHub automation. It’s perfect for quick queries, shell scripts, and interactive exploration.
Installation
macOS:
brew install gh
Linux (Debian/Ubuntu):
curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | sudo tee /etc/apt/sources.list.d/github-cli.list > /dev/null
sudo apt update
sudo apt install gh
Windows (Scoop):
scoop install gh
From Binary:
# Download from https://github.com/cli/cli/releases
# Extract and place in your PATH
Authentication
The GitHub CLI supports multiple authentication methods:
# Interactive authentication (recommended for getting started)
gh auth login
# Authenticate with a token
gh auth login --with-token < token.txt
# Or use environment variable
export GITHUB_TOKEN="ghp_yourpersonalaccesstoken"
gh auth status
During interactive authentication, you’ll choose:
- GitHub.com or GitHub Enterprise Server
- HTTPS or SSH protocol
- Authentication method (web browser or token)
Creating a Personal Access Token:
- Go to GitHub Settings → Developer settings → Personal access tokens → Tokens (classic)
- Click “Generate new token”
- Select appropriate scopes (repo, read:org, read:user, etc.)
- Save the token securely
Basic Usage
# View current user
gh auth status
# List repositories
gh repo list
# View repository details
gh repo view owner/repo
# List issues
gh issue list --repo owner/repo
# List pull requests
gh pr list --repo owner/repo
# View GitHub Actions workflows
gh workflow list --repo owner/repo
Reporting Examples with GitHub CLI
1. Repository Activity Report
#!/bin/bash
# repo-activity.sh - Generate repository activity report
REPO="owner/repo"
OUTPUT="activity_report.txt"
{
echo "GitHub Repository Activity Report"
echo "Repository: $REPO"
echo "Generated: $(date)"
echo "================================"
echo ""
echo "Recent Commits (Last 7 Days):"
gh api "repos/$REPO/commits?since=$(date -d '7 days ago' -I)T00:00:00Z" \
--jq '.[] | "\(.commit.author.date) - \(.commit.author.name): \(.commit.message | split("\n")[0])"'
echo ""
echo "Open Pull Requests:"
gh pr list --repo "$REPO" --state open --json number,title,author,createdAt \
--jq '.[] | "#\(.number) - \(.title) by @\(.author.login) (created: \(.createdAt))"'
echo ""
echo "Recently Closed Issues:"
gh issue list --repo "$REPO" --state closed --limit 10 \
--json number,title,closedAt \
--jq '.[] | "#\(.number) - \(.title) (closed: \(.closedAt))"'
echo ""
echo "Top Contributors (Last 30 Days):"
gh api "repos/$REPO/stats/contributors" \
--jq 'sort_by(-.total) | .[:5] | .[] | "\(.author.login): \(.total) commits"'
} > "$OUTPUT"
echo "Report saved to $OUTPUT"
2. Pull Request Metrics
#!/bin/bash
# pr-metrics.sh - Calculate PR review turnaround times
REPO="owner/repo"
gh pr list --repo "$REPO" --state closed --limit 50 --json number,createdAt,closedAt,title | \
jq -r '
.[] |
{
number: .number,
title: .title,
created: .createdAt,
closed: .closedAt,
hours: (((.closedAt | fromdateiso8601) - (.createdAt | fromdateiso8601)) / 3600 | floor)
} |
"PR #\(.number): \(.hours)h - \(.title)"
' | sort -t: -k2 -n
echo ""
echo "Average turnaround time:"
gh pr list --repo "$REPO" --state closed --limit 50 --json createdAt,closedAt | \
jq '[.[] | (((.closedAt | fromdateiso8601) - (.createdAt | fromdateiso8601)) / 3600)] | add / length | floor | "\\(.)\\ hours"'
3. Security Audit Report
#!/bin/bash
# security-audit.sh - Repository security configuration audit
REPO="owner/repo"
echo "Security Audit Report for $REPO"
echo "Generated: $(date)"
echo "================================"
echo ""
echo "Branch Protection Rules:"
gh api "repos/$REPO/branches" --jq '.[] | select(.protected == true) | .name' | while read branch; do
echo "Branch: $branch"
gh api "repos/$REPO/branches/$branch/protection" --jq '
" - Require PR reviews: \(.required_pull_request_reviews != null)
- Required approvals: \(.required_pull_request_reviews.required_approving_review_count // 0)
- Dismiss stale reviews: \(.required_pull_request_reviews.dismiss_stale_reviews // false)
- Require status checks: \(.required_status_checks != null)
- Enforce for admins: \(.enforce_admins.enabled // false)"
'
echo ""
done
echo "Dependabot Alerts:"
gh api "repos/$REPO/dependabot/alerts" --jq '.[] | "[\(.state)] \(.security_advisory.severity | ascii_upcase): \(.security_advisory.summary)"'
echo ""
echo "Secret Scanning Alerts:"
gh api "repos/$REPO/secret-scanning/alerts" --jq '.[] | "[\(.state)] \(.secret_type): \(.html_url)"'
echo ""
echo "Repository Settings:"
gh api "repos/$REPO" --jq '
"Visibility: \(.visibility)
Default Branch: \(.default_branch)
Allow Merge Commits: \(.allow_merge_commit)
Allow Squash Merge: \(.allow_squash_merge)
Allow Rebase Merge: \(.allow_rebase_merge)
Delete Branch on Merge: \(.delete_branch_on_merge)
Has Issues: \(.has_issues)
Has Wiki: \(.has_wiki)
Has Downloads: \(.has_downloads)"
'
4. Team Contribution Analysis
#!/bin/bash
# team-contributions.sh - Analyze team member contributions
ORG="your-org"
SINCE="2025-01-01"
echo "Team Contribution Analysis"
echo "Organization: $ORG"
echo "Period: Since $SINCE"
echo "================================"
echo ""
# Get all org members
gh api "orgs/$ORG/members" --jq '.[].login' | while read member; do
echo "Analyzing $member..."
# Count commits across all org repos
commit_count=$(gh api "search/commits?q=author:$member+org:$ORG+author-date:>$SINCE" \
--jq '.total_count')
# Count PRs
pr_count=$(gh api "search/issues?q=author:$member+org:$ORG+type:pr+created:>$SINCE" \
--jq '.total_count')
# Count issues opened
issue_count=$(gh api "search/issues?q=author:$member+org:$ORG+type:issue+created:>$SINCE" \
--jq '.total_count')
echo "$member: $commit_count commits, $pr_count PRs, $issue_count issues"
done | sort -t: -k2 -rn
5. Workflow Run Statistics
#!/bin/bash
# workflow-stats.sh - GitHub Actions workflow statistics
REPO="owner/repo"
echo "GitHub Actions Workflow Statistics"
echo "Repository: $REPO"
echo "================================"
echo ""
gh api "repos/$REPO/actions/workflows" --jq '.workflows[] | .id' | while read workflow_id; do
workflow_name=$(gh api "repos/$REPO/actions/workflows/$workflow_id" --jq '.name')
echo "Workflow: $workflow_name"
# Get last 10 runs
runs=$(gh api "repos/$REPO/actions/workflows/$workflow_id/runs?per_page=10")
total=$(echo "$runs" | jq '.workflow_runs | length')
successful=$(echo "$runs" | jq '[.workflow_runs[] | select(.conclusion == "success")] | length')
failed=$(echo "$runs" | jq '[.workflow_runs[] | select(.conclusion == "failure")] | length')
echo " Recent runs: $total"
echo " Successful: $successful"
echo " Failed: $failed"
if [ "$total" -gt 0 ]; then
success_rate=$(( successful * 100 / total ))
echo " Success rate: ${success_rate}%"
fi
echo ""
done
Pros and Cons of GitHub CLI
Pros:
- ✅ Quick to get started - Simple installation and authentication
- ✅ Great for scripts - Easily integrated into shell scripts
- ✅ Interactive features - Built-in pagination, formatting, and filtering
- ✅ No code required - Perfect for bash scripting
- ✅ Built-in helpers - Convenient commands for common operations
- ✅ Cross-platform - Works on macOS, Linux, and Windows
Cons:
- ❌ Limited to shell environments - Not ideal for complex applications
- ❌ Text processing required - Output often needs parsing with jq or awk
- ❌ Less programmatic control - Harder to build complex logic
- ❌ Shell script maintenance - Can become complex for large projects
Method 2: GitHub REST API
The REST API provides direct access to GitHub’s functionality via HTTP requests. It’s the most flexible option and works with any programming language.
Authentication
GitHub REST API supports multiple authentication methods:
1. Personal Access Token (Recommended)
# Create a token at: https://github.com/settings/tokens
# Use in curl
curl -H "Authorization: token ghp_yourtoken" \
https://api.github.com/user
# Use in HTTP headers
Authorization: token ghp_yourtoken
# Or for fine-grained tokens:
Authorization: Bearer github_pat_yourtoken
2. OAuth Apps
# Authenticate via OAuth flow
# Redirect users to:
https://github.com/login/oauth/authorize?client_id=YOUR_CLIENT_ID&scope=repo,read:org
# Exchange code for token
curl -X POST https://github.com/login/oauth/access_token \
-d "client_id=YOUR_CLIENT_ID" \
-d "client_secret=YOUR_CLIENT_SECRET" \
-d "code=CODE_FROM_OAUTH"
3. GitHub App
# Install GitHub App and generate JWT
# Use JWT to get installation access token
curl -X POST https://api.github.com/app/installations/:installation_id/access_tokens \
-H "Authorization: Bearer YOUR_JWT" \
-H "Accept: application/vnd.github+json"
Basic Usage with curl
# Set token as environment variable
export GITHUB_TOKEN="ghp_yourtoken"
# Get current user
curl -H "Authorization: token $GITHUB_TOKEN" \
https://api.github.com/user
# List repositories
curl -H "Authorization: token $GITHUB_TOKEN" \
https://api.github.com/user/repos
# Get repository details
curl -H "Authorization: token $GITHUB_TOKEN" \
https://api.github.com/repos/owner/repo
# List issues
curl -H "Authorization: token $GITHUB_TOKEN" \
https://api.github.com/repos/owner/repo/issues
# Get pull request
curl -H "Authorization: token $GITHUB_TOKEN" \
https://api.github.com/repos/owner/repo/pulls/123
Reporting Examples with REST API
1. Python Script for Repository Statistics
#!/usr/bin/env python3
"""
GitHub Repository Statistics Reporter
Fetches and analyzes repository metrics using the REST API
"""
import requests
import json
from datetime import datetime, timedelta
from collections import defaultdict
# Configuration
GITHUB_TOKEN = "ghp_yourtoken"
OWNER = "owner"
REPO = "repo"
BASE_URL = "https://api.github.com"
headers = {
"Authorization": f"token {GITHUB_TOKEN}",
"Accept": "application/vnd.github+json"
}
def get_commit_activity(owner, repo, since_days=30):
"""Get commit activity for the last N days"""
since_date = (datetime.now() - timedelta(days=since_days)).isoformat()
url = f"{BASE_URL}/repos/{owner}/{repo}/commits"
params = {
"since": since_date,
"per_page": 100
}
commits = []
page = 1
while True:
params['page'] = page
response = requests.get(url, headers=headers, params=params)
response.raise_for_status()
page_commits = response.json()
if not page_commits:
break
commits.extend(page_commits)
page += 1
# Check if there are more pages
if 'Link' not in response.headers:
break
return commits
def analyze_commits(commits):
"""Analyze commit data"""
authors = defaultdict(int)
daily_commits = defaultdict(int)
for commit in commits:
author = commit['commit']['author']['name']
date = commit['commit']['author']['date'][:10]
authors[author] += 1
daily_commits[date] += 1
return {
'total_commits': len(commits),
'unique_authors': len(authors),
'top_contributors': sorted(authors.items(), key=lambda x: x[1], reverse=True)[:5],
'daily_activity': sorted(daily_commits.items())
}
def get_pr_metrics(owner, repo, state='all'):
"""Get pull request metrics"""
url = f"{BASE_URL}/repos/{owner}/{repo}/pulls"
params = {
"state": state,
"per_page": 100
}
response = requests.get(url, headers=headers, params=params)
response.raise_for_status()
prs = response.json()
metrics = {
'total': len(prs),
'open': sum(1 for pr in prs if pr['state'] == 'open'),
'merged': sum(1 for pr in prs if pr.get('merged_at')),
'closed_unmerged': sum(1 for pr in prs if pr['state'] == 'closed' and not pr.get('merged_at'))
}
# Calculate average time to merge
merge_times = []
for pr in prs:
if pr.get('merged_at'):
created = datetime.fromisoformat(pr['created_at'].replace('Z', '+00:00'))
merged = datetime.fromisoformat(pr['merged_at'].replace('Z', '+00:00'))
hours = (merged - created).total_seconds() / 3600
merge_times.append(hours)
if merge_times:
metrics['avg_merge_time_hours'] = sum(merge_times) / len(merge_times)
return metrics
def get_issue_metrics(owner, repo):
"""Get issue metrics"""
url = f"{BASE_URL}/repos/{owner}/{repo}/issues"
params = {
"state": "all",
"per_page": 100
}
response = requests.get(url, headers=headers, params=params)
response.raise_for_status()
issues = response.json()
# Filter out pull requests (they appear in issues endpoint)
issues = [i for i in issues if 'pull_request' not in i]
metrics = {
'total': len(issues),
'open': sum(1 for i in issues if i['state'] == 'open'),
'closed': sum(1 for i in issues if i['state'] == 'closed')
}
# Calculate average time to close
close_times = []
for issue in issues:
if issue['state'] == 'closed' and issue.get('closed_at'):
created = datetime.fromisoformat(issue['created_at'].replace('Z', '+00:00'))
closed = datetime.fromisoformat(issue['closed_at'].replace('Z', '+00:00'))
hours = (closed - created).total_seconds() / 3600
close_times.append(hours)
if close_times:
metrics['avg_close_time_hours'] = sum(close_times) / len(close_times)
return metrics
def generate_report(owner, repo):
"""Generate comprehensive repository report"""
print(f"GitHub Repository Report: {owner}/{repo}")
print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 70)
print()
# Commit analysis
print("📊 Commit Activity (Last 30 Days)")
print("-" * 70)
commits = get_commit_activity(owner, repo, since_days=30)
analysis = analyze_commits(commits)
print(f"Total commits: {analysis['total_commits']}")
print(f"Unique authors: {analysis['unique_authors']}")
print()
print("Top contributors:")
for author, count in analysis['top_contributors']:
print(f" {author}: {count} commits")
print()
# Pull request metrics
print("🔀 Pull Request Metrics")
print("-" * 70)
pr_metrics = get_pr_metrics(owner, repo)
print(f"Total PRs: {pr_metrics['total']}")
print(f"Open: {pr_metrics['open']}")
print(f"Merged: {pr_metrics['merged']}")
print(f"Closed (unmerged): {pr_metrics['closed_unmerged']}")
if 'avg_merge_time_hours' in pr_metrics:
print(f"Average time to merge: {pr_metrics['avg_merge_time_hours']:.1f} hours")
print()
# Issue metrics
print("🐛 Issue Metrics")
print("-" * 70)
issue_metrics = get_issue_metrics(owner, repo)
print(f"Total issues: {issue_metrics['total']}")
print(f"Open: {issue_metrics['open']}")
print(f"Closed: {issue_metrics['closed']}")
if 'avg_close_time_hours' in issue_metrics:
print(f"Average time to close: {issue_metrics['avg_close_time_hours']:.1f} hours")
if __name__ == "__main__":
generate_report(OWNER, REPO)
2. Bash Script Using curl and jq
#!/bin/bash
# github-api-report.sh - Generate report using GitHub REST API
GITHUB_TOKEN="ghp_yourtoken"
OWNER="owner"
REPO="repo"
API_URL="https://api.github.com"
# Helper function for API calls
github_api() {
curl -s -H "Authorization: token $GITHUB_TOKEN" \
-H "Accept: application/vnd.github+json" \
"$API_URL/$1"
}
echo "GitHub Repository Report: $OWNER/$REPO"
echo "Generated: $(date)"
echo "======================================"
echo ""
# Repository info
echo "Repository Information:"
github_api "repos/$OWNER/$REPO" | jq -r '
"Name: \(.name)
Description: \(.description // "N/A")
Language: \(.language // "N/A")
Stars: \(.stargazers_count)
Forks: \(.forks_count)
Open Issues: \(.open_issues_count)
Created: \(.created_at[:10])
Last Updated: \(.updated_at[:10])"
'
echo ""
# Contributors
echo "Top 5 Contributors:"
github_api "repos/$OWNER/$REPO/contributors?per_page=5" | jq -r '
.[] | " \(.login): \(.contributions) contributions"
'
echo ""
# Recent releases
echo "Recent Releases:"
github_api "repos/$OWNER/$REPO/releases?per_page=3" | jq -r '
.[] | " \(.tag_name) - \(.name) (\(.published_at[:10]))"
'
echo ""
# Workflow runs
echo "Recent Workflow Runs:"
github_api "repos/$OWNER/$REPO/actions/runs?per_page=5" | jq -r '
.workflow_runs[] | " \(.name): \(.conclusion) (\(.created_at[:10]))"
'
3. Organization-Wide Reporting
#!/usr/bin/env python3
"""
GitHub Organization Reporter
Generates reports across all repositories in an organization
"""
import requests
import csv
from datetime import datetime
GITHUB_TOKEN = "ghp_yourtoken"
ORG = "your-org"
BASE_URL = "https://api.github.com"
headers = {
"Authorization": f"token {GITHUB_TOKEN}",
"Accept": "application/vnd.github+json"
}
def get_all_repos(org):
"""Get all repositories in an organization"""
repos = []
page = 1
while True:
url = f"{BASE_URL}/orgs/{org}/repos"
params = {"per_page": 100, "page": page}
response = requests.get(url, headers=headers, params=params)
response.raise_for_status()
page_repos = response.json()
if not page_repos:
break
repos.extend(page_repos)
page += 1
return repos
def get_repo_metrics(owner, repo):
"""Get key metrics for a repository"""
url = f"{BASE_URL}/repos/{owner}/{repo}"
response = requests.get(url, headers=headers)
if response.status_code != 200:
return None
data = response.json()
# Get additional metrics
issues_url = f"{BASE_URL}/repos/{owner}/{repo}/issues"
issues_response = requests.get(issues_url, headers=headers, params={"state": "open"})
open_issues = len(issues_response.json()) if issues_response.status_code == 200 else 0
return {
"name": data["name"],
"visibility": data["visibility"],
"language": data.get("language", "N/A"),
"stars": data["stargazers_count"],
"forks": data["forks_count"],
"open_issues": open_issues,
"size_kb": data["size"],
"created_at": data["created_at"][:10],
"updated_at": data["updated_at"][:10],
"default_branch": data["default_branch"],
"archived": data["archived"]
}
def generate_org_report(org, output_file="org_report.csv"):
"""Generate organization-wide report"""
print(f"Fetching repositories for organization: {org}")
repos = get_all_repos(org)
print(f"Found {len(repos)} repositories")
# Collect metrics for each repo
metrics_list = []
for i, repo in enumerate(repos, 1):
print(f"Processing {i}/{len(repos)}: {repo['name']}")
metrics = get_repo_metrics(org, repo['name'])
if metrics:
metrics_list.append(metrics)
# Write to CSV
if metrics_list:
keys = metrics_list[0].keys()
with open(output_file, 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=keys)
writer.writeheader()
writer.writerows(metrics_list)
print(f"\nReport saved to {output_file}")
# Print summary
print(f"\nOrganization Summary:")
print(f"Total repositories: {len(metrics_list)}")
print(f"Total stars: {sum(m['stars'] for m in metrics_list)}")
print(f"Total forks: {sum(m['forks'] for m in metrics_list)}")
print(f"Archived repos: {sum(1 for m in metrics_list if m['archived'])}")
# Language breakdown
languages = {}
for m in metrics_list:
lang = m['language']
languages[lang] = languages.get(lang, 0) + 1
print(f"\nTop languages:")
for lang, count in sorted(languages.items(), key=lambda x: x[1], reverse=True)[:5]:
print(f" {lang}: {count} repos")
if __name__ == "__main__":
generate_org_report(ORG)
Pros and Cons of REST API
Pros:
- ✅ Maximum flexibility - Full control over requests and responses
- ✅ Language agnostic - Works with any HTTP client
- ✅ Well documented - Comprehensive API documentation
- ✅ Fine-grained control - Access to all GitHub features
- ✅ No dependencies - Just HTTP requests
Cons:
- ❌ More verbose - Requires more code than SDK
- ❌ Manual pagination - Must handle pagination yourself
- ❌ Rate limiting complexity - Need to implement rate limit handling
- ❌ Authentication management - Must manage tokens manually
- ❌ No type safety - Working with raw JSON
Method 3: PyGithub (Official Python SDK)
PyGithub is the official Python library for GitHub’s API. It provides a high-level, object-oriented interface that makes Python-based reporting clean and maintainable.
Installation
pip install PyGithub
Authentication
PyGithub supports the same authentication methods as the REST API:
from github import Github, Auth
# 1. Personal Access Token (recommended)
auth = Auth.Token("ghp_yourtoken")
g = Github(auth=auth)
# 2. Username and password (deprecated)
g = Github("username", "password")
# 3. GitHub App
auth = Auth.AppAuth(app_id, private_key)
g = Github(auth=auth)
# 4. Using environment variable
import os
token = os.environ.get('GITHUB_TOKEN')
g = Github(token)
# Test authentication
user = g.get_user()
print(f"Authenticated as: {user.login}")
Basic Usage
from github import Github
# Initialize
g = Github("ghp_yourtoken")
# Get current user
user = g.get_user()
print(f"Hello {user.name}")
# Get a specific repository
repo = g.get_repo("owner/repo")
print(f"Repository: {repo.full_name}")
print(f"Stars: {repo.stargazers_count}")
# List user repositories
for repo in g.get_user().get_repos():
print(repo.name)
# Get issues
issues = repo.get_issues(state='open')
for issue in issues:
print(f"#{issue.number}: {issue.title}")
# Get pull requests
pulls = repo.get_pulls(state='all')
for pr in pulls:
print(f"PR #{pr.number}: {pr.title}")
# Get commits
commits = repo.get_commits()
for commit in commits[:10]:
print(f"{commit.sha[:7]}: {commit.commit.message.split('\\n')[0]}")
Reporting Examples with PyGithub
1. Comprehensive Repository Report
#!/usr/bin/env python3
"""
GitHub Repository Comprehensive Report using PyGithub
"""
from github import Github
from datetime import datetime, timedelta
from collections import defaultdict
import sys
# Configuration
GITHUB_TOKEN = "ghp_yourtoken"
OWNER = "owner"
REPO = "repo"
def generate_comprehensive_report(owner, repo_name):
"""Generate a comprehensive repository report"""
g = Github(GITHUB_TOKEN)
repo = g.get_repo(f"{owner}/{repo_name}")
print(f"=" * 80)
print(f"GitHub Repository Comprehensive Report")
print(f"Repository: {repo.full_name}")
print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"=" * 80)
print()
# Basic Information
print("📋 BASIC INFORMATION")
print("-" * 80)
print(f"Description: {repo.description or 'N/A'}")
print(f"Homepage: {repo.homepage or 'N/A'}")
print(f"Primary Language: {repo.language or 'N/A'}")
print(f"Created: {repo.created_at.strftime('%Y-%m-%d')}")
print(f"Last Updated: {repo.updated_at.strftime('%Y-%m-%d')}")
print(f"Default Branch: {repo.default_branch}")
print(f"Visibility: {'Private' if repo.private else 'Public'}")
print(f"Archived: {'Yes' if repo.archived else 'No'}")
print()
# Statistics
print("📊 STATISTICS")
print("-" * 80)
print(f"Stars: {repo.stargazers_count:,}")
print(f"Forks: {repo.forks_count:,}")
print(f"Watchers: {repo.watchers_count:,}")
print(f"Open Issues: {repo.open_issues_count:,}")
print(f"Repository Size: {repo.size:,} KB")
print()
# Contributors
print("👥 TOP 10 CONTRIBUTORS")
print("-" * 80)
contributors = repo.get_contributors()
for i, contributor in enumerate(contributors[:10], 1):
print(f"{i:2}. {contributor.login:20} - {contributor.contributions:,} contributions")
print()
# Recent Commits (Last 30 Days)
print("💻 COMMIT ACTIVITY (LAST 30 DAYS)")
print("-" * 80)
thirty_days_ago = datetime.now() - timedelta(days=30)
commits = repo.get_commits(since=thirty_days_ago)
commit_list = list(commits)
commit_by_author = defaultdict(int)
commit_by_day = defaultdict(int)
for commit in commit_list:
if commit.author:
commit_by_author[commit.author.login] += 1
day = commit.commit.author.date.strftime('%Y-%m-%d')
commit_by_day[day] += 1
print(f"Total commits: {len(commit_list)}")
print(f"Unique authors: {len(commit_by_author)}")
print()
print("Most active committers:")
for author, count in sorted(commit_by_author.items(), key=lambda x: x[1], reverse=True)[:5]:
print(f" {author}: {count} commits")
print()
# Pull Requests
print("🔀 PULL REQUEST METRICS")
print("-" * 80)
open_prs = list(repo.get_pulls(state='open'))
closed_prs = list(repo.get_pulls(state='closed'))[:50] # Limit for performance
print(f"Open PRs: {len(open_prs)}")
print(f"Recently Closed PRs: {len(closed_prs)}")
# Calculate average merge time for closed PRs
merge_times = []
merged_count = 0
for pr in closed_prs:
if pr.merged:
merged_count += 1
time_to_merge = (pr.merged_at - pr.created_at).total_seconds() / 3600
merge_times.append(time_to_merge)
if merge_times:
avg_merge_time = sum(merge_times) / len(merge_times)
print(f"Merged PRs: {merged_count}")
print(f"Average time to merge: {avg_merge_time:.1f} hours ({avg_merge_time/24:.1f} days)")
print()
if open_prs:
print("Currently Open PRs:")
for pr in open_prs[:5]:
age_days = (datetime.now() - pr.created_at.replace(tzinfo=None)).days
print(f" #{pr.number}: {pr.title[:60]} (age: {age_days} days)")
print()
# Issues
print("🐛 ISSUE METRICS")
print("-" * 80)
open_issues = list(repo.get_issues(state='open'))
# Filter out pull requests
open_issues = [i for i in open_issues if not i.pull_request]
print(f"Open Issues: {len(open_issues)}")
# Group by labels
label_counts = defaultdict(int)
for issue in open_issues:
for label in issue.labels:
label_counts[label.name] += 1
if label_counts:
print()
print("Issues by label:")
for label, count in sorted(label_counts.items(), key=lambda x: x[1], reverse=True)[:10]:
print(f" {label}: {count}")
print()
# Languages
print("💾 LANGUAGES")
print("-" * 80)
languages = repo.get_languages()
total_bytes = sum(languages.values())
for lang, bytes_count in sorted(languages.items(), key=lambda x: x[1], reverse=True):
percentage = (bytes_count / total_bytes) * 100
print(f"{lang:15} {percentage:5.1f}% ({bytes_count:,} bytes)")
print()
# Branch Protection
print("🔒 BRANCH PROTECTION")
print("-" * 80)
try:
default_branch = repo.get_branch(repo.default_branch)
if default_branch.protected:
protection = default_branch.get_protection()
print(f"Branch '{repo.default_branch}' is protected:")
if protection.required_pull_request_reviews:
print(f" - Required approving reviews: {protection.required_pull_request_reviews.required_approving_review_count}")
print(f" - Dismiss stale reviews: {protection.required_pull_request_reviews.dismiss_stale_reviews}")
else:
print(f" - No PR review requirements")
if protection.required_status_checks:
print(f" - Status checks required: {', '.join(protection.required_status_checks.contexts)}")
else:
print(f" - No required status checks")
print(f" - Enforce for admins: {protection.enforce_admins.enabled}")
else:
print(f"Branch '{repo.default_branch}' is NOT protected")
except Exception as e:
print(f"Could not retrieve branch protection info: {e}")
print()
# Recent Releases
print("🚀 RECENT RELEASES")
print("-" * 80)
releases = repo.get_releases()
release_list = list(releases[:5])
if release_list:
for release in release_list:
print(f"{release.tag_name:15} {release.title or '(no title)':30} ({release.published_at.strftime('%Y-%m-%d')})")
else:
print("No releases found")
print()
# Workflows
print("⚙️ GITHUB ACTIONS WORKFLOWS")
print("-" * 80)
try:
workflows = repo.get_workflows()
for workflow in workflows:
print(f"Workflow: {workflow.name}")
print(f" Path: {workflow.path}")
print(f" State: {workflow.state}")
# Get recent runs
runs = workflow.get_runs()
recent_runs = list(runs[:5])
if recent_runs:
success = sum(1 for r in recent_runs if r.conclusion == 'success')
failure = sum(1 for r in recent_runs if r.conclusion == 'failure')
print(f" Recent runs (last 5): ✅ {success} success, ❌ {failure} failure")
print()
except Exception as e:
print(f"Could not retrieve workflow info: {e}")
print("=" * 80)
print("Report complete")
print("=" * 80)
if __name__ == "__main__":
if len(sys.argv) > 2:
OWNER = sys.argv[1]
REPO = sys.argv[2]
generate_comprehensive_report(OWNER, REPO)
2. Organization Security Audit
#!/usr/bin/env python3
"""
GitHub Organization Security Audit using PyGithub
"""
from github import Github
from datetime import datetime
import csv
GITHUB_TOKEN = "ghp_yourtoken"
ORG = "your-org"
def audit_repository_security(repo):
"""Audit security settings for a repository"""
audit = {
"repo_name": repo.name,
"visibility": "private" if repo.private else "public",
"archived": repo.archived,
"default_branch": repo.default_branch,
}
# Branch protection
try:
branch = repo.get_branch(repo.default_branch)
audit["branch_protected"] = branch.protected
if branch.protected:
protection = branch.get_protection()
audit["require_pr_reviews"] = protection.required_pull_request_reviews is not None
if protection.required_pull_request_reviews:
audit["required_approvals"] = protection.required_pull_request_reviews.required_approving_review_count
else:
audit["required_approvals"] = 0
audit["enforce_admins"] = protection.enforce_admins.enabled if protection.enforce_admins else False
else:
audit["require_pr_reviews"] = False
audit["required_approvals"] = 0
audit["enforce_admins"] = False
except Exception as e:
audit["branch_protected"] = False
audit["require_pr_reviews"] = False
audit["required_approvals"] = 0
audit["enforce_admins"] = False
# Repository settings
audit["has_issues"] = repo.has_issues
audit["has_wiki"] = repo.has_wiki
audit["has_downloads"] = repo.has_downloads
audit["allow_merge_commit"] = repo.allow_merge_commit
audit["allow_squash_merge"] = repo.allow_squash_merge
audit["allow_rebase_merge"] = repo.allow_rebase_merge
audit["delete_branch_on_merge"] = repo.delete_branch_on_merge
# Vulnerability alerts
audit["has_vulnerability_alerts"] = repo.get_vulnerability_alert()
return audit
def generate_org_security_audit(org_name, output_file="security_audit.csv"):
"""Generate security audit for all repositories in an organization"""
g = Github(GITHUB_TOKEN)
org = g.get_organization(org_name)
print(f"GitHub Organization Security Audit")
print(f"Organization: {org_name}")
print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 80)
print()
repos = org.get_repos()
audits = []
for i, repo in enumerate(repos, 1):
print(f"Auditing {i}: {repo.name}")
audit = audit_repository_security(repo)
audits.append(audit)
# Write to CSV
if audits:
fieldnames = audits[0].keys()
with open(output_file, 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(audits)
print(f"\n✅ Audit report saved to {output_file}")
# Summary statistics
total = len(audits)
protected = sum(1 for a in audits if a['branch_protected'])
private = sum(1 for a in audits if a['visibility'] == 'private')
archived = sum(1 for a in audits if a['archived'])
print(f"\n📊 Summary:")
print(f"Total repositories: {total}")
print(f"Private repositories: {private} ({private/total*100:.1f}%)")
print(f"Archived repositories: {archived}")
print(f"Repositories with branch protection: {protected} ({protected/total*100:.1f}%)")
print(f"Repositories WITHOUT branch protection: {total - protected} ⚠️")
if __name__ == "__main__":
generate_org_security_audit(ORG)
3. Team Activity Dashboard
#!/usr/bin/env python3
"""
GitHub Team Activity Dashboard using PyGithub
Tracks team member contributions across repositories
"""
from github import Github
from datetime import datetime, timedelta
from collections import defaultdict
import json
GITHUB_TOKEN = "ghp_yourtoken"
ORG = "your-org"
DAYS = 30
def get_user_activity(g, org_name, username, since_date):
"""Get activity for a specific user"""
activity = {
"username": username,
"commits": 0,
"prs_opened": 0,
"prs_merged": 0,
"issues_opened": 0,
"issues_closed": 0,
"reviews": 0,
"comments": 0
}
org = g.get_organization(org_name)
# Iterate through org repositories
for repo in org.get_repos():
try:
# Count commits
commits = repo.get_commits(author=username, since=since_date)
activity["commits"] += commits.totalCount
# Count PRs
prs = repo.get_pulls(state='all')
for pr in prs:
if pr.created_at >= since_date:
if pr.user.login == username:
activity["prs_opened"] += 1
if pr.merged:
activity["prs_merged"] += 1
# Count issues
issues = repo.get_issues(state='all', creator=username, since=since_date)
for issue in issues:
if not issue.pull_request: # Exclude PRs
activity["issues_opened"] += 1
if issue.state == 'closed':
activity["issues_closed"] += 1
# Count PR reviews
prs_for_review = repo.get_pulls(state='all')
for pr in prs_for_review:
if pr.created_at >= since_date:
reviews = pr.get_reviews()
for review in reviews:
if review.user.login == username:
activity["reviews"] += 1
except Exception as e:
# Skip repositories where user has no access or other errors
continue
return activity
def generate_team_dashboard(org_name, days=30):
"""Generate team activity dashboard"""
g = Github(GITHUB_TOKEN)
org = g.get_organization(org_name)
since_date = datetime.now() - timedelta(days=days)
print(f"GitHub Team Activity Dashboard")
print(f"Organization: {org_name}")
print(f"Period: Last {days} days (since {since_date.strftime('%Y-%m-%d')})")
print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print("=" * 100)
print()
# Get all team members
members = org.get_members()
activities = []
for member in members:
print(f"Analyzing {member.login}...")
activity = get_user_activity(g, org_name, member.login, since_date)
activities.append(activity)
# Sort by total activity
activities.sort(key=lambda x: x['commits'] + x['prs_opened'] + x['issues_opened'], reverse=True)
# Print formatted table
print()
print(f"{'Username':<20} {'Commits':<10} {'PRs':<10} {'Merged':<10} {'Issues':<10} {'Reviews':<10}")
print("-" * 100)
for activity in activities:
print(f"{activity['username']:<20} "
f"{activity['commits']:<10} "
f"{activity['prs_opened']:<10} "
f"{activity['prs_merged']:<10} "
f"{activity['issues_opened']:<10} "
f"{activity['reviews']:<10}")
# Summary statistics
print()
print("=" * 100)
print("📊 Team Summary:")
total_commits = sum(a['commits'] for a in activities)
total_prs = sum(a['prs_opened'] for a in activities)
total_issues = sum(a['issues_opened'] for a in activities)
total_reviews = sum(a['reviews'] for a in activities)
print(f"Total commits: {total_commits}")
print(f"Total PRs opened: {total_prs}")
print(f"Total issues opened: {total_issues}")
print(f"Total code reviews: {total_reviews}")
print(f"Active contributors: {len([a for a in activities if a['commits'] > 0])}")
# Save to JSON
output_file = f"team_activity_{datetime.now().strftime('%Y%m%d')}.json"
with open(output_file, 'w') as f:
json.dump({
"generated_at": datetime.now().isoformat(),
"period_days": days,
"activities": activities
}, f, indent=2)
print(f"\n✅ Detailed report saved to {output_file}")
if __name__ == "__main__":
generate_team_dashboard(ORG, DAYS)
Pros and Cons of PyGithub
Pros:
- ✅ Clean, Pythonic API - Intuitive object-oriented interface
- ✅ Type hints - Better IDE support and code completion
- ✅ Automatic pagination - Handles pagination transparently
- ✅ Built-in rate limiting - Automatic rate limit handling
- ✅ Well maintained - Official GitHub library
- ✅ Comprehensive - Covers most GitHub API features
- ✅ Easy authentication - Simple auth setup
Cons:
- ❌ Python only - Limited to Python projects
- ❌ Dependency required - Must install library
- ❌ Memory usage - Can consume more memory with large datasets
- ❌ Learning curve - Need to understand the object model
- ❌ Some API lag - New GitHub features may take time to be added
Comparison Matrix
| Feature | GitHub CLI | REST API | PyGithub |
|---|---|---|---|
| Setup Complexity | Low (single command) | Medium (HTTP client) | Low (pip install) |
| Language | Shell/Bash | Any | Python only |
| Code Verbosity | Low for simple tasks | High | Medium |
| Type Safety | None | None | Partial (with hints) |
| Pagination | Manual or built-in flags | Manual | Automatic |
| Rate Limiting | Manual handling | Manual handling | Automatic |
| Authentication | Built-in login flow | Manual token management | Simple Auth object |
| Best For | Quick scripts, CLI users | Multi-language projects | Python applications |
| Performance | Good for small queries | Best (direct HTTP) | Good |
| Maintainability | Poor for complex logic | Medium | Excellent |
| Documentation | Excellent | Excellent | Good |
| Community | Large | Very large | Large |
When to Use Each Method
Use GitHub CLI When:
- ✅ Writing quick shell scripts
- ✅ Need interactive exploration
- ✅ Working in terminal-centric workflows
- ✅ Want minimal setup
- ✅ Running one-off queries
- ✅ Integrating with existing bash scripts
Use REST API When:
- ✅ Working in non-Python languages
- ✅ Need maximum performance
- ✅ Building microservices
- ✅ Want no dependencies
- ✅ Require fine-grained control
- ✅ Implementing custom retry logic
Use PyGithub When:
- ✅ Building Python applications
- ✅ Need clean, maintainable code
- ✅ Want automatic pagination
- ✅ Prefer object-oriented approach
- ✅ Need type hints and IDE support
- ✅ Building long-term reporting systems
Best Practices Across All Methods
1. Secure Token Management
Never commit tokens to version control:
# Use environment variables
export GITHUB_TOKEN="ghp_yourtoken"
# Use .env files (add to .gitignore)
echo "GITHUB_TOKEN=ghp_yourtoken" > .env
# Use secrets managers
aws secretsmanager get-secret-value --secret-id github-token
Python example with python-dotenv:
from dotenv import load_dotenv
import os
load_dotenv()
GITHUB_TOKEN = os.getenv('GITHUB_TOKEN')
2. Rate Limiting
GitHub has rate limits (5,000 requests/hour for authenticated users):
Check rate limit status:
# CLI
gh api rate_limit
# Python
rate_limit = g.get_rate_limit()
print(f"Remaining: {rate_limit.core.remaining}/{rate_limit.core.limit}")
print(f"Reset at: {rate_limit.core.reset}")
Handle rate limiting:
import time
from github import RateLimitExceededException
try:
# Your API calls
repos = g.get_user().get_repos()
except RateLimitExceededException:
# Wait until rate limit resets
reset_time = g.get_rate_limit().core.reset
sleep_time = (reset_time - datetime.now()).total_seconds() + 60
print(f"Rate limit exceeded. Sleeping for {sleep_time} seconds")
time.sleep(sleep_time)
3. Pagination
Always handle pagination for complete results:
# PyGithub - automatic pagination
for repo in org.get_repos():
print(repo.name)
# Manual pagination with REST API
page = 1
while True:
response = requests.get(url, params={"page": page, "per_page": 100})
data = response.json()
if not data:
break
# Process data
page += 1
4. Error Handling
Implement robust error handling:
from github import GithubException, RateLimitExceededException
try:
repo = g.get_repo("owner/repo")
issues = repo.get_issues(state='open')
for issue in issues:
print(issue.title)
except RateLimitExceededException:
print("Rate limit exceeded")
except GithubException as e:
print(f"GitHub API error: {e.status} - {e.data}")
except Exception as e:
print(f"Unexpected error: {e}")
5. Performance Optimization
Use field filtering to reduce data transfer:
# Only fetch needed fields
repo = g.get_repo("owner/repo")
issues = repo.get_issues(state='open')
# Process only what you need
for issue in issues:
print(f"{issue.number}: {issue.title}") # Don't access unnecessary fields
Cache data when appropriate:
import pickle
from pathlib import Path
cache_file = Path("repo_cache.pkl")
if cache_file.exists():
with open(cache_file, 'rb') as f:
repo_data = pickle.load(f)
else:
repo_data = fetch_repo_data()
with open(cache_file, 'wb') as f:
pickle.dump(repo_data, f)
6. Logging
Implement logging for troubleshooting:
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
handlers=[
logging.FileHandler('github_report.log'),
logging.StreamHandler()
]
)
logger = logging.getLogger(__name__)
logger.info("Starting report generation")
logger.debug(f"Fetching data for repo: {repo_name}")
logger.error(f"Failed to fetch data: {error}")
Advanced Topics
GraphQL API
For complex queries, consider GitHub’s GraphQL API:
import requests
query = """
{
repository(owner: "owner", name: "repo") {
issues(first: 10, states: OPEN) {
nodes {
number
title
author {
login
}
labels(first: 5) {
nodes {
name
}
}
}
}
}
}
"""
headers = {
"Authorization": f"bearer {GITHUB_TOKEN}",
"Content-Type": "application/json"
}
response = requests.post(
"https://api.github.com/graphql",
json={"query": query},
headers=headers
)
data = response.json()
Webhooks for Real-Time Reporting
Instead of polling, use webhooks for real-time updates:
from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/webhook', methods=['POST'])
def github_webhook():
event = request.headers.get('X-GitHub-Event')
payload = request.json
if event == 'push':
# Handle push event
commits = payload['commits']
print(f"Received {len(commits)} commits")
elif event == 'pull_request':
# Handle PR event
action = payload['action']
pr = payload['pull_request']
print(f"PR #{pr['number']} was {action}")
return jsonify({"status": "success"}), 200
if __name__ == '__main__':
app.run(port=5000)
Troubleshooting
Common Issues
1. Authentication Failed
Error: Bad credentials (401)
- Verify token is correct and not expired
- Check token has required scopes
- Ensure token is properly formatted (no extra spaces)
2. Rate Limit Exceeded
Error: API rate limit exceeded
- Wait for rate limit to reset
- Use authentication (higher limits)
- Implement exponential backoff
- Cache results when possible
3. Resource Not Found
Error: Not Found (404)
- Verify repository/organization name
- Check token has access to private resources
- Ensure resource exists
4. Permission Denied
Error: Forbidden (403)
- Token missing required scopes
- User lacks repository access
- Organization settings restrict API access
Debug Tips
Enable debug logging:
import logging
logging.basicConfig(level=logging.DEBUG)
# For PyGithub
from github import enable_console_debug_logging
enable_console_debug_logging()
Test API access:
# Test with curl
curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/user
# Check token scopes
curl -I -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/user | grep X-OAuth-Scopes
Conclusion
GitHub provides powerful tools for extracting data for reporting and analytics. The right choice depends on your specific needs:
- GitHub CLI excels at quick queries and shell scripting
- REST API offers maximum flexibility for any language
- PyGithub provides the cleanest Python experience
All three methods can authenticate securely using Personal Access Tokens, support comprehensive data extraction, and can be integrated into automated reporting workflows.
Start with the GitHub CLI for exploration, move to PyGithub for production Python applications, or use the REST API when working in other languages or requiring maximum control.
Key Takeaways
- ✅ Authentication is critical - Use Personal Access Tokens with appropriate scopes
- ✅ Respect rate limits - Implement rate limit handling and caching
- ✅ Handle errors gracefully - Expect API failures and retry appropriately
- ✅ Choose the right tool - Match the method to your use case
- ✅ Secure your tokens - Never commit credentials to version control
- ✅ Start simple - Begin with basic queries and add complexity as needed
- ✅ Automate reporting - Schedule regular reports for consistent insights
Next Steps
- Explore GitHub REST API documentation
- Try PyGithub documentation
- Read GitHub CLI manual
- Experiment with GitHub GraphQL API
- Build custom dashboards with your preferred visualization tools
- Integrate GitHub data with your existing reporting systems