Retrieving Data from GitHub for Reporting: CLI, REST API, and Python SDK Compared

READER BEWARE: THE FOLLOWING WRITTEN ENTIRELY BY AI WITHOUT HUMAN EDITING.

Introduction

GitHub isn’t just a code repository—it’s a treasure trove of project data, analytics, and configuration information. Whether you need to generate reports on repository activity, audit security configurations, track issue metrics, or analyze team productivity, GitHub provides multiple ways to extract this data programmatically.

This comprehensive guide explores three primary methods for retrieving GitHub data for reporting purposes:

  1. GitHub CLI (gh) - A command-line interface for quick queries and scripts
  2. GitHub REST API - Direct HTTP access for maximum flexibility and control
  3. PyGithub - The official Python SDK for object-oriented GitHub interaction

We’ll compare these approaches, show how to authenticate with each, and provide practical examples for common reporting scenarios.

Why Extract GitHub Data Programmatically?

Common Use Cases

  • Repository Analytics: Track commits, pull requests, issues, and contributor activity
  • Security Auditing: Review access permissions, scan for vulnerabilities, and monitor security alerts
  • Team Metrics: Measure code review turnaround times, issue resolution, and sprint velocity
  • Configuration Management: Document repository settings, branch protection rules, and webhooks
  • Compliance Reporting: Generate evidence for audits and regulatory requirements
  • Custom Dashboards: Build tailored visualizations beyond GitHub’s built-in insights
  • Automated Notifications: Alert on specific events or threshold breaches

Benefits of Automation

  • Consistency: Eliminate manual errors and ensure repeatable processes
  • Scalability: Process data across hundreds of repositories simultaneously
  • Timeliness: Schedule regular reports and real-time monitoring
  • Integration: Combine GitHub data with other systems (Jira, Slack, etc.)
  • Historical Analysis: Track trends and patterns over time

Method 1: GitHub CLI (gh)

The GitHub CLI is the fastest way to get started with GitHub automation. It’s perfect for quick queries, shell scripts, and interactive exploration.

Installation

macOS:

brew install gh

Linux (Debian/Ubuntu):

curl -fsSL https://cli.github.com/packages/githubcli-archive-keyring.gpg | sudo dd of=/usr/share/keyrings/githubcli-archive-keyring.gpg
echo "deb [arch=$(dpkg --print-architecture) signed-by=/usr/share/keyrings/githubcli-archive-keyring.gpg] https://cli.github.com/packages stable main" | sudo tee /etc/apt/sources.list.d/github-cli.list > /dev/null
sudo apt update
sudo apt install gh

Windows (Scoop):

scoop install gh

From Binary:

# Download from https://github.com/cli/cli/releases
# Extract and place in your PATH

Authentication

The GitHub CLI supports multiple authentication methods:

# Interactive authentication (recommended for getting started)
gh auth login

# Authenticate with a token
gh auth login --with-token < token.txt

# Or use environment variable
export GITHUB_TOKEN="ghp_yourpersonalaccesstoken"
gh auth status

During interactive authentication, you’ll choose:

  • GitHub.com or GitHub Enterprise Server
  • HTTPS or SSH protocol
  • Authentication method (web browser or token)

Creating a Personal Access Token:

  1. Go to GitHub Settings → Developer settings → Personal access tokens → Tokens (classic)
  2. Click “Generate new token”
  3. Select appropriate scopes (repo, read:org, read:user, etc.)
  4. Save the token securely

Basic Usage

# View current user
gh auth status

# List repositories
gh repo list

# View repository details
gh repo view owner/repo

# List issues
gh issue list --repo owner/repo

# List pull requests
gh pr list --repo owner/repo

# View GitHub Actions workflows
gh workflow list --repo owner/repo

Reporting Examples with GitHub CLI

1. Repository Activity Report

#!/bin/bash
# repo-activity.sh - Generate repository activity report

REPO="owner/repo"
OUTPUT="activity_report.txt"

{
  echo "GitHub Repository Activity Report"
  echo "Repository: $REPO"
  echo "Generated: $(date)"
  echo "================================"
  echo ""
  
  echo "Recent Commits (Last 7 Days):"
  gh api "repos/$REPO/commits?since=$(date -d '7 days ago' -I)T00:00:00Z" \
    --jq '.[] | "\(.commit.author.date) - \(.commit.author.name): \(.commit.message | split("\n")[0])"'
  echo ""
  
  echo "Open Pull Requests:"
  gh pr list --repo "$REPO" --state open --json number,title,author,createdAt \
    --jq '.[] | "#\(.number) - \(.title) by @\(.author.login) (created: \(.createdAt))"'
  echo ""
  
  echo "Recently Closed Issues:"
  gh issue list --repo "$REPO" --state closed --limit 10 \
    --json number,title,closedAt \
    --jq '.[] | "#\(.number) - \(.title) (closed: \(.closedAt))"'
  echo ""
  
  echo "Top Contributors (Last 30 Days):"
  gh api "repos/$REPO/stats/contributors" \
    --jq 'sort_by(-.total) | .[:5] | .[] | "\(.author.login): \(.total) commits"'
    
} > "$OUTPUT"

echo "Report saved to $OUTPUT"

2. Pull Request Metrics

#!/bin/bash
# pr-metrics.sh - Calculate PR review turnaround times

REPO="owner/repo"

gh pr list --repo "$REPO" --state closed --limit 50 --json number,createdAt,closedAt,title | \
jq -r '
  .[] | 
  {
    number: .number,
    title: .title,
    created: .createdAt,
    closed: .closedAt,
    hours: (((.closedAt | fromdateiso8601) - (.createdAt | fromdateiso8601)) / 3600 | floor)
  } | 
  "PR #\(.number): \(.hours)h - \(.title)"
' | sort -t: -k2 -n

echo ""
echo "Average turnaround time:"
gh pr list --repo "$REPO" --state closed --limit 50 --json createdAt,closedAt | \
jq '[.[] | (((.closedAt | fromdateiso8601) - (.createdAt | fromdateiso8601)) / 3600)] | add / length | floor | "\\(.)\\ hours"'

3. Security Audit Report

#!/bin/bash
# security-audit.sh - Repository security configuration audit

REPO="owner/repo"

echo "Security Audit Report for $REPO"
echo "Generated: $(date)"
echo "================================"
echo ""

echo "Branch Protection Rules:"
gh api "repos/$REPO/branches" --jq '.[] | select(.protected == true) | .name' | while read branch; do
  echo "Branch: $branch"
  gh api "repos/$REPO/branches/$branch/protection" --jq '
    "  - Require PR reviews: \(.required_pull_request_reviews != null)
     - Required approvals: \(.required_pull_request_reviews.required_approving_review_count // 0)
     - Dismiss stale reviews: \(.required_pull_request_reviews.dismiss_stale_reviews // false)
     - Require status checks: \(.required_status_checks != null)
     - Enforce for admins: \(.enforce_admins.enabled // false)"
  '
  echo ""
done

echo "Dependabot Alerts:"
gh api "repos/$REPO/dependabot/alerts" --jq '.[] | "[\(.state)] \(.security_advisory.severity | ascii_upcase): \(.security_advisory.summary)"'
echo ""

echo "Secret Scanning Alerts:"
gh api "repos/$REPO/secret-scanning/alerts" --jq '.[] | "[\(.state)] \(.secret_type): \(.html_url)"'
echo ""

echo "Repository Settings:"
gh api "repos/$REPO" --jq '
  "Visibility: \(.visibility)
   Default Branch: \(.default_branch)
   Allow Merge Commits: \(.allow_merge_commit)
   Allow Squash Merge: \(.allow_squash_merge)
   Allow Rebase Merge: \(.allow_rebase_merge)
   Delete Branch on Merge: \(.delete_branch_on_merge)
   Has Issues: \(.has_issues)
   Has Wiki: \(.has_wiki)
   Has Downloads: \(.has_downloads)"
'

4. Team Contribution Analysis

#!/bin/bash
# team-contributions.sh - Analyze team member contributions

ORG="your-org"
SINCE="2025-01-01"

echo "Team Contribution Analysis"
echo "Organization: $ORG"
echo "Period: Since $SINCE"
echo "================================"
echo ""

# Get all org members
gh api "orgs/$ORG/members" --jq '.[].login' | while read member; do
  echo "Analyzing $member..."
  
  # Count commits across all org repos
  commit_count=$(gh api "search/commits?q=author:$member+org:$ORG+author-date:>$SINCE" \
    --jq '.total_count')
  
  # Count PRs
  pr_count=$(gh api "search/issues?q=author:$member+org:$ORG+type:pr+created:>$SINCE" \
    --jq '.total_count')
  
  # Count issues opened
  issue_count=$(gh api "search/issues?q=author:$member+org:$ORG+type:issue+created:>$SINCE" \
    --jq '.total_count')
  
  echo "$member: $commit_count commits, $pr_count PRs, $issue_count issues"
done | sort -t: -k2 -rn

5. Workflow Run Statistics

#!/bin/bash
# workflow-stats.sh - GitHub Actions workflow statistics

REPO="owner/repo"

echo "GitHub Actions Workflow Statistics"
echo "Repository: $REPO"
echo "================================"
echo ""

gh api "repos/$REPO/actions/workflows" --jq '.workflows[] | .id' | while read workflow_id; do
  workflow_name=$(gh api "repos/$REPO/actions/workflows/$workflow_id" --jq '.name')
  
  echo "Workflow: $workflow_name"
  
  # Get last 10 runs
  runs=$(gh api "repos/$REPO/actions/workflows/$workflow_id/runs?per_page=10")
  
  total=$(echo "$runs" | jq '.workflow_runs | length')
  successful=$(echo "$runs" | jq '[.workflow_runs[] | select(.conclusion == "success")] | length')
  failed=$(echo "$runs" | jq '[.workflow_runs[] | select(.conclusion == "failure")] | length')
  
  echo "  Recent runs: $total"
  echo "  Successful: $successful"
  echo "  Failed: $failed"
  
  if [ "$total" -gt 0 ]; then
    success_rate=$(( successful * 100 / total ))
    echo "  Success rate: ${success_rate}%"
  fi
  
  echo ""
done

Pros and Cons of GitHub CLI

Pros:

  • Quick to get started - Simple installation and authentication
  • Great for scripts - Easily integrated into shell scripts
  • Interactive features - Built-in pagination, formatting, and filtering
  • No code required - Perfect for bash scripting
  • Built-in helpers - Convenient commands for common operations
  • Cross-platform - Works on macOS, Linux, and Windows

Cons:

  • Limited to shell environments - Not ideal for complex applications
  • Text processing required - Output often needs parsing with jq or awk
  • Less programmatic control - Harder to build complex logic
  • Shell script maintenance - Can become complex for large projects

Method 2: GitHub REST API

The REST API provides direct access to GitHub’s functionality via HTTP requests. It’s the most flexible option and works with any programming language.

Authentication

GitHub REST API supports multiple authentication methods:

# Create a token at: https://github.com/settings/tokens

# Use in curl
curl -H "Authorization: token ghp_yourtoken" \
  https://api.github.com/user

# Use in HTTP headers
Authorization: token ghp_yourtoken
# Or for fine-grained tokens:
Authorization: Bearer github_pat_yourtoken

2. OAuth Apps

# Authenticate via OAuth flow
# Redirect users to:
https://github.com/login/oauth/authorize?client_id=YOUR_CLIENT_ID&scope=repo,read:org

# Exchange code for token
curl -X POST https://github.com/login/oauth/access_token \
  -d "client_id=YOUR_CLIENT_ID" \
  -d "client_secret=YOUR_CLIENT_SECRET" \
  -d "code=CODE_FROM_OAUTH"

3. GitHub App

# Install GitHub App and generate JWT
# Use JWT to get installation access token
curl -X POST https://api.github.com/app/installations/:installation_id/access_tokens \
  -H "Authorization: Bearer YOUR_JWT" \
  -H "Accept: application/vnd.github+json"

Basic Usage with curl

# Set token as environment variable
export GITHUB_TOKEN="ghp_yourtoken"

# Get current user
curl -H "Authorization: token $GITHUB_TOKEN" \
  https://api.github.com/user

# List repositories
curl -H "Authorization: token $GITHUB_TOKEN" \
  https://api.github.com/user/repos

# Get repository details
curl -H "Authorization: token $GITHUB_TOKEN" \
  https://api.github.com/repos/owner/repo

# List issues
curl -H "Authorization: token $GITHUB_TOKEN" \
  https://api.github.com/repos/owner/repo/issues

# Get pull request
curl -H "Authorization: token $GITHUB_TOKEN" \
  https://api.github.com/repos/owner/repo/pulls/123

Reporting Examples with REST API

1. Python Script for Repository Statistics

#!/usr/bin/env python3
"""
GitHub Repository Statistics Reporter
Fetches and analyzes repository metrics using the REST API
"""

import requests
import json
from datetime import datetime, timedelta
from collections import defaultdict

# Configuration
GITHUB_TOKEN = "ghp_yourtoken"
OWNER = "owner"
REPO = "repo"
BASE_URL = "https://api.github.com"

headers = {
    "Authorization": f"token {GITHUB_TOKEN}",
    "Accept": "application/vnd.github+json"
}

def get_commit_activity(owner, repo, since_days=30):
    """Get commit activity for the last N days"""
    since_date = (datetime.now() - timedelta(days=since_days)).isoformat()
    url = f"{BASE_URL}/repos/{owner}/{repo}/commits"
    
    params = {
        "since": since_date,
        "per_page": 100
    }
    
    commits = []
    page = 1
    
    while True:
        params['page'] = page
        response = requests.get(url, headers=headers, params=params)
        response.raise_for_status()
        
        page_commits = response.json()
        if not page_commits:
            break
            
        commits.extend(page_commits)
        page += 1
        
        # Check if there are more pages
        if 'Link' not in response.headers:
            break
    
    return commits

def analyze_commits(commits):
    """Analyze commit data"""
    authors = defaultdict(int)
    daily_commits = defaultdict(int)
    
    for commit in commits:
        author = commit['commit']['author']['name']
        date = commit['commit']['author']['date'][:10]
        
        authors[author] += 1
        daily_commits[date] += 1
    
    return {
        'total_commits': len(commits),
        'unique_authors': len(authors),
        'top_contributors': sorted(authors.items(), key=lambda x: x[1], reverse=True)[:5],
        'daily_activity': sorted(daily_commits.items())
    }

def get_pr_metrics(owner, repo, state='all'):
    """Get pull request metrics"""
    url = f"{BASE_URL}/repos/{owner}/{repo}/pulls"
    params = {
        "state": state,
        "per_page": 100
    }
    
    response = requests.get(url, headers=headers, params=params)
    response.raise_for_status()
    prs = response.json()
    
    metrics = {
        'total': len(prs),
        'open': sum(1 for pr in prs if pr['state'] == 'open'),
        'merged': sum(1 for pr in prs if pr.get('merged_at')),
        'closed_unmerged': sum(1 for pr in prs if pr['state'] == 'closed' and not pr.get('merged_at'))
    }
    
    # Calculate average time to merge
    merge_times = []
    for pr in prs:
        if pr.get('merged_at'):
            created = datetime.fromisoformat(pr['created_at'].replace('Z', '+00:00'))
            merged = datetime.fromisoformat(pr['merged_at'].replace('Z', '+00:00'))
            hours = (merged - created).total_seconds() / 3600
            merge_times.append(hours)
    
    if merge_times:
        metrics['avg_merge_time_hours'] = sum(merge_times) / len(merge_times)
    
    return metrics

def get_issue_metrics(owner, repo):
    """Get issue metrics"""
    url = f"{BASE_URL}/repos/{owner}/{repo}/issues"
    params = {
        "state": "all",
        "per_page": 100
    }
    
    response = requests.get(url, headers=headers, params=params)
    response.raise_for_status()
    issues = response.json()
    
    # Filter out pull requests (they appear in issues endpoint)
    issues = [i for i in issues if 'pull_request' not in i]
    
    metrics = {
        'total': len(issues),
        'open': sum(1 for i in issues if i['state'] == 'open'),
        'closed': sum(1 for i in issues if i['state'] == 'closed')
    }
    
    # Calculate average time to close
    close_times = []
    for issue in issues:
        if issue['state'] == 'closed' and issue.get('closed_at'):
            created = datetime.fromisoformat(issue['created_at'].replace('Z', '+00:00'))
            closed = datetime.fromisoformat(issue['closed_at'].replace('Z', '+00:00'))
            hours = (closed - created).total_seconds() / 3600
            close_times.append(hours)
    
    if close_times:
        metrics['avg_close_time_hours'] = sum(close_times) / len(close_times)
    
    return metrics

def generate_report(owner, repo):
    """Generate comprehensive repository report"""
    print(f"GitHub Repository Report: {owner}/{repo}")
    print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("=" * 70)
    print()
    
    # Commit analysis
    print("📊 Commit Activity (Last 30 Days)")
    print("-" * 70)
    commits = get_commit_activity(owner, repo, since_days=30)
    analysis = analyze_commits(commits)
    
    print(f"Total commits: {analysis['total_commits']}")
    print(f"Unique authors: {analysis['unique_authors']}")
    print()
    print("Top contributors:")
    for author, count in analysis['top_contributors']:
        print(f"  {author}: {count} commits")
    print()
    
    # Pull request metrics
    print("🔀 Pull Request Metrics")
    print("-" * 70)
    pr_metrics = get_pr_metrics(owner, repo)
    print(f"Total PRs: {pr_metrics['total']}")
    print(f"Open: {pr_metrics['open']}")
    print(f"Merged: {pr_metrics['merged']}")
    print(f"Closed (unmerged): {pr_metrics['closed_unmerged']}")
    if 'avg_merge_time_hours' in pr_metrics:
        print(f"Average time to merge: {pr_metrics['avg_merge_time_hours']:.1f} hours")
    print()
    
    # Issue metrics
    print("🐛 Issue Metrics")
    print("-" * 70)
    issue_metrics = get_issue_metrics(owner, repo)
    print(f"Total issues: {issue_metrics['total']}")
    print(f"Open: {issue_metrics['open']}")
    print(f"Closed: {issue_metrics['closed']}")
    if 'avg_close_time_hours' in issue_metrics:
        print(f"Average time to close: {issue_metrics['avg_close_time_hours']:.1f} hours")

if __name__ == "__main__":
    generate_report(OWNER, REPO)

2. Bash Script Using curl and jq

#!/bin/bash
# github-api-report.sh - Generate report using GitHub REST API

GITHUB_TOKEN="ghp_yourtoken"
OWNER="owner"
REPO="repo"
API_URL="https://api.github.com"

# Helper function for API calls
github_api() {
  curl -s -H "Authorization: token $GITHUB_TOKEN" \
       -H "Accept: application/vnd.github+json" \
       "$API_URL/$1"
}

echo "GitHub Repository Report: $OWNER/$REPO"
echo "Generated: $(date)"
echo "======================================"
echo ""

# Repository info
echo "Repository Information:"
github_api "repos/$OWNER/$REPO" | jq -r '
  "Name: \(.name)
   Description: \(.description // "N/A")
   Language: \(.language // "N/A")
   Stars: \(.stargazers_count)
   Forks: \(.forks_count)
   Open Issues: \(.open_issues_count)
   Created: \(.created_at[:10])
   Last Updated: \(.updated_at[:10])"
'
echo ""

# Contributors
echo "Top 5 Contributors:"
github_api "repos/$OWNER/$REPO/contributors?per_page=5" | jq -r '
  .[] | "  \(.login): \(.contributions) contributions"
'
echo ""

# Recent releases
echo "Recent Releases:"
github_api "repos/$OWNER/$REPO/releases?per_page=3" | jq -r '
  .[] | "  \(.tag_name) - \(.name) (\(.published_at[:10]))"
'
echo ""

# Workflow runs
echo "Recent Workflow Runs:"
github_api "repos/$OWNER/$REPO/actions/runs?per_page=5" | jq -r '
  .workflow_runs[] | "  \(.name): \(.conclusion) (\(.created_at[:10]))"
'

3. Organization-Wide Reporting

#!/usr/bin/env python3
"""
GitHub Organization Reporter
Generates reports across all repositories in an organization
"""

import requests
import csv
from datetime import datetime

GITHUB_TOKEN = "ghp_yourtoken"
ORG = "your-org"
BASE_URL = "https://api.github.com"

headers = {
    "Authorization": f"token {GITHUB_TOKEN}",
    "Accept": "application/vnd.github+json"
}

def get_all_repos(org):
    """Get all repositories in an organization"""
    repos = []
    page = 1
    
    while True:
        url = f"{BASE_URL}/orgs/{org}/repos"
        params = {"per_page": 100, "page": page}
        
        response = requests.get(url, headers=headers, params=params)
        response.raise_for_status()
        
        page_repos = response.json()
        if not page_repos:
            break
            
        repos.extend(page_repos)
        page += 1
    
    return repos

def get_repo_metrics(owner, repo):
    """Get key metrics for a repository"""
    url = f"{BASE_URL}/repos/{owner}/{repo}"
    response = requests.get(url, headers=headers)
    
    if response.status_code != 200:
        return None
    
    data = response.json()
    
    # Get additional metrics
    issues_url = f"{BASE_URL}/repos/{owner}/{repo}/issues"
    issues_response = requests.get(issues_url, headers=headers, params={"state": "open"})
    open_issues = len(issues_response.json()) if issues_response.status_code == 200 else 0
    
    return {
        "name": data["name"],
        "visibility": data["visibility"],
        "language": data.get("language", "N/A"),
        "stars": data["stargazers_count"],
        "forks": data["forks_count"],
        "open_issues": open_issues,
        "size_kb": data["size"],
        "created_at": data["created_at"][:10],
        "updated_at": data["updated_at"][:10],
        "default_branch": data["default_branch"],
        "archived": data["archived"]
    }

def generate_org_report(org, output_file="org_report.csv"):
    """Generate organization-wide report"""
    print(f"Fetching repositories for organization: {org}")
    repos = get_all_repos(org)
    print(f"Found {len(repos)} repositories")
    
    # Collect metrics for each repo
    metrics_list = []
    for i, repo in enumerate(repos, 1):
        print(f"Processing {i}/{len(repos)}: {repo['name']}")
        metrics = get_repo_metrics(org, repo['name'])
        if metrics:
            metrics_list.append(metrics)
    
    # Write to CSV
    if metrics_list:
        keys = metrics_list[0].keys()
        with open(output_file, 'w', newline='') as f:
            writer = csv.DictWriter(f, fieldnames=keys)
            writer.writeheader()
            writer.writerows(metrics_list)
        
        print(f"\nReport saved to {output_file}")
        
        # Print summary
        print(f"\nOrganization Summary:")
        print(f"Total repositories: {len(metrics_list)}")
        print(f"Total stars: {sum(m['stars'] for m in metrics_list)}")
        print(f"Total forks: {sum(m['forks'] for m in metrics_list)}")
        print(f"Archived repos: {sum(1 for m in metrics_list if m['archived'])}")
        
        # Language breakdown
        languages = {}
        for m in metrics_list:
            lang = m['language']
            languages[lang] = languages.get(lang, 0) + 1
        
        print(f"\nTop languages:")
        for lang, count in sorted(languages.items(), key=lambda x: x[1], reverse=True)[:5]:
            print(f"  {lang}: {count} repos")

if __name__ == "__main__":
    generate_org_report(ORG)

Pros and Cons of REST API

Pros:

  • Maximum flexibility - Full control over requests and responses
  • Language agnostic - Works with any HTTP client
  • Well documented - Comprehensive API documentation
  • Fine-grained control - Access to all GitHub features
  • No dependencies - Just HTTP requests

Cons:

  • More verbose - Requires more code than SDK
  • Manual pagination - Must handle pagination yourself
  • Rate limiting complexity - Need to implement rate limit handling
  • Authentication management - Must manage tokens manually
  • No type safety - Working with raw JSON

Method 3: PyGithub (Official Python SDK)

PyGithub is the official Python library for GitHub’s API. It provides a high-level, object-oriented interface that makes Python-based reporting clean and maintainable.

Installation

pip install PyGithub

Authentication

PyGithub supports the same authentication methods as the REST API:

from github import Github, Auth

# 1. Personal Access Token (recommended)
auth = Auth.Token("ghp_yourtoken")
g = Github(auth=auth)

# 2. Username and password (deprecated)
g = Github("username", "password")

# 3. GitHub App
auth = Auth.AppAuth(app_id, private_key)
g = Github(auth=auth)

# 4. Using environment variable
import os
token = os.environ.get('GITHUB_TOKEN')
g = Github(token)

# Test authentication
user = g.get_user()
print(f"Authenticated as: {user.login}")

Basic Usage

from github import Github

# Initialize
g = Github("ghp_yourtoken")

# Get current user
user = g.get_user()
print(f"Hello {user.name}")

# Get a specific repository
repo = g.get_repo("owner/repo")
print(f"Repository: {repo.full_name}")
print(f"Stars: {repo.stargazers_count}")

# List user repositories
for repo in g.get_user().get_repos():
    print(repo.name)

# Get issues
issues = repo.get_issues(state='open')
for issue in issues:
    print(f"#{issue.number}: {issue.title}")

# Get pull requests
pulls = repo.get_pulls(state='all')
for pr in pulls:
    print(f"PR #{pr.number}: {pr.title}")

# Get commits
commits = repo.get_commits()
for commit in commits[:10]:
    print(f"{commit.sha[:7]}: {commit.commit.message.split('\\n')[0]}")

Reporting Examples with PyGithub

1. Comprehensive Repository Report

#!/usr/bin/env python3
"""
GitHub Repository Comprehensive Report using PyGithub
"""

from github import Github
from datetime import datetime, timedelta
from collections import defaultdict
import sys

# Configuration
GITHUB_TOKEN = "ghp_yourtoken"
OWNER = "owner"
REPO = "repo"

def generate_comprehensive_report(owner, repo_name):
    """Generate a comprehensive repository report"""
    g = Github(GITHUB_TOKEN)
    repo = g.get_repo(f"{owner}/{repo_name}")
    
    print(f"=" * 80)
    print(f"GitHub Repository Comprehensive Report")
    print(f"Repository: {repo.full_name}")
    print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print(f"=" * 80)
    print()
    
    # Basic Information
    print("📋 BASIC INFORMATION")
    print("-" * 80)
    print(f"Description: {repo.description or 'N/A'}")
    print(f"Homepage: {repo.homepage or 'N/A'}")
    print(f"Primary Language: {repo.language or 'N/A'}")
    print(f"Created: {repo.created_at.strftime('%Y-%m-%d')}")
    print(f"Last Updated: {repo.updated_at.strftime('%Y-%m-%d')}")
    print(f"Default Branch: {repo.default_branch}")
    print(f"Visibility: {'Private' if repo.private else 'Public'}")
    print(f"Archived: {'Yes' if repo.archived else 'No'}")
    print()
    
    # Statistics
    print("📊 STATISTICS")
    print("-" * 80)
    print(f"Stars: {repo.stargazers_count:,}")
    print(f"Forks: {repo.forks_count:,}")
    print(f"Watchers: {repo.watchers_count:,}")
    print(f"Open Issues: {repo.open_issues_count:,}")
    print(f"Repository Size: {repo.size:,} KB")
    print()
    
    # Contributors
    print("👥 TOP 10 CONTRIBUTORS")
    print("-" * 80)
    contributors = repo.get_contributors()
    for i, contributor in enumerate(contributors[:10], 1):
        print(f"{i:2}. {contributor.login:20} - {contributor.contributions:,} contributions")
    print()
    
    # Recent Commits (Last 30 Days)
    print("💻 COMMIT ACTIVITY (LAST 30 DAYS)")
    print("-" * 80)
    thirty_days_ago = datetime.now() - timedelta(days=30)
    commits = repo.get_commits(since=thirty_days_ago)
    
    commit_list = list(commits)
    commit_by_author = defaultdict(int)
    commit_by_day = defaultdict(int)
    
    for commit in commit_list:
        if commit.author:
            commit_by_author[commit.author.login] += 1
        day = commit.commit.author.date.strftime('%Y-%m-%d')
        commit_by_day[day] += 1
    
    print(f"Total commits: {len(commit_list)}")
    print(f"Unique authors: {len(commit_by_author)}")
    print()
    print("Most active committers:")
    for author, count in sorted(commit_by_author.items(), key=lambda x: x[1], reverse=True)[:5]:
        print(f"  {author}: {count} commits")
    print()
    
    # Pull Requests
    print("🔀 PULL REQUEST METRICS")
    print("-" * 80)
    
    open_prs = list(repo.get_pulls(state='open'))
    closed_prs = list(repo.get_pulls(state='closed'))[:50]  # Limit for performance
    
    print(f"Open PRs: {len(open_prs)}")
    print(f"Recently Closed PRs: {len(closed_prs)}")
    
    # Calculate average merge time for closed PRs
    merge_times = []
    merged_count = 0
    for pr in closed_prs:
        if pr.merged:
            merged_count += 1
            time_to_merge = (pr.merged_at - pr.created_at).total_seconds() / 3600
            merge_times.append(time_to_merge)
    
    if merge_times:
        avg_merge_time = sum(merge_times) / len(merge_times)
        print(f"Merged PRs: {merged_count}")
        print(f"Average time to merge: {avg_merge_time:.1f} hours ({avg_merge_time/24:.1f} days)")
    
    print()
    
    if open_prs:
        print("Currently Open PRs:")
        for pr in open_prs[:5]:
            age_days = (datetime.now() - pr.created_at.replace(tzinfo=None)).days
            print(f"  #{pr.number}: {pr.title[:60]} (age: {age_days} days)")
    print()
    
    # Issues
    print("🐛 ISSUE METRICS")
    print("-" * 80)
    
    open_issues = list(repo.get_issues(state='open'))
    # Filter out pull requests
    open_issues = [i for i in open_issues if not i.pull_request]
    
    print(f"Open Issues: {len(open_issues)}")
    
    # Group by labels
    label_counts = defaultdict(int)
    for issue in open_issues:
        for label in issue.labels:
            label_counts[label.name] += 1
    
    if label_counts:
        print()
        print("Issues by label:")
        for label, count in sorted(label_counts.items(), key=lambda x: x[1], reverse=True)[:10]:
            print(f"  {label}: {count}")
    
    print()
    
    # Languages
    print("💾 LANGUAGES")
    print("-" * 80)
    languages = repo.get_languages()
    total_bytes = sum(languages.values())
    
    for lang, bytes_count in sorted(languages.items(), key=lambda x: x[1], reverse=True):
        percentage = (bytes_count / total_bytes) * 100
        print(f"{lang:15} {percentage:5.1f}% ({bytes_count:,} bytes)")
    print()
    
    # Branch Protection
    print("🔒 BRANCH PROTECTION")
    print("-" * 80)
    try:
        default_branch = repo.get_branch(repo.default_branch)
        if default_branch.protected:
            protection = default_branch.get_protection()
            print(f"Branch '{repo.default_branch}' is protected:")
            
            if protection.required_pull_request_reviews:
                print(f"  - Required approving reviews: {protection.required_pull_request_reviews.required_approving_review_count}")
                print(f"  - Dismiss stale reviews: {protection.required_pull_request_reviews.dismiss_stale_reviews}")
            else:
                print(f"  - No PR review requirements")
            
            if protection.required_status_checks:
                print(f"  - Status checks required: {', '.join(protection.required_status_checks.contexts)}")
            else:
                print(f"  - No required status checks")
            
            print(f"  - Enforce for admins: {protection.enforce_admins.enabled}")
        else:
            print(f"Branch '{repo.default_branch}' is NOT protected")
    except Exception as e:
        print(f"Could not retrieve branch protection info: {e}")
    print()
    
    # Recent Releases
    print("🚀 RECENT RELEASES")
    print("-" * 80)
    releases = repo.get_releases()
    release_list = list(releases[:5])
    
    if release_list:
        for release in release_list:
            print(f"{release.tag_name:15} {release.title or '(no title)':30} ({release.published_at.strftime('%Y-%m-%d')})")
    else:
        print("No releases found")
    print()
    
    # Workflows
    print("⚙️  GITHUB ACTIONS WORKFLOWS")
    print("-" * 80)
    try:
        workflows = repo.get_workflows()
        for workflow in workflows:
            print(f"Workflow: {workflow.name}")
            print(f"  Path: {workflow.path}")
            print(f"  State: {workflow.state}")
            
            # Get recent runs
            runs = workflow.get_runs()
            recent_runs = list(runs[:5])
            
            if recent_runs:
                success = sum(1 for r in recent_runs if r.conclusion == 'success')
                failure = sum(1 for r in recent_runs if r.conclusion == 'failure')
                print(f"  Recent runs (last 5): ✅ {success} success, ❌ {failure} failure")
            print()
    except Exception as e:
        print(f"Could not retrieve workflow info: {e}")
    
    print("=" * 80)
    print("Report complete")
    print("=" * 80)

if __name__ == "__main__":
    if len(sys.argv) > 2:
        OWNER = sys.argv[1]
        REPO = sys.argv[2]
    
    generate_comprehensive_report(OWNER, REPO)

2. Organization Security Audit

#!/usr/bin/env python3
"""
GitHub Organization Security Audit using PyGithub
"""

from github import Github
from datetime import datetime
import csv

GITHUB_TOKEN = "ghp_yourtoken"
ORG = "your-org"

def audit_repository_security(repo):
    """Audit security settings for a repository"""
    audit = {
        "repo_name": repo.name,
        "visibility": "private" if repo.private else "public",
        "archived": repo.archived,
        "default_branch": repo.default_branch,
    }
    
    # Branch protection
    try:
        branch = repo.get_branch(repo.default_branch)
        audit["branch_protected"] = branch.protected
        
        if branch.protected:
            protection = branch.get_protection()
            audit["require_pr_reviews"] = protection.required_pull_request_reviews is not None
            if protection.required_pull_request_reviews:
                audit["required_approvals"] = protection.required_pull_request_reviews.required_approving_review_count
            else:
                audit["required_approvals"] = 0
            audit["enforce_admins"] = protection.enforce_admins.enabled if protection.enforce_admins else False
        else:
            audit["require_pr_reviews"] = False
            audit["required_approvals"] = 0
            audit["enforce_admins"] = False
    except Exception as e:
        audit["branch_protected"] = False
        audit["require_pr_reviews"] = False
        audit["required_approvals"] = 0
        audit["enforce_admins"] = False
    
    # Repository settings
    audit["has_issues"] = repo.has_issues
    audit["has_wiki"] = repo.has_wiki
    audit["has_downloads"] = repo.has_downloads
    audit["allow_merge_commit"] = repo.allow_merge_commit
    audit["allow_squash_merge"] = repo.allow_squash_merge
    audit["allow_rebase_merge"] = repo.allow_rebase_merge
    audit["delete_branch_on_merge"] = repo.delete_branch_on_merge
    
    # Vulnerability alerts
    audit["has_vulnerability_alerts"] = repo.get_vulnerability_alert()
    
    return audit

def generate_org_security_audit(org_name, output_file="security_audit.csv"):
    """Generate security audit for all repositories in an organization"""
    g = Github(GITHUB_TOKEN)
    org = g.get_organization(org_name)
    
    print(f"GitHub Organization Security Audit")
    print(f"Organization: {org_name}")
    print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("=" * 80)
    print()
    
    repos = org.get_repos()
    audits = []
    
    for i, repo in enumerate(repos, 1):
        print(f"Auditing {i}: {repo.name}")
        audit = audit_repository_security(repo)
        audits.append(audit)
    
    # Write to CSV
    if audits:
        fieldnames = audits[0].keys()
        with open(output_file, 'w', newline='') as f:
            writer = csv.DictWriter(f, fieldnames=fieldnames)
            writer.writeheader()
            writer.writerows(audits)
        
        print(f"\n✅ Audit report saved to {output_file}")
        
        # Summary statistics
        total = len(audits)
        protected = sum(1 for a in audits if a['branch_protected'])
        private = sum(1 for a in audits if a['visibility'] == 'private')
        archived = sum(1 for a in audits if a['archived'])
        
        print(f"\n📊 Summary:")
        print(f"Total repositories: {total}")
        print(f"Private repositories: {private} ({private/total*100:.1f}%)")
        print(f"Archived repositories: {archived}")
        print(f"Repositories with branch protection: {protected} ({protected/total*100:.1f}%)")
        print(f"Repositories WITHOUT branch protection: {total - protected} ⚠️")

if __name__ == "__main__":
    generate_org_security_audit(ORG)

3. Team Activity Dashboard

#!/usr/bin/env python3
"""
GitHub Team Activity Dashboard using PyGithub
Tracks team member contributions across repositories
"""

from github import Github
from datetime import datetime, timedelta
from collections import defaultdict
import json

GITHUB_TOKEN = "ghp_yourtoken"
ORG = "your-org"
DAYS = 30

def get_user_activity(g, org_name, username, since_date):
    """Get activity for a specific user"""
    activity = {
        "username": username,
        "commits": 0,
        "prs_opened": 0,
        "prs_merged": 0,
        "issues_opened": 0,
        "issues_closed": 0,
        "reviews": 0,
        "comments": 0
    }
    
    org = g.get_organization(org_name)
    
    # Iterate through org repositories
    for repo in org.get_repos():
        try:
            # Count commits
            commits = repo.get_commits(author=username, since=since_date)
            activity["commits"] += commits.totalCount
            
            # Count PRs
            prs = repo.get_pulls(state='all')
            for pr in prs:
                if pr.created_at >= since_date:
                    if pr.user.login == username:
                        activity["prs_opened"] += 1
                        if pr.merged:
                            activity["prs_merged"] += 1
            
            # Count issues
            issues = repo.get_issues(state='all', creator=username, since=since_date)
            for issue in issues:
                if not issue.pull_request:  # Exclude PRs
                    activity["issues_opened"] += 1
                    if issue.state == 'closed':
                        activity["issues_closed"] += 1
            
            # Count PR reviews
            prs_for_review = repo.get_pulls(state='all')
            for pr in prs_for_review:
                if pr.created_at >= since_date:
                    reviews = pr.get_reviews()
                    for review in reviews:
                        if review.user.login == username:
                            activity["reviews"] += 1
            
        except Exception as e:
            # Skip repositories where user has no access or other errors
            continue
    
    return activity

def generate_team_dashboard(org_name, days=30):
    """Generate team activity dashboard"""
    g = Github(GITHUB_TOKEN)
    org = g.get_organization(org_name)
    
    since_date = datetime.now() - timedelta(days=days)
    
    print(f"GitHub Team Activity Dashboard")
    print(f"Organization: {org_name}")
    print(f"Period: Last {days} days (since {since_date.strftime('%Y-%m-%d')})")
    print(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print("=" * 100)
    print()
    
    # Get all team members
    members = org.get_members()
    
    activities = []
    for member in members:
        print(f"Analyzing {member.login}...")
        activity = get_user_activity(g, org_name, member.login, since_date)
        activities.append(activity)
    
    # Sort by total activity
    activities.sort(key=lambda x: x['commits'] + x['prs_opened'] + x['issues_opened'], reverse=True)
    
    # Print formatted table
    print()
    print(f"{'Username':<20} {'Commits':<10} {'PRs':<10} {'Merged':<10} {'Issues':<10} {'Reviews':<10}")
    print("-" * 100)
    
    for activity in activities:
        print(f"{activity['username']:<20} "
              f"{activity['commits']:<10} "
              f"{activity['prs_opened']:<10} "
              f"{activity['prs_merged']:<10} "
              f"{activity['issues_opened']:<10} "
              f"{activity['reviews']:<10}")
    
    # Summary statistics
    print()
    print("=" * 100)
    print("📊 Team Summary:")
    total_commits = sum(a['commits'] for a in activities)
    total_prs = sum(a['prs_opened'] for a in activities)
    total_issues = sum(a['issues_opened'] for a in activities)
    total_reviews = sum(a['reviews'] for a in activities)
    
    print(f"Total commits: {total_commits}")
    print(f"Total PRs opened: {total_prs}")
    print(f"Total issues opened: {total_issues}")
    print(f"Total code reviews: {total_reviews}")
    print(f"Active contributors: {len([a for a in activities if a['commits'] > 0])}")
    
    # Save to JSON
    output_file = f"team_activity_{datetime.now().strftime('%Y%m%d')}.json"
    with open(output_file, 'w') as f:
        json.dump({
            "generated_at": datetime.now().isoformat(),
            "period_days": days,
            "activities": activities
        }, f, indent=2)
    
    print(f"\n✅ Detailed report saved to {output_file}")

if __name__ == "__main__":
    generate_team_dashboard(ORG, DAYS)

Pros and Cons of PyGithub

Pros:

  • Clean, Pythonic API - Intuitive object-oriented interface
  • Type hints - Better IDE support and code completion
  • Automatic pagination - Handles pagination transparently
  • Built-in rate limiting - Automatic rate limit handling
  • Well maintained - Official GitHub library
  • Comprehensive - Covers most GitHub API features
  • Easy authentication - Simple auth setup

Cons:

  • Python only - Limited to Python projects
  • Dependency required - Must install library
  • Memory usage - Can consume more memory with large datasets
  • Learning curve - Need to understand the object model
  • Some API lag - New GitHub features may take time to be added

Comparison Matrix

FeatureGitHub CLIREST APIPyGithub
Setup ComplexityLow (single command)Medium (HTTP client)Low (pip install)
LanguageShell/BashAnyPython only
Code VerbosityLow for simple tasksHighMedium
Type SafetyNoneNonePartial (with hints)
PaginationManual or built-in flagsManualAutomatic
Rate LimitingManual handlingManual handlingAutomatic
AuthenticationBuilt-in login flowManual token managementSimple Auth object
Best ForQuick scripts, CLI usersMulti-language projectsPython applications
PerformanceGood for small queriesBest (direct HTTP)Good
MaintainabilityPoor for complex logicMediumExcellent
DocumentationExcellentExcellentGood
CommunityLargeVery largeLarge

When to Use Each Method

Use GitHub CLI When:

  • ✅ Writing quick shell scripts
  • ✅ Need interactive exploration
  • ✅ Working in terminal-centric workflows
  • ✅ Want minimal setup
  • ✅ Running one-off queries
  • ✅ Integrating with existing bash scripts

Use REST API When:

  • ✅ Working in non-Python languages
  • ✅ Need maximum performance
  • ✅ Building microservices
  • ✅ Want no dependencies
  • ✅ Require fine-grained control
  • ✅ Implementing custom retry logic

Use PyGithub When:

  • ✅ Building Python applications
  • ✅ Need clean, maintainable code
  • ✅ Want automatic pagination
  • ✅ Prefer object-oriented approach
  • ✅ Need type hints and IDE support
  • ✅ Building long-term reporting systems

Best Practices Across All Methods

1. Secure Token Management

Never commit tokens to version control:

# Use environment variables
export GITHUB_TOKEN="ghp_yourtoken"

# Use .env files (add to .gitignore)
echo "GITHUB_TOKEN=ghp_yourtoken" > .env

# Use secrets managers
aws secretsmanager get-secret-value --secret-id github-token

Python example with python-dotenv:

from dotenv import load_dotenv
import os

load_dotenv()
GITHUB_TOKEN = os.getenv('GITHUB_TOKEN')

2. Rate Limiting

GitHub has rate limits (5,000 requests/hour for authenticated users):

Check rate limit status:

# CLI
gh api rate_limit

# Python
rate_limit = g.get_rate_limit()
print(f"Remaining: {rate_limit.core.remaining}/{rate_limit.core.limit}")
print(f"Reset at: {rate_limit.core.reset}")

Handle rate limiting:

import time
from github import RateLimitExceededException

try:
    # Your API calls
    repos = g.get_user().get_repos()
except RateLimitExceededException:
    # Wait until rate limit resets
    reset_time = g.get_rate_limit().core.reset
    sleep_time = (reset_time - datetime.now()).total_seconds() + 60
    print(f"Rate limit exceeded. Sleeping for {sleep_time} seconds")
    time.sleep(sleep_time)

3. Pagination

Always handle pagination for complete results:

# PyGithub - automatic pagination
for repo in org.get_repos():
    print(repo.name)

# Manual pagination with REST API
page = 1
while True:
    response = requests.get(url, params={"page": page, "per_page": 100})
    data = response.json()
    if not data:
        break
    # Process data
    page += 1

4. Error Handling

Implement robust error handling:

from github import GithubException, RateLimitExceededException

try:
    repo = g.get_repo("owner/repo")
    issues = repo.get_issues(state='open')
    for issue in issues:
        print(issue.title)
except RateLimitExceededException:
    print("Rate limit exceeded")
except GithubException as e:
    print(f"GitHub API error: {e.status} - {e.data}")
except Exception as e:
    print(f"Unexpected error: {e}")

5. Performance Optimization

Use field filtering to reduce data transfer:

# Only fetch needed fields
repo = g.get_repo("owner/repo")
issues = repo.get_issues(state='open')
# Process only what you need
for issue in issues:
    print(f"{issue.number}: {issue.title}")  # Don't access unnecessary fields

Cache data when appropriate:

import pickle
from pathlib import Path

cache_file = Path("repo_cache.pkl")

if cache_file.exists():
    with open(cache_file, 'rb') as f:
        repo_data = pickle.load(f)
else:
    repo_data = fetch_repo_data()
    with open(cache_file, 'wb') as f:
        pickle.dump(repo_data, f)

6. Logging

Implement logging for troubleshooting:

import logging

logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('github_report.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)

logger.info("Starting report generation")
logger.debug(f"Fetching data for repo: {repo_name}")
logger.error(f"Failed to fetch data: {error}")

Advanced Topics

GraphQL API

For complex queries, consider GitHub’s GraphQL API:

import requests

query = """
{
  repository(owner: "owner", name: "repo") {
    issues(first: 10, states: OPEN) {
      nodes {
        number
        title
        author {
          login
        }
        labels(first: 5) {
          nodes {
            name
          }
        }
      }
    }
  }
}
"""

headers = {
    "Authorization": f"bearer {GITHUB_TOKEN}",
    "Content-Type": "application/json"
}

response = requests.post(
    "https://api.github.com/graphql",
    json={"query": query},
    headers=headers
)

data = response.json()

Webhooks for Real-Time Reporting

Instead of polling, use webhooks for real-time updates:

from flask import Flask, request, jsonify

app = Flask(__name__)

@app.route('/webhook', methods=['POST'])
def github_webhook():
    event = request.headers.get('X-GitHub-Event')
    payload = request.json
    
    if event == 'push':
        # Handle push event
        commits = payload['commits']
        print(f"Received {len(commits)} commits")
    elif event == 'pull_request':
        # Handle PR event
        action = payload['action']
        pr = payload['pull_request']
        print(f"PR #{pr['number']} was {action}")
    
    return jsonify({"status": "success"}), 200

if __name__ == '__main__':
    app.run(port=5000)

Troubleshooting

Common Issues

1. Authentication Failed

Error: Bad credentials (401)
  • Verify token is correct and not expired
  • Check token has required scopes
  • Ensure token is properly formatted (no extra spaces)

2. Rate Limit Exceeded

Error: API rate limit exceeded
  • Wait for rate limit to reset
  • Use authentication (higher limits)
  • Implement exponential backoff
  • Cache results when possible

3. Resource Not Found

Error: Not Found (404)
  • Verify repository/organization name
  • Check token has access to private resources
  • Ensure resource exists

4. Permission Denied

Error: Forbidden (403)
  • Token missing required scopes
  • User lacks repository access
  • Organization settings restrict API access

Debug Tips

Enable debug logging:

import logging
logging.basicConfig(level=logging.DEBUG)

# For PyGithub
from github import enable_console_debug_logging
enable_console_debug_logging()

Test API access:

# Test with curl
curl -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/user

# Check token scopes
curl -I -H "Authorization: token $GITHUB_TOKEN" https://api.github.com/user | grep X-OAuth-Scopes

Conclusion

GitHub provides powerful tools for extracting data for reporting and analytics. The right choice depends on your specific needs:

  • GitHub CLI excels at quick queries and shell scripting
  • REST API offers maximum flexibility for any language
  • PyGithub provides the cleanest Python experience

All three methods can authenticate securely using Personal Access Tokens, support comprehensive data extraction, and can be integrated into automated reporting workflows.

Start with the GitHub CLI for exploration, move to PyGithub for production Python applications, or use the REST API when working in other languages or requiring maximum control.

Key Takeaways

  1. Authentication is critical - Use Personal Access Tokens with appropriate scopes
  2. Respect rate limits - Implement rate limit handling and caching
  3. Handle errors gracefully - Expect API failures and retry appropriately
  4. Choose the right tool - Match the method to your use case
  5. Secure your tokens - Never commit credentials to version control
  6. Start simple - Begin with basic queries and add complexity as needed
  7. Automate reporting - Schedule regular reports for consistent insights

Next Steps

Resources