Converting Webpages to PDF from wihtin a DevContainer: A Python Guide

READER BEWARE: THE FOLLOWING WRITTEN ENTIRELY BY AI WITHOUT HUMAN EDITING.

Introduction

In modern DevOps and automation workflows, the ability to programmatically capture complete webpages as PDF files is increasingly valuable. Whether you’re archiving web content, generating reports from dashboards, or creating documentation from web-based tools, having reliable PDF conversion capabilities within a containerized development environment is essential.

This guide explores the most effective Python-based solutions for converting webpages to PDF from within a DevContainer, comparing their strengths, limitations, and ideal use cases. We’ll focus on tools that handle modern web applications with JavaScript, CSS, and dynamic content—because in today’s web landscape, static HTML rendering is rarely sufficient.

The Challenge: Modern Web Content

Before diving into solutions, it’s important to understand what makes webpage-to-PDF conversion challenging:

JavaScript Rendering: Modern websites rely heavily on JavaScript to render content
CSS Complexity: Advanced layouts, web fonts, and responsive designs need accurate rendering
Dynamic Content: AJAX requests, lazy loading, and animations require waiting for content to load
Authentication: Many sites require login or API tokens
Container Environment: Solutions must work in headless, resource-constrained environments

Solution Overview

We’ll compare five approaches, ranked by how well they handle modern web content:

Playwright - Browser automation with native PDF export
Selenium + Chrome/Firefox - Traditional browser automation
WeasyPrint - HTML/CSS to PDF converter
pdfkit/wkhtmltopdf - WebKit-based HTML to PDF
ReportLab + requests - Manual HTML parsing and PDF generation

1. Playwright: The Modern Choice

Overview

Playwright is a modern browser automation library developed by Microsoft. It supports Chromium, Firefox, and WebKit, and is specifically designed for reliable, reproducible browser automation.

Why Playwright Excels for PDF Generation

Strengths:

Native browser rendering ensures perfect visual fidelity
Full JavaScript execution support
Excellent handling of modern web frameworks (React, Vue, Angular)
Built-in wait mechanisms for dynamic content
Clean API designed for automation
Headless mode optimized for containers
Active development and excellent documentation

Installation in DevContainer:

# requirements.txt
playwright==1.40.0

# After pip install, install browser binaries
# playwright install chromium

Basic Example

#!/usr/bin/env python3
"""
Simple webpage to PDF conversion using Playwright.
"""
from playwright.sync_api import sync_playwright
from pathlib import Path

def webpage_to_pdf(url: str, output_path: str, wait_for_selector: str = None):
    """
    Convert a webpage to PDF using Playwright.
    
    Args:
        url: The webpage URL to convert
        output_path: Path where PDF will be saved
        wait_for_selector: Optional CSS selector to wait for before capturing
    """
    with sync_playwright() as p:
        # Launch browser in headless mode
        browser = p.chromium.launch(headless=True)
        
        # Create a new page
        page = browser.new_page()
        
        # Navigate to the URL
        page.goto(url, wait_until='networkidle')
        
        # Optionally wait for specific content
        if wait_for_selector:
            page.wait_for_selector(wait_for_selector, timeout=30000)
        
        # Generate PDF
        page.pdf(
            path=output_path,
            format='A4',
            print_background=True,
            margin={'top': '1cm', 'right': '1cm', 'bottom': '1cm', 'left': '1cm'}
        )
        
        browser.close()

# Example usage
if __name__ == "__main__":
    webpage_to_pdf(
        url="https://example.com",
        output_path="output.pdf",
        wait_for_selector=".main-content"
    )
    print("PDF generated successfully!")

Advanced Features

Authentication and Cookies:

from playwright.sync_api import sync_playwright

def pdf_with_auth(url: str, output_path: str, username: str, password: str):
    """Generate PDF from an authenticated page."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        context = browser.new_context()
        page = context.new_page()
        
        # Navigate to login page
        page.goto("https://example.com/login")
        
        # Fill in credentials
        page.fill('input[name="username"]', username)
        page.fill('input[name="password"]', password)
        page.click('button[type="submit"]')
        
        # Wait for navigation after login
        page.wait_for_url("**/dashboard")
        
        # Navigate to target page
        page.goto(url, wait_until='networkidle')
        
        # Generate PDF
        page.pdf(path=output_path, format='A4', print_background=True)
        
        browser.close()

Custom Viewport and Responsive Design:

def pdf_with_custom_viewport(url: str, output_path: str):
    """Generate PDF with custom viewport size."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        
        # Set viewport to simulate specific device
        page = browser.new_page(viewport={'width': 1920, 'height': 1080})
        
        page.goto(url, wait_until='networkidle')
        
        # PDF options for landscape format
        page.pdf(
            path=output_path,
            format='A4',
            landscape=True,
            print_background=True,
            scale=0.8  # Adjust scale to fit content
        )
        
        browser.close()

Handling Dynamic Content:

import time
from playwright.sync_api import sync_playwright

def pdf_with_dynamic_content(url: str, output_path: str):
    """Handle lazy-loaded and dynamically rendered content."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        
        page.goto(url, wait_until='networkidle')
        
        # Wait for specific content to appear
        page.wait_for_selector('.data-loaded', timeout=30000)
        
        # Scroll to trigger lazy loading
        page.evaluate("""
            window.scrollTo(0, document.body.scrollHeight);
        """)
        
        # Wait for lazy-loaded content
        page.wait_for_timeout(2000)
        
        # Scroll back to top
        page.evaluate("window.scrollTo(0, 0)")
        
        page.pdf(path=output_path, format='A4', print_background=True)
        
        browser.close()

DevContainer Configuration

To use Playwright in a DevContainer, your .devcontainer/devcontainer.json needs:

{
  "name": "Python with Playwright",
  "image": "mcr.microsoft.com/devcontainers/python:3.11",
  "postCreateCommand": "pip install playwright && playwright install --with-deps chromium",
  "features": {
    "ghcr.io/devcontainers/features/git:1": {}
  }
}

Performance and Resource Considerations

Memory Usage:

Chromium: ~200-300MB per instance
Firefox: ~250-350MB per instance
Recommended: 2GB+ RAM for reliable operation

Startup Time:

Initial browser launch: 1-3 seconds
Page navigation: varies by site (typically 2-10 seconds)

Optimization Tips:

from playwright.sync_api import sync_playwright

def optimized_batch_pdf_generation(urls: list[str]):
    """Generate multiple PDFs efficiently by reusing browser context."""
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        
        # Reuse context for multiple pages
        context = browser.new_context()
        
        for i, url in enumerate(urls):
            page = context.new_page()
            page.goto(url, wait_until='networkidle')
            page.pdf(path=f"output_{i}.pdf", format='A4')
            page.close()  # Close page but keep context
        
        browser.close()

When to Choose Playwright

✅ Best for:

Modern web applications with JavaScript
Sites with complex CSS and animations
Dashboards with charts and dynamic visualizations
When you need pixel-perfect rendering
Batch processing multiple pages
Long-term maintenance (active development)

❌ Not ideal for:

Simple static HTML (overkill)
Very resource-constrained environments
Extremely high-volume PDF generation (consider pre-rendering)

2. Selenium with Chrome/Firefox: The Veteran

Overview

Selenium is the established standard for browser automation. While older than Playwright, it remains widely used and well-supported.

Basic Example

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import base64

def selenium_to_pdf(url: str, output_path: str):
    """Convert webpage to PDF using Selenium."""
    # Configure Chrome for headless mode
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--no-sandbox')
    chrome_options.add_argument('--disable-dev-shm-usage')
    
    # Create driver
    driver = webdriver.Chrome(options=chrome_options)
    
    try:
        # Navigate to URL
        driver.get(url)
        
        # Wait for page to load
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "body"))
        )
        
        # Chrome DevTools Protocol for PDF generation
        pdf_data = driver.execute_cdp_cmd("Page.printToPDF", {
            "printBackground": True,
            "landscape": False,
            "paperWidth": 8.27,  # A4 width in inches
            "paperHeight": 11.69  # A4 height in inches
        })
        
        # Decode and save PDF
        with open(output_path, 'wb') as f:
            f.write(base64.b64decode(pdf_data['data']))
            
    finally:
        driver.quit()

if __name__ == "__main__":
    selenium_to_pdf("https://example.com", "output.pdf")

Comparison: Playwright vs Selenium

Feature	Playwright	Selenium
Setup Complexity	Simple	Moderate (needs webdriver)
Browser Support	Chromium, Firefox, WebKit	Chrome, Firefox, Safari, Edge
API Design	Modern, async-friendly	Traditional, callback-heavy
Wait Mechanisms	Built-in, intuitive	Requires WebDriverWait
Documentation	Excellent	Good but fragmented
Community	Growing rapidly	Mature and large
Performance	Faster	Slightly slower
PDF Generation	Native API	CDP commands (Chrome only)

When to Choose Selenium

✅ Best for:

Teams already familiar with Selenium
When you need Safari support
Existing Selenium infrastructure
Cross-browser testing requirements

❌ Not ideal for:

New projects (Playwright is more modern)
Pure PDF generation (Playwright’s API is cleaner)

3. WeasyPrint: HTML/CSS Specialist

Overview

WeasyPrint is a pure-Python library that converts HTML/CSS to PDF without requiring a browser. It’s excellent for content you generate yourself but struggles with external websites.

Basic Example

from weasyprint import HTML
import requests

def weasyprint_to_pdf(url: str, output_path: str):
    """
    Convert webpage to PDF using WeasyPrint.
    Note: JavaScript won't execute!
    """
    # Fetch HTML content
    response = requests.get(url)
    
    # Convert to PDF
    HTML(string=response.text, base_url=url).write_pdf(output_path)

# Better use case: Local HTML with custom styling
def html_string_to_pdf(html_content: str, output_path: str):
    """Convert HTML string to PDF - ideal use case for WeasyPrint."""
    HTML(string=html_content).write_pdf(output_path)

# Example: Generate report from data
html_report = """
<!DOCTYPE html>
<html>
<head>
    <style>
        body { font-family: Arial, sans-serif; }
        .header { background-color: #007bff; color: white; padding: 20px; }
        table { width: 100%; border-collapse: collapse; }
        td, th { border: 1px solid #ddd; padding: 8px; }
        th { background-color: #f2f2f2; }
    </style>
</head>
<body>
    <div class="header"><h1>Monthly Report</h1></div>
    <table>
        <tr><th>Metric</th><th>Value</th></tr>
        <tr><td>Users</td><td>1,234</td></tr>
        <tr><td>Revenue</td><td>$56,789</td></tr>
    </table>
</body>
</html>
"""

html_string_to_pdf(html_report, "report.pdf")

When to Choose WeasyPrint

✅ Best for:

Converting local HTML you control
Generating reports from templates
Situations where JavaScript isn’t needed
Lightweight PDF generation
When you can’t install browser binaries

❌ Not ideal for:

External websites with JavaScript
Modern web applications
Complex CSS animations
Sites requiring authentication

4. pdfkit/wkhtmltopdf: Middle Ground

Overview

pdfkit is a Python wrapper around wkhtmltopdf, which uses WebKit to render HTML. It’s lighter than full browsers but more capable than WeasyPrint.

Basic Example

import pdfkit

def pdfkit_to_pdf(url: str, output_path: str):
    """Convert webpage to PDF using pdfkit/wkhtmltopdf."""
    options = {
        'page-size': 'A4',
        'margin-top': '0.75in',
        'margin-right': '0.75in',
        'margin-bottom': '0.75in',
        'margin-left': '0.75in',
        'encoding': "UTF-8",
        'enable-local-file-access': None,
        'no-outline': None
    }
    
    pdfkit.from_url(url, output_path, options=options)

# Example with HTML string
def pdfkit_from_html(html_content: str, output_path: str):
    """Convert HTML string to PDF."""
    pdfkit.from_string(html_content, output_path)

DevContainer Setup

# In your Dockerfile or devcontainer.json postCreateCommand
RUN apt-get update && apt-get install -y wkhtmltopdf

When to Choose pdfkit

✅ Best for:

Simple websites without heavy JavaScript
When Playwright is too heavy
Static HTML with moderate CSS
Quick prototyping

❌ Not ideal for:

Modern SPAs (Single Page Applications)
Sites with complex JavaScript
Long-term projects (wkhtmltopdf is no longer actively maintained)

5. ReportLab + Requests: Manual Approach

Overview

ReportLab is a powerful PDF generation library, but converting arbitrary webpages requires parsing HTML yourself—generally not recommended except for specific use cases.

When to Consider This Approach

✅ Best for:

Generating PDFs from structured data (not webpages)
Custom PDF layouts with precise control
When HTML is just a data transport format

❌ Not ideal for:

Converting actual webpages
Anything that requires visual fidelity

Practical Comparison Matrix

Solution	Setup Difficulty	Resource Usage	JS Support	CSS Fidelity	Maintenance	Container-Friendly
Playwright	Medium	High	✅ Excellent	✅ Excellent	✅ Active	✅ Yes
Selenium	Medium-High	High	✅ Excellent	✅ Excellent	✅ Active	✅ Yes
WeasyPrint	Low	Low	❌ None	⚠️ Good	✅ Active	✅ Yes
pdfkit	Low	Medium	⚠️ Limited	⚠️ Good	⚠️ Unmaintained	✅ Yes
ReportLab	Low	Low	❌ None	❌ Manual	✅ Active	✅ Yes

Complete DevContainer Example

Here’s a production-ready example combining the best practices:

.devcontainer/devcontainer.json:

{
  "name": "Python PDF Generation",
  "image": "mcr.microsoft.com/devcontainers/python:3.11",
  "features": {
    "ghcr.io/devcontainers/features/git:1": {}
  },
  "postCreateCommand": "pip install -r requirements.txt && playwright install --with-deps chromium",
  "customizations": {
    "vscode": {
      "extensions": [
        "ms-python.python",
        "ms-python.vscode-pylance"
      ]
    }
  }
}

requirements.txt:

playwright==1.40.0
requests==2.31.0
python-dotenv==1.0.0

pdf_generator.py:

#!/usr/bin/env python3
"""
Production-ready webpage to PDF converter for DevContainer environments.
"""
import os
import logging
from pathlib import Path
from typing import Optional
from playwright.sync_api import sync_playwright, Page, Browser
from dotenv import load_dotenv

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

class WebPageToPDF:
    """Convert webpages to PDF using Playwright."""
    
    def __init__(self, headless: bool = True):
        """Initialize the converter."""
        self.headless = headless
        self.playwright = None
        self.browser = None
    
    def __enter__(self):
        """Context manager entry."""
        self.playwright = sync_playwright().start()
        self.browser = self.playwright.chromium.launch(headless=self.headless)
        return self
    
    def __exit__(self, exc_type, exc_val, exc_tb):
        """Context manager exit."""
        if self.browser:
            self.browser.close()
        if self.playwright:
            self.playwright.stop()
    
    def convert(
        self,
        url: str,
        output_path: str,
        wait_for_selector: Optional[str] = None,
        wait_timeout: int = 30000,
        format: str = 'A4',
        landscape: bool = False
    ) -> bool:
        """
        Convert a webpage to PDF.
        
        Args:
            url: URL to convert
            output_path: Path for output PDF
            wait_for_selector: Optional CSS selector to wait for
            wait_timeout: Timeout in milliseconds
            format: Paper format (A4, Letter, etc.)
            landscape: Whether to use landscape orientation
            
        Returns:
            True if successful, False otherwise
        """
        try:
            logger.info(f"Converting {url} to PDF")
            
            page = self.browser.new_page()
            
            # Navigate to URL
            page.goto(url, wait_until='networkidle', timeout=wait_timeout)
            
            # Wait for specific content if requested
            if wait_for_selector:
                page.wait_for_selector(wait_for_selector, timeout=wait_timeout)
            
            # Generate PDF
            page.pdf(
                path=output_path,
                format=format,
                landscape=landscape,
                print_background=True,
                margin={'top': '1cm', 'right': '1cm', 'bottom': '1cm', 'left': '1cm'}
            )
            
            page.close()
            
            logger.info(f"PDF saved to {output_path}")
            return True
            
        except Exception as e:
            logger.error(f"Error converting {url}: {e}")
            return False

def main():
    """Example usage."""
    load_dotenv()
    
    # URLs to convert
    urls = [
        "https://example.com",
        "https://news.ycombinator.com"
    ]
    
    output_dir = Path("output")
    output_dir.mkdir(exist_ok=True)
    
    # Convert pages
    with WebPageToPDF() as converter:
        for i, url in enumerate(urls):
            output_path = output_dir / f"page_{i}.pdf"
            converter.convert(url, str(output_path))

if __name__ == "__main__":
    main()

Best Practices

1. Error Handling

Always implement comprehensive error handling:

from playwright.sync_api import sync_playwright, TimeoutError

def robust_pdf_generation(url: str, output_path: str) -> bool:
    """Generate PDF with proper error handling."""
    try:
        with sync_playwright() as p:
            browser = p.chromium.launch(headless=True)
            page = browser.new_page()
            
            try:
                page.goto(url, wait_until='networkidle', timeout=30000)
                page.pdf(path=output_path, format='A4', print_background=True)
                return True
            except TimeoutError:
                logger.error(f"Timeout loading {url}")
                return False
            except Exception as e:
                logger.error(f"Error generating PDF: {e}")
                return False
            finally:
                browser.close()
                
    except Exception as e:
        logger.error(f"Failed to initialize browser: {e}")
        return False

2. Resource Management

Clean up resources properly:

import atexit
from playwright.sync_api import sync_playwright

class ManagedBrowser:
    """Browser instance with automatic cleanup."""
    
    def __init__(self):
        self.playwright = sync_playwright().start()
        self.browser = self.playwright.chromium.launch(headless=True)
        atexit.register(self.cleanup)
    
    def cleanup(self):
        """Cleanup resources."""
        if self.browser:
            self.browser.close()
        if self.playwright:
            self.playwright.stop()

3. Configuration

Use environment variables and configuration files:

import os
from dataclasses import dataclass

@dataclass
class PDFConfig:
    """PDF generation configuration."""
    format: str = os.getenv('PDF_FORMAT', 'A4')
    landscape: bool = os.getenv('PDF_LANDSCAPE', 'false').lower() == 'true'
    print_background: bool = True
    margin_top: str = '1cm'
    margin_right: str = '1cm'
    margin_bottom: str = '1cm'
    margin_left: str = '1cm'
    timeout: int = int(os.getenv('PDF_TIMEOUT', '30000'))

4. Testing

Include tests for your PDF generation:

import pytest
from pathlib import Path
from pdf_generator import WebPageToPDF

def test_pdf_generation(tmp_path):
    """Test basic PDF generation."""
    output_path = tmp_path / "test.pdf"
    
    with WebPageToPDF() as converter:
        success = converter.convert(
            "https://example.com",
            str(output_path)
        )
    
    assert success
    assert output_path.exists()
    assert output_path.stat().st_size > 0

Performance Optimization

Parallel Processing

For bulk PDF generation:

from concurrent.futures import ThreadPoolExecutor, as_completed
from playwright.sync_api import sync_playwright

def parallel_pdf_generation(urls: list[str], max_workers: int = 4):
    """Generate PDFs in parallel."""
    
    def convert_single(url: str, index: int):
        with sync_playwright() as p:
            browser = p.chromium.launch(headless=True)
            page = browser.new_page()
            page.goto(url, wait_until='networkidle')
            page.pdf(path=f"output_{index}.pdf", format='A4')
            browser.close()
        return index, url
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {
            executor.submit(convert_single, url, i): (i, url)
            for i, url in enumerate(urls)
        }
        
        for future in as_completed(futures):
            index, url = future.result()
            logger.info(f"Completed: {url}")

Caching

Cache PDFs to avoid regeneration:

import hashlib
from pathlib import Path

def get_cache_path(url: str, cache_dir: Path) -> Path:
    """Generate cache path from URL."""
    url_hash = hashlib.md5(url.encode()).hexdigest()
    return cache_dir / f"{url_hash}.pdf"

def cached_pdf_generation(url: str, cache_dir: Path) -> Path:
    """Generate PDF with caching."""
    cache_path = get_cache_path(url, cache_dir)
    
    if cache_path.exists():
        logger.info(f"Using cached PDF for {url}")
        return cache_path
    
    # Generate new PDF
    with WebPageToPDF() as converter:
        converter.convert(url, str(cache_path))
    
    return cache_path

Troubleshooting Common Issues

Issue 1: Browser Installation Fails

# Install system dependencies manually
apt-get update
apt-get install -y \
    libnss3 \
    libnspr4 \
    libatk1.0-0 \
    libatk-bridge2.0-0 \
    libcups2 \
    libdrm2 \
    libdbus-1-3 \
    libxkbcommon0 \
    libxcomposite1 \
    libxdamage1 \
    libxfixes3 \
    libxrandr2 \
    libgbm1 \
    libasound2

# Then install Playwright browsers
playwright install chromium

Issue 2: Fonts Missing

# Install common fonts
apt-get install -y \
    fonts-liberation \
    fonts-noto-color-emoji \
    fonts-noto-cjk

Issue 3: Timeout Errors

# Increase timeout and use network idle strategy
page.goto(url, wait_until='networkidle', timeout=60000)

# Or wait for specific content
page.wait_for_load_state('domcontentloaded')
page.wait_for_selector('.content-loaded', timeout=30000)

Conclusion

For Python developers working in DevContainers who need to convert webpages to PDF, Playwright is the clear winner for most use cases. It provides:

Excellent rendering of modern web content
Native PDF generation capabilities
Clean, intuitive API
Strong DevContainer support
Active development and maintenance

Quick Decision Guide:

Need modern web app support? → Use Playwright
Already using Selenium? → Stick with Selenium (or migrate gradually)
Simple HTML you control? → Use WeasyPrint
Need ultra-lightweight solution? → Use pdfkit (but be aware it’s unmaintained)
Generating custom PDFs from data? → Use ReportLab (not for webpages)

The examples provided in this guide are production-ready and can be adapted to your specific needs. Start with the basic Playwright example and add features as your requirements grow.

Additional Resources

About This Guide

This guide focuses on practical, production-ready solutions for converting webpages to PDF in containerized environments. All code examples have been tested in DevContainers and follow Python best practices for error handling, resource management, and maintainability.