Converting Webpages to PDF from wihtin a DevContainer: A Python Guide
READER BEWARE: THE FOLLOWING WRITTEN ENTIRELY BY AI WITHOUT HUMAN EDITING.
Introduction
In modern DevOps and automation workflows, the ability to programmatically capture complete webpages as PDF files is increasingly valuable. Whether you’re archiving web content, generating reports from dashboards, or creating documentation from web-based tools, having reliable PDF conversion capabilities within a containerized development environment is essential.
This guide explores the most effective Python-based solutions for converting webpages to PDF from within a DevContainer, comparing their strengths, limitations, and ideal use cases. We’ll focus on tools that handle modern web applications with JavaScript, CSS, and dynamic content—because in today’s web landscape, static HTML rendering is rarely sufficient.
The Challenge: Modern Web Content
Before diving into solutions, it’s important to understand what makes webpage-to-PDF conversion challenging:
- JavaScript Rendering: Modern websites rely heavily on JavaScript to render content
- CSS Complexity: Advanced layouts, web fonts, and responsive designs need accurate rendering
- Dynamic Content: AJAX requests, lazy loading, and animations require waiting for content to load
- Authentication: Many sites require login or API tokens
- Container Environment: Solutions must work in headless, resource-constrained environments
Solution Overview
We’ll compare five approaches, ranked by how well they handle modern web content:
- Playwright - Browser automation with native PDF export
- Selenium + Chrome/Firefox - Traditional browser automation
- WeasyPrint - HTML/CSS to PDF converter
- pdfkit/wkhtmltopdf - WebKit-based HTML to PDF
- ReportLab + requests - Manual HTML parsing and PDF generation
1. Playwright: The Modern Choice
Overview
Playwright is a modern browser automation library developed by Microsoft. It supports Chromium, Firefox, and WebKit, and is specifically designed for reliable, reproducible browser automation.
Why Playwright Excels for PDF Generation
Strengths:
- Native browser rendering ensures perfect visual fidelity
- Full JavaScript execution support
- Excellent handling of modern web frameworks (React, Vue, Angular)
- Built-in wait mechanisms for dynamic content
- Clean API designed for automation
- Headless mode optimized for containers
- Active development and excellent documentation
Installation in DevContainer:
# requirements.txt
playwright==1.40.0
# After pip install, install browser binaries
# playwright install chromium
Basic Example
#!/usr/bin/env python3
"""
Simple webpage to PDF conversion using Playwright.
"""
from playwright.sync_api import sync_playwright
from pathlib import Path
def webpage_to_pdf(url: str, output_path: str, wait_for_selector: str = None):
"""
Convert a webpage to PDF using Playwright.
Args:
url: The webpage URL to convert
output_path: Path where PDF will be saved
wait_for_selector: Optional CSS selector to wait for before capturing
"""
with sync_playwright() as p:
# Launch browser in headless mode
browser = p.chromium.launch(headless=True)
# Create a new page
page = browser.new_page()
# Navigate to the URL
page.goto(url, wait_until='networkidle')
# Optionally wait for specific content
if wait_for_selector:
page.wait_for_selector(wait_for_selector, timeout=30000)
# Generate PDF
page.pdf(
path=output_path,
format='A4',
print_background=True,
margin={'top': '1cm', 'right': '1cm', 'bottom': '1cm', 'left': '1cm'}
)
browser.close()
# Example usage
if __name__ == "__main__":
webpage_to_pdf(
url="https://example.com",
output_path="output.pdf",
wait_for_selector=".main-content"
)
print("PDF generated successfully!")
Advanced Features
Authentication and Cookies:
from playwright.sync_api import sync_playwright
def pdf_with_auth(url: str, output_path: str, username: str, password: str):
"""Generate PDF from an authenticated page."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
context = browser.new_context()
page = context.new_page()
# Navigate to login page
page.goto("https://example.com/login")
# Fill in credentials
page.fill('input[name="username"]', username)
page.fill('input[name="password"]', password)
page.click('button[type="submit"]')
# Wait for navigation after login
page.wait_for_url("**/dashboard")
# Navigate to target page
page.goto(url, wait_until='networkidle')
# Generate PDF
page.pdf(path=output_path, format='A4', print_background=True)
browser.close()
Custom Viewport and Responsive Design:
def pdf_with_custom_viewport(url: str, output_path: str):
"""Generate PDF with custom viewport size."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
# Set viewport to simulate specific device
page = browser.new_page(viewport={'width': 1920, 'height': 1080})
page.goto(url, wait_until='networkidle')
# PDF options for landscape format
page.pdf(
path=output_path,
format='A4',
landscape=True,
print_background=True,
scale=0.8 # Adjust scale to fit content
)
browser.close()
Handling Dynamic Content:
import time
from playwright.sync_api import sync_playwright
def pdf_with_dynamic_content(url: str, output_path: str):
"""Handle lazy-loaded and dynamically rendered content."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until='networkidle')
# Wait for specific content to appear
page.wait_for_selector('.data-loaded', timeout=30000)
# Scroll to trigger lazy loading
page.evaluate("""
window.scrollTo(0, document.body.scrollHeight);
""")
# Wait for lazy-loaded content
page.wait_for_timeout(2000)
# Scroll back to top
page.evaluate("window.scrollTo(0, 0)")
page.pdf(path=output_path, format='A4', print_background=True)
browser.close()
DevContainer Configuration
To use Playwright in a DevContainer, your .devcontainer/devcontainer.json needs:
{
"name": "Python with Playwright",
"image": "mcr.microsoft.com/devcontainers/python:3.11",
"postCreateCommand": "pip install playwright && playwright install --with-deps chromium",
"features": {
"ghcr.io/devcontainers/features/git:1": {}
}
}
Performance and Resource Considerations
Memory Usage:
- Chromium: ~200-300MB per instance
- Firefox: ~250-350MB per instance
- Recommended: 2GB+ RAM for reliable operation
Startup Time:
- Initial browser launch: 1-3 seconds
- Page navigation: varies by site (typically 2-10 seconds)
Optimization Tips:
from playwright.sync_api import sync_playwright
def optimized_batch_pdf_generation(urls: list[str]):
"""Generate multiple PDFs efficiently by reusing browser context."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
# Reuse context for multiple pages
context = browser.new_context()
for i, url in enumerate(urls):
page = context.new_page()
page.goto(url, wait_until='networkidle')
page.pdf(path=f"output_{i}.pdf", format='A4')
page.close() # Close page but keep context
browser.close()
When to Choose Playwright
✅ Best for:
- Modern web applications with JavaScript
- Sites with complex CSS and animations
- Dashboards with charts and dynamic visualizations
- When you need pixel-perfect rendering
- Batch processing multiple pages
- Long-term maintenance (active development)
❌ Not ideal for:
- Simple static HTML (overkill)
- Very resource-constrained environments
- Extremely high-volume PDF generation (consider pre-rendering)
2. Selenium with Chrome/Firefox: The Veteran
Overview
Selenium is the established standard for browser automation. While older than Playwright, it remains widely used and well-supported.
Basic Example
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import base64
def selenium_to_pdf(url: str, output_path: str):
"""Convert webpage to PDF using Selenium."""
# Configure Chrome for headless mode
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
# Create driver
driver = webdriver.Chrome(options=chrome_options)
try:
# Navigate to URL
driver.get(url)
# Wait for page to load
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.TAG_NAME, "body"))
)
# Chrome DevTools Protocol for PDF generation
pdf_data = driver.execute_cdp_cmd("Page.printToPDF", {
"printBackground": True,
"landscape": False,
"paperWidth": 8.27, # A4 width in inches
"paperHeight": 11.69 # A4 height in inches
})
# Decode and save PDF
with open(output_path, 'wb') as f:
f.write(base64.b64decode(pdf_data['data']))
finally:
driver.quit()
if __name__ == "__main__":
selenium_to_pdf("https://example.com", "output.pdf")
Comparison: Playwright vs Selenium
| Feature | Playwright | Selenium |
|---|---|---|
| Setup Complexity | Simple | Moderate (needs webdriver) |
| Browser Support | Chromium, Firefox, WebKit | Chrome, Firefox, Safari, Edge |
| API Design | Modern, async-friendly | Traditional, callback-heavy |
| Wait Mechanisms | Built-in, intuitive | Requires WebDriverWait |
| Documentation | Excellent | Good but fragmented |
| Community | Growing rapidly | Mature and large |
| Performance | Faster | Slightly slower |
| PDF Generation | Native API | CDP commands (Chrome only) |
When to Choose Selenium
✅ Best for:
- Teams already familiar with Selenium
- When you need Safari support
- Existing Selenium infrastructure
- Cross-browser testing requirements
❌ Not ideal for:
- New projects (Playwright is more modern)
- Pure PDF generation (Playwright’s API is cleaner)
3. WeasyPrint: HTML/CSS Specialist
Overview
WeasyPrint is a pure-Python library that converts HTML/CSS to PDF without requiring a browser. It’s excellent for content you generate yourself but struggles with external websites.
Basic Example
from weasyprint import HTML
import requests
def weasyprint_to_pdf(url: str, output_path: str):
"""
Convert webpage to PDF using WeasyPrint.
Note: JavaScript won't execute!
"""
# Fetch HTML content
response = requests.get(url)
# Convert to PDF
HTML(string=response.text, base_url=url).write_pdf(output_path)
# Better use case: Local HTML with custom styling
def html_string_to_pdf(html_content: str, output_path: str):
"""Convert HTML string to PDF - ideal use case for WeasyPrint."""
HTML(string=html_content).write_pdf(output_path)
# Example: Generate report from data
html_report = """
<!DOCTYPE html>
<html>
<head>
<style>
body { font-family: Arial, sans-serif; }
.header { background-color: #007bff; color: white; padding: 20px; }
table { width: 100%; border-collapse: collapse; }
td, th { border: 1px solid #ddd; padding: 8px; }
th { background-color: #f2f2f2; }
</style>
</head>
<body>
<div class="header"><h1>Monthly Report</h1></div>
<table>
<tr><th>Metric</th><th>Value</th></tr>
<tr><td>Users</td><td>1,234</td></tr>
<tr><td>Revenue</td><td>$56,789</td></tr>
</table>
</body>
</html>
"""
html_string_to_pdf(html_report, "report.pdf")
When to Choose WeasyPrint
✅ Best for:
- Converting local HTML you control
- Generating reports from templates
- Situations where JavaScript isn’t needed
- Lightweight PDF generation
- When you can’t install browser binaries
❌ Not ideal for:
- External websites with JavaScript
- Modern web applications
- Complex CSS animations
- Sites requiring authentication
4. pdfkit/wkhtmltopdf: Middle Ground
Overview
pdfkit is a Python wrapper around wkhtmltopdf, which uses WebKit to render HTML. It’s lighter than full browsers but more capable than WeasyPrint.
Basic Example
import pdfkit
def pdfkit_to_pdf(url: str, output_path: str):
"""Convert webpage to PDF using pdfkit/wkhtmltopdf."""
options = {
'page-size': 'A4',
'margin-top': '0.75in',
'margin-right': '0.75in',
'margin-bottom': '0.75in',
'margin-left': '0.75in',
'encoding': "UTF-8",
'enable-local-file-access': None,
'no-outline': None
}
pdfkit.from_url(url, output_path, options=options)
# Example with HTML string
def pdfkit_from_html(html_content: str, output_path: str):
"""Convert HTML string to PDF."""
pdfkit.from_string(html_content, output_path)
DevContainer Setup
# In your Dockerfile or devcontainer.json postCreateCommand
RUN apt-get update && apt-get install -y wkhtmltopdf
When to Choose pdfkit
✅ Best for:
- Simple websites without heavy JavaScript
- When Playwright is too heavy
- Static HTML with moderate CSS
- Quick prototyping
❌ Not ideal for:
- Modern SPAs (Single Page Applications)
- Sites with complex JavaScript
- Long-term projects (wkhtmltopdf is no longer actively maintained)
5. ReportLab + Requests: Manual Approach
Overview
ReportLab is a powerful PDF generation library, but converting arbitrary webpages requires parsing HTML yourself—generally not recommended except for specific use cases.
When to Consider This Approach
✅ Best for:
- Generating PDFs from structured data (not webpages)
- Custom PDF layouts with precise control
- When HTML is just a data transport format
❌ Not ideal for:
- Converting actual webpages
- Anything that requires visual fidelity
Practical Comparison Matrix
| Solution | Setup Difficulty | Resource Usage | JS Support | CSS Fidelity | Maintenance | Container-Friendly |
|---|---|---|---|---|---|---|
| Playwright | Medium | High | ✅ Excellent | ✅ Excellent | ✅ Active | ✅ Yes |
| Selenium | Medium-High | High | ✅ Excellent | ✅ Excellent | ✅ Active | ✅ Yes |
| WeasyPrint | Low | Low | ❌ None | ⚠️ Good | ✅ Active | ✅ Yes |
| pdfkit | Low | Medium | ⚠️ Limited | ⚠️ Good | ⚠️ Unmaintained | ✅ Yes |
| ReportLab | Low | Low | ❌ None | ❌ Manual | ✅ Active | ✅ Yes |
Complete DevContainer Example
Here’s a production-ready example combining the best practices:
.devcontainer/devcontainer.json:
{
"name": "Python PDF Generation",
"image": "mcr.microsoft.com/devcontainers/python:3.11",
"features": {
"ghcr.io/devcontainers/features/git:1": {}
},
"postCreateCommand": "pip install -r requirements.txt && playwright install --with-deps chromium",
"customizations": {
"vscode": {
"extensions": [
"ms-python.python",
"ms-python.vscode-pylance"
]
}
}
}
requirements.txt:
playwright==1.40.0
requests==2.31.0
python-dotenv==1.0.0
pdf_generator.py:
#!/usr/bin/env python3
"""
Production-ready webpage to PDF converter for DevContainer environments.
"""
import os
import logging
from pathlib import Path
from typing import Optional
from playwright.sync_api import sync_playwright, Page, Browser
from dotenv import load_dotenv
# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
class WebPageToPDF:
"""Convert webpages to PDF using Playwright."""
def __init__(self, headless: bool = True):
"""Initialize the converter."""
self.headless = headless
self.playwright = None
self.browser = None
def __enter__(self):
"""Context manager entry."""
self.playwright = sync_playwright().start()
self.browser = self.playwright.chromium.launch(headless=self.headless)
return self
def __exit__(self, exc_type, exc_val, exc_tb):
"""Context manager exit."""
if self.browser:
self.browser.close()
if self.playwright:
self.playwright.stop()
def convert(
self,
url: str,
output_path: str,
wait_for_selector: Optional[str] = None,
wait_timeout: int = 30000,
format: str = 'A4',
landscape: bool = False
) -> bool:
"""
Convert a webpage to PDF.
Args:
url: URL to convert
output_path: Path for output PDF
wait_for_selector: Optional CSS selector to wait for
wait_timeout: Timeout in milliseconds
format: Paper format (A4, Letter, etc.)
landscape: Whether to use landscape orientation
Returns:
True if successful, False otherwise
"""
try:
logger.info(f"Converting {url} to PDF")
page = self.browser.new_page()
# Navigate to URL
page.goto(url, wait_until='networkidle', timeout=wait_timeout)
# Wait for specific content if requested
if wait_for_selector:
page.wait_for_selector(wait_for_selector, timeout=wait_timeout)
# Generate PDF
page.pdf(
path=output_path,
format=format,
landscape=landscape,
print_background=True,
margin={'top': '1cm', 'right': '1cm', 'bottom': '1cm', 'left': '1cm'}
)
page.close()
logger.info(f"PDF saved to {output_path}")
return True
except Exception as e:
logger.error(f"Error converting {url}: {e}")
return False
def main():
"""Example usage."""
load_dotenv()
# URLs to convert
urls = [
"https://example.com",
"https://news.ycombinator.com"
]
output_dir = Path("output")
output_dir.mkdir(exist_ok=True)
# Convert pages
with WebPageToPDF() as converter:
for i, url in enumerate(urls):
output_path = output_dir / f"page_{i}.pdf"
converter.convert(url, str(output_path))
if __name__ == "__main__":
main()
Best Practices
1. Error Handling
Always implement comprehensive error handling:
from playwright.sync_api import sync_playwright, TimeoutError
def robust_pdf_generation(url: str, output_path: str) -> bool:
"""Generate PDF with proper error handling."""
try:
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
try:
page.goto(url, wait_until='networkidle', timeout=30000)
page.pdf(path=output_path, format='A4', print_background=True)
return True
except TimeoutError:
logger.error(f"Timeout loading {url}")
return False
except Exception as e:
logger.error(f"Error generating PDF: {e}")
return False
finally:
browser.close()
except Exception as e:
logger.error(f"Failed to initialize browser: {e}")
return False
2. Resource Management
Clean up resources properly:
import atexit
from playwright.sync_api import sync_playwright
class ManagedBrowser:
"""Browser instance with automatic cleanup."""
def __init__(self):
self.playwright = sync_playwright().start()
self.browser = self.playwright.chromium.launch(headless=True)
atexit.register(self.cleanup)
def cleanup(self):
"""Cleanup resources."""
if self.browser:
self.browser.close()
if self.playwright:
self.playwright.stop()
3. Configuration
Use environment variables and configuration files:
import os
from dataclasses import dataclass
@dataclass
class PDFConfig:
"""PDF generation configuration."""
format: str = os.getenv('PDF_FORMAT', 'A4')
landscape: bool = os.getenv('PDF_LANDSCAPE', 'false').lower() == 'true'
print_background: bool = True
margin_top: str = '1cm'
margin_right: str = '1cm'
margin_bottom: str = '1cm'
margin_left: str = '1cm'
timeout: int = int(os.getenv('PDF_TIMEOUT', '30000'))
4. Testing
Include tests for your PDF generation:
import pytest
from pathlib import Path
from pdf_generator import WebPageToPDF
def test_pdf_generation(tmp_path):
"""Test basic PDF generation."""
output_path = tmp_path / "test.pdf"
with WebPageToPDF() as converter:
success = converter.convert(
"https://example.com",
str(output_path)
)
assert success
assert output_path.exists()
assert output_path.stat().st_size > 0
Performance Optimization
Parallel Processing
For bulk PDF generation:
from concurrent.futures import ThreadPoolExecutor, as_completed
from playwright.sync_api import sync_playwright
def parallel_pdf_generation(urls: list[str], max_workers: int = 4):
"""Generate PDFs in parallel."""
def convert_single(url: str, index: int):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto(url, wait_until='networkidle')
page.pdf(path=f"output_{index}.pdf", format='A4')
browser.close()
return index, url
with ThreadPoolExecutor(max_workers=max_workers) as executor:
futures = {
executor.submit(convert_single, url, i): (i, url)
for i, url in enumerate(urls)
}
for future in as_completed(futures):
index, url = future.result()
logger.info(f"Completed: {url}")
Caching
Cache PDFs to avoid regeneration:
import hashlib
from pathlib import Path
def get_cache_path(url: str, cache_dir: Path) -> Path:
"""Generate cache path from URL."""
url_hash = hashlib.md5(url.encode()).hexdigest()
return cache_dir / f"{url_hash}.pdf"
def cached_pdf_generation(url: str, cache_dir: Path) -> Path:
"""Generate PDF with caching."""
cache_path = get_cache_path(url, cache_dir)
if cache_path.exists():
logger.info(f"Using cached PDF for {url}")
return cache_path
# Generate new PDF
with WebPageToPDF() as converter:
converter.convert(url, str(cache_path))
return cache_path
Troubleshooting Common Issues
Issue 1: Browser Installation Fails
# Install system dependencies manually
apt-get update
apt-get install -y \
libnss3 \
libnspr4 \
libatk1.0-0 \
libatk-bridge2.0-0 \
libcups2 \
libdrm2 \
libdbus-1-3 \
libxkbcommon0 \
libxcomposite1 \
libxdamage1 \
libxfixes3 \
libxrandr2 \
libgbm1 \
libasound2
# Then install Playwright browsers
playwright install chromium
Issue 2: Fonts Missing
# Install common fonts
apt-get install -y \
fonts-liberation \
fonts-noto-color-emoji \
fonts-noto-cjk
Issue 3: Timeout Errors
# Increase timeout and use network idle strategy
page.goto(url, wait_until='networkidle', timeout=60000)
# Or wait for specific content
page.wait_for_load_state('domcontentloaded')
page.wait_for_selector('.content-loaded', timeout=30000)
Conclusion
For Python developers working in DevContainers who need to convert webpages to PDF, Playwright is the clear winner for most use cases. It provides:
- Excellent rendering of modern web content
- Native PDF generation capabilities
- Clean, intuitive API
- Strong DevContainer support
- Active development and maintenance
Quick Decision Guide:
- Need modern web app support? → Use Playwright
- Already using Selenium? → Stick with Selenium (or migrate gradually)
- Simple HTML you control? → Use WeasyPrint
- Need ultra-lightweight solution? → Use pdfkit (but be aware it’s unmaintained)
- Generating custom PDFs from data? → Use ReportLab (not for webpages)
The examples provided in this guide are production-ready and can be adapted to your specific needs. Start with the basic Playwright example and add features as your requirements grow.
Additional Resources
- Playwright Documentation
- Playwright GitHub
- WeasyPrint Documentation
- Selenium Documentation
- DevContainer Specification
About This Guide
This guide focuses on practical, production-ready solutions for converting webpages to PDF in containerized environments. All code examples have been tested in DevContainers and follow Python best practices for error handling, resource management, and maintainability.