PSParseHTML – Parse HTML PowerShell Module

PSParseHTML

PSParseHTML started as a suite of data processing Cmdlets to help PSWriteHTML, but it has gained functionality enough to be its own module. Basic usage instructions are described on this blog post.

PSParseHTML exposes a suite of PowerShell cmdlets that let you parse, format and optimise web resources right from the shell. The module currently ships with thirteen cmdlets:

Processing of HTML/CSS/JS

  • Convert-HTMLToText - convert markup to plain text
  • ConvertFrom-HtmlTable - turn table elements into objects (supports rowspan/colspan)
  • ConvertFrom-HTMLAttributes - extract elements by tag, class, id or name (aliases: ConvertFrom-HTMLTag, ConvertFrom-HTMLClass)
  • ConvertFrom-HTML - parse full documents or fragments
  • Format-CSS - pretty‑print style sheets
  • Format-HTML - tidy up HTML markup
  • Format-JavaScript - beautify JavaScript (Format-JS alias, supports jsbeautifier options like -IndentSize and -BraceStyle)
  • Optimize-CSS - minify style sheets
  • Optimize-Email - inline CSS for email bodies (can download linked stylesheets)
  • Optimize-HTML - minify HTML
  • Optimize-JavaScript - minify JavaScript

Browsing

  • Invoke-HTMLRendering - render pages with a headless browser (supports basic and form authentication)
    • Open-HTMLSession is an alias to allow for nicer cmdlet naming alignment under different conditons
    • Start-HTMLSession is an alias to allow for nicer cmdlet naming alignment under different conditons
  • Save-HTMLScreenshot - capture page screenshots (supports -Full, -Open, -ElementSelector, clipping parameters, -Delay, -Selector, -HighlightSelector, -Format, -Quality and -OverlayText)
  • Visible and -SlowMo parameters let you see the browser or slow down execution
  • UserAgent, -ViewportWidth, -ViewportHeight and -DeviceScaleFactor allow customizing the browser context
  • Save-HTMLPdf - generate a PDF of a rendered page
  • Start-HTMLVideoRecording/Stop-HTMLVideoRecording - record browser sessions to a .webm video file
    • supports the same authentication options as Invoke-HTMLRendering so you can capture the login process
    • can start recording from an existing session to capture later steps
    • the temporary Playwright video file is cleaned up automatically when stopping
  • Start-HTMLTracing/Stop-HTMLTracing - record Playwright traces for debugging
  • Save-HTMLHar - export captured network traffic to a HAR file
  • Save-HTMLAttachment - download files discovered on a rendered page (optionally filter with -Filter, alias: Save-HTMLDownload)
  • Invoke-HTMLNavigation - navigate an existing session to another URL
  • Invoke-HTMLScript - run custom JavaScript against a session
  • Invoke-HTMLDomScript - run JavaScript with AngleSharp without a browser
  • Get-HTMLInteractable - list clickable elements from a session, URL or file
  • Register-HTMLRoute/Unregister-HTMLRoute - intercept requests to block or mock responses
  • Get-HTMLCookie - retrieve cookies from the active session
  • Get-HTMLNetworkLog - view captured network requests and responses
  • Set-HTMLCookie - add cookies to the active session
  • Export-HTMLSession/Import-HTMLSession - save and restore cookies and storage state
  • Close-HTMLSession - dispose an active browser session (Stop-HTMLSession alias)

All cmdlets that work with files accept a -Path parameter (or alias) in addition to -File. Relative, absolute and UNC paths are resolved to full paths automatically.

Cmdlet quick start

# Convert an entire file to plain text
Convert-HTMLToText -Path '.\report.html'

# Extract all <a> tags with a specific class
ConvertFrom-HTMLAttributes -Path '.\site.html' -Class 'promo'

# Parse a snippet of markup
$doc = ConvertFrom-HTML -Content '<div>Hello</div>'

# Format a CSS style sheet
Format-CSS -Path '.\style.css'

# Beautify an HTML fragment
Format-HTML -Content $html

# Format a JavaScript file
Format-JavaScript -Path '.\script.js'

# Format JavaScript with custom options
Format-JavaScript -Content $js -IndentSize 2 -BraceStyle Expand

# Minify a CSS file
Optimize-CSS -Path '.\style.css'

# Inline CSS in an email body (fetch linked stylesheets)
Optimize-Email -Body $html -UseEmailFormatter -DownloadRemoteCss

# Minify an HTML file
Optimize-HTML -Path '.\page.html'

# Minify JavaScript and save to a new file
Optimize-JavaScript -Path '.\app.js' -OutputFile '.\app.min.js'
# Run JavaScript on markup without a browser
Invoke-HTMLDomScript -Content '<div id="demo">Hi</div>' -Script 'document.getElementById("demo").textContent'

# Render a page after executing JavaScript
$handler = Register-HTMLRoute -Session $session -Pattern '**/api/data' -ScriptBlock {
Unregister-HTMLRoute -Session $session -Pattern '**/api/data' -Handler $handler
# Render a protected page using credentials
$cred = Get-Credential
Invoke-HTMLRendering -Url 'https://example.com' -Credential $cred
# Login using a form and render the target page
Invoke-HTMLRendering -Url 'https://example.com/protected' `
    -Credential $cred `
    -LoginUrl 'https://example.com/wp-login.php' `
    -UsernameSelector 'input[name=log]' `
    -PasswordSelector 'input[name=pwd]' `
    -SubmitSelector '#wp-submit'
Save-HTMLScreenshot -Url 'https://example.com' -OutFile .\page.jpg -Full -Open -Delay 1000 -Selector '#content' -Format Jpeg -Quality 80
Save-HTMLScreenshot -Url 'https://example.com' -OutFile .\header.gif -ElementSelector 'header' -Format Gif
Save-HTMLScreenshot -Url 'https://example.com/login' -OutFile .\login.jpg -HighlightSelector 'form#login' -OverlayText 'Login Form' -Format Jpeg -Quality 90
Save-HTMLPdf -Url 'https://example.com' -OutFile .\page.pdf -PrintBackground
Save-HTMLAttachment -Url 'https://github.com/user/repo/releases/tag/v1.0' -Path '.\Downloads'
Save-HTMLAttachment -Url 'https://github.com/user/repo/releases/tag/v1.0' -Path '.\Downloads' -Filter '.zip'
# Record a trace and export HAR
Start-HTMLTracing -Session $session
Invoke-HTMLNavigation -Session $session -Url 'https://example.com/profile'
Stop-HTMLTracing -Session $session -OutFile '.\trace.zip'
Save-HTMLHar -Session $session -OutFile '.\traffic.har'
# Keep a logged in browser session and reuse it
$session = Start-HTMLSession -Url 'https://example.com/protected' `
    -Credential $cred `
    -LoginUrl 'https://example.com/login' `
    -UsernameSelector 'input[name=user]' `
    -PasswordSelector 'input[name=pass]' `
    -SubmitSelector 'button[type=submit]'
Save-HTMLScreenshot -Session $session -OutFile '.\secure.bmp' -Selector '#content' -Format Bmp
Invoke-HTMLNavigation -Session $session -Url 'https://example.com/downloads' |
    Save-HTMLScreenshot -OutFile '.\downloads.gif' -Format Gif
Save-HTMLAttachment -Session $session -Path '.\Downloads'
# Mock an API response
$handler = Register-HTMLRoute -Session $session -Pattern '*api/data' -ScriptBlock {
    param($route)
    $route.FulfillAsync([Microsoft.Playwright.RouteFulfillOptions]@{ Status = 200; ContentType = 'application/json'; Body = '{"ok":true}' }) | Out-Null
}
Invoke-HTMLNavigation -Session $session -Url 'https://example.com/api/data'
Unregister-HTMLRoute -Session $session -Pattern '*api/data' -Handler $handler
$log = Get-HTMLNetworkLog -Session $session
$log | Format-Table Url, Status
Export-HTMLSession -Session $session -Path 'state.json'
$session = Import-HTMLSession -Path 'state.json' -Url 'https://example.com/protected'
Close-HTMLSession -Session $session

The first run triggers Playwright to download the selected browser. You'll see output similar to:

Downloading Chromium 136.0.7103.25 (playwright build v1169) from https://...
144.4 MiB [====================] 100% 0.0s
Chromium 136.0.7103.25 (playwright build v1169) downloaded to C:\Users\USER\AppData\Local\ms-playwright\chromium-1169

Use -Clean to clear the ms-playwright cache and re-download the runtime if needed. Use -Full to capture the entire document, -Open to automatically view the image, specify -Delay or -Selector to wait for dynamic content, or supply -X, -Y, -Width and -Height to grab a specific region. Use -Visible to see the browser window and -SlowMo to slow down actions for easier debugging. Supply -UserAgent, -ViewportWidth, -ViewportHeight or -DeviceScaleFactor to emulate different devices.

The expected input is a string literal or data read from a file. The output can be PowerShell objects (classes are HtmlNode or AngleSharp.Html.Dom.HtmlElement depending on the selected engine) or strings written to stdout.

It may not seem like much, but those thirteen cmdlets are powerful enough to enable robust HTML processing in shell.

Examples

Additional sample scripts are available in the Examples directory. See Example-PlaywrightNetwork.ps1 for tracing and HAR capture.

# Parse tables from a web page (rowspan and colspan handled automatically)
$tables = ConvertFrom-HtmlTable -Url 'https://en.wikipedia.org/wiki/PowerShell'
$tables[1] | Format-Table -AutoSize
# Sample output
# Sequence Meaning
# -------- -------
# `0       Null
# `a       Alert
# `b       Backspace
# `e       Escape (since PowerShell 6)
# `f       Form feed

# Inline CSS in an e-mail body and pretty print the result
$html = Optimize-Email -Body $body -RemoveComments
Format-HTML -Content $html

# Minify JavaScript from a file
Optimize-JavaScript -Path './script.js' -OutputFile './script.min.js'

# Convert HTML file to plain text
Get-Content './report.html' -Raw | Convert-HTMLToText

# Extract all product entries
$markup = Get-Content './catalog.html' -Raw
ConvertFrom-HTMLAttributes -Content $markup -Class 'product'

More sample scripts are in the Examples folder, including Example-FormatJavaScriptAdvanced.ps1 which demonstrates custom BeautifierOptions.

Installation

Install from PSGallery

Install-Module -Name PSParseHTML -AllowClobber -Force

Force and AllowClobber aren't necessary but they do skip errors in case some appear.

Update from PSGallery

Update-Module -Name PSParseHTML

That's it. Whenever there's a new version you simply run the Update-Module command and enjoy. Remember that you may need to close, re-open your PowerShell session if you had used the module prior to updating it.

As usual, remember module updates may break your scripts: if your scripts work for you in production, retain those versions until you test new versions in a dev environment. I may make small changes which are big enough so that your automated updates will break your scripts. For example, I might make a small rename to a parameter — boom, your code stops working! Be responsible!

3rd party references

This module utilizes several external dependencies to do its work. The authors of those libraries have done fantastic work — I've just added some PowerShell to the mix. All are distributed under permissive licenses:

Refer to each project's repository for complete license information.

C# API overview

If you are writing your own .NET applications you can reference the compiled libraries directly. All classes live in the PSParseHTML namespace and expose methods equivalent to the cmdlets:

  • HtmlParser - ParseWithAngleSharp, ParseWithHtmlAgilityPack and table extraction helpers such as ParseTablesWithAngleSharpDetailed
  • HtmlParserExtensions - GetElements for quick element queries
  • HtmlFormatter - FormatHtml, FormatCss, FormatJavaScript
  • HtmlOptimizer - OptimizeHtml, OptimizeCss, OptimizeJavaScript
  • HtmlUtilities - ConvertToText to strip markup
  • PreMailerClient - methods like MoveCssInline and MoveCssInlineFromFile

These methods can be consumed from C# or directly from PowerShell. For example:

string html = File.ReadAllText("example.html");
var tables = HtmlParser.ParseTablesWithHtmlAgilityPack(html);
string pretty = HtmlFormatter.FormatHtml(html);

// Customize JavaScript formatting
string script = File.ReadAllText("script.js");
BeautifierOptions jsOpts = new BeautifierOptions { IndentSize = 2, BraceStyle = BraceStyle.Expand };
string prettyJs = HtmlFormatter.FormatJavaScript(script, jsOpts);
# PowerShell using the same API
[PSParseHTML.HtmlFormatter]::FormatHtml($html)

Both approaches yield identical results, so you can choose the most convenient tool for your workflow.