Parsing HTML with PowerShell without a browser object

Parsing HTML with PowerShell without a browser object

In a recent post I wrote about using PowerShell to execute actions against a browser object. A great option if you are working with HTML and need to interact. By creating a browser object you can get HTML elements easily, and more important work with them interactively. In a more recent scenario I had to migrate HTML content towards Office 365. This HTML content consisted of several hundred pages. All those pages where present in a single menu, and contained a simple HTML structure. Instead of buying a tool we decided to write a simple script that could strip the pages we needed. After stripping the content from those pages we could create them in Office 365. The reason we did it ourselves instead of buying a tool was that the HTML structure was that easy that it only took a few hours to write the script that we needed.

In order to strip the pages we resorted to creating a new browser object like before. However when executing the script on a client we encountered that in order to create a browser object you need to be an administrator. Something that might not always be an option. Without administrator privileges it will not be possible to execute the following statement.

$ie = New-Object -ComObject InternetExplorer.Application
$ie.visible=$true
$ie.navigate2($url);

We did however needed the getElementByID methods in order to extract our HTML. Luckily the Invoke-WebRequest allows for similar logic as the browser object. By using a Invoke-WebRequest you will be presented with a web response that contains a ParsedHtml property. The ParsedHtml property allows you to execute the getElementByID against.

$ie = Invoke-WebRequest $url -UseDefaultCredentials
$menu = $ie.ParsedHtml.getElementByID('menu');
$content = $ie.ParsedHtml.getElementByID('bodyContent').innerHTML;

So if you are stripping or reading HTML and cannot use the browser com object. You can easily resort to the Invoke-WebRequest and use the ParsedHtml property to retrieve the HTML.

Leave a Reply