PowerShell 5.0 Tutorial: Example-Driven Parsing using ConvertFrom-String

Not to be confused with ConvertFrom-StringData, a cmdlet available in previous versions of PowerShell, ConvertFrom-String in PowerShell 5.0 provides an easy way to parse complex text files using machine learning.

The ability to parse complex text files is one of PowerShell’s many strengths, but using the SubString() method and regular expressions can involve writing a lot of code to get the desired results. PowerShell 5.0’s ConvertFrom-String cmdlet has two modes that can be used to parse text: Basic Delimited Parsing, and Auto-Generated Example-Driven Parsing.

Delimited parsing uses a character, such as a space or semicolon, to determine where data stops and starts:

Name,Email,Office
Russell Smth,[email protected],London
David Jones,[email protected],Manchester

For more information on parsing strings using regular expressions, see PowerShell Problem Solver: PowerShell String Parsing with Regular Expressions on the Petri IT Knowledgebase. To learn how to use ConvertFrom-String Basic Delimited Parsing, take a look at Basic Delimited Parsing using ConvertFrom-String in PowerShell 5.0.

Auto-Generated Example-Driven Parsing makes it easy to parse more complex text files by supplying PowerShell with a template of how the data usually looks. For instance, the output of a ping command can be parsed by giving an example, in the form of a template, of how the output looks. From the template, PowerShell learns how to parse the ping command’s output using FlashExtract, a machine learning tool developed by Microsoft Research, which is also used in Excel’s FlashFill and FlashConvert commands for automatically creating regular expression extraction tools from samples of highlighted data.

Auto-Generated Example-Driven Parsing

Let’s attempt to parse the output of a typical ping command. The –n parameter in the ping command below specifies the number of echoes to return.

The output of the ping command (Image Credit: Russell Smith)
The output of the ping command (Image Credit: Russell Smith)
Now we'll create a template, from which ConvertFrom-String will learn how to parse the output:
​
Pinging www.google.com [74.125.232.51] with 32 bytes of data: 
Reply from {IP*:74.125.232.51}: bytes={Bytes:32} time={Time:2}ms TTL={Ttl:58}

Ping statistics for 74.125.232.51:     
  Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), 
Approximate round trip times in milli-seconds:     
  Minimum = 2ms, Maximum = 20ms, Average = 6ms

'@

In the template above, I’ve used an asterisk to indicate the start of a new data sequence.  The actual data that I want to parse from the text is wrapped in curly brackets. I’ve also added variable names Bytes, Time, and Ttl to the template, but it’s not strictly necessary to add these names. If I wanted to make the variable name of one of the fields something different than from what’s in the output of the ping command, then they must be included.

If I wanted to use latency instead of time as a variable name, then the template would look like this:

​
Despite that the output of the ping command consists of 10 echoes, I've only included one in the template because I'm hoping this is enough for PowerShell to learn how to parse the output. It's important that the template includes any spaces or other delimiting symbols that might exist in the output you want to parse.

Let's try out the template and see if PowerShell is successfully able to parse the output:
Passing a template to ConvertFrom-String (Image Credit: Russell Smith)
Passing a template to ConvertFrom-String (Image Credit: Russell Smith)
From the output of ConvertFrom-String, you can see that PowerShell hasn't quite managed to parse all the data sequences correctly, because sometimes network latency is higher than 2ms, and the Time field can go into double digits. To solve this problem, all we need to do is add more variation to the template so that ConvertFrom-String can learn better. We'll do that by adding another echo to the template but with milliseconds in double digits:
​
Pinging www.google.com [74.125.232.51] with 32 bytes of data: 
Reply from {IP*:74.125.232.51}: bytes={Bytes:32} time={Time:2}ms TTL={Ttl:58} 
Reply from {IP*:74.125.232.51}: bytes={Bytes:32} time={Time:15}ms TTL={Ttl:58}

Ping statistics for 74.125.232.51:     
  Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), 
Approximate round trip times in milli-seconds:     
  Minimum = 2ms, Maximum = 20ms, Average = 6ms

'@

Now if we run the ping command again and pipe the output to ConvertFrom-String, hopefully we’ve provided a good enough example in the template that PowerShell will be able to parse the output correctly:

Passing a modified template to ConvertFrom-String (Image Credit: Russell Smith)
Passing a modified template to ConvertFrom-String (Image Credit: Russell Smith)
As you can see in the screenshot above, PowerShell was successfully able to parse the output of the ping command. Let's run ConvertFrom-String again, but this time we'll write the results to a variable ($ping) so that they can be further processed using PowerShell:
Processing the results of ConvertFrom-String (Image Credit: Russell Smith)
Processing the results of ConvertFrom-String (Image Credit: Russell Smith)

Parsing HTML

Here's another example on how to use auto-generated example-driven parsing. This time, we'll provide ConvertFrom-String with a data file (data.html) in HTML format:
​
Now create a template to parse the above HTML file. I'm going to start by providing just one table row and hope that it's enough for PowerShell to learn from.
​
And now we can use ConvertFrom-String to parse the file:
Importing a file as a string, creating a template and parsing the file using ConvertFrom-String (Image Credit: Russell Smith)
Importing a file as a string, creating a template and parsing the file using ConvertFrom-String (Image Credit: Russell Smith)
Again as you can see from the screenshot above, PowerShell hasn't been able to parse the HTML file correctly, so we need to provide more variation in the template, so I'll add another data sequence example:
​
Now if I run the ConvertFrom-String cmdlet again, maybe the template provides enough variation for PowerShell to parse the HTML file correctly:
Adding variation to the template (Image Credit: Russell Smith)
Adding variation to the template (Image Credit: Russell Smith)
In the screenshot above, you can see that the template still doesn't have enough variation to parse the file correctly. To solve the problem, I'll replace the data sequence for David Jones with John Cameron in the template, because his office details contain a space, where the other data sequences don’t.
​
Finally, running the ConvertFrom-String cmdlet again should produce a complete result:
Modifying the template again (Image Credit: Russell Smith)
Modifying the template again (Image Credit: Russell Smith)
Like in the example using the ping command, once the text has been parsed correctly, the results can be processed using PowerShell:
Processing the results of ConvertFrom-String (Image Credit: Russell Smith)
Processing the results of ConvertFrom-String (Image Credit: Russell Smith)