Best Way to Parse a HTML Page

I would like to write some code to parse a set of HTML pages from the internet in order to gather information from each web page.

All of the web pages are generated using a template, so the format of each of the web-pages is consistent with one-another and the information that I want to gather is always located in the same logical place within the page.

What is the best way to parse an html page in order to gather information at a specific place?

Can XML XPATH be used here?  Does anyone have any examples of parsing HTML content?

 

Answers

First you get the the website or html using %Net.HttpRequest , convert it to an XHTML compliant code using html-tidy and then use %XML.Reader and %XML.Node to get at elements attributes and content.  This can be done in a generic way that can then be reused for any website. I would also recommend $JOB to increase performance of pieces of this code.

http://docs.intersystems.com/latest/csp/docbook/DocBook.UI.Page.cls?KEY=... 

http://www.html-tidy.org/

http://docs.intersystems.com/latest/csp/docbook/%25CSP.Documatic.cls?PAG...

http://docs.intersystems.com/latest/csp/docbook/%25CSP.Documatic.cls?PAG...

 

Ensemble has a nice utility for this - see the class reference for Ens.Util.HTML.Parser.

I was able to use this method (Ens.Utils.HTML.Parser) to successfully parse disease information from the CDCs website.  Basically I created a persistent class to store disease names along with the source URLs from the CDC's a-z web pages.

The template looks like this:

<div,class=span16><ul>+<li><a,class=noLinking,href={pageurl}>{pagetitle}</a></li>+</ul>

and the Class method that actually does the parsing of all of the pages:

ClassMethod getDiseasesOrCondition(Output tCount As %Integer) As %Status
{
set tCount=0
set template="<div,class=span16><ul>+<li><a,class=noLinking,href={pageurl}>{pagetitle}</a></li>+</ul>"
for alpha=1:1:26 {
kill tOut
set url="http://www.cdc.gov/DiseasesConditions/az/"_$c(96+alpha)_".html"
do ##class(Ens.Util.HTML.Parser).test(url, template, .tOut)
for i=1:1 {
quit:'$d(tOut("pageurl",i))
if tOut("pageurl",i)?1"http://www.cdc.gov/".e {
set iCDC=##class(iCDC.DiseaseOrCondition).%New()
set iCDC.title=tOut("pagetitle",i)
set iCDC.sourceUrl=tOut("pageurl",i)
set tSC=iCDC.%Save()
set tCount=$i(tCount)
}
}
}
quit $$$OK
}

There was a little checking that was needed to verify that the url that was returned for the source url's were actually pointing to the CDCs website, but other than that, the Ens.Utils.HTML.Parser worked exactly as I had hoped it would.

Very clean and straight forward for my needs.

Continuing my testing with Ens.Utils.HTML.Parser, I am trying to parse information contained within paragraphs.

For example:

<p>This is paragraph one</p>

<p>This is paragraph two with some <i>italics</i>.</p>

<p>This is paragraph three</p>

I want to parse the contents of each of these paragraphs, including the italics content contained within paragraph two.

I setup my template to:

+<p>{paragraph}</p>+

This kind of works, except in the case of the second paragraph, it stops when it hits the <i> tag even though if it were to continue it would eventually hit the </p> tag.  So what ends up being parsed is:

This is paragraph one

This is paragraph two with some

This is paragraph three

What I am expecting is:

This is paragraph one

This is paragraph two with some <i>italics</i>

This is paragraph three

Does this seem to be a limitation of the parser?  Is there any way to get what I'm trying to get from this document using the parser?

 

Hello Kenneth,

Have you figured out how to get the embedded HTML (the "<i>italics</i>") to appear in your second result? I have a feeling that this parser can't return embedded HTML like that because it handles text and tags separately (and only expects to return text).

Please update if you have gotten past this with this or another approach (maybe the approach suggested by John H. below?)

Much appreciated,

Jean