Thursday, February 19, 2009

HTML parser java library, jericho

For an R&D project I needed something that would help in scraping our existing web pages, specifically pages with forms. I wanted it to be preferably a java library to easily plug into ColdFusion. After trying a few (jTidy, Cobra and two HtmlParsers) I stumbled on Jericho. http://sourceforge.net/projects/jerichohtml/. It did all I needed to do and then some. It did great parsing HTML and getting the values I was after. Projects java docs helped a lot.

There is only one .jar file (as of this writing jericho-html-2.6.jar). So put it somewhere in the CLASSPATH. To check if you put it in the right place in ColdFusion 8 administrator look in Settings Summary and see if it's listed under Java Class Path.

I didn't go as far as to write full on wrapper for it. I was after form fields so here is the code that get's what I need.


Here is the code for parseFormValues() where you can see some of the API Jericho provides in action. As I looked through this, I noticed where I collect lists values with listAppend() if the values have commas it would create a problem.
So keep it in mind if you plan to use it.
I am sure there are some other improvements that can be made since it's a first pass at this.

Note on "this" scope usage. The component this code is in, extends BaseComponent (thank you Hal Helms)
with generic (can you say "lazy" :-) ) set and get implemented with onMissingMethod. You have to use "this" for it to work inside the component.
I did find accidentally later in the project that using this (ooh cool pun) technique is slower then actually creating a setter and a getter, which kind of makes sense.

2 comments:

Dad-I-O said...

When I try to run the code, I get the following error message:
Object Instantiation Exception. An exception occurred when instantiating a Java object. The class must not be an interface or an abstract class. Error: ''.

If you have any time or thoughts it would be greatly appreciated... till then I will keep trying.

Dad-I-O said...

Figured out my issue... I unzipped the Jericho zip to my \ColdFusion8\lib folder.. and then moved the JAR file out of the extracted folder and put it in \lib folder. After I deleted the extracted folder and just left the JAR file, then restarted the service and this got it working correctly.