This is one C# tool that I want to share. It can be used for extracting a portion of an HTML code without cutting the HTML tags in half or leaving unclosed tags. It also allows to measure the length of the extracted part in number of letters, words, sentences, closed HTML tags, closed P tags, closed DIV tags and closed P or DIV. When counting the number of letters or words, those contained within the HTML tags are not considered.

If you have some content stored as HTML in a database, or if you want to summarize an HTML page residing on a remote server you have to be able to extract a certain number of words or letters without considering the HTML tags them self and without leaving unclosed HTML tags. This is the exact tool for this kind of scenario.

The tool uses a single iteration of the string and because of that will be have better performance than similar tools based on regular expressions.

Full source code is available at codeplex under MIT open source license.

Here is a sample way of using it - take the first 50 non-space letters located outide the HTML tags:

var summarizer = new Summarizer();
var summaryHtmlString = 
      summarizer.GetHtmlSummary(htmlString, 50, PartType.Letter);

Similarly you can get the summary based on the number of words, sentences, closed HTML tags, closed P tags, closed DIV tags and closed P or DIV. This is the PartType enumeration:

public enum PartType
    Letter = 1,

You can also pass a delegate that will determine where to start your summary. For example if you want to take the beginning of a remote HTML page you may want to start from the <BODY> tag. In this case you can use it as:

var summarizer = new Summarizer();
var summaryHtmlString = 
            s => s.IndexOf(">", s.IndexOf("<body", StringComparison.CurrentCultureIgnoreCase) + 1) + 1, 

In the source code you can also find a project with unit tests.

Share this post:   digg     Stumble Upon     E-mail

Posted on 9/7/2009 7:36:32 AM, can you summarize the whole document.

Vladimir Bodurov
Posted on 9/7/2009 7:50:11 AM

This is not HTML Gina.

If you want to transform PDF into HTML you can use this tool:

Commenting temporarily disabled