Friday, July 26, 2013

SQL Server - How to remove HTML tags from data


1.     Background

The purpose of this article is to provide a way of cleaning up of HTML tags within the data. When we use various styles or tabular format data in UI using Rich Text Editor/ Rad Grid etc, it will save data in database with HTML tags.

·         Fig. 1 showing data with style using Rich text editor in UI
 

·         Fig 2 showing data saved in table corresponding to fig 1 with HTML tags
           
            

·         If we simply pull the data from the table for SSRS or excel then data will come with HTML tags unless we explicitly remove the HTML tags from the data.

2.     What is HTML Tags?

An HTML element starts with a start tag (<p>) and ends with end tag (<p/>) and everything between
Start tag and End tag is HTML element. e.g.
       <b>Following are the popular databases: <br />  </b>1. SQL Server <br />  2. Oracle <br />  3.    
      Teradata <br />  4. Sybase

3.     Script to remove HTML tags from the data?   

We are using SQL Server CHARINDEX function to get the positions of ‘<’ and ‘>’ and once found replacing the string between <….> with blank using STUFF function. We are using WHILE Loop that will run till presence of ‘<’ & ‘>’ in the string. Below is the UDF script that performs HTML Tags clean up from data.


/****** Object:  UserDefinedFunction [dbo].[usp_ClearHTMLTags]   ******/ 
    SET ANSI_NULLS ON 
    GO 
    SET QUOTED_IDENTIFIER ON 
    GO 
    /**************************************************************************** 
    Name of Author  :   Vishal Jharwade 
    Purpose         :   The Purpose of this function is to clean the html tags from the data. 
    ***************************************************************************************/ 
    CREATE FUNCTION [dbo].[usp_ClearHTMLTags] 
    (@String NVARCHAR(MAX)) 
     
    RETURNS NVARCHAR(MAX) 
    AS 
    BEGIN 
        DECLARE @Start INT, 
                @End INT, 
                @Length INT 
         
        WHILE CHARINDEX('<', @String) > 0 AND CHARINDEX('>', @String, CHARINDEX('<', @String)) > 0 
        BEGIN 
            SELECT  @Start  = CHARINDEX('<', @String),  
                    @End    = CHARINDEX('>', @String, CHARINDEX('<', @String)) 
            SELECT @Length = (@End - @Start) + 1 
             
            IF @Length > 0 
            BEGIN 
                SELECT @String = STUFF(@String, @Start, @Length, '') 
             END 
         END 
         
        RETURN @String 
    END 

4.     Conclusion

By using above UDF, We can clean the HTML tags from the data.

-----------------------------------End of Article----------------------------------------------