Saturday, August 17, 2013

T-SQL script to manage redundant data (Duplicate data) using CHECKSUM function in SQL Server

1.     Background

The purpose of this article is providing a way of managing duplicate string data in such a case when users will enter duplicate data as part of their job. Best example of such kind of case is giving feedback for a product or service. Whenever we give feedback for a product, many people may write “Product Quality is good”, for Service, some people can write for service- “Service is Awesome” or “Great Service” or “Great Help” etc. If we are product or service company and after delivering our product\service, if we send invitation for survey feedback to 1 million people then there is chance to get same answer(string) from many people. Storing the redundant data in tables is not a good practice and redundant data may confuse us, it may reduce performance and can create storage issue.

If we see the same case in Data Warehousing then we need a Feedback Dimension that will have FeedbackID and FeedbackText. Multiple FeedbackIDs for same Feedbacktext will do redundancy in dimension and may create confusion while loading Fact table.

We are going to manage this condition by using CHECKSUM function that returns the checksum value computed on data and a unique clustered index over that checksum value column to restrict redundancy.

2.     Steps to restrict redundancy using CHECKSUM function:

                                i.      Create the table in such a way that ID column should be computed based on Text column using CHECKSUM function. Below is the example showing a table named DimFeedback having FeedbackNote and FeecbackID column whose value computes based on FeedbackNote for the particular row using CHECKSUM FUNCTION :
     CREATE TABLE [dbo].[DimFeedBack (
                 [FeedBackID]  AS (CHECKSUM([FeedBackNote])),
                 [FeedBackNote] [NVARCHAR](MAX) NULL

                              ii.    Now create a unique non clustered index on FeecbackID column to maintain unique FeedbackIDs in the table. (Optional step)
 [dbo].[DimFeedBack] ([FeedBackID] ASC )

                            iii.      Finally insert the data in such a way that system first should check whether checksum value for the input string is avalable in table or not. IF not avalable then insert the data else do not insert the data.

            DECLARE @FeedBack NVARCHAR(500)
            SET @FeedBack = 'Thank you so much for your great service.'
            IF NOT EXIST (SELECT 1 FROM FeedBack (NOLOCK) WHERE FeedBackID                            = CHECKSUM(@FeedBack))
                  INSERT INTO FeedBack
                  SELECT @FeedBack
                  PRINT 'Feedback has been sucessfully submitted.'
                  PRINT 'Feedback string is already available in the                                        system.'



3.     Conclusion

          By using above steps, we can avoid data redundancy in the table using CHECKSUM() function 
          in SQL Server..

No comments:

Post a Comment