SQL Server – How to Combine Table Columns into One Unique Value for HASHBYTES

Objective

I want to combine the four columns of a row into one value, named COMBINED_VALUE.

These strict rules apply:

Two rows with the same column values in the same order must produce the same COMBINED_VALUE
No two different rows may produce the same COMBINED_VALUE
Order of columns makes a difference, trailing spaces make a difference, invisible characters make a difference, any difference makes a difference
All columns are nullable, one columns is defined as INT, three columns are defined as NVARCHAR(100)

Notes

Among the first things I tried was BINARY_CHECKSUM. It fails, resulting in the same value for different rows:

SELECT
    a, b,
    binary_checksum_failing = BINARY_CHECKSUM(a,b)
FROM
    (VALUES ('i01', CONVERT(DECIMAL(9,2), 100)),
            ('i01', CONVERT(DECIMAL(9,2), 10)),
            ('i01', CONVERT(DECIMAL(9,2),  1)) )v(a, b)

HASHBYTES with 'SHA2_256' works better, but it needs one well prepared COMBINED_VALUE as input. If that is ensured, it has a chance to produce "very" unique hashes. I found that concatenating the columns should work but NULLs need to be replaced and columns need to be separated.

Find below two pieces of code showing my current progress and further notes. The first part shows a solution based on conversion to NVARCHAR, the second part uses VARBINARY. The example rows show edge cases and try to provoke failure.

Questions

Can you make the presented scripts fail? Can you create two different rows, that produce the same COMBINED_VALUE while passing validation?
Is there a better way to get COMBINED_VALUE? Is there a faster way? What is best practise in general?

Script #1:

DROP TABLE IF EXISTS #t

CREATE TABLE #t
(
    id           INT IDENTITY(1,1)
    ,DESCRIPTION NVARCHAR(100)
    ,number      INT
    ,word1       NVARCHAR(100)
    ,word2       NVARCHAR(100)
    ,word3       NVARCHAR(100)
)

INSERT INTO #t(DESCRIPTION, number, word1, word2, word3)
VALUES
 ('base row'                          , 1,        'aa',        'aa',     'aa') 
,('trailing space in different places', 1,        'aa',        'aa',    'aa ') 
,('trailing space in different places', 1,        'aa',       'aa ',     'aa') 
,('NULLs in different places'         , 1,       NULL,         'aa',     'aa') 
,('NULLs in different places'         , 1,        'aa',        NULL,     'aa') 
,('different rows, screwery with NULL', 1,        'aa',        'aa',     NULL) 
,('different rows, screwery with NULL', 1,        'aa',        'aa', '*NULL*') 
,('different rows, screwery with sep.', 1, 'aa'       , 'aa/SEP\aa',     'aa') 
,('different rows, screwery with sep.', 1, 'aa/SEP\aa',        'aa',     'aa') 
,('longer values'
  , 1111111111, 'abcdefghijklmn', 'abcdefghijklmn', 'abcdefghijklmn') 


DECLARE 
    @NULL_replacement NVARCHAR(10) = '*NULL*'
    ,@separator       NVARCHAR(10) = '/SEP\'

;WITH cte AS
(
    SELECT
        id, DESCRIPTION, number,
        word1, word2, word3,
        COMBINED_VALUE,
        valid_f
    FROM 
        #t
    OUTER APPLY
        (SELECT valid_f = IIF(COUNT(*) = 0, 1, 0)
         FROM (VALUES  (word1)
                      ,(word2)
                      ,(word3) )v(v)
         JOIN (VALUES  (@NULL_replacement)
                      ,(@separator       ) )w(w)
           ON v LIKE '%' + w + '%' COLLATE Czech_BIN2 
        )validator
    OUTER APPLY 
        (SELECT COMBINED_VALUE = CONCAT_WS(
              @separator
              ,COALESCE(CONVERT(NVARCHAR(100), number), @NULL_replacement) 
              ,COALESCE(CONVERT(NVARCHAR(100), word1 ), @NULL_replacement) 
              ,COALESCE(CONVERT(NVARCHAR(100), word2 ), @NULL_replacement) 
              ,COALESCE(CONVERT(NVARCHAR(100), word3 ), @NULL_replacement)) 
        )combiner
)
SELECT
    id
    ,DESCRIPTION
    ,number
    ,word1_encapsulated          = '>' + word1 + '<'
    ,word2_encapsulated          = '>' + word2 + '<'
    ,word3_encapsulated          = '>' + word3 + '<'
    ,COMBINED_VALUE_encapsulated = '>' + COMBINED_VALUE + '<'
    ,valid_f
    ,repeats
    ,COMBINED_VALUE_dlen = DATALENGTH(COMBINED_VALUE)
FROM
    cte o
OUTER APPLY
    (SELECT repeats = COUNT(*) 
     FROM cte i
     WHERE o.COMBINED_VALUE LIKE i.COMBINED_VALUE COLLATE Czech_BIN2 
    )repeatfinder
ORDER BY 
    id

I consider the example above proof that I need to replace NULLs and use a separator. See rows 4 and 5, if these columns get concatenated with CONCAT, you get the same output. See the column DESCRIPTION for more details.

NULL replacement and separator

Prevent different rows forming the same COMBINED_VALUE
Must be disallowed as part of input. This must be verified, input must be sanitized
Should very unlikely be part of input, should be "rare"

Script #2

After the above I came up with this mostly similar query that uses VARBINARY to make the output smaller, particularly for numeric data. Null replacement and separator are "super rare". The ASCII codes for them are labeled "unused" in the ASCII table.

DECLARE 
    @separator_bin VARBINARY(5) = 
      CONVERT(VARBINARY(5), CHAR(129)+CHAR(141)+CHAR(143)+CHAR(144)+CHAR(157))
    ,@NULL_repl_bin VARBINARY(5) = 
      CONVERT(VARBINARY(5), CHAR(157)+CHAR(144)+CHAR(143)+CHAR(141)+CHAR(129))

DECLARE 
    @separator_vc  VARCHAR(10) = CONVERT(VARCHAR(10), @separator_bin, 2)
    ,@NULL_repl_vc VARCHAR(10) = CONVERT(VARCHAR(10), @NULL_repl_bin, 2)

;WITH cte AS
(
    SELECT
        id, DESCRIPTION, number,
        word1, word2, word3,
        COMBINED_VALUE, 
        valid_f, word1_binary
    FROM 
        #t
    OUTER APPLY
        (SELECT 
             number_binary = CONVERT(VARBINARY(8000), number)
             ,word1_binary = CONVERT(VARBINARY(8000), word1 )
             ,word2_binary = CONVERT(VARBINARY(8000), word2 )
             ,word3_binary = CONVERT(VARBINARY(8000), word3 )
        )binaries
    OUTER APPLY
        (SELECT valid_f = IIF(COUNT(*) = 0, 1, 0)
         FROM (VALUES  (CONVERT(VARCHAR(8000), number_binary, 2))
                      ,(CONVERT(VARCHAR(8000), word1_binary , 2))
                      ,(CONVERT(VARCHAR(8000), word2_binary , 2))
                      ,(CONVERT(VARCHAR(8000), word3_binary , 2)) )v(v)
         JOIN (VALUES  (@separator_vc)
                      ,(@NULL_repl_vc) )w(w) 
           ON CHARINDEX(w, v COLLATE Latin1_General_BIN2) > 0
        )validator
    OUTER APPLY
        (SELECT COMBINED_VALUE = 
                            COALESCE(number_binary, @NULL_repl_bin) 
         + @separator_bin + COALESCE(word1_binary , @NULL_repl_bin) 
         + @separator_bin + COALESCE(word2_binary , @NULL_repl_bin) 
         + @separator_bin + COALESCE(word3_binary , @NULL_repl_bin) 
        )combiner
    )
    SELECT
        id
        ,DESCRIPTION
        ,number
        ,word1_encapsulated = '>' + word1 + '<'
        ,word2_encapsulated = '>' + word2 + '<'
        ,word3_encapsulated = '>' + word3 + '<'
        ,COMBINED_VALUE
        ,valid_f
        ,repeats
        ,COMBINED_VALUE_dlen = DATALENGTH(COMBINED_VALUE)
        ,word1_varchar       = CONVERT(VARCHAR(8000), word1_binary, 2)
        ,separator_varchar   = @separator_varchar
    FROM 
        cte o
    OUTER APPLY
        (SELECT repeats = COUNT(*) 
         FROM cte i
         WHERE o.COMBINED_VALUE = i.COMBINED_VALUE
        )repeatfinder
    ORDER BY 
        id

Best Answer

I could suggest a binary concatenation that includes a combination of following for each column:

A null indicator IIF(col IS NULL THEN 0x01, 0x00),
The data length ISNULL(CAST(DATALENGTH(col) AS VARBINARY(MAX)), 0x), and
The value itself ISNULL(CAST(col AS VARBINARY(MAX)), 0x).

The null indicator and data length parts can be selectively omitted for non-nullable and fixed-length column types, respectively.

This will unambiguously handle any combination of nullable and variable-length columns, and you don't need to worry about delimiters in your data.

The concatenated result can be fed into a good hash function like SHA-256. The chance of a SHA-256 collision is about 1 every 10 billion years, if you add a million unique rows to your database every femtosecond.

SELECT
    t.*,
    --C.CombinedValue,
    H.HashValue
FROM #t t
CROSS APPLY (
    SELECT 
        + IIF(number IS NULL, 0x01, 0x00)
          + ISNULL(CAST(DATALENGTH(number) AS VARBINARY(MAX)), 0x) -- Optional
          + ISNULL(CAST(number AS VARBINARY(MAX)), 0x)
        + IIF(word1 IS NULL, 0x01, 0x00)
          + ISNULL(CAST(DATALENGTH(word1) AS VARBINARY(MAX)), 0x)
          + ISNULL(CAST(word1 AS VARBINARY(MAX)), 0x)
        + IIF(word2 IS NULL, 0x01, 0x00)
          + ISNULL(CAST(DATALENGTH(word2) AS VARBINARY(MAX)), 0x)
          + ISNULL(CAST(word2 AS VARBINARY(MAX)), 0x)
        + IIF(word3 IS NULL, 0x01, 0x00)
          + ISNULL(CAST(DATALENGTH(word3) AS VARBINARY(MAX)), 0x)
          + ISNULL(CAST(word3 AS VARBINARY(MAX)), 0x)
        AS CombinedValue
) C
CROSS APPLY (
    SELECT HASHBYTES('SHA2_256',C.CombinedValue) AS HashValue
) H
ORDER BY T.id

The above can be easily rearranged to use a CTE if desired.

Sample results:

id	DESCRIPTION	number	word1	word2	word3	HashValue
1	base row	1	aa	aa	aa	0xFB9299EC58E4DAC1465BD3025517F735DC140BFE1CB8EE1F627C610514DF2A2D
2	trailing space in different places	1	aa	aa	aa	0x6E462C1EF5F7CD583F74F10594CFF454C6EAD556D42564153C1F9B558ECB3C3A
3	trailing space in different places	1	aa	aa	aa	0x862F08AC15AFEF147455038FF9F3AFFEB95826530BFE3DA2C87EBB8AFDAAA572
4	NULLs in different places	1	null	aa	aa	0xBDF6691C3F2207B854251CBC5A527E9A1DAFF7282F68843F97178E47A2A91900
5	NULLs in different places	1	aa	null	aa	0xB53C88CA610E62159C942AF39D61165ECDB12F52E0B65D62A126BB69257AD316
6	different rows, screwery with NULL	1	aa	aa	null	0x86D4A0F8808C4691E121001E199AD69160283768824BE86723AEE191FC5387A3
7	different rows, screwery with NULL	1	aa	aa	NULL	0x27CDC248AEDB67EA3FEE935FB28FFD6F8BCF144D0B177824DAEEA7D5F2C69D2F
8	different rows, screwery with sep.	1	aa	aa/SEP\aa	aa	0x3CD630032648215DD6AFACE7098AE0A3DCA13D8180490B99046195F1B84CAE84
9	different rows, screwery with sep.	1	aa/SEP\aa	aa	aa	0x507C17CE4F560775AE6A0973A5BB33A8E30C52160927C7D682914446AC7ADC1E
10	longer values	1111111111	abcdefghijklmn	abcdefghijklmn	abcdefghijklmn	0x04DD3801CEBEC074D512AFDA327CDB1F05089E53EE2CD93AC72BC260145A95DC

See this db<>fiddle for a demo.

Best Answer

Related Solutions

Related Question