Objective
I want to combine the four columns of a row into one value, named COMBINED_VALUE
.
These strict rules apply:
- Two rows with the same column values in the same order must produce the same
COMBINED_VALUE
- No two different rows may produce the same
COMBINED_VALUE
- Order of columns makes a difference, trailing spaces make a difference, invisible characters make a difference, any difference makes a difference
- All columns are nullable, one columns is defined as
INT
, three columns are defined asNVARCHAR(100)
Notes
Among the first things I tried was BINARY_CHECKSUM
. It fails, resulting in the same value for different rows:
SELECT
a, b,
binary_checksum_failing = BINARY_CHECKSUM(a,b)
FROM
(VALUES ('i01', CONVERT(DECIMAL(9,2), 100)),
('i01', CONVERT(DECIMAL(9,2), 10)),
('i01', CONVERT(DECIMAL(9,2), 1)) )v(a, b)
HASHBYTES
with 'SHA2_256' works better, but it needs one well prepared COMBINED_VALUE
as input. If that is ensured, it has a chance to produce "very" unique hashes. I found that concatenating the columns should work but NULLs need to be replaced and columns need to be separated.
Find below two pieces of code showing my current progress and further notes. The first part shows a solution based on conversion to NVARCHAR
, the second part uses VARBINARY
. The example rows show edge cases and try to provoke failure.
Questions
- Can you make the presented scripts fail? Can you create two different rows, that produce the same
COMBINED_VALUE
while passing validation? - Is there a better way to get
COMBINED_VALUE
? Is there a faster way? What is best practise in general?
Script #1:
DROP TABLE IF EXISTS #t
CREATE TABLE #t
(
id INT IDENTITY(1,1)
,DESCRIPTION NVARCHAR(100)
,number INT
,word1 NVARCHAR(100)
,word2 NVARCHAR(100)
,word3 NVARCHAR(100)
)
INSERT INTO #t(DESCRIPTION, number, word1, word2, word3)
VALUES
('base row' , 1, 'aa', 'aa', 'aa')
,('trailing space in different places', 1, 'aa', 'aa', 'aa ')
,('trailing space in different places', 1, 'aa', 'aa ', 'aa')
,('NULLs in different places' , 1, NULL, 'aa', 'aa')
,('NULLs in different places' , 1, 'aa', NULL, 'aa')
,('different rows, screwery with NULL', 1, 'aa', 'aa', NULL)
,('different rows, screwery with NULL', 1, 'aa', 'aa', '*NULL*')
,('different rows, screwery with sep.', 1, 'aa' , 'aa/SEP\aa', 'aa')
,('different rows, screwery with sep.', 1, 'aa/SEP\aa', 'aa', 'aa')
,('longer values'
, 1111111111, 'abcdefghijklmn', 'abcdefghijklmn', 'abcdefghijklmn')
DECLARE
@NULL_replacement NVARCHAR(10) = '*NULL*'
,@separator NVARCHAR(10) = '/SEP\'
;WITH cte AS
(
SELECT
id, DESCRIPTION, number,
word1, word2, word3,
COMBINED_VALUE,
valid_f
FROM
#t
OUTER APPLY
(SELECT valid_f = IIF(COUNT(*) = 0, 1, 0)
FROM (VALUES (word1)
,(word2)
,(word3) )v(v)
JOIN (VALUES (@NULL_replacement)
,(@separator ) )w(w)
ON v LIKE '%' + w + '%' COLLATE Czech_BIN2
)validator
OUTER APPLY
(SELECT COMBINED_VALUE = CONCAT_WS(
@separator
,COALESCE(CONVERT(NVARCHAR(100), number), @NULL_replacement)
,COALESCE(CONVERT(NVARCHAR(100), word1 ), @NULL_replacement)
,COALESCE(CONVERT(NVARCHAR(100), word2 ), @NULL_replacement)
,COALESCE(CONVERT(NVARCHAR(100), word3 ), @NULL_replacement))
)combiner
)
SELECT
id
,DESCRIPTION
,number
,word1_encapsulated = '>' + word1 + '<'
,word2_encapsulated = '>' + word2 + '<'
,word3_encapsulated = '>' + word3 + '<'
,COMBINED_VALUE_encapsulated = '>' + COMBINED_VALUE + '<'
,valid_f
,repeats
,COMBINED_VALUE_dlen = DATALENGTH(COMBINED_VALUE)
FROM
cte o
OUTER APPLY
(SELECT repeats = COUNT(*)
FROM cte i
WHERE o.COMBINED_VALUE LIKE i.COMBINED_VALUE COLLATE Czech_BIN2
)repeatfinder
ORDER BY
id
I consider the example above proof that I need to replace NULLs and use a separator. See rows 4 and 5, if these columns get concatenated with CONCAT
, you get the same output. See the column DESCRIPTION
for more details.
NULL replacement and separator
- Prevent different rows forming the same
COMBINED_VALUE
- Must be disallowed as part of input. This must be verified, input must be sanitized
- Should very unlikely be part of input, should be "rare"
Script #2
After the above I came up with this mostly similar query that uses VARBINARY
to make the output smaller, particularly for numeric data. Null replacement and separator are "super rare". The ASCII codes for them are labeled "unused" in the ASCII table.
DECLARE
@separator_bin VARBINARY(5) =
CONVERT(VARBINARY(5), CHAR(129)+CHAR(141)+CHAR(143)+CHAR(144)+CHAR(157))
,@NULL_repl_bin VARBINARY(5) =
CONVERT(VARBINARY(5), CHAR(157)+CHAR(144)+CHAR(143)+CHAR(141)+CHAR(129))
DECLARE
@separator_vc VARCHAR(10) = CONVERT(VARCHAR(10), @separator_bin, 2)
,@NULL_repl_vc VARCHAR(10) = CONVERT(VARCHAR(10), @NULL_repl_bin, 2)
;WITH cte AS
(
SELECT
id, DESCRIPTION, number,
word1, word2, word3,
COMBINED_VALUE,
valid_f, word1_binary
FROM
#t
OUTER APPLY
(SELECT
number_binary = CONVERT(VARBINARY(8000), number)
,word1_binary = CONVERT(VARBINARY(8000), word1 )
,word2_binary = CONVERT(VARBINARY(8000), word2 )
,word3_binary = CONVERT(VARBINARY(8000), word3 )
)binaries
OUTER APPLY
(SELECT valid_f = IIF(COUNT(*) = 0, 1, 0)
FROM (VALUES (CONVERT(VARCHAR(8000), number_binary, 2))
,(CONVERT(VARCHAR(8000), word1_binary , 2))
,(CONVERT(VARCHAR(8000), word2_binary , 2))
,(CONVERT(VARCHAR(8000), word3_binary , 2)) )v(v)
JOIN (VALUES (@separator_vc)
,(@NULL_repl_vc) )w(w)
ON CHARINDEX(w, v COLLATE Latin1_General_BIN2) > 0
)validator
OUTER APPLY
(SELECT COMBINED_VALUE =
COALESCE(number_binary, @NULL_repl_bin)
+ @separator_bin + COALESCE(word1_binary , @NULL_repl_bin)
+ @separator_bin + COALESCE(word2_binary , @NULL_repl_bin)
+ @separator_bin + COALESCE(word3_binary , @NULL_repl_bin)
)combiner
)
SELECT
id
,DESCRIPTION
,number
,word1_encapsulated = '>' + word1 + '<'
,word2_encapsulated = '>' + word2 + '<'
,word3_encapsulated = '>' + word3 + '<'
,COMBINED_VALUE
,valid_f
,repeats
,COMBINED_VALUE_dlen = DATALENGTH(COMBINED_VALUE)
,word1_varchar = CONVERT(VARCHAR(8000), word1_binary, 2)
,separator_varchar = @separator_varchar
FROM
cte o
OUTER APPLY
(SELECT repeats = COUNT(*)
FROM cte i
WHERE o.COMBINED_VALUE = i.COMBINED_VALUE
)repeatfinder
ORDER BY
id
Best Answer
I could suggest a binary concatenation that includes a combination of following for each column:
IIF(col IS NULL THEN 0x01, 0x00)
,ISNULL(CAST(DATALENGTH(col) AS VARBINARY(MAX)), 0x)
, andISNULL(CAST(col AS VARBINARY(MAX)), 0x)
.The null indicator and data length parts can be selectively omitted for non-nullable and fixed-length column types, respectively.
This will unambiguously handle any combination of nullable and variable-length columns, and you don't need to worry about delimiters in your data.
The concatenated result can be fed into a good hash function like SHA-256. The chance of a SHA-256 collision is about 1 every 10 billion years, if you add a million unique rows to your database every femtosecond.
The above can be easily rearranged to use a CTE if desired.
Sample results:
See this db<>fiddle for a demo.