Solving String Similarity in SQL
Steps to solve string similarity in SQL.
Posted in Code on June 8, 2017 – 2 min read
Problem
The main focus today was to identify the similarity between two strings and derive the relevant records from a query.
To better illustrate this, let’s look at a sample table:
id string
mjZThqXnp7 QB1934
odGf0DZuUR JU9910
jemvgDUw6T UJ991
xkVGZbpG05 DF987
MtaNYMMgkV FL9111
jqZCZI0Fdn AM9114
xRNLSPEdAY AM914
yMDC1QNNMc P1N9
YAmabwCWla JU990
Tsoc1lqG6I Q1934B
If we are using JU991
as input, the expected output would be
id string
odGf0DZuUR JU9910
jemvgDUw6T UJ991
YAmabwCWla JU990
When you look at the results closely, you will identify the changes needed to change the input string to match the derived strings:
- Delete character(s) from the beginning/end
- Add characters(s) from the beginning/end
- Switch position for two characters
So, how do we create conditions that can catch the above changes?
Implementation: Condition 1 and 2
For conditions 1 and 2, we can do so by applying the formula:
abs(len(input) - len(record[i]))
to get the absolute value as ‘difference’ between two strings, and then
isSubstringOf(input, record[i])
to determine whether the record is a substring of the input.
Implementation: Condition 3
This is a bit more tricky, because there exist a pretty straightforward solution:
sort(input) == sort(record[i])
However, I was not allowed to do so due to language constraints and there was no available sorting function. Gladly, a hard-coded solution came up:
sum(
positionOf('a', record[i]) == positionOf('a', input)
positionOf('b', record[i]) == positionOf('b', input)
positionOf('c', record[i]) == positionOf('c', input)
...
positionOf('A', record[i]) == positionOf('A', input)
positionOf('B', record[i]) == positionOf('B', input)
positionOf('C', record[i]) == positionOf('C', input)
...
positionOf('0', record[i]) == positionOf('0', input)
positionOf('1', record[i]) == positionOf('1', input)
positionOf('2', record[i]) == positionOf('2', input)
...
)
It is to count the sum of existing alphanumeric characters of the input and record. The same count means they are permutations of each other.
Hope this post can help those out there solving similarity problems!