Fuzzy String Matching in SQL: Using Levenshtein Similarity in CData Virtuality

Forum|Forum|6 months ago
July 18, 2025
0 replies
27 views

+1

Marijan
Employee

Use case

Compare strings that look similar but aren’t exactly the same — such as customer names, addresses, or product titles — in SQL queries.

We are happy to announce that the development of built in functions to calculate string similarity using different algorithms is under way with the expected release by the end of summer 2025. In the meantime, here is an example on how to achieve this by creating your own virtual procedure.

Why Fuzzy Matching?

When integrating data from multiple sources or cleaning up inconsistent records, exact string matching (‘=’ or ‘LIKE’) often falls short. For example:

"John Smith" vs. "Jon Smith"
"Mary Johnson" vs. "Mary Jonson"

Despite being nearly identical, these will not match using standard SQL operators.

This is where Levenshtein similarity becomes powerful — it quantifies how "close" two strings are based on the minimum number of edits (insertions, deletions, substitutions) needed to convert one into the other.

Solution: Levenshtein Similarity as a Virtual Procedure

You can embed a Levenshtein algorithm in JavaScript and use it inside a CData Virtuality Virtual Procedure to return a similarity score between 0 and 1.

Procedure Definition

CREATE OR REPLACE VIRTUAL PROCEDURE views.similarity_levenshtein (

IN string1 string,

IN string2 string

)

RETURNS (

string1 string,

string2 string,

similarity double

)

AS

BEGIN

SELECT string1, string2, similarity

FROM OBJECTTABLE(

LANGUAGE 'javascript'

'function levenshtein(a, b) {

if (a === b) return 1.0;

if (!a || !b) return 0.0;

var m = a.length;

var n = b.length;

var dp = [];

// Initialize the dp array

for (var i = 0; i <= m; i++) {

dp[i] = [];

dp[i][0] = i;

}

for (var j = 1; j <= n; j++) {

dp[0][j] = j;

}

// Fill the dp table

for (var i = 1; i <= m; i++) {

for (var j = 1; j <= n; j++) {

if (a.charAt(i-1) === b.charAt(j-1)) {

dp[i][j] = dp[i-1][j-1];

} else {

dp[i][j] = Math.min(

dp[i-1][j] + 1, // deletion

dp[i][j-1] + 1, // insertion

dp[i-1][j-1] + 1 // substitution

);

}

var distance = dp[m][n];

var maxLen = Math.max(m, n);

var similarity = maxLen === 0 ? 1.0 : 1.0 - (distance / maxLen);

return similarity;

}

// Execute the function

var result = levenshtein(string1, string2);

// Set the dv_row for the output

dv_row = {

string1: string1,

string2: string2,

similarity: result

};'

PASSING string1 AS string1,

string2 AS string2

COLUMNS

string1 string 'dv_row.string1',

string2 string 'dv_row.string2',

similarity double 'dv_row.similarity'

) AS a;

END;;

How to use it in SQL joins

Let’s look at two examples for joining records based on string similarity:

Example 1: Joining Two Customer Tables

WITH

customers_a AS (

SELECT 1 as id, 'John Smith' as name

UNION ALL SELECT 2, 'Mary Johnson'

),

customers_b AS (

SELECT 101 as id, 'Jon Smith' as name

UNION ALL SELECT 102, 'Mary Jonson'

)

SELECT

a.id as id_a,

a.name as name_a,

b.id as id_b,

b.name as name_b,

sim.similarity

FROM

customers_a a, customers_b b

, LATERAL(

SELECT similarity FROM views.similarity_levenshtein(a.name, b.name)

) sim

WHERE

sim.similarity > 0.8;

Result: Returns likely matches such as "John Smith" and "Jon Smith" with a similarity score over 0.8.

Example 2: Using a Scalar Subquery (Alternative Syntax)

WITH

customers_a AS (

SELECT

1 AS id,

'John Smith' AS NAME

UNION ALL

SELECT

2,

'Mary Johnson'

),

customers_b AS (

SELECT

101 AS id,

'Jon Smith' AS NAME

UNION ALL

SELECT

102,

'Mary Jonson'

)

SELECT

a.id AS id_a,

a.name AS name_a,

b.id AS id_b,

b.name AS name_b

FROM

customers_a a,

customers_b b

WHERE

(

SELECT

similarity

FROM

VIEWS.similarity_levenshtein (a.name, b.name)

) > 0.8;

Note: This version works fine but doesn't expose the similarity value in the output — use lateral joins for better traceability.

When to Use This

Use Levenshtein similarity when:

Matching misspelled customer names, product names, addresses
Detecting duplicates across systems
Preprocessing fuzzy joins for human validation
Scoring "likelihood of match" for machine learning pipelines

Tips for Production Use

For performance, cache similarity results in intermediate tables if comparing large datasets.
Consider using a threshold like 0.85+ for high confidence matches.
Pair with soundex/metaphone phonetic algorithms for better results across dialects.

This topic has been closed for replies.

Use case

Why Fuzzy Matching?

Solution: Levenshtein Similarity as a Virtual Procedure

Procedure Definition

How to use it in SQL joins

Example 1: Joining Two Customer Tables

Example 2: Using a Scalar Subquery (Alternative Syntax)

When to Use This

Tips for Production Use

Sign up

CData Community

Scanning file for viruses.

This file cannot be downloaded