C# – How to Find the Longest Dictionary Key Match in a Phrase for Optimized String-Matching

c++dictionarystring-matching

I have a SortedDictionary<string, string>, ordered by key length descending, of the form:

  • red fox – address1
  • weasel – address2
  • foxes – address3
  • fox – address3
    etc.
    and a list of phrases e.g.
    "today the red fox is home"
    "no foxes today"
    etc

I need to find the value of the longest key in the dictionary that matches a substring in a given phrase.
Given that the dictionary has about one thousand entries, is there a better approach than a brute force search?

The examples above should return:
"red fox" – address1
"foxes" – address3
I have the brute force solution of iterating over the Keys, but in view of the number of keys and the need to perform the search in tens of phrases, I'm looking for a cleverer approach.
-1

I did some performance measuring.
With a dictionary of 100 keys, searching for matches in 100 phrases took on the average 65 ms.
If the execution time scales linearly with the dictionary length and with the number of phrases, then the whole process will take about 20 seconds (search time only).
Taking into account the time needed to read the phrases from files and to make use of the matches, it will take probably 3-4 minutes for the whole run.
The user will have to look at some animation for this time, or she could drink a glass of something.
Would this be acceptable?
If yes, the brute force approach is OK.
If not, I am still open to suggestions.

Best Answer

Judging from your requirement the brute-force approach is really not a best way to do it. A much more efficient approach would be to use a "Trie" (prefix tree) type. Using "Trie" can help you efficiently search for the longest matching key in a given phrase. Just create a "Tire" structure, than insert each key from your "SortedDictionary" into the "Trie". Each node in the "Trie" structure will represent a character of a key. At the end of each key, store the corresponding value (address). For the search for longest match you can implement like: For each phrase, iterate through each substring and search for it in the "Trie" structure. Than just track the longest match found for each phrase. I believe it what you need. Take a look here: A Trie implementation in C#, Using Trie Class for Efficient Text Pattern Searching in C#

I just put some code below. I think in general it does what you need:

public sealed class Trie
{
    private readonly TrieNode _root = new();

    public void Insert(string key, string address)
    {
        var node = _root;
        foreach (var ch in key)
        {
            if (!node.Children.ContainsKey(ch))
            {
                node.Children[ch] = new TrieNode();
            }

            node = node.Children[ch];
        }

        node.Address = address;
    }

    public string SearchLongestMatch(string phrase, out string address)
    {
        address = null;
        var maxLength = 0;
        TrieNode longestMatchNode = null;

        for (var i = 0; i < phrase.Length; i++)
        {
            var node = _root;
            var length = 0;
            for (var j = i; j < phrase.Length; j++)
            {
                var ch = phrase[j];
                if (node.Children.TryGetValue(ch, out var child))
                {
                    node = child;
                    length++;
                    if (length > maxLength)
                    {
                        maxLength = length;
                        longestMatchNode = node;
                    }
                }
                else
                {
                    break;
                }
            }
        }

        if (longestMatchNode != null)
        {
            address = longestMatchNode.Address;
            var index = phrase.IndexOf(longestMatchNode.Address, StringComparison.Ordinal);
            if (index != -1)
            {
                return phrase.Substring(index, maxLength);
            }
        }

        return null;
    }
}
public class TrieNode
{
    public Dictionary<char, TrieNode> Children { get; set; } = new();
    public string Address { get; set; }
}

public sealed class DescendingComparer<T> : IComparer<T> where T : IComparable<T>
{
    public int Compare(T? x, T? y)
    {
        return y.CompareTo(x);
    }
}
public class Program
{
    public static void Main()
    {
        var dict = new SortedDictionary<string, string>(new DescendingComparer<string>())
            { { "red fox", "address1" }, { "weasel", "address2" }, { "foxes", "address3" }, { "fox", "address4" } };
        var trie = new Trie();
        foreach (var kvp in dict)
        {
            trie.Insert(kvp.Key, kvp.Value);
        }

        var phrases = new List<string> { "today the red fox is home", "no foxes today" };
        foreach (var phrase in phrases)
        {
            var match = trie.SearchLongestMatch(phrase, out var address);
            Console.WriteLine($"Phrase: \"{phrase}\" - Match: \"{match}\" - Address: \"{address}\"");
        }
    }
}