I have a SortedDictionary<string, string>, ordered by key length descending, of the form:
- red fox – address1
- weasel – address2
- foxes – address3
- fox – address3
etc.
and a list of phrases e.g.
"today the red fox is home"
"no foxes today"
etc
I need to find the value of the longest key in the dictionary that matches a substring in a given phrase.
Given that the dictionary has about one thousand entries, is there a better approach than a brute force search?
The examples above should return:
"red fox" – address1
"foxes" – address3
I have the brute force solution of iterating over the Keys, but in view of the number of keys and the need to perform the search in tens of phrases, I'm looking for a cleverer approach.
-1
I did some performance measuring.
With a dictionary of 100 keys, searching for matches in 100 phrases took on the average 65 ms.
If the execution time scales linearly with the dictionary length and with the number of phrases, then the whole process will take about 20 seconds (search time only).
Taking into account the time needed to read the phrases from files and to make use of the matches, it will take probably 3-4 minutes for the whole run.
The user will have to look at some animation for this time, or she could drink a glass of something.
Would this be acceptable?
If yes, the brute force approach is OK.
If not, I am still open to suggestions.
Best Answer
Judging from your requirement the brute-force approach is really not a best way to do it. A much more efficient approach would be to use a "Trie" (prefix tree) type. Using "Trie" can help you efficiently search for the longest matching key in a given phrase. Just create a "Tire" structure, than insert each key from your "SortedDictionary" into the "Trie". Each node in the "Trie" structure will represent a character of a key. At the end of each key, store the corresponding value (address). For the search for longest match you can implement like: For each phrase, iterate through each substring and search for it in the "Trie" structure. Than just track the longest match found for each phrase. I believe it what you need. Take a look here: A Trie implementation in C#, Using Trie Class for Efficient Text Pattern Searching in C#
I just put some code below. I think in general it does what you need: