According to Google, the search feature follows a pre-determined process : a query is initiated, the web is navigated by following links from one page to another (an operation termed web crawling), applicable pages are sorted using a set of criteria and indexes, and the most suitable results delivered which are relevant to the search. However, the search feature also allows for a suggestion box which gives a ‘did you mean’ option. These are usually suggested search options, allowing you to broaden your search field, or different spelling options either within context, or for an individual word.
Algorithms that could form part of the implementation of the ‘did you mean’ feature
As a spellchecker, the algorithm does not only allow for the correction of incorrectly spelled words, but for misspelled words as well. According to Wikipedia, there are an inordinately large number of words that are commonly misspelt in English, even by English first language speakers.
Words are, generally speaking, misspelt in a search due to a number of factors, and the algorithm gives corrective options according to certain criteria. There are a number of different types of spelling mistakes and the ways in which the algorithm finds the correct version, varies.
There are, however, two ways in which spelling is commonly checked by an algorithm. It would either choose the ‘nearest’ version of the spelling, or where there are two spelling options, use the most popular version.
This is where an algorithm might suggest substituting one word, within context, for another which it believes is similar.
Because people often spell certain keywords wrongly, or use the incorrect form of a word, a typical algorithm has a feature to search through these fuzzy words in order to find the correct variation and spelling. In order to do so, a fuzzy word K-gram index is a great help in calculated keyword similarity. Keywords that have a Jaccard coefficient that happens to be less than a predetermined threshold value are automatically eliminated, while the search results are then ranked according to a weight ranking function.
To illustrate: The hoarder ran – suggest hoard
Columbia George University College has a video that gives a good introduction to K-maps
The Levenshtein Distance
The Levenshtein distance is also referred to as the Edit Distance. It calculates the least number of edits an algorithm would need to do in order to correct a word, taking into consideration the distance between keys.
To illustrate, the misspelled word ‘keus’. The keys nearest the incorrect ‘u’ on a QWERTY keyboard, are 7,8,I,j,h, and y. The obvious choice as a replacement for the incorrect ‘u’ would, in this instance, be ‘y’, i.e. ‘keus’ = ‘keys’.
As explained in a Google blog ‘Now let’s look at the problem of enumerating the possible corrections c of a given word w. It is common to talk of the edit distance between two words: the number of edits it would take to turn one into the other. An edit can be a deletion (remove one letter), a transposition (swap adjacent letters), an alteration (change one letter to another) or an insertion (add a letter).
The literature on spelling correction claims that 80 to 95% of spelling errors are an edit distance of 1 from the target.’
The Weighted Edit Distance
Also known as the Damerau–Levenshtein distance, it takes into consideration that there are a number of letters that are more commonly mistyped in the English language than others. For example, ‘o’ is often mistyped as ‘u’. In this case, the algorithm works out which letters are most commonly mistyped and gives a suggestion as to the correct way of spelling the word, i.e. one may type ‘yuu’, and the algorithm will compute that, although the word is spelt incorrectly, the first ‘u’ would most probably be ‘o’, (given the law of probability) and give the correct spelling as ‘you’.
In this instance, unlike the Edit Distance algorithm, the correct letters are not directly next to the incorrectly typed key, but may be a few keys distant.
Richard Minerich gives a simple explanation of how this algorithm can be implemented.
- A word is either corrected individually, or the correct form or possibilities thereof, given (isolated term).
- A word is corrected in the given context of the search.
If, for example, a typing error has letters that are incorrectly placed, the algorithm would try to find the correct form of the word within the given context: He ran away form home – He ran away from home.
In effect, the algorithm is programmed to find words in different languages that are misspelt by millions of users across the globe. It collects the data, sifts it, and makes educated guesses based on what it deems the norm. This is a process which is commonly known as statistical machine learning. Google’s translation ability makes this process possible.
Where a word is content sensitive, it suggests the correct version: The man and their vehicles > it would suggest man-men and vehicles-vehicle.
As Google’s algorithm crawls through the web, it is able to sift through the most common sentences in which words are used, extracting the most probable, correct version of the word in context.
Peter Norvig, in his article How to Write a Spelling Corrector, discusses the technical details pertaining to probability.
Hidden Markov Model
Simply put, the Hidden Markov Model is based on what the user types, i.e. that which is ‘seen’, and computes it into what is ‘hidden’, in other words ‘meant’. For this algorithm to succeed, the user would have to type the correct key at least 60% of the time.
An example: Dustingus -dustings
Hzppy – happy, hoppy, hippy
Gogle – goggle, google, goggles, googled, googled
The algorithm, in effect, also works on probability, giving a variation or variations of the probable meaning.
Suggestion removal feature
Besides giving suggestion, the ‘did you mean…’ feature also gives the option of removing certain suggestions. This is typically when you have searched a related topic which the algorithm automatically stores in your unique search history. When typing in a relevant search, it would include the previous, related search, but allow the option of removing it from the subsequent search.
Google’s algorithm is able to extract, from the web, the most common sentences in which are used, enabling it to suggest the most probable, correct version of a word in context.
According to Google, ‘Research at Google is at the forefront of innovation in Machine Intelligence, with active research exploring virtually all aspects of machine learning, including deep learning and more classical algorithms. Exploring theory as well as application, much of our work on language, speech, translation, visual processing, ranking and prediction relies on Machine Intelligence.’
The Google ‘did you mean…” algorithm, can thus, be likened to artificial intelligence, according to Bloomberg, as ‘software packages grow increasingly “smarter,” with more predictive capabilities, these products are changing the industry landscape’. Bloomberg continues by saying that ‘new frameworks to examine both structured and unstructured data and new systems that can digest large quantities of data are accelerating the development of artificial intelligence and its commercial use across all industries’.
Although many have speculated about Google’s ‘did you mean…’ algorithm – and there is much information available on the web with regards to this – it is commonly believed that it uses a combination of pre-designed features with new unique programs which are constantly added. It could, in effect, be seen as a hybridized algorithm which is constantly changing, evolving, and expanding.