You are viewing an historical archive of past issues. Please report new issues to the appropriate project issue tracker on GitHub.

Home » Issues » Bug #1598

Bug #1598: Eliminate counter-intuitive results by using recursive "best match" strategy

Kind	bug
Product	Command-T
When	Created 2010-07-09T09:15:11Z, updated 2010-07-11T06:11:13Z
Status	closed
Reporter	Greg Hurrell
Tags	no tags

Description

The problem

Given search string "artcon", and files:

heartbeat_controller.rb
articles_controller.rb

The first file will be scored higher due to the location of the matched characters:

heartbeat_controller.rb
  ^^^     ^^^
articles_controller.rb
^^^ ^     ^^

While the "art" in articles_controller.rb scores very highly, the following c and on do not score all that well. Compare that with the art run in heartbeat_controller.rb, which scores acceptably, followed by the very-well scoring con.

In reality, if we use the full paths to the files the difference becomes even clearer:

app/controllers/heartbeat_controller.rb
^       ^           ^     ^^^
app/controllers/articles_controller.rb
^       ^         ^ ^     ^^

Looking at the entire path, heartbeat_controller.rb has a high-scoring run of art, whereas the second path only has on as a run, and it is not in a particularly high-scoring location.

This is counter-intuitive. A user typing "artcon" should have a reasonable expectation that articles_controller.rb should be among the highest scoring results.

The solution

The above scores are determined by the left-to-right nature of the search. The first-found character to the left always "wins" and alternatives are not considered.

If we could instead look for alternative matches, we might find higher scoring ones which would change the ordering of results.

In the cited examples, the highest scoring matches are probably the following:

app/controllers/articles_controller.rb
                ^^^      ^^^
         and

app/controllers/heartbeat_controller.rb
^                  ^^     ^^^
         or

app/controllers/heartbeat_controller.rb
                  ^^^     ^^^

And given files like these:

art.txt
app/models/article.rb
app/controllers/articles_controller.rb

The highest scoring matches for the search string "art" would be:

art.txt
^^^
app/models/article.rb
           ^^^
app/controllers/articles_controller.rb
                ^^^

ie. we preserve the existing behavior or giving more weight to words at the beginning of a string or immediately after special characters like path separators, and of giving more weight to matches which occupy a proportionally larger part of the overall string, while searching for multiple different ways of matching, and finally choosing the one that gives us the highest score.

At least my instincts suggest that this will produce the most intuitive results. The only case I am not sure about is this one:

Which of these should score highest for the search string "avid"?:

app/views/issues/demo.html.haml
^   ^     ^      ^
vendor/libraries/avid-setup.txt
                 ^^^^

ie. same overall string length, and all matching characters in "high-value" positions. I guess we fall back to alphabetical sorting in that case.

Implementation

The basic algorithm is the following:

Call match() function and only if the string matches, start looking for a better, alternative match, starting from the next character after the first matched character.

This is important because we don't want to engage in a letter-by-letter recursive search until we are sure that we have a matching string. (Imagine a folder with 10,000 files; for a string like "artcon" only a handful may match, so any time spent recursing over the non-matching files — the majority; is a waste.)

We look for this alternative match by recursively calling match(), but with a new starting index that gives us a substring.

The match() function doesn't just tell us if something matches or not; it returns a score between 0 and 1 (ranging from no match at all to perfect match).

In this way during recursion we can detect whether to accept the returned value (if it is better than our own), give up on recursion (because no match was possible) and/or keep our own result (if it is better).

Note that the existing implementation has Match and Matcher classes. The key methods are:

Matcher#sorted_matches_for
Matcher#matches_for
Match#matches?
Match#score

sorted_matches_for itself just calls matches_for, and then sorts the results by calling the Match#score method on each. It then returns the sorted results.

matches_for interrogates its associated Scanner object for its paths, then iterates over the paths calling Match#matches? on each in turn. Note that the actual determination of whether or not there is a match takes place as soon as we call Match#new.

So, really, to implement this new algorithm, we want to change the output of the score method. We want it to return the best possible score instead of just the first score found for the left-to-right scan.

In practice, it may make sense to alter the new method, so that it ends up calculating the best possible score as part of its natural initialization process. The score method then just becomes a dumb accessor instead of the method that calculates the score.

For the 10,000 file case, given a search string like "a" we might still have 4,000 matching files. We do not want to necessarily start recursing on so many strings. It is possible that most of the the strings will end up getting culled with the next letter in the search query ("r"), but if it turns out to be a problem we might make it that we only start performing this recursive searching when the search query exceeds a certain length (which may be as little as 3 or 4 characters).

Example

Search string: "art"
         Path: "articles_controller.rb"

* match "a":
    articles_controller.rb
    ^
* calculate score so far: let's say 0.1
  * recurse, trying to find better match for "a" in "rticles_controller.rb": fail, so return score 0
* recursion returned 0, so proceed to next letter in search string ("r")
* match "r":
    articles_controller.rb
    ^^
* calculate score so far: let's say 0.2
  * recurse, trying to find better match for "r" in "ticles_controller.rb"
  * match "r":
      articles_controller.rb
      ^            ^
  * calculate score so far (over entire string, not just substring): let's say 0.15
    * recurse, trying to find better match for "r" in "oller.rb"
    * match "r":
        articles_controller.rb
        ^                 ^
    * calculate score so far: 0.13
      * recurse, trying to find better match for "r" in ".rb"
      * match "r":
          articles_controller.rb
          ^                   ^
      * calculate score so far: 0.14
        * recurse, trying to find better match for "r" in "b": fail, so return score 0
      * recursion returned 0, so proceed to next letter in search string ("t")
      * match "t" and fail, so return 0
    * recursion returned 0, so proceed to next letter in search string ("t")
    * match "t" and fail, so return 0
  * recursion returned 0, so proceed to next letter in search string ("t")
  * match "t" and fail, so return 0
* recursion returned 0, so proceed to next letter in search string ("t")
* match "t":
    articles_controller.rb
    ^^^
* calculate score so far: 0.3
  * recurse, trying to find better match for "t" in "icles_controller.rb"
  * match "t":
      articles_controller.rb
      ^^          ^
  * calculate score so far: 0.25
    * recurse, trying to find better match for "t" in "roller.rb": fail, so return score 0
  * recursion returned 0, so proceed to next letter in search string ("c")
  * match "c" and fail, so return 0
* recursion returned 0, so proceed to next letter in search string ("c")
* match "c":
    articles_controller.rb
    ^^^ ^
* calculate score so far: 0.35
  * recurse, trying to find better match for "c" in "les_controller.rb"
  * match "c":
      articles_controller.rb
      ^^^      ^
  * calculate score so far: 0.4
    * recurse, trying to find better match for "c" in "ontroller.rb": fail, so return score 0
  * recursion returned 0, so proceed to next letter in search string ("o")
  * match "o":
      articles_controller.rb
      ^^^      ^^
  * calculate score so far: 0.5
    * recurse, trying to find better match for "o" in "ntroller.rb"
    * match "o":
        articles_controller.rb
        ^^^      ^    ^
    * calculate score so far: 0.45
      * recurse, trying to find better match for "o" in "ller.rb": fail, so return score 0
    * recursion returned 0, so proceed to next letter in search string ("n")
    * match "n" and fail, so return 0
  * recursion returned 0, so proceed to next letter in search string ("n")
  * match "n":
      articles_controller.rb
      ^^^      ^^^
  * calculate score so far: 0.6
    * recurse, trying to find better match for "n" in "troller.rb": fail, so return score 0
  * recursion returned 0, so proceed to next letter in search string (no letters left)
  * return score 0.6
* recursion return 0.6, record and continue with next letter in search string ("o")
* match "o":
    articles_controller.rb
    ^^^ ^     ^
* calculate score so far: 0.45
  * recurse, trying to find better match for "o" in "ntroller.rb"
  * match "o":
      articles_controller.rb
      ^^^ ^         ^
  * calculate score so far: 0.43
    * recurse, trying to find better match for "o" in "ller.rb": fail, so return score 0
  * recursion returned 0, so proceed to next letter in search string ("n")
  * match "n" and fail, so return 0
* recursion returned 0, so proceed to next letter in search string ("n")
* match "n":
    articles_controller.rb
    ^^^ ^     ^^
* calculate score so far: 0.5
  * recurse, trying to find better match for "n" in "troller.rb": fail, so return score 0
* recursion returned 0, so proceed to next letter in search string (no letters left)
* compare our score (0.5) with best score returned via recursion (0.6)
* return the winner (0.6)

As you can see, the recursive nature of the search increases the amount of computation that must be done exponentially. Will have to test to see if this is going to be prohibitively expensive or not. It may be necessary to look for optimization opportunities such as short-circuits, to reduce the amount of work done.

Comments

Greg Hurrell 2010-07-09T16:12:08Z
Ok, I've whipped up a quick test implementation of this and the results so far look good:
```
>> CommandT::Match.new('app/controllers/articles_controller.rb', 'artcon').score
=> 0.95
>> CommandT::Match.new('app/controllers/heartbeat_controller.rb', 'artcon').score
=> 0.809259259259259
>> CommandT::Match.new('app/controllers/heartbeat_controller.rb', 'art').score
=> 0.685185185185185
>> CommandT::Match.new('app/controllers/articles_controller.rb', 'art').score
=> 0.966666666666667
>> CommandT::Match.new('app/models/article.rb', 'art').score
=> 0.966666666666667
```
Still tweaking to do, and lots of testing, but it's at least a working demo of the "recurse to find best score" approach.

The score of the last two needs to be differentiated, I think, seeing as in the latter case you've entered a higher proportion of the string being searched, so it should be worth more I think.

Greg Hurrell 2010-07-09T16:16:42Z

Perhaps this:

>> CommandT::Match.new('app/controllers/articles_controller.rb', 'art').score
=> 0.0763157894736842
>> CommandT::Match.new('app/models/article.rb', 'art').score
=> 0.138095238095238
>> CommandT::Match.new('app/models/article.rb', 'a').score
=> 0.0476190476190476
>> CommandT::Match.new('app/controllers/articles_controller.rb', 'a').score
=> 0.0263157894736842
>> CommandT::Match.new('app/controllers/articles_controller.rb', 'artcon').score
=> 0.15
>> CommandT::Match.new('app/controllers/heartbeat_controller.rb', 'artcon').score
=> 0.124501424501424
>> CommandT::Match.new('app/controllers/articles_controller.rb', 'articlescontroller').score
=> 0.46578947368421
>> CommandT::Match.new('app/controllers/articles_controller.rb', 'aca').score
=> 0.0736842105263158
>> CommandT::Match.new('app/controllers/heartbeat_controller.rb', 'aca').score
=> 0.0505494505494506

Greg Hurrell 2010-07-11T06:11:06Z

Ok, this is pretty much done now. Performance degradation is noticeable on large trees (eg. my home directory; half a million files, but with match ceiling set to 10,000) but is still acceptable for smaller projects.

So optimization is definitely going to be required seeing as one of the selling points of the add-on is its performance, but, the quality of the ranking seems so much better in the hands-on testing that I've done that it seems like continuing down this path is an absolute no-brainer.

So going to close this ticket and put some optimization ideas in a separate one.
Greg Hurrell 2010-07-11T06:11:13Z
Status changed:
- From: new
- To: closed

Add a comment

Comments are now closed for this issue.

Bug #1598: Eliminate counter-intuitive results by using recursive "best match" strategy

Description

The problem

The solution

Implementation

Example

Comments

Add a comment

Menu