≡

wincent.dev

  • Products
  • Blog
  • Wiki
  • Issues
You are viewing an historical archive of past issues. Please report new issues to the appropriate project issue tracker on GitHub.
Home » Issues » Bug #1838

Bug #1838: Self-referential symlink causes Command-T to hit file limit before scanning all files in a directory

Kind bug
Product Command-T
When Created 2011-07-06T10:13:36Z, updated 2012-06-30T19:26:07Z
Status closed
Reporter anonymous
Tags no tags

Description

I find that the symlink private, below, seems to prevent Command-T working. It seems that Command-T scans the directories up to the symlink, and then something goes wrong, and the directories project and public don't get scanned. If I remove the symlink it works fine. This is on Mac OS/X Snow Leopard.

drwxrwxr-x   5 charles.woodcock  staff   170 Jun 29 14:15 bank/
drwxrwxr-x   3 charles.woodcock  staff   102 Jun 29 15:23 bin/
drwxr-xr-x   8 charles.woodcock  staff   272 Jul  5 12:06 lib/
lrwxr-xr-x   1 charles.woodcock  staff     1 Jul  6 12:06 private@ -> .
drwxrwxr-x   8 charles.woodcock  staff   272 Jun 29 14:04 project/
drwxrwxr-x   7 charles.woodcock  staff   238 Jul  5 16:35 public/

Comments

  1. Greg Hurrell 2011-07-07T07:32:36Z

    Thanks for the report. As you've probably guessed, the symlink back to . creates a circular reference, which is effectively an "infinite loop", causing Command-T to hit the file limit (10,000 by default) and then stop scanning, meaning that it never gets to your project and public directories.

    In other words, it will wind up scanning:

    • bank/*, bin/* and lib/*; then:
    • bank/private/*; then:
    • bank/private/private/*; then:
    • bank/private/private/private/*; and so on, until it hits the limit.

    I can see two or three workarounds here:

    One is to teach Command-T about symlinks (currently, it only knows about paths) and have it not traverse into them at all. This is pretty much a no-starter idea as there are legitimate uses for symlinks in project organization.

    Another: again teach it about symlinks, and have it explicitly check for them and see if it can detect circular references. Not sure if this will slow down the directory scanning much, but I do fear it will only work properly on UNIX-y platforms, seeing as Windows is symlink-challenged, as far as I know.

    The other is to to replace what is effectively a depth-first traversal of the directory tree with a breadth-first one. It's written as a depth-first traversal because that was the most natural way to do so (see the implementation in ruby/command-t/scanner/file_scanner.rb, where we use Ruby's Dir.foreach method, which naturally lends itself to depth-first traversal).

    In order to switch to breadth-first traversal we'd need a FIFO queue. Note that this would not make the circular reference go away, so the traversal would still eventually hit the file limit, but at least it would increase the likelihood of exploring some of the other directories before hitting the limit. (Still, depending on the number of files in each directory, it is possible that the circular reference could prevent full scans of the non-cyclic parts of the graph from completing.) As this is not necessarily a full mitigation of the problem, I don't think it is worth exploring. Additionally, you'd still wind up with effectively useless "garbage" paths in your search space (like bank/private/private/private/private/private/foo), which are only going to slow down the matching algorithm as it will have to try matching against 10,000 paths all the time, even if your project has well under 10,000 files in it.

    So, I don't think this is actually a "bug" as such in Command-T, and I actually think having a circular-reference in your directory hierarchy is more of a "bug" in your project layout, but I am not averse to the idea of implementing the second workaround to be more robust in the face of this kind of "pathological" input. (This is now the second time I've seen somebody puzzled by the way Command-T behaves because they had such a "bug" in their project layout. The fix in that case was for them to remove the circular reference which they hadn't even intentionally added in the first place, but I'd rather not have people puzzled by the behavior of the tool, even if it is their own "fault").

    My only concern is making sure it works across all platforms (quite a few people use Command-T on Windows), and making sure that it doesn't have too much of an adverse performance impact. It may well do so, as for each path we traverse, we'll have to expand it to get its true location on the file system, record that path, and then check against that set of traversed paths for every single other path we look at during traversal. This immediately doubles our space requirements (because we have store not only the apparent path, but the actual path as well), and while checking for set membership should be relatively quick, I am not sure whether getting the real path will be an expensive operation that will turn into a noticeable delay when you have to do it for large projects.

  2. Greg Hurrell 2011-07-07T07:37:08Z

    Summary changed:

    • From: symlink prevents Command-T from scanning for files
    • To: Self-referential symlink causes Command-T to hit file limit before scanning all files in a directory
  3. anonymous 2011-07-07T08:22:27Z

    I admire your attention to detail! I agree that the directory structure is pathological input, but I think Command-T should be able to not fail in the face of it (even if it doesn't have to work well).

    To detect circular references, wouldn't it be sufficient to only expand suspicious directories, i.e. with the same name as another directory? e.g. for foo/bar/foo/, you would check that it is not a symlink to foo/ (or vice versa). Maybe you could also raise the bar of suspiciousness to 2 repeated directory names, e.g. foo/bar/x/y/z/foo/bar/.

    Another suggestion: is it possible to notify the user that the 10,000 file limit has been reached?

  4. Greg Hurrell 2011-07-07T17:10:37Z

    I don't think that will work, as symlinks need not be named like their targets. ie. in your example, private links to ., so the scanning algorithm would happily enter ./private/bank without any clues that it was entering bank for the second time, and neither . nor bank are symlinks.

    As for notifying the user, I split that off into a separate ticket, issue #1839.

  5. anonymous 2011-07-08T08:00:01Z

    You are right, a symlink could be named differently to its target, but you could be suspicious about any subdirectories, e.g. you could be suspicious that private/bank/ contains a circular reference somewhere, and is actually identical to bank/.

    Or you could raise suspicion on only sub-subdirectories, e.g. you could be suspicious that private/bank/foo/ is identical to bank/foo/.

    I don't mean to argue that I really think this is the right thing to do, I am just throwing an idea into your hat.

  6. anonymous 2011-07-08T08:16:47Z

    Or why not just be supicious about any repeating name in the path, eg.

    foo/foo/ (foo looks like symlink to .)

    foo/bar/foo/ (bar looks like symlink to .)

  7. Greg Hurrell 2012-06-27T07:37:08Z

    This pull request may handle some (or most?) of the simple cases, without incurring much cost or complexity. Not sure yet if there are any edge cases which would trip it up. I'm going to try it out on the "next" branch for a while.

  8. Greg Hurrell 2012-06-30T19:26:07Z

    Status changed:

    • From: new
    • To: closed
Add a comment

Comments are now closed for this issue.

  • contact
  • legal

Menu

  • Blog
  • Wiki
  • Issues
  • Snippets