Web developers wiki ASP.NET Sitecore Sharepoint Kentico by Evident Interactive

Using dtSearch to create a search index for a secure part of an extranet site we came across some strange behaviour: the dtSearch spider had no access to the secure part.

We thought we had this covered by creating a custom login page the spider could use to log in and which would redirect the spider to the secure part. This custom page did its job wel, but the spider would still try to search the rest of the secure extranet as Anonymous user.

After finding out the next page the crawler visited was the logoff page, we knew we had to prevent the crawler from going there. dtSearch has the following to say about that:
Excluding sections of web sites

In the dtSearch Indexer, you can use filename filters and exclude filters to limit indexing by filename or folder name. For example, you could use a filter of *⁄OnlyThisFolder⁄* to limit indexing to documents in a folder named OnlyThisFolder, or you could use an exclude filter of *⁄NotThisFolder⁄* to prevent anything in the folder named NotThisFolder (or subfolders) from being indexed. For more information on filename filters, see: How to exclude folders from an index.

Additionally, the dtSearch Spider checks for robots.txt and robots META tags in web pages, so you can use a robots.txt file or embedded tags in web pages to specify whether they should be indexed, and whether the Spider should check them for links when indexing the site. For more information on robots.txt and the robots META tag standard, see:

http:⁄⁄www.robotstxt.org⁄wc⁄meta-user.html

http:⁄⁄www.robotstxt.org⁄wc⁄exclusion.html

 © Evident Interactive BV