Scansoft lmspider bannned
I’ve just added “lmspider firstname.lastname@example.org” and “lmspider (email@example.com)” with a “Disallow: /” to my robots.txt file. I’ve contacted them quite some time ago and never received a response to my question about the purpose of this spider. Until firstname.lastname@example.org responds to legitimate requests, I suggest you do the same on your server.
Update 05/03/2004: I received a response to my query about the purpose of email@example.com. It took almost two months for a reply, but at least they sent a reply. As lmspider does not seem to be documented anywhere else, I post the reponse here on my server.
I leave it up to you whether you decide to support them or not.
From: LMSPIDER <LMSPIDER-at-scansoft.com>
Subject: RE: lmsipder?
To: Tobias Hoellrich, LMSPIDER <LMSPIDER-at-scansoft.com>
The lmspider user agent is a bot that collects text from the web. This is part of a research project here at Scansoft where we are trying to use web documents to improve the linguistic models we use in our speech recognition engine.
Our idea is that instead of training linguistic models on text like newspaper articles or journal papers as is traditionally done, we should focus more on what people actually write in the real world. The recent explosion of weblogs has resulted in a huge amount of text that is representative of what people really want to write about and we are searching for ways of using this information to improve the state of the art in speech recognition.
The text we collect is not sold or shared with outside parties, it is only used for internal research tasks. The contents of the web pages that we collect are never published in any product Scansoft sells although we do hope to use this information to keep our lexicon up to date and focused on the words the people tend to use more frequently in writing real world documents.
By allowing the lmspider to visit your site, you are in effect helping to influence speech recognition technology to be able to more accurately transcribe the kinds of documents you and your contributers want to create. We hope you will agree that this is a good reason to crawl the web and allow our spider to continue to visit.