Mar 8 2004

Scansoft lmspider bannned

I’ve just added “lmspider lmspider@scansoft.com” and “lmspider (lmspider@scansoft.com)” with a “Disallow: /” to my robots.txt file. I’ve contacted them quite some time ago and never received a response to my question about the purpose of this spider. Until lmspider@scansoft.com responds to legitimate requests, I suggest you do the same on your server.

Update 05/03/2004: I received a response to my query about the purpose of lmspider@scansoft.com. It took almost two months for a reply, but at least they sent a reply. As lmspider does not seem to be documented anywhere else, I post the reponse here on my server.
I leave it up to you whether you decide to support them or not.


From: LMSPIDER <LMSPIDER-at-scansoft.com>
Subject: RE: lmsipder?
To: Tobias Hoellrich, LMSPIDER <LMSPIDER-at-scansoft.com>

The lmspider user agent is a bot that collects text from the web. This is part of a research project here at Scansoft where we are trying to use web documents to improve the linguistic models we use in our speech recognition engine.

Our idea is that instead of training linguistic models on text like newspaper articles or journal papers as is traditionally done, we should focus more on what people actually write in the real world. The recent explosion of weblogs has resulted in a huge amount of text that is representative of what people really want to write about and we are searching for ways of using this information to improve the state of the art in speech recognition.

The text we collect is not sold or shared with outside parties, it is only used for internal research tasks. The contents of the web pages that we collect are never published in any product Scansoft sells although we do hope to use this information to keep our lexicon up to date and focused on the words the people tend to use more frequently in writing real world documents.

By allowing the lmspider to visit your site, you are in effect helping to influence speech recognition technology to be able to more accurately transcribe the kinds of documents you and your contributers want to create. We hope you will agree that this is a good reason to crawl the web and allow our spider to continue to visit.

5 Responses to “Scansoft lmspider bannned”

  • Johan Svensson Says:

    They’re banned here as well. I find no purpose for them on my site.

  • Scott McGerik Says:

    Simson Garfinkel recently had an article, titled The Paper Killer (http://www.technologyreview.com/articles/garfinkel0504.asp), in Technology Review that mentioned Scansoft using the Web to build better OCR software.

  • Nicolas Says:

    I found this post because I was trying to figure out what it was since it scanned my site today. It’s been going for months and clearly hasn’t been shut down yet.

    Personally, I don’t mind it scanning my site, and all it picked up was the robots.txt and xml of my current articles, so it’s hardly as persistent as the MSNbot or slurp from yahoo (that is still pinging deleted pages several months after they were deleted).

  • Arden Wiebe Says:

    I just don’t need the extra load on my server here. If they want to pay then fine. I’ve blocked there bot on 192.133.61.88 at the IP level. If that didn’t do it I’ve denied it content through a php script.

  • Manton Reece Says:

    Thanks for posting this info.

    I don’t mind them crawling text for research purposes. Unfortunately it looks like they aren’t very smart about what links to follow, since they just downloaded a 10MB MP3 file from my site.

Leave a Reply