Originally posted on: Monday, 22 Aug 2005 20:00
Is the best search engine the one with the most documents in its index? Is 20 Billion a magic number? Is 20 Billion a real number?
One of the first things we realized when we embarked upon our quest at MSN to build the world’s best search engine is that figuring out how to meaure search engine quality was vital to our hopes of success. We knew that our goal was only as good as our ability to measure.
Recently I was at the Search Engine Strategies conference in San Jose, and I participated on the Executive Roundtable panel on Wednesday morning. Danny Sullivan asked myself and representatives of the other major search engines about Yahoo’s recent claims on index size and our reactions to developing a standard way to measure relevance.
I’ll leave the detailed guessing on “true” index size of Yahoo, MSN and Google to others who I see doing a good job starting to dig in on various places around the web (Danny Sullivan, John Batelle, NCSA).
Having the biggest index of the documents people care about is job #1 of a search engine. What’s the magic number that we want in MSN? I’ll put it this way: We won’t be happy until every one of our customers finds every answer they are looking for. So — we are constantly looking to improve the quality and size of our index.
Over the past months, due to some changes in our internal MSNRank calculations, we have added a huge number of good docs to our index without actually growing the size — and without removing any docs that users care about. The total number of docs didn’t grow — but the effective size grew dramatically. Our customers noticed it quickly. (If you know of a document we don’t have — please report it to our “help us improve” link on our search results page.)
Everyone seems to be focused on number of docs which is good — but, I wanted to point out a few topics that I think are of equal importance that are being overlooked in this recent conversation:
- Paid Inclusion – At MSN, we believe that the web results we return should be 100% ranked based on relevancy. There is no way a site can get a boost to their ranking by paying us. We turned off paid inclusion over a year ago, and we got a big “thank you” from our customers through their feedback, and their increased usage of our system. We are happy with that decision.
- Freshness – Anytime a page is published on the web, we want to have it in our index as quickly as possible. Our crawler is designed very carefully to discover changes on the web as quickly as possible. Today a page on an important site will likely get discovered in less than 24 hours. Ditto for a page that existed, but has had it’s contents updated. We have recently started exploring a real-time extension to our crawl that will drastically shrink that time delta in the future. We don’t want site owners to pay us to crawl them more frequently – we think it’s our job to crawl all important content as quickly as possible — our customers expect it
- Spam – Spam attempts to trick our customers to click on links that are misleading. Search is a big business; therefore spam is a big business. We aggressively fight spam in many ways. If we doubled our reported index size, but filled it all with spam — that wouldn’t be right.
This discussion should turn to overall search engine quality which includes the above and more. It needs to be a measure of how often our customers get their answers. We need to develop meausre that are apples to apples to help people understand how the various engines are progressing.
Until we reach that point, please give MSN Search a try and tell us what you think. We’ve heard some really good comments in recent months about our improving quality. We are very excited about our progress — and more excited about the work in our labs.
Ken Moss
General Manager, MSN Web Search
Live.com (MSN Search) has really made some good progress recently.