MarineCorps
Explosion!
stormbind said:You would start by downloading the contents of DMOZ.org which is publicly available (whole, or in part) as an XML file. This is what Google and most other search engines do.
What is done after that varies from one search engine to another. There are some XML-based RDBMS available but you would probably parse it into some propietory format.
You could write a program to crawl the sites referenced, supplementing the content but if your net connection is fast enough that could be done in real-time during searches to get actually current results (none of that two-weeks out of date nonsense that Google returns).
Current computers do not have the spare capacity to make light work of this, or mine certainly does not, but with the rate of HDD and CPU improvements I doubt it would be long until they do..
DMOZ only has 5,112,742 while google has 8,168,684,336 and yahoo has 20 billion (or so they say)