At Doubletwist Inc. doubletwist公司 we我们 worked工作 with 40 4 CPU Sun Ultra Machines with 4 GB RAM each to carry out与40 4的CPU的Sun Ultra的机器, 4 GB的RAM的每一个进行 annotations of human genome说明人类基因组 . We were first, ahead of Celera and HGP.我们第一,提前celera和hgp 。
At that time (2000-2001) it was possibly the largest massively scaled Java Technology Deployment.在那个时间( 2000-2001年) ,它可能是大量的规模最大的Java技术部署。 Human Genome Annotation run took about 1.5 months the first time.人类基因组的诠释来说,需时约1.5个月的第一时间。 With several revisions it took about a month even with all that hardware and an additional Sun Ultra Sparc box.与几次修改它花了大约一个月,甚至与所有的硬件和额外的Sun Ultra SPARC的方块。

Today I was reading about Become.com’s Web Crawler deployment.今天,我是读约become.com奇摩的网页检索器的部署。 It maybe somewhat bigger in the data it handles and an interesting example of massive scaled deployment.它也许有点大,在数据处理和一个有趣的例子庞大的规模部署。

Java的部署

Become.com’s decision to deploy Java technology become.com '的决定,部署Java技术 followed the experience of the company’s CTO, chairman, and cofounder, Yeogirl Yun, at Wisenut.com, where Wisenut spent a year creating a C++ web crawler that had significant memory and threading problems.其次是经验,该公司的首席技术官,主席,创始人之一的, yeogirl云,在wisenut.com , wisenut ,花了一年时间建立一个C + +的网页检索器,即有显着的内存和线程的问题。

We needed to do it faster this time ,” observes Yun.我们需要做更快的这个时候 , ”观察云。 “So we made the radical decision to implement a crawler using Java technology. “因此,我们提出的激进的决定,实施一项履带使用Java技术。 No one believed it was possible, but we were able to build the prototype crawler in three months using two developers, which was a major achievement.没有人认为这是可能的,但是我们可以建立原型履带在3个月内使用两个发展商,这是一项重大的成就。 The built-in network library, multithreading framework, and RMI [remote method invocation] saved a lot of development time.内置在网络图书馆,多线程架构,和马绍尔群岛共和国[远程方法调用]节省了大量的开发时间。

Become.com’s crawlers build a web index, a searchable database, roughly every two weeks. become.com奇摩搜寻器建立一个网页索引,检索数据库,大约每两个星期。 It searches for shopping-related information only.它的搜索购物相关的信息而已。 The fetcher, which itself stores no information, classifies information by running several checks on every page it locates.该提取,它本身的商店没有信息,分类信息,通过运行几个检查,对每一个网页上找到。 It looks for page type and language and filters out duplicates or spam.看来,对于网页类型和语言,过滤掉重复或垃圾邮件。 It identifies links, buying guides, expert reviews, forums, articles, and other relevant materials.它确定了联系,购买指南,专家评论,论坛,文章,和其他有关材料。 Then it sends information back to the crawl controller, which guides the crawl.然后它发送信息回抓取控制器,指导检索。 Once the process is finished, it forms a database of all pages visited, in order by URL.一旦该过程结束后,它形成了一个数据库的所有网页的访问,在秩序的网址。 Although searches are currently limited to English, the crawler is constructed so that it can scale easily to other languages.尽管搜索目前只限于英语,履带式构造,以便它能够很容易的规模,以其他语文。

The gathered information then goes to an “inverted” index, currently of 3.2 billion web pages, in order not by URLs but by keywords.所收集的资料,然后去一“倒”指数,目前的32.0亿网页,为了不通过的网址,而是由关键字。 Finally, the index is fine-tuned to both expert feedback from the Become.com research team and page-value connectivity analysis, which notes the frequency with which other pages on the same topic link to a page.最后,该指数是微调,以双方专家的反馈意见,从become.com研究小组和网页价值的连通性分析,债券的频率与其他网页的关于同一主题的链接到一个网页。 The crawler takes about a week to complete its task.检索器需时约一个星期,以完成其任务。 Finally, all of this information goes into the next crawl.最后,所有这一切的资料,进入下一次检索。

详情

In developing Crawler B, Bart Niechwiej tried out the java.nio library (NIO) and got better performance than with a multithreaded version.在发展中国家的履带b ,巴特niechwiej尝试了java.nio图书馆(氧化镍)和得到更好的性能比与多线程版本。 Unfortunately, some classes — such as URL — did not support the NIO, so he implemented a URL connection.不幸的是,一些班级-如网址-不支持Ni O的,所以他实施了网址连接。

He used Tomcat for his statistics server and required 20 GB of memory for fetchers, which ran on 10 separate 32-bit machines of 2 GB each.他用的Tomcat他统计服务器和所需的20 GB的内存为fetchers ,冉以10个单独的32位机器的2 GB的每一个。