At Doubletwist Inc. doubletwist公司 we我們 worked工作 with 40 4 CPU Sun Ultra Machines with 4 GB RAM each to carry out與40 4的CPU的Sun Ultra的機器, 4 GB的RAM的每一個進行 annotations of human genome說明人類基因組 . We were first, ahead of Celera and HGP.我們第一,提前celera和hgp 。
At that time (2000-2001) it was possibly the largest massively scaled Java Technology Deployment.在那個時間( 2000-2001年) ,它可能是大量的規模最大的Java技術部署。 Human Genome Annotation run took about 1.5 months the first time.人類基因組的詮釋來說,需時約1.5個月的第一時間。 With several revisions it took about a month even with all that hardware and an additional Sun Ultra Sparc box.與幾次修改它花了大約一個月,甚至與所有的硬件和額外的Sun Ultra SPARC的方塊。

Today I was reading about Become.com’s Web Crawler deployment.今天,我是讀約become.com奇摩的網頁檢索器的部署。 It maybe somewhat bigger in the data it handles and an interesting example of massive scaled deployment.它也許有點大,在數據處理和一個有趣的例子龐大的規模部署。

Java的部署

Become.com’s decision to deploy Java technology become.com '的決定,部署Java技術 followed the experience of the company’s CTO, chairman, and cofounder, Yeogirl Yun, at Wisenut.com, where Wisenut spent a year creating a C++ web crawler that had significant memory and threading problems.其次是經驗,該公司的首席技術官,主席,創始人之一的, yeogirl雲,在wisenut.com , wisenut ,花了一年時間建立一個C + +的網頁檢索器,即有顯著的內存和線程的問題。

We needed to do it faster this time ,” observes Yun.我們需要做更快的這個時候 , ”觀察雲。 “So we made the radical decision to implement a crawler using Java technology. “因此,我們提出的激進的決定,實施一項履帶使用Java技術。 No one believed it was possible, but we were able to build the prototype crawler in three months using two developers, which was a major achievement.沒有人認為這是可能的,但是我們可以建立原型履帶在3個月內使用兩個發展商,這是一項重大的成就。 The built-in network library, multithreading framework, and RMI [remote method invocation] saved a lot of development time.內置在網絡圖書館,多線程架構,和馬紹爾群島共和國[遠程方法調用]節省了大量的開發時間。

Become.com’s crawlers build a web index, a searchable database, roughly every two weeks. become.com奇摩搜尋器建立一個網頁索引,檢索數據庫,大約每兩個星期。 It searches for shopping-related information only.它的搜索購物相關的信息而已。 The fetcher, which itself stores no information, classifies information by running several checks on every page it locates.該提取,它本身的商店沒有信息,分類信息,通過運行幾個檢查,對每一個網頁上找到。 It looks for page type and language and filters out duplicates or spam.看來,對於網頁類型和語言,過濾掉重複或垃圾郵件。 It identifies links, buying guides, expert reviews, forums, articles, and other relevant materials.它確定了聯繫,購買指南,專家評論,論壇,文章,和其他有關材料。 Then it sends information back to the crawl controller, which guides the crawl.然後它發送信息回抓取控制器,指導檢索。 Once the process is finished, it forms a database of all pages visited, in order by URL.一旦該過程結束後,它形成了一個數據庫的所有網頁的訪問,在秩序的網址。 Although searches are currently limited to English, the crawler is constructed so that it can scale easily to other languages.儘管搜索目前只限於英語,履帶式構造,以便它能夠很容易的規模,以其他語文。

The gathered information then goes to an “inverted” index, currently of 3.2 billion web pages, in order not by URLs but by keywords.所收集的資料,然後去一“倒”指數,目前的32.0億網頁,為了不通過的網址,而是由關鍵字。 Finally, the index is fine-tuned to both expert feedback from the Become.com research team and page-value connectivity analysis, which notes the frequency with which other pages on the same topic link to a page.最後,該指數是微調,以雙方專家的反饋意見,從become.com研究小組和網頁價值的連通性分析,債券的頻率與其他網頁的關於同一主題的鏈接到一個網頁。 The crawler takes about a week to complete its task.檢索器需時約一個星期,以完成其任務。 Finally, all of this information goes into the next crawl.最後,所有這一切的資料,進入下一次檢索。

詳情

In developing Crawler B, Bart Niechwiej tried out the java.nio library (NIO) and got better performance than with a multithreaded version.在發展中國家的履帶b ,巴特niechwiej嘗試了java.nio圖書館(氧化鎳)和得到更好的性能比與多線程版本。 Unfortunately, some classes — such as URL — did not support the NIO, so he implemented a URL connection.不幸的是,一些班級-如網址-不支持Ni O的,所以他實施了網址連接。

He used Tomcat for his statistics server and required 20 GB of memory for fetchers, which ran on 10 separate 32-bit machines of 2 GB each.他用的Tomcat他統計服務器和所需的20 GB的內存為fetchers ,冉以10個單獨的32位機器的2 GB的每一個。