Nutch插件机制和实例-计算机等级二级考试网-优易学网

Nutch插件机制和实例

来源：优易学 2011-1-18 12:59:47 【优易学：中国教育考试门户网】资料下载 IT书店

　　<name>plugin.includes</name>
　　<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|recommended</value>
　　<description>Regular expression naming plugin directory names to
　　include. Any plugin not matching this expression is excluded.
　　In any case you need at least include the nutch-extensionpoints plugin. By
　　default Nutch includes crawling just HTML and plain text via HTTP,
　　and basic indexing and search plugins.
　　</description>
　　</property> 使用Ant对你的plugin进行编译在之前我们要编辑一下src/plugin/build.xml 这个文件，这是对编译和部署做一些设置你会看到有很多如下形式的行<ant dir="[plugin-name]" target="deploy" />在</target>前添加一新行 <ant dir="reccomended" target="deploy" />Running ’ant’ in the root of your checkout directory should get everything compiled and jared up. The next time you run a crawl your parser and index filter should get used.
　　You’ll need to run ’ant war’ to compile a new ROOT.war file. Once you’ve deployed that, your query filter should get used when searches are performed.
　　plugins和class加载到nutch的问题集合
　　对plugin开发者来说最棒的事情就是自由了，可以不去理会别的plugin的开发者在做什么，可以自由的使用第三方的jar库。
　　nutch是怎样解决类加载这个问题的？
　　Nutch使用了一个非常容易的方法，每一个plugin都有一个属于自己的类加载器，这个class－loader在plugin启动以前将会被初始化
　　写plugin－by stefan
　　nutch 0.7中的plugins
　　如果你要在nutch中应用这些插件，你只需要编辑conf/nutch-site.xml，把你所要用的plugin的名字加入plugin.includes的列表中
　　clustering-carrot2 - Online Search Results Clustering using Carrot2’s Lingo component.
　　creativecommons - Support for crawling and searching Creative-Commons licensed content.
　　index-basic - Adds url, content and anchor fields to the index.
　　index-more - Adds date, content-length, contentType, primaryType and subtype fields to the index.
　　languageidentifier - Adds a lang field to the index and allows you to query against it.
　　ontology - Helps refine queries based on owl files.
　　parse-ext - A wrapper that invokes external command to do real parsing job.
　　parse-html - Parses HTML documents
　　parse-js - Parses JavaScript
　　parse-mp3 - Parses MP3s
　　parse-msword - Parses MS Word documents
　　parse-pdf - Parses PDFs
　　parse-rss - Parses RSS feeds
　　parse-rtf - Parses RTF files
　　parse-text - Parses text documents
　　protocol-file - Retreives documents from the filesystem
　　protocol-ftp - Retreives documents through ftp
　　protocol-http - Retreives documents through http
　　protocol-httpclient - Retreives documents through http and https
　　query-basic - Runs queries against content, url and anchor fields
　　query-more - Runs queries against date, content-length, contentType, primaryType and subType fields.
　　query-site - Runs queries against site field
　　query-url - Runs queries against url field.
　　urlfilter-prefix
　　urlfilter-regex
　　Additional Plugins in Dev Branch (0.8)
　　analysis-de
　　analysis-fr
　　lib-commons-httpclient
　　lib-http
　　lib-jakarta-poi
　　lib-log4j
　　lib-lucene-analyzers
　　lib-nekohtml
　　lib-parsems
　　parse-msexcel - Parses MS Excel documents
　　parse-mspowerpoint - Parses MS Powerpoint documents
　　parse-oo - Parses Open Office and Star Office documents (Extentsions: ODT, OTT, ODH, ODM, ODS, OTS, ODP, OTP, SXW, STW, SXC, STC, SXI, STI)
　　parse-swf - Parses Flash SWF files
　　microformats-reltag - Adds rel-tag fields to the index and runs queries against them.
　　parse-zip