<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|recommended</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property> 使用Ant对你的plugin进行编译在之前我们要编辑一下src/plugin/build.xml 这个文件,这是对编译和部署做一些设置你会看到有很多如下形式的行<ant dir="[plugin-name]" target="deploy" />在</target>前添加一新行 <ant dir="reccomended" target="deploy" />Running ’ant’ in the root of your checkout directory should get everything compiled and jared up. The next time you run a crawl your parser and index filter should get used.
You’ll need to run ’ant war’ to compile a new ROOT.war file. Once you’ve deployed that, your query filter should get used when searches are performed.
plugins和class加载到nutch的问题集合
对plugin开发者来说最棒的事情就是自由了,可以不去理会别的plugin的开发者在做什么,可以自由的使用第三方的jar库。
nutch是怎样解决类加载这个问题的?
Nutch使用了一个非常容易的方法,每一个plugin都有一个属于自己的类加载器,这个class-loader在plugin启动以前将会被初始化
写plugin-by stefan
nutch 0.7中的plugins
如果你要在nutch中应用这些插件,你只需要编辑conf/nutch-site.xml,把你所要用的plugin的名字加入plugin.includes的列表中
clustering-carrot2 - Online Search Results Clustering using Carrot2’s Lingo component.
creativecommons - Support for crawling and searching Creative-Commons licensed content.
index-basic - Adds url, content and anchor fields to the index.
index-more - Adds date, content-length, contentType, primaryType and subtype fields to the index.
languageidentifier - Adds a lang field to the index and allows you to query against it.
ontology - Helps refine queries based on owl files.
parse-ext - A wrapper that invokes external command to do real parsing job.
parse-html - Parses HTML documents
parse-js - Parses JavaScript
parse-mp3 - Parses MP3s
parse-msword - Parses MS Word documents
parse-pdf - Parses PDFs
parse-rss - Parses RSS feeds
parse-rtf - Parses RTF files
parse-text - Parses text documents
protocol-file - Retreives documents from the filesystem
protocol-ftp - Retreives documents through ftp
protocol-http - Retreives documents through http
protocol-httpclient - Retreives documents through http and https
query-basic - Runs queries against content, url and anchor fields
query-more - Runs queries against date, content-length, contentType, primaryType and subType fields.
query-site - Runs queries against site field
query-url - Runs queries against url field.
urlfilter-prefix
urlfilter-regex
Additional Plugins in Dev Branch (0.8)
analysis-de
analysis-fr
lib-commons-httpclient
lib-http
lib-jakarta-poi
lib-log4j
lib-lucene-analyzers
lib-nekohtml
lib-parsems
parse-msexcel - Parses MS Excel documents
parse-mspowerpoint - Parses MS Powerpoint documents
parse-oo - Parses Open Office and Star Office documents (Extentsions: ODT, OTT, ODH, ODM, ODS, OTS, ODP, OTP, SXW, STW, SXC, STC, SXI, STI)
parse-swf - Parses Flash SWF files
microformats-reltag - Adds rel-tag fields to the index and runs queries against them.
parse-zip
责任编辑:小草