Nutch插件机制和实例
来源:优易学  2011-1-18 12:59:47   【优易学:中国教育考试门户网】   资料下载   IT书店

  <name>plugin.includes</name>
  <value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|recommended</value>
  <description>Regular expression naming plugin directory names to
  include. Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins.
  </description>
  </property> 使用Ant对你的plugin进行编译在之前我们要编辑一下src/plugin/build.xml 这个文件,这是对编译和部署做一些设置你会看到有很多如下形式的行<ant dir="[plugin-name]" target="deploy" />在</target>前添加一新行 <ant dir="reccomended" target="deploy" />Running ’ant’ in the root of your checkout directory should get everything compiled and jared up. The next time you run a crawl your parser and index filter should get used.
  You’ll need to run ’ant war’ to compile a new ROOT.war file. Once you’ve deployed that, your query filter should get used when searches are performed.
  plugins和class加载到nutch的问题集合
  对plugin开发者来说最棒的事情就是自由了,可以不去理会别的plugin的开发者在做什么,可以自由的使用第三方的jar库。
  nutch是怎样解决类加载这个问题的?
  Nutch使用了一个非常容易的方法,每一个plugin都有一个属于自己的类加载器,这个class-loader在plugin启动以前将会被初始化
  写plugin-by stefan
  nutch 0.7中的plugins
  如果你要在nutch中应用这些插件,你只需要编辑conf/nutch-site.xml,把你所要用的plugin的名字加入plugin.includes的列表中
  clustering-carrot2 - Online Search Results Clustering using Carrot2’s Lingo component.
  creativecommons - Support for crawling and searching Creative-Commons licensed content.
  index-basic - Adds url, content and anchor fields to the index.
  index-more - Adds date, content-length, contentType, primaryType and subtype fields to the index.
  languageidentifier - Adds a lang field to the index and allows you to query against it.
  ontology - Helps refine queries based on owl files.
  parse-ext - A wrapper that invokes external command to do real parsing job.
  parse-html - Parses HTML documents
  parse-js - Parses JavaScript
  parse-mp3 - Parses MP3s
  parse-msword - Parses MS Word documents
  parse-pdf - Parses PDFs
  parse-rss - Parses RSS feeds
  parse-rtf - Parses RTF files
  parse-text - Parses text documents
  protocol-file - Retreives documents from the filesystem
  protocol-ftp - Retreives documents through ftp
  protocol-http - Retreives documents through http
  protocol-httpclient - Retreives documents through http and https
  query-basic - Runs queries against content, url and anchor fields
  query-more - Runs queries against date, content-length, contentType, primaryType and subType fields.
  query-site - Runs queries against site field
  query-url - Runs queries against url field.
  urlfilter-prefix
  urlfilter-regex
  Additional Plugins in Dev Branch (0.8)
  analysis-de
  analysis-fr
  lib-commons-httpclient
  lib-http
  lib-jakarta-poi
  lib-log4j
  lib-lucene-analyzers
  lib-nekohtml
  lib-parsems
  parse-msexcel - Parses MS Excel documents
  parse-mspowerpoint - Parses MS Powerpoint documents
  parse-oo - Parses Open Office and Star Office documents (Extentsions: ODT, OTT, ODH, ODM, ODS, OTS, ODP, OTP, SXW, STW, SXC, STC, SXI, STI)
  parse-swf - Parses Flash SWF files
  microformats-reltag - Adds rel-tag fields to the index and runs queries against them.
  parse-zip

上一页  [1] [2] [3] 

责任编辑:小草

文章搜索:
 相关文章
热点资讯
资讯快报
热门课程培训