Table of Contents
Apache Nutch is a web crawler that is used in conjunction with Solr to index web pages. If you have the Solr server installed prior to installing Nutch, you can immediately pass the Nutch results to Solr.
          Use your package manager to install Solr or follow the instructions
          given in Installing Solr on Red Hat-type Systems. Install Solr with
          the sample application file and don't worry about configuring Solr.
          Nutch comes with a Solr schema.xml file
          that works out of the box. Some changes may need to be made to the
          solrconfig.xml file but this is dealt
          with in Section 3, "Configuring Solr".
        
In this article Jetty is used as the servlet container for Solr but Tomcat will work equally well. This article describes using Nutch in the following configuration:
Apache Solr 3.6.1
CentOS release 6.3 (Final)
Nutch 1.5.1
In this document Nutch and Solr are running on the same machine.
          Download the Nutch tar file and decompress it to a location of your
          choice for example, /opt/apache-nutch-1.5.1. In this case the Nutch
          home directory is the /opt/apache-nutch-1.5.1 directory. You'll find the
          following files and directories below this directory:
        
bin CHANGES.txt conf docs lib LICENSE.txt logs NOTICE.txt plugins » README.txt
          The nutch command found
          in the bin directory is typically
          invoked from the Nutch home directory. If you wish you can set an
          environment variable, NUTCH_HOME, but
          this is not a requirement.
        
          Navigate to the Nutch home directory and create a directory named
          urls containing a file named
          seed.txt. Add a single line to this
          file to identify the domain that you wish to index, http://objectorientedphp.com/, for example.
        
          In the regex-urlfilter.xml file replace
          the last line +. with a regular expression
          identifying the domain that you wish to crawl, for example:
          +^http://([a-z0-9]*\.)*objectorientedphp.com/.
        
          You must also set the value of the http.agent.name property of the nutch-site.xml file before you run Nutch. If you
          like you can overwrite the nutch-site.xml file with the contents of
          nutch-default.xml file and then set
          http.agent.name-that's what nutch-default.xml is there for.
        
          You must also set the JAVA_HOME
          environment variable if it is not already set. If which java returns /usr/bin/java on a Red Hat-type system you would add
          the line export JAVA_HOME=/usr to the
          .bash_profile file in your home
          directory.
        
          To configure the Solr server to work with Nutch copy the schema.xml file found in the Nutch home
          conf directory to the solr/conf directory below the Solr home directory.
        
            If the schema name version number is set to 1.5.1 on the line, <schema name="nutch" version="1.5.1">,
            change this to 1.5. The Solr server
            will not start up until this change is made.
          
          When initiating queries you may also need to make changes to the
          solrconfig.xml file. This is discussed
          in Section 5, "Checking Your Index".
        
          It is good practice to test Nutch by first searching at a reduced
          depth. You can do this by navigating to the Nutch home directory and
          issuing the command: bin/nutch crawl
          urls -dir crawl -solr http://localhost:8983/solr/ -depth 3 -topN
          20. This command will crawl the domain defined in the
          urls/seed.txt file and at the same time
          create a searchable Solr index.
        
When Nutch has finished you should see output such as the following:
... LinkDb: internal links will be ignored. LinkDb: adding segment: file:/opt/apache-nutch-1.5.1/crawl-20121008104351/» segments/20121008105515 LinkDb: adding segment: file:/opt/apache-nutch-1.5.1/crawl-20121008104351/» segments/20121008104606 LinkDb: adding segment: file:/opt/apache-nutch-1.5.1/crawl-20121008104351/» segments/20121008104454 LinkDb: adding segment: file:/opt/apache-nutch-1.5.1/crawl-20121008104351/» segments/20121008104414 LinkDb: adding segment: file:/opt/apache-nutch-1.5.1/crawl-20121008104351/» segments/20121008105427 LinkDb: finished at 2012-10-08 10:56:06, elapsed: 00:00:16 SolrIndexer: starting at 2012-10-08 10:56:06 Indexing 114 documents SolrIndexer: finished at 2012-10-08 10:57:17, elapsed: 00:01:11 SolrDeleteDuplicates: starting at 2012-10-08 10:57:17 SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/ SolrDeleteDuplicates: deleting 1 duplicates SolrDeleteDuplicates: finished at 2012-10-08 10:57:23, elapsed: 00:00:06 crawl finished: crawl-20121008104351
The next section verifies that web pages have been indexed.
          You can check the files that you have indexed by pointing your
          browser at http://.
          You should see something similar to the following:
        solr_server:8983/solr/admin/
          If you are using the default solrconfig.xml and you initiate a search such as
          *:* you may see the following error:
        
problem:
Problem accessing /solr/select/. Reason:
    undefined field text
        
          To remedy this, search the solr/solrconfig.xml file found below the Solr home
          directory for references to a field named text and replace these references with content. For example, the select request handler identifies the default field
          as text
        
  <requestHandler name="/select" class="solr.SearchHandler">
    <!-- default values for query parameters can be specified, these
         will be overridden by parameters in the request
      -->
     <lst name="defaults">
       <str name="echoParams">explicit</str>
       <int name="rows">10</int>
       <str name="df">text</str>
       ...
        
          Change text to content. After making this configuration change you
          will have to restart Jetty. On Red Hat-type systems issue the command
          service restart
          jetty.
        
You may find that you want to remove the Solr index and start again from scratch especially if you have performed a test crawl. From the command line of the machine hosting the Solr server, use the following commands to remove an existing Solr index:
shell> curl http://localhost:8983/solr/update -H "Content-Type: text/xml" \ --data-binary '<delete><query>*:*</query></delete>' shell> curl http://localhost:8983/solr/update -H \ "Content-Type: text/xml" --data-binary '<commit/>'
          You should also remove all the Nutch files found in the crawl directory below the Nutch home directory.
        
Once all configuration changes have been made, create the Solr index by navigating to the Nutch home directory and issuing the following command:
shell> bin/nutch crawl urls -dir crawl -solr http://localhost:8983/solr/ \ -depth 20 -topN 200
          The sites.txt file in the urls/ directory tells nutch which domain to crawl
          and the resulting files are stored in the crawl directory. These are used to build the Solr
          search index.
        
A complete description of the nutch crawl command is found at nutch crawl and reproduced below:
This class performs a complete crawl given a set of root urls.
Usage:
bin/nutch crawl <urlDir> [-solr <solrURL>] [-dir d] [-threads n]
[-depth i] [-topN N] <urlDir>: Contains text files with URL lists.
This must be an existing directory. Example would be ${NUTCH_HOME}/urls
[-solr <solrURL>]: Enables us to pass our Solr instance as an indexing
parameter to simplify the process of indexing with Solr.
[-dir d]: This parameter enables you to choose the directory Nutch
should use when crawling.
[-threads n]: This parameter enables you to choose how many threads
Nutch should use when crawling.
[-depth i]: You can tell Nutch how deep it should crawl. If you don't
tell Nutch a value, it takes 5 as his standard parameter. For example
if you pass -depth 1 as the parameter, Nutch will only index the first
level. If you say -depth 2 (or more) Nutch will follow this number
of outlinks.
[-topN N]: The maximum number of outlinks Nutch will obtain from
any one page.
      
          You can easily enhance your configuration by making changes to files
          such as the regex-urlfilter.txt file.
          For example, if there are file types that you do not wish to index,
          add them to this filter.
        
          If parser.skip.truncated in the
          nutch-site.xml file is set to
          true and you are using the default value
          for http.content.limit then no files
          larger than 65536 Kilobytes will be
          indexed. With this setting most PDF files will be ignored. The
          default setting is shown below:
        
<property> <name>http.content.limit</name> <value>65536</value> <description>The length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. </description> </property>
          Change the value of http.content.limit to
          -1 to accept files of any size.
        
You will probably also want to write some code to present search results in an easily usable fashion.
Apache Nutch Wiki - the official Nutch Wiki
"Apache 3.1 Solr Cookbook" by Rafal Kuc, Packt Publishing - this book provides a recipe for getting started with Nutch
Integrating Nutch - an excerpt from LucidWorks documentation
Peter Lavin is a technical writer who has been published in a number of print and online magazines. He is the author of Object Oriented PHP, published by No Starch Press and a contributor to PHP Hacks by O'Reilly Media.
Please do not reproduce this article in whole or part, in any form, without obtaining written permission.