Table of Contents
Apache Nutch is a web crawler that is used in conjunction with Solr to index web pages. If you have the Solr server installed prior to installing Nutch, you can immediately pass the Nutch results to Solr.
Use your package manager to install Solr or follow the instructions
given in Installing Solr on Red Hat-type Systems. Install Solr with
the sample application file and don't worry about configuring Solr.
Nutch comes with a Solr schema.xml
file
that works out of the box. Some changes may need to be made to the
solrconfig.xml
file but this is dealt
with in Section 3, "Configuring Solr".
In this article Jetty is used as the servlet container for Solr but Tomcat will work equally well. This article describes using Nutch in the following configuration:
Apache Solr 3.6.1
CentOS release 6.3 (Final)
Nutch 1.5.1
In this document Nutch and Solr are running on the same machine.
Download the Nutch tar file and decompress it to a location of your
choice for example, /opt/apache-nutch-1.5.1
. In this case the Nutch
home directory is the /opt/apache-nutch-1.5.1
directory. You'll find the
following files and directories below this directory:
bin CHANGES.txt conf docs lib LICENSE.txt logs NOTICE.txt plugins » README.txt
The nutch command found
in the bin
directory is typically
invoked from the Nutch home directory. If you wish you can set an
environment variable, NUTCH_HOME
, but
this is not a requirement.
Navigate to the Nutch home directory and create a directory named
urls
containing a file named
seed.txt
. Add a single line to this
file to identify the domain that you wish to index, http://objectorientedphp.com/
, for example.
In the regex-urlfilter.xml
file replace
the last line +.
with a regular expression
identifying the domain that you wish to crawl, for example:
+^http://([a-z0-9]*\.)*objectorientedphp.com/
.
You must also set the value of the http.agent.name
property of the nutch-site.xml
file before you run Nutch. If you
like you can overwrite the nutch-site.xml
file with the contents of
nutch-default.xml
file and then set
http.agent.name
-that's what nutch-default.xml
is there for.
You must also set the JAVA_HOME
environment variable if it is not already set. If which java
returns /usr/bin/java
on a Red Hat-type system you would add
the line export JAVA_HOME=/usr
to the
.bash_profile
file in your home
directory.
To configure the Solr server to work with Nutch copy the schema.xml
file found in the Nutch home
conf
directory to the solr/conf
directory below the Solr home directory.
If the schema name version number is set to 1.5.1
on the line, <schema name="nutch" version="1.5.1">
,
change this to 1.5
. The Solr server
will not start up until this change is made.
When initiating queries you may also need to make changes to the
solrconfig.xml
file. This is discussed
in Section 5, "Checking Your Index".
It is good practice to test Nutch by first searching at a reduced
depth. You can do this by navigating to the Nutch home directory and
issuing the command: bin/nutch crawl
urls -dir crawl -solr http://localhost:8983/solr/ -depth 3 -topN
20
. This command will crawl the domain defined in the
urls/seed.txt
file and at the same time
create a searchable Solr index.
When Nutch has finished you should see output such as the following:
... LinkDb: internal links will be ignored. LinkDb: adding segment: file:/opt/apache-nutch-1.5.1/crawl-20121008104351/» segments/20121008105515 LinkDb: adding segment: file:/opt/apache-nutch-1.5.1/crawl-20121008104351/» segments/20121008104606 LinkDb: adding segment: file:/opt/apache-nutch-1.5.1/crawl-20121008104351/» segments/20121008104454 LinkDb: adding segment: file:/opt/apache-nutch-1.5.1/crawl-20121008104351/» segments/20121008104414 LinkDb: adding segment: file:/opt/apache-nutch-1.5.1/crawl-20121008104351/» segments/20121008105427 LinkDb: finished at 2012-10-08 10:56:06, elapsed: 00:00:16 SolrIndexer: starting at 2012-10-08 10:56:06 Indexing 114 documents SolrIndexer: finished at 2012-10-08 10:57:17, elapsed: 00:01:11 SolrDeleteDuplicates: starting at 2012-10-08 10:57:17 SolrDeleteDuplicates: Solr url: http://localhost:8983/solr/ SolrDeleteDuplicates: deleting 1 duplicates SolrDeleteDuplicates: finished at 2012-10-08 10:57:23, elapsed: 00:00:06 crawl finished: crawl-20121008104351
The next section verifies that web pages have been indexed.
You can check the files that you have indexed by pointing your
browser at http://
.
You should see something similar to the following:
solr_server
:8983/solr/admin/
If you are using the default solrconfig.xml
and you initiate a search such as
*:*
you may see the following error:
problem: Problem accessing /solr/select/. Reason: undefined field text
To remedy this, search the solr/solrconfig.xml
file found below the Solr home
directory for references to a field named text
and replace these references with content
. For example, the select
request handler identifies the default field
as text
<requestHandler name="/select" class="solr.SearchHandler"> <!-- default values for query parameters can be specified, these will be overridden by parameters in the request --> <lst name="defaults"> <str name="echoParams">explicit</str> <int name="rows">10</int> <str name="df">text</str> ...
Change text
to content
. After making this configuration change you
will have to restart Jetty. On Red Hat-type systems issue the command
service restart
jetty
.
You may find that you want to remove the Solr index and start again from scratch especially if you have performed a test crawl. From the command line of the machine hosting the Solr server, use the following commands to remove an existing Solr index:
shell> curl http://localhost:8983/solr/update -H "Content-Type: text/xml" \ --data-binary '<delete><query>*:*</query></delete>' shell> curl http://localhost:8983/solr/update -H \ "Content-Type: text/xml" --data-binary '<commit/>'
You should also remove all the Nutch files found in the crawl
directory below the Nutch home directory.
Once all configuration changes have been made, create the Solr index by navigating to the Nutch home directory and issuing the following command:
shell> bin/nutch crawl urls -dir crawl -solr http://localhost:8983/solr/ \ -depth 20 -topN 200
The sites.txt
file in the urls/
directory tells nutch which domain to crawl
and the resulting files are stored in the crawl
directory. These are used to build the Solr
search index.
A complete description of the nutch crawl command is found at nutch crawl and reproduced below:
This class performs a complete crawl given a set of root urls. Usage: bin/nutch crawl <urlDir> [-solr <solrURL>] [-dir d] [-threads n] [-depth i] [-topN N] <urlDir>: Contains text files with URL lists. This must be an existing directory. Example would be ${NUTCH_HOME}/urls [-solr <solrURL>]: Enables us to pass our Solr instance as an indexing parameter to simplify the process of indexing with Solr. [-dir d]: This parameter enables you to choose the directory Nutch should use when crawling. [-threads n]: This parameter enables you to choose how many threads Nutch should use when crawling. [-depth i]: You can tell Nutch how deep it should crawl. If you don't tell Nutch a value, it takes 5 as his standard parameter. For example if you pass -depth 1 as the parameter, Nutch will only index the first level. If you say -depth 2 (or more) Nutch will follow this number of outlinks. [-topN N]: The maximum number of outlinks Nutch will obtain from any one page.
You can easily enhance your configuration by making changes to files
such as the regex-urlfilter.txt
file.
For example, if there are file types that you do not wish to index,
add them to this filter.
If parser.skip.truncated
in the
nutch-site.xml
file is set to
true
and you are using the default value
for http.content.limit
then no files
larger than 65536
Kilobytes will be
indexed. With this setting most PDF files will be ignored. The
default setting is shown below:
<property> <name>http.content.limit</name> <value>65536</value> <description>The length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. </description> </property>
Change the value of http.content.limit
to
-1
to accept files of any size.
You will probably also want to write some code to present search results in an easily usable fashion.
Apache Nutch Wiki - the official Nutch Wiki
"Apache 3.1 Solr Cookbook" by Rafal Kuc, Packt Publishing - this book provides a recipe for getting started with Nutch
Integrating Nutch - an excerpt from LucidWorks documentation
Peter Lavin is a technical writer who has been published in a number of print and online magazines. He is the author of Object Oriented PHP, published by No Starch Press and a contributor to PHP Hacks by O'Reilly Media.
Please do not reproduce this article in whole or part, in any form, without obtaining written permission.