Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes/nutch webapp #1

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
.idea
atlassian-ide-plugin.xml
.DS_Store
202 changes: 0 additions & 202 deletions LICENSE

This file was deleted.

74 changes: 19 additions & 55 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,79 +1,43 @@
####Apache Nutch 2.x with Cassandra on Docker
=======================
# POC for Apache Nutch 2.3 with Cassandra on Docker

This project is 3 Docker containers running Apache Nutch 2.x configured with Cassandra storage.

Due to the lack of integration information between Nutch 2.x / Cassandra, I have created this docker containers with configuration and integration between them.

This is project is fully operational but its still experimental, any feedback, suggestions or contribution will be highly appreciated!

Current Nutch version is 2.3 ( There is a branch for 2.2.1 and it has ElasticSearch integrated since 2.3 missing elastic search indexerJob ).

###Usage notes:

1. Clone the repository.
2. Build the images and start the containers " NOTE: for Mac OS running boot2docker, Please read the Notes section Below ".
Make sure you have docker installed on your system.

### Build the images
```

# Build the images ( this will build the application )
./bin/build.sh
```

# Start all containers with data folders from scripts
### Start all containers
```
./bin/start.sh
```

# stop all containers
### Stop all containers
```
./bin/stop.sh
```

# restart containers
./bin/stop.sh

```
3. Start Crawling with Nutch 2.2.
### Sample Crawling


Run the command:
```
# Run the crawler, You can use docker exec command, or you can docker attach to the container and run the commands there, or use docker-enter if you are using Mac OS

docker exec NUTCH01 /opt/nutch/bin/crawl /opt/nutch/testUrls test_crawl 3
# OR
```
or

docker-enter NUTCH01
```
docker exec -it NUTCH01 /bin/bash
root@9ec43c388769:/# cd opt/nutch
root@9ec43c388769:/opt/nutch# ./bin/crawl
Usage: crawl <seedDir> <crawlID> [<solrUrl>] <numberOfRounds>
root@9ec43c388769:/opt/nutch# ./bin/crawl testUrls test_crawl 3


```
###NOTES:

Nutch 2.x Container name : NUTCH01

Cassandra Container name : CASS01

Cassandra installed with OpsCenter


###MAC OSx notes
- you need to mount data folders to your VirtualMachine to be able to get persistent data every time you run this application.
- You might need to install docker-enter for easier access to the containers

```
mkdir ~/docker-data
mkdir ~/docker-data/cassandra
mkdir ~/docker-data/nutch

chmod -R 777 ~/docker-data/
### NOTES:

VBoxManage sharedfolder add boot2docker-vm -name home -hostpath ~/
Nutch 2.3 Container name : NUTCH01

boot2docker up
boot2docker ssh

#mkdir /data
#mount -t vboxsf -o uid=1000,gid=50 data /data
#vi /etc/fstab
#data /data vboxsf rw,nodev,relatime 0 0
#docker-enter
```
Cassandra Container name : CASS01
5 changes: 2 additions & 3 deletions bin/start.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,8 @@

B_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
DOCKER_DATA_FOLDER=$B_DIR/docker-data

chmod -R 777 $DOCKER_DATA_FOLDER
mkdir -p $DOCKER_DATA_FOLDER
chmod -R 777 $DOCKER_DATA_FOLDER
source "$B_DIR/nodes.sh"
# source "$B_DIR/stop.sh"

Expand All @@ -16,4 +15,4 @@ cassandraIP=$("$B_DIR"/ipof.sh $cassandraId)
echo "Starting Nutch container.."
echo "cassandraNodeName: $cassandraNodeName"
echo "nutchNodeName: $nutchNodeName"
docker run -d -p 8899:8899 -P -e CASSANDRA_NODE_NAME=$cassandraNodeName -it --link $cassandraNodeName:$cassandraNodeName --net bridge -v $DOCKER_DATA_FOLDER:/data:rw --name $nutchNodeName meabed/nutch:2.3
docker run -d -P -e CASSANDRA_NODE_NAME=$cassandraNodeName -it --link $cassandraNodeName:$cassandraNodeName --net bridge -v $DOCKER_DATA_FOLDER:/data:rw --name $nutchNodeName meabed/nutch:2.3
9 changes: 1 addition & 8 deletions nutch/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -63,13 +63,6 @@ RUN sed -i '/field name="date" type.*/ s/.*/&\n\n <field name="rawconten
RUN rm $NUTCH_HOME/conf/nutch-site.xml
ADD config/nutch-site.xml $NUTCH_HOME/conf/nutch-site.xml

# Port that nutchserver will use
ENV NUTCHSERVER_PORT 8899

#RUN cd $NUTCH_HOME && ls -al

#RUN mkdir -p /opt/nutch/urls && cd /opt/crawl

ADD bootstrap.sh /etc/bootstrap.sh
RUN chown root:root /etc/bootstrap.sh
RUN chmod 700 /etc/bootstrap.sh
Expand All @@ -78,4 +71,4 @@ VOLUME ["/data"]

CMD ["/etc/bootstrap.sh", "-d"]

EXPOSE 8899
EXPOSE 8080 8081
7 changes: 4 additions & 3 deletions nutch/bootstrap.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,10 @@ sed -i "/^gora\.cassandrastore\.servers.*/ s/.*//" $NUTCH_HOME/conf/gora.prope
echo gora.cassandrastore.servers=$cassandra_ip:9160 >> $NUTCH_HOME/conf/gora.properties
vim -c '%s/localhost/'$cassandra_ip'/' -c 'x' $NUTCH_HOME/conf/gora-cassandra-mapping.xml

nutchserver_port=$(printenv NUTCHSERVER_PORT)

$NUTCH_HOME/bin/nutch nutchserver $nutchserver_port
echo "Starting Nutch server.."
$NUTCH_HOME/bin/nutch nutchserver > /dev/null &
echo "Starting Nutch Web UI server.."
$NUTCH_HOME/bin/nutch webapp > /dev/null &

echo "export PATH=$PATH" >> /etc/env_profile

Expand Down
3 changes: 2 additions & 1 deletion nutch/testUrls/seed.txt
Original file line number Diff line number Diff line change
@@ -1 +1,2 @@
http://www.google.com
https://in.yahoo.com/?p=us
http://www.shekharsingh.com