Jordi

banner pi cluster hadoop hdfs

Introduction

Weeks ago I decided to start creating an experimental home size "Big data" system based on Apache Spark. The first step for it is to create a distributed filesystem where Apache Spark will read and write eveything.

HDFS is the Hadoop distributed filesystem which provides features like: fault detection and recovery, huge datasets, hardware at data, etc... despite it is a Hadoop ecosystem piece, it works nice as the data distributed filesytem for Apache Spark.

The HDFS filesystem will be installed in 8 RPIs of my 10 Raspberry Pi 4 cluster. To have a minimal storage size and decent speed I bought 8 extra SSD + 8 USB3 adaptors.

ssd disks with usb 3 adaptors

10 raspberry pi 4 cluster

Installing HDFS

First of all, we are going to install a HDFS filesystem in a single node to test everything works in a simple way. I already have Ubuntu 18.04 server for Raspberry Pi 4 installed but I guess these steps won't differ too much if you have Ubuntu 20.04 server installed.

In my case pi3cluster is the RPI master host name and pi4cluster, pi5cluster, pi6cluster, pi7cluster, pi8cluster, pi9cluster, pi10cluster are the node host names. All my raspberry pis have declared a /etc/hosts file with all the name/ip pairs, you must do it as well if it's not already done:

192.168.1.32        pi3cluster
192.168.1.33        pi4cluster
192.168.1.34        pi5cluster
192.168.1.35        pi6cluster
192.168.1.36        pi7cluster
192.168.1.37        pi8cluster
192.168.1.38        pi9cluster
192.168.1.39        pi10cluster 


1. Lets prepare our new 8 SSDs. After plugging them all into the Raspberry PI 4 USB3 ports,  they become available as the /dev/sda device. Let's partition and format all of them. Repeat these commands in all of your RPIs:

sudo fdisk /dev/sda

    #(select option n, new partition)
    #(select option p, primary type)
    #(select option w, write and exit)
    # like this:

partition and format ssd


2. format /dev/sda1 new partition and create a new linux directory to mount it there (do it in all of your RPIs)

sudo mkfs.ext4 /dev/sda1
sudo mkdir /mnt/hdfs


3. Edit your linux filesystem list and add the new one (do it in all of your RPIs)

sudo nano /etc/fstab


4. Once you have nano displaying it, add this line at the end of the file (do it in all of your RPIs):

      /dev/sda1 /mnt/hdfs ext4 defaults 0 0


5.refresh your linux filesystem mounting points:

sudo mount -av


6. Install Java and wget packages, here: Hadoop Java Versions documentation you can find which version is recommended. (do it in all of your RPIs)

sudo apt install openjdk-8-jdk-headless wget


7. Add a new system user for your hadoop hdfs (do it in all of your RPIs)

sudo adduser hadoop


8. Download and copy the latest hadoop available package and install it only in your master RPI

cd ~
wget https://www-eu.apache.org/dist/hadoop/common/hadoop-3.1.2/hadoop-3.1.2.tar.gz
tar xzf hadoop-3.2.1.tar.gz
rm hadoop-3.2.1.tar.gz
sudo mkdir /opt
sudo mv hadoop-3.1.2 /opt/hadoop
sudo chown hadoop:hadoop -R /opt/hadoop
    


9. Create hadoop hfds data folders (do it in all of your RPIs)

sudo mkdir -p /mnt/hdfs/datanode
sudo mkdir -p /mnt/hdfs/namenode
sudo chown hadoop:hadoop -R /mnt/hdfs


10. Login as hadoop user and create the ssh key (do it in all of your RPIs)

sudo -i
su hadoop
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys


11. From your master node, logged in as hadoop user, authorise your master to all the other nodes adding your master ssh key as an authorised one, repeat this command for all your nodes replacing <ip> by your node ip

$ cat ~/.ssh/id_rsa.pub | ssh hadoop@<ip> 'cat >> .ssh/authorized_keys'


12. Add all environment variables needed by hadoop in this file /home/hadoop/.bashrc (do it in all of your RPIs)

export HADOOP_HOME=/opt/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
    

after adding those lines, reload your variables

source ~/.bashrc

 

13. Add Hadoop to your PATH, edit /home/hadoop/.profile. Add this line at the end of the file (do it in all of your RPIs)

PATH=/opt/hadoop/bin:/opt/hadoop/sbin:$PATH


14. Set JAVA_HOME in the hadoop environment config file

nano $HADOOP_HOME/etc/hadoop/hadoop-env.sh

Add export java_home=/usr/lib/jvm/java-8-openjdk-arm64/bin/java like:

hdfs hadoop javahome config raspberry pi


15. Let's start changing the hadoop cluster configuration files, you can find all texts to copy and paste here hdfs-config.txt. Add the configuration section to core-site hadoop file

nano $HADOOP_HOME/etc/hadoop/core-site.xml

hdfs hadoop javahome config raspberry pi


16. Add the configuration section to hdfs-site hadoop file

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

hadoop hdfs site config raspberry pi


17. Add the configuration section to mapred-site haddop file

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

hadoop hdfs mapred config raspberry pi


18. Add the configuration section to yarn-site hadoop file

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

hadoop hdfs mapred config raspberry pi


19
. After everything is configured and the SSD is partitioned and mounted, we can format the HDFS (only the master RPI)

/opt/hadoop/bin/hdfs namenode -format -force


20. Let's start HDFS filesystem and do some operations to check everything works fine

cd $HADOOP_HOME/sbin/
./start-dfs.sh
./start-yarn.sh
hadoop fs -mkdir /test_folder
hadoop fs -ls /


21. Check you can see all Hadoop info websites: (A) services web, (B) cluster info web, (C) hadoop node web, and (D) check the "test_folder" you created is there too

http://<your server ip>:9870/
http://<your server ip>:8042/
http://<your server ip>:9864/
http://<your server ip>:9870/explorer.html

(A)

hadoop-web.PNG

(B)

hadoop-web2.PNG

(C)
hadoop-web3.PNG

(D)

hadoop-web4.PNG

21. If you want to restart the server later, use these commands

cd $HADOOP_HOME/sbin/
stop-all.sh
start-all.sh

 

From single node to HDFS cluster

After checking Hadoop HDFS working fine as a single node, It's time to extend it to all our capable nodes creating a HDFS cluster.

22. Let's start changing the hadoop cluster configuration files again, leave them as they are shown in the pictures. You can find all texts to copy and paste here hdfs-cluster-config.txt. Remember to change my RPI host names like pi3cluster to your specific names. 

nano $HADOOP_HOME/etc/hadoop/core-site.xml

1-core-site.xml-clister.PNG

nano $HADOOP_HOME/etc/hadoop/hdfs-site.xml

2-hdfs-site.xml-cluster.PNG

nano $HADOOP_HOME/etc/hadoop/mapred-site.xml

3-mapred-site.xml-cluster.PNG

nano $HADOOP_HOME/etc/hadoop/yarn-site.xml

4-yarn-site.xml-cluster.PNG


23. In your cluster make sure all services are stopped, delete all files from before 

cd $HADOOP_HOME/sbin/
stop-all.sh
$ clustercmd rm –rf /opt/hadoop_tmp/hdfs/datanode/*
$ clustercmd rm –rf /opt/hadoop_tmp/hdfs/namenode/*


24. Logged in as hadoop user, create master and workers files inside hadoop configuration.

nano /opt/hadoop/etc/hadoop/master
nano /opt/hadoop/etc/hadoop/workers

In master file, add only one line, your master RPI host name, in my case it looks like:

pi3cluster

In workers file, add as many lines as you need listing all the host names of your RPI nodes, in my case it looks like:

pi4cluster
pi5cluster
pi6cluster
pi7cluster
pi8cluster
pi9cluster
pi10cluster


25. In your master node, and ONLY there, format your name node. IMPORTANT: backup your hdfs files folder if you have important files there from before, they will be deleted.

hdfs namenode -format -force


26. From your single master node, tar your hadoop home to a shared network folder or a pen drive

sudo tar -czvf /mnt/shared-network-folder/hadoop.tar.gz /opt/hadoop


27. In all your new raspberry pi cluster nodes, extract the tar file in the same place and force the ownership to hadoop user

sudo tar -xzvf /mnt/shared-network-folder/hadoop.tar.gz -C /opt/
sudo chown hadoop:hadoop -R /opt/hadoop


28. From your master node, logged in as hadoop user start the HDFS system

start-dfs.sh && start-yarn.sh

 

29. You should be able to see your multinode hadoop HDFS distributed filesystem, browse this URL: http://<your-master-ip-or-node-name>:9870/dfshealth.html#tab-datanode

hadoop-hdfs-multi-node-ubuntu-raspberry-pi.PNG

banner-pi-cluster-kubespray.jpg

Introduction

Recently, I've been thinking about improving my Kubernetes (K8s) Raspberry Pi cluster to  something more resilient than just a one K8s master cluster which can fail easily and ruin all your work of months on it.
I installed multiple times on-premise single master K8s clusters based on Redhat ecosystem ( Centos/Fedora). The difficulty of doing this manually if it's your first time it's far from easy, and if you want to install something closer to an on premises HA K8s cluster manually it becomes a quite uncomfortable task.
So the challenge was to find a simple way to install a multi-master K8s cluster using my new 10 raspberry Pi 4 (4Gb Ram) project. 
--post-breaker--
This article intends to share my experience doing this task as a quick and easy guide for  anybody who wants the same results without loosing hours on it. The hardware environment  will consist in a simple network like this one:

raspberry-pi-4-cluster-multi-master.jpg

OS/Software stack

The keys to reach this goal was to use:
- Kubespray K8s deployment tool. 
- Ubuntu 18.04 for Arm64
Kubespray is a very good tool to ease your pain deploying highly available K8s clusters. It uses ansible playbooks and works with some major linux distributions, hardward platforms and cloud providers.
In the pre-installation you can configure your desired cluster structure, numer of masters and etcd replicated databases.
I tried some combinations of linux distributions and mostly all failed due it's difficult to find a linux distro that suits Kubespray requirements and the new Raspberry Pi 4 (4Gb RAM) hardware.
At this moment Raspbian it's compiled for Arm32/armhf and Kubespray will fail installing packages like calico, etcd has experimental support for arm32, so better to use arm64. On the other side, Raspbian Arm64 is experimental and not officially supported by Kubespray so better not to use any of these distros yet, avoiding unexpected issues.

The current latest Ubuntu 19.10 (Eoan Ermine) has a Preinstalled server image for the new Raspberry Pi 4 compiled for arm64  but it's not supported officially by Kubespray and the installation can fail so, maybe this distro will be a good candidate in near months.

I wanted a long term support stable distro to avoid collateral issues I thought about Ubuntu 18.04 LTS which will be supported for 5 years until April 2023. But, the current official Ubuntu 18.04 LTS doesn't have a preinstalled server image for the new Raspberry Pi 4, maybe they'll release it later, but now only Raspberry Pi 2/3 are supported.
While Ubuntu releases 18.04 LTS for Raspberry Pi 4 (Arm64) I found an unofficial preinstalled server image for Raspberry Pi 4, it works like a charm thanks to James Chchambers work:
Using this unofficial distro you'll still be able to update Ubuntu linux packages from the Official repo, the only thing you'll have to care is about kernel packages, but how to block them will be explained below.

Setting up Raspberry Pi 4 nodes

First let's install Ubuntu 18.04 LTS for (Arm64) in all your Raspberry Pi 4. In my case I installed it in ten Raspberry Pi 4, for a multi-master installation the recommended minimum size is four raspberry pis (as shown in the picture above): two masters, two workers (three etcds, two in master nodes one in a worker node).
I'm going to suppose that your network range is 192.168.0.1/24, your raspberry ips are 192.168.0.10-13, and your gateway is 192.168.0.1, change them accordingly with your values in the scripts below.
1. Flash ubuntu 18.04 from https://github.com/TheRemote/Ubuntu-Server-raspi4-unofficial/releases/ using balena Etcher or any other similar "flash to sd" app to all your pi nodes.
2. Boot your raspberry pi node with ubuntu 18.04 for the first time and change the default  ubuntu/ubuntu user admin password.
3. If you are in the United Kingdom, change your keyboard layout to gb:
sudo vi /etc/default/keyboard
change XKBLAYOUT line to: XKBLAYOUT="gb"
4. Update all packages
sudo apt-get update -y && sudo apt-get --with-new-pkgs upgrade -y
5. Change your raspberry pi ip to static editing /etc/netplan/50-cloud-init.yaml , it should look like this one (changing ip address and gateway to your specific ones):
# This file is generated from information provided by
# the datasource.  Changes to it will not persist across an instance.
# To disable cloud-init's network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    ethernets:
        eth0:
            dhcp4: false
            match:
                macaddress: dd:a6:3a:1e:50:b9
            set-name: eth0
            addresses:
                - 192.168.0.10/24
            gateway4: 192.168.0.1
            nameservers:
                addresses: [1.1.1.1, 1.0.0.1]
    version: 2                                       
6. Block kernel packages to be upgraded from ubuntu official distribution
sudo apt-mark hold flash-kernel linux-raspi2 linux-image-raspi2 linux-headers-raspi2 linux-firmware-raspi2 linux-firmware
 
7. Reboot and check all changes work ok.

8. Repeat steps 1 to 7 for all raspberry pis.
 

Setting up a controller node

The first thing needed is a controller machine from where you're are going to install K8s to your Raspberry Pi cluster. I used my Windows 10 machine installing the Ubuntu 18.04 subsystem.
Once you have a controller node ready you'll need to prepare it and install Kubespray on it. Open a linux terminal and follow the next steps.
1. Generate a ssh key in your controller:
ssh-keygen -t rsa
It will be saved in /home/your-username/.ssh/id_rsa
2. Copy your ssh key to all raspberry pi 4 nodes, repeat this command for all raspberry pi ips:
ssh-copy-id 192.168.0.10
You'll get:
Are you sure you want to continue connecting (yes/no)? yes
(answer your password)
3. Edit /etc/sudoers in all pi nodes and add a line after the %sudo one to avoid password questions when using ssh to pi nodes from the controller node, replace "your-controller-username" by your real controller user name:
# Allow members of group sudo to execute any command
%sudo   ALL=(ALL:ALL) ALL
your-controller-username ALL=(ALL:ALL) NOPASSWD:ALL
4. Install python3, pip and git in your controller node:
sudo apt install python3-pip git
sudo pip3 install --upgrade pip
pip --version

5. Add a function in your .bashrc script to be able to run commands to all raspberry pis at the same time:

cd
nano .bashrc

Add this function at the end of the .bashrc file:

function picmd {                                                                                                         
   echo "pi1"                                                                                                              
   ssh 192.168.0.10 "$@"                                                                                                   
   echo "pi2"                                                                                                              
   ssh 192.168.0.11 "$@"                                                                                                   
   echo "pi3"                                                                                                              
   ssh 192.168.0.12 "$@"                                                                                                   
   echo "pi4"                                                                                                              
   ssh 192.168.0.13 "$@"                                                                                                                                                                                                    }     

Apply .bashrc changes

source .bashrc

Try it from your controller node writing: picmd date, you should see all pi node responses, something like this:

ubuntu@CONTROLLER-NODE:~$ picmd date
pi1
Sat Jan 25 14:50:31 UTC 2020
pi2
Sat Jan 25 14:50:33 UTC 2020
pi3
Sat Jan 25 14:50:35 UTC 2020
pi4

You can now use picmd command to execute commands (like apt update) in all your raspberry pi nodes without logging in all of them one by one.

 

Installing Kubespray and the Kubernetes cluster

1. In your controller node home directory clone kubespray repo:
git clone https://github.com/kubernetes-sigs/kubespray.git

2. Enter into kubespray directory and take a look which kuberspray version are you going to install:

cd kubespray/ && git describe --tags 
It's important to know which kubespray version you are installing due kubespray don't support upgrades from an older release straight to the latest release, you have to upgrade it version after version without jumping any.
4. Install kubespray requirements:
sudo pip install -r requirements.txt
5. Create a new inventory from the sample folder:
cp -rfp inventory/sample inventory/mycluster  
6. Declare a IPS variable with all your raspberry cluster ips, for instance:
declare -a IPS=(192.168.0.10 192.168.0.11 192.168.0.12 92.168.0.13)
7. Generate kubespray configuration based in your node ips:
CONFIG_FILE=inventory/mycluster/hosts.yml python3 contrib/inventory_builder/inventory.py ${IPS[@]}
8. Review the generated hosts.yml file:
nano inventory/mycluster/hosts.yml
In this file, in the 'hosts:' section, be aware that node1, node2, etc... will be the hostname of your raspberry pis based on the ips you provided. Kubespray will change them, if you prefer to name your raspberry pi hostnames differently, change them in this file. Initially, hosts.yml file should look like:
all:
  hosts:
    node1:
      ansible_host: 192.168.0.10
      ip: 192.168.0.10
      access_ip: 192.168.0.10
    node2:
      ansible_host: 192.168.0.11
      ip: 192.168.0.11
      access_ip: 192.168.0.11
    node3:
      ansible_host: 192.168.0.12
      ip: 192.168.0.12
      access_ip: 192.168.0.12
    node4:
      ansible_host: 192.168.0.13
      ip: 192.168.0.13
      access_ip: 192.168.0.13
  children:
    kube-master:
      hosts:
        node1:
        node2:
    kube-node:
      hosts:
        node1:
        node2:
        node3:
        node4:
    etcd:
      hosts:
        node1:
        node2:
        node3:
    k8s-cluster:
      children:
        kube-master:
        kube-node:
    calico-rr:
      hosts: {}
9. Optionally you can review other generated configuration files:
nano inventory/mycluster/group_vars/all/all.yml
nano inventory/mycluster/group_vars/k8s-cluster/k8s-cluster.yml
10. Install your multi master Kubernetes cluster using kubespray:
ansible-playbook -i inventory/mycluster/hosts.yml --become --become-user=root cluster.yml
Kubespray installation process can last several minutes or more than half an hour depending on your hardware and the configuration you set, relax and enjoy a cup of tea.
 

Manage your new cluster with Kubectl

Let's install kubectl in your controller node to start playing with your new cluster. Download the latest kubectl version:
curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl
Make it an executable and move it to your bin directory:
sudo chmod +x kubectl
sudo mv kubectl /usr/local/bin/

Copy the admin.conf file from one of the master nodes to provide kubectl configuration. In the following bash commands replace the user name "ubuntu" by your controller username and the ip "192.168.0.1" with your master node ip:

cd
ssh 192.168.0.1 sudo cp /etc/kubernetes/admin.conf /home/ubuntu/config
ssh 192.168.0.1 sudo chmod +r ~/config
scp 192.168.0.1:~/config .
mkdir .kube
mv config .kube/
ssh 192.168.0.1 sudo rm ~/config

Now you should be able to list all your nodes and kubernetes pods:

kubectl get nodes -o wide 
kubectl get pods -o wide --all-namespaces

DONE!

 

Classic error, number of masters and etcds databases

If you are getting an error similar like this one:
[node3 -> 192.168.0.12]: FAILED! => 
{"changed": true, 
"cmd": ["bash", "-x", 
"/usr/local/bin/etcd-scripts/make-ssl-etcd.sh", 
"-f", "/etc/ssl/etcd/openssl.conf", "-d", "/etc/ssl/etcd/ssl"], "delta": "0:00:00.007623", 
"end": "2019-12-21 13:10:13.140942", "msg": "non-zero return code", "rc": 127, "start": "2019-12-21 13:10:13.133319", 
"stderr": "bash: /usr/local/bin/etcd-scripts/make-ssl-etcd.sh: No such file or directory", 
"stderr_lines": ["bash: /usr/local/bin/etcd-scripts/make-ssl-etcd.sh: No such file or directory"], 
"stdout": "", 
"stdout_lines": []}
It's because in your hosts.yml configuration the number of masters and etcd database nodes are wrong adjust them and launch kubespray installer again.
 

Skip Kubespray packages upgrades

In all your raspberry pi nodes, block to upgrade packages installed by Kubespray. When you upgrade your Kubernetes Kubespray cluster to a newer version, Kubespray will be in charge of update them too. Run this command in all your raspberry pi nodes:
sudo apt-mark hold apt-transport-https aufs-tools cgroupfs-mount containerd.io docker-ce docker-ce-cli ipset ipvsadm libipset3 libltdl7 libpython-stdlib libpython2-stdlib libpython2.7-minimal libpython2.7-stdlib pigz python python-apt python-minimal python2 python2-minimal python2.7 python2.7-minimal socat