- Hosts:
- How to
Typical Galaxy location: /srv/executor
. In older scripts or deployments,
it may be /srv/galaxy
.
Logs: cat /var/log/syslog|grep galaxy
.
Start, stop, restart or check the status of the service with systemd:
$ sudo service galaxy <start|stop|restart|status>
or start manually:
$ sudo /srv/executor/run.sh
If something doesn't work, stop the service and start it manually with
the run.sh
script, which will check if some code has been modified or
if database migrations have to be applied. It may take a while to load
so you must wait until either the service is up or an error is produced.
To connect to UI, open a browser at https://<host fqdn>
.
Logs: cat /var/log/syslog|grep nfs
.
Start, stop, restart or check the status of the service with systemd:
$ sudo service nfs-server <start|stop|restart|status>
The NFS Server shares to folders:
/srv/executor/database
with thevcluster nodes/srv/executor/tools
with the editor
To check what directories you share with whom:
$ sudo showmount -e localhost
The exports should be configured in /etc/export
e.g., to share /srv/executor/database
with 12.345.67.89
:
/srv/galaxy/database 83.212.12.345.67.89.96(rw,sync,no_subtree_check,no_root_squash)
If you make changes to this file, restart nfs-server afterwards.
Configuration: /etc/apache2/sites-enabled/*
Modules enabled: rewrite
, proxy
, proxy_http
, ssl
.
Logs: /var/log/apache2/
Start, stop, restart or check the status of the service with systemd:
$ sudo service apache2 <start|stop|restart|status>
To handle apache modules:
$ sudo a2query -m <module>
$ sudo a2enmod <module>
$ sudo a2dismod <module>
always restart or reload apache2 afterwards.
On this host, Apache2 acts a reverse proxy redirecting everything to localhost:8080
Typical Galaxy location: /srv/editor
. In older scripts or deployments,
it may be located at /srv/galaxy
.
Start, stop, restart or check the status of the service with systemd:
$ sudo service galaxy <start|stop|restart|status>
or start manually:
$ sudo /srv/executor/run.sh
If something doesn't work, stop the service and start it manually with the run.sh script. this script checks if some code has been modified or if database migrations have to be applied. If that's the case, it may take a while to load and you need to wait until either the service is up or an error is produced.
To connect to UI: https://<host fqdn>/galaxy
Note: you should be denied access to the system. If you need to get Ui access e.g. for
debugging, go to
/srv/executor/config/galaxy.ini' and turn "use_remote_user" to "False". Don't forget to turn it back to "True" when you are finished.
Logs: cat /var/log/syslog|grep nfs
.
Make sure the `/srv/editor/tools` directory is mounted as an NFS share.
One way to do this:
```bash
$ df -h
Filesystem Size Used Avail Use% Mounted on
[...]
<executor IP>:/srv/executor/tools 59G 7.5G 49G 14% /srv/editor/tools
If this doesn't work, make sure nfs-client is installed and /etc/fstab
contains a line like this:
<executor IP>:/srv/executor/tools /srv/editor/tools nfs defaults 0 0
Check that all elements (executor IP, NFS server directory, NFS client directory) exist and are correct and mount again:
$ mount /srv/editor/tools
Same as Apache2 for Executor.
- Zookeeper for service discovery
- Chronos for scheduling
- Mesos master for cluster management
- Troubleshooting
Logs: cat /var/log/syslog|grep zookeeper
Listening to port: 2181
Start, stop, restart or check the status of the service with systemd:
$ sudo service zookeeper <start|stop|restart|status>
How does Mesos now where to find zookeeper:
$ cat /etc/mesos/zk
zk://localhost:2181/mesos
Logs: cat /var/log/syslog|grep zookeeper
Configuration: /etc/chronos
Listening to port: 8080
Start, stop, restart or check the status of the service with systemd:
$ sudo service chronos <start|stop|restart|status>
In order to connect to Chronos, Galaxy Executor needs credentials. The
credentials are stored in /etc/chronos/conf/http_credentials
.
In order for Chronos to connect to Mesos, it needs some other
credentials: a principal and a secret. The principal is stored in
/etc/chronos/mesos_authentication_principal. The secret is stored in a file indicated by
/etc/chronos/mesos_authentication_secret_file(usually,
/etc/chronos/store/mesos_authentication_secret). The principal and secret must match *exactly one* of the credentials in
/etc/mesos-master/store/credentials.json`.
Make sure all mesos files have NO new line at the end.
If you change any credentials, restart Chronos.
To check if Chronos is running, open
http://<cluster manager IP or fqdn>:8080
on a browser. You can use
the http_credentials
to log. In order for the cluster to function
properly, it is not enough for Chronos to be up, it should also be
connected to Mesos master (see next paragraph).
Logs: /var/log/mesos/
Configuration: /etc/mesos-master
Listening to port: 5050
Start, stop, restart or check the status of the service with systemd:
$ sudo service mesos-master <start|stop|restart|status>
Credentials are stored in a file indicated by
/etc/mesos-master/credentials
which is typically
/etc/mesos-master/store/credentials.json
. This file holds credentials
of "principal"/"secret" form in json format. There are stored the
credentials for Chronos as well as for each node.
To check if the mesos master is up, visit
http://<cluster manager IP or fqdn>:5050
on a browser. You can use
any of the credentials in the credentials file to log. You must be able
to see Chronos connected as a framework (click "Frameworks" on top
menu), and each of the nodes connected as an agent (click "Agents" on
top menu. If you don't see them there, something is wrong.
If you must troubleshoot the cluster manager, you need luck, patience and a salary raise. If you succeed, you must demand a cookie as a reward.
Most errors are caused by electing the wrong leader. This can happen if mesos masters installed on the cluster nodes are enabled by mistake and compete to become leaders. To make sure your system doesn't suffer from this, log on each cluster node and check if the mesos master is running (service mesos-master status). If that's the case, stop it (service mesos-master stop). Make sure only the cluster manager mesos-master is running. Give it a minute and your issues may get resolved.
In general, before you go deeper in troubleshooting, try shutting down the Chronos, mesos master and mesos slaves on cluster manager and every cluster node. Make sure there are not any mesos masters running on slaves or mesos slaves running on cluster manager. Then restart them in this order (order maters) :
- Mesos master on cluster manager
- Chronos on cluster manager
- Mesos slaves on every node
Also, check the credentials on Mesos master. Do they much with the credentials on the nodes and Chronos?
Another tip: make sure all one-line configuration files in
/etc/mesos-master
and /etc/chronos
have no new line at the end. I
kid you not!
Have fun debugging Chronos/Mesos.
- Docker engine
- Docker registry
- Apache2 as a reverse proxy and password security
- Test if the service works
Start, stop, restart or check the status of the service with systemd:
$ sudo service docker <start|stop|restart|status>
Runs as a docker container.
Logs: $ docker logs registry
Start, stop, restart or check the status of the service:
$ docker <start|stop|restart|inspect> registry
or find "registry" when running:
$ docker ps
Configuration: /etc/apache2/sites-enabled/
Modules enabled:
- authn_file
- authn_core
- authz_groupfile
- authz_user
- authz_core
- auth_basic
- access_compat
- headers
- ssl
- proxy
- proxy_http
Logs:
cat /var/log/apache2/
Start, stop, restart or check the status of the service with systemd:
$ sudo service apache2 <start|stop|restart|status>
To handle apache modules:
$ sudo a2query -m <module>
$ sudo a2enmod <module>
$ sudo a2dismod <module>
always restart or reload apache2 afterwards.
In this host, Apache2 acts a reverse proxy redirecting /v2
to
localhost:5000/v2
. Also, it guards the /v2
location with a password.
From an external machine (e.g., your workstation):
$ docker pull hello-world
$ docker login <docker_registry fqdn>
username: <registry_username from group_vars/all>
password: <registry_password from group_vars/all>
$ docker tag hello-world <docker_registry fqdn>:hello-world
$ docker push <docker_registry fqdn>:hello-world
$ docker pull <docker_registry fqdn>:hello-world
Typically, the username and password of the registry are defined in
groups_vars/all
(ansible) as "registry_username" and
"registry_password".
If all these commands complete without an error, the registry works.
Configuration: /etc/prometeus/prometheus.yml
Logs: /var/log/prometheus/prometheus.log
Start, stop, restart or check the status of the service with systemd:
$ sudo service prometheus <start|stop|restart|status>
Configuration: /etc/grafana/grafana.ini
Logs: /var/log/grafana/grafana.log
Start, stop, restart or check the status of the service with systemd:
$ sudo service grafana-server <start|stop|restart|status>
Similar Executor setup, with the following details: Grafana: / --> http://127.0.0.1:3000/ Prometheus: /prometheus/ --> http://127.0.0.1:9090/prometheus/
- Docker engine pulling images from our "docker registry" service
- Prometheus node exporter
- CAdvisor as a docker container
- NFS client
Start, stop, restart or check the status of the service with systemd:
$ sudo service docker <start|stop|restart|status>
Test if the engine can pull from the docker registry by trying out the test in Test if the service works section.
Logs: cat /var/log/syslog|grep prometheus
.
Start, stop, restart or check the status of the service with systemd:
$ sudo service prometheus-node-exporter <start|stop|restart|status>
Logs: $ docker logs cadvisor
Start, stop, restart or check the status of the service:
$ docker <start|stop|restart|inspect> cadvisor
Logs: cat /var/log/syslog|grep nfs
.
Make sure the `/srv/executor/datavase` directory is mounted as an NFS share.
One way to do this:
```bash
$ df -h
Filesystem Size Used Avail Use% Mounted on
[...]
<executor IP>:/srv/executor/database 59G 7.5G 49G 14% /srv/executor/database
If this doesn't work, make sure nfs-client is installed and /etc/fstab
contains a line like this:
<executor IP>:/srv/executor/database /srv/editor/database nfs defaults 0 0
Check that all elements (executor IP, NFS server directory, NFS client directory) exist and are correct and mount again:
$ mount /srv/executor/database
How to add a node
=================
1. Provision a new virtual Machine with Debian Jessie (tested) or
similar (untested).
2. Make sure you have SSH access to this VM from the host
running ansible.
3. Add the VM IP in your hosts (or hosts.production, or whatever
you use) file inside the ansible script, under the
`cluster_nodes` section.
4. Add this in `cluster_nodes` dict which you can find in
`group\_vars/all`:
```json
"123.45.67.89": {
principal: "nodeX.omtd",
secret: "nodeX.secret"
}
```
(replace 123.45.67.89 with the actual VM IP, change the
principal to something unique - if it is the 6th node,use `node6.omtd`,
change the secret to something harder to guess).
5. Install with ansible:
$ ansible-playbook -i hosts -l cluster_master,cluster_nodes
How to remove a node
====================
1. In hosts (or hosts.production or whatever) remove the IP of the node
from the `cluster_nodes` list
2. In `group_vars/all` remove the related section (IP, prinsipal
and secret) from the `cluster_nodes` dict variable
3. Reset the cluster node with ansible:
```bash
$ ansible-playbook -i hosts -l cluster_master
```
The node is not connected to the cluster anymore, you can destroy it or
use it otherwise. If you don't destroy the node, make sure
prometheus-node-exporter does not send scraps to the prometheus server
(e.g., stop the exporter service).
How to mount a volume
=====================
General purpose instructions, after you create a volume on your cloud and attach it on a VM::
```bash
$ fdisk -l
$ fdisk /dev/vdb
<type:> n
<type:> p
<type:> w
`$ mke2fs -t ext4 /dev/vdb1 # note down Filesystem UUID: f090a4e2-ab78-49a6-9be3-8c07161c4f0b
$ mkdir -p /volumes/nfs1
$ echo "UUID=f090a4e2-ab78-49a6-9be3-8c07161c4f0b /volumes/nfs1 ext4 errors=remount-ro 0 1" >> /etc/fstab
The main storage space is an LVM (Logical VoluMe) disk mounted on the executor VM, which is shared with NFS to every cluster node. Practically, an LVM is a group of physical volumes which behaves as a single disk.
Lets see what's inside our nfsstore LVM:
$ vgs
VG #PV #LV #SN Attr VSize VFree
nfsstore 2 1 0 wz--n- 699.99g 0
$ pvs -o+pv_used
PV VG Fmt Attr PSize PFree Used
/dev/vdb nfsstore lvm2 a-- 380.00g 0 380.00g
/dev/vdc nfsstore lvm2 a-- 320.00g 0 320.00g
- Create a new volume and attach it on the executor VM, using your
cloud tools e.g., in ~okeanos:
$ kamaki volume create --size 200 --name omtd_3 --server-id <executor VM id> --project-id <openinted project id>
- Log in executor and find the new, unformated volume:
$ fdisk -l ... Disk /dev/vdd: 200 GiB, 203097383680 bytes, 521088640 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes ...
- Prepare the volume for LVM:
$ pvcreate /dev/vdd
- Extend the volume group and the logical partition:
$ vgextend nfsstore /dev/vdd $ lvextend /dev/nfsstore/nfs_logical /dev/vdd $ resize2fs /dev/nfsstore/nfs_logical
Done!
Make sure you have a backup of everything in /srv/executor/database
and
then proceed.
List all the physical volumes in LVM:
$ pvs -o+pv_used
PV VG Fmt Attr PSize PFree Used
/dev/vdb nfsstore lvm2 a-- 380.00g 0 380.00g
/dev/vdc nfsstore lvm2 a-- 320.00g 0 320.00g
/dev/vdd nfsstore lvm2 a-- 200.00g 0 200.00g
For demonstration purposes, we assume you want to remove the 200G volume, thus reducing the VLM storage by 200G. this volume is /dev/vdd.
So, lets remove /dev/vdd
:
$ pvmove /dev/vdd
$ vgreduce nfsstore /dev/vdd
And we are done! You can now use /dev/vdd
for some other purpose, or
destroy it, if you don't need it anymore.
Explanation: pvmove
moves all data stored in /dev/vdd
to the other
physical volumes, while vgreduce removes the volume from LVM.