Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16220 test: Include username in default common test directory #14937

Draft
wants to merge 63 commits into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
995b03b
DAOS-16220 test: Include username in default common test directory
phender Aug 15, 2024
3fc3b4f
Merge branch 'master' into pahender/DAOS-16220
phender Aug 15, 2024
f0b01e9
Fix pylint.
phender Aug 16, 2024
7907b85
Merge branch 'master' into pahender/DAOS-16220
phender Aug 28, 2024
5b031ec
Fix shellcheck and apply feedback.
phender Aug 28, 2024
bb345cb
Fix main.sh
phender Aug 28, 2024
bb3ef4c
Fix main.sh again
phender Aug 28, 2024
c521baa
Test custom_a mode
phender Aug 29, 2024
09fbcfb
Fix systemctl --user usage.
phender Aug 29, 2024
681b8cc
Enable user systemctl on all test nodes.
phender Aug 29, 2024
8d54966
Merge branch 'master' into pahender/DAOS-16220
phender Aug 30, 2024
2ac57c5
Fixing running agent as user.
phender Aug 30, 2024
bfeb3b3
Cleanup verify socket dir.
phender Aug 30, 2024
810c456
Cleanup verify socket dir.
phender Aug 30, 2024
1908088
Remove missing config temp error.
phender Aug 30, 2024
3bb43b6
Merge branch 'pahender/DAOS-16220' of https://github.com/daos-stack/d…
phender Aug 30, 2024
4565846
Merge branch 'master' into pahender/DAOS-16220
phender Aug 30, 2024
f89984a
Fix running systemctl as user.
phender Aug 30, 2024
1df9183
Adding systemctl status for failed start debug.
phender Aug 31, 2024
71db462
Run systemctl status first on error.
phender Sep 3, 2024
fc6f498
Fix systemctl user setup.
phender Sep 3, 2024
e1bd482
Adding jenkins user to the systemd-journal group.
phender Sep 3, 2024
5dacb90
Merge branch 'master' into pahender/DAOS-16220
phender Sep 3, 2024
9a53052
Adding debug for failed systemctl start command.
phender Sep 3, 2024
2b78e78
Adding config file debug.
phender Sep 3, 2024
23723ca
Fix debug.
phender Sep 3, 2024
9568e99
Improvements
phender Sep 4, 2024
6977312
Debug
phender Sep 4, 2024
55cee23
Fix debug
phender Sep 4, 2024
d04b1bc
Fix mod of user daos_agent service file
phender Sep 4, 2024
9a0566c
Attempting to allow user access to journalctl output
phender Sep 4, 2024
48c08e4
Re-adding jenkins user to systemd-journal group.
phender Sep 4, 2024
ddde157
Merge branch 'master' into pahender/DAOS-16220
phender Sep 4, 2024
2fee48a
Adding restarting systemd-journald
phender Sep 4, 2024
d225fd3
Attempting w/o adding jenkins to systemd-journal
phender Sep 4, 2024
c96f498
Confirm normal CI operation.
phender Sep 4, 2024
0f11d81
Test fixes.
phender Sep 5, 2024
24040c5
Apply feedback.
phender Sep 5, 2024
cdf12b4
Merge branch 'master' into pahender/DAOS-16220
phender Sep 17, 2024
f780e92
Fix creatre_directory() owner/user swap.
phender Sep 17, 2024
5727b4c
Merge branch 'master' into pahender/DAOS-16220
phender Sep 18, 2024
15a4599
Additional updates for running the agent as the user.
phender Sep 18, 2024
46b0765
Fix typo.
phender Sep 19, 2024
e404fc0
Merge branch 'master' into pahender/DAOS-16220
phender Sep 20, 2024
34568ba
Remove the control metadata path with sudo.
phender Sep 20, 2024
c3e4126
Merge branch 'master' into pahender/DAOS-16220
phender Sep 24, 2024
d6331dd
Fix empty server list.
phender Sep 24, 2024
8ddd6ad
Merge branch 'master' into pahender/DAOS-16220
phender Sep 25, 2024
43f0d28
Remove the control metadata path contents as part of the server cleanup.
phender Sep 25, 2024
84d6b7c
Support running cart_ctl with user run daos_agent.
phender Sep 25, 2024
c3fad19
Code cleanup.
phender Sep 25, 2024
282d39c
Merge branch 'master' into pahender/DAOS-16220
phender Oct 3, 2024
48fcdb1
Add user to stop_processes.
phender Oct 3, 2024
50d5422
Merge branch 'master' into pahender/DAOS-16220
phender Oct 4, 2024
0a2cf6b
Apply feedback.
phender Oct 4, 2024
5109544
Fix handling daos_user home dirs.
phender Oct 7, 2024
3776f23
Merge branch 'master' into pahender/DAOS-16220
phender Oct 8, 2024
75d8677
Fix typo.
phender Oct 8, 2024
b28ae35
Merge branch 'master' into pahender/DAOS-16220
phender Oct 14, 2024
12d0c0e
Merge branch 'master' into pahender/DAOS-16220
phender Oct 17, 2024
2141693
Merge branch 'master' into pahender/DAOS-16220
phender Oct 22, 2024
2507648
Setting DAOS_AGENT_DRPC_DIR for remote commands
phender Oct 22, 2024
a4ff69c
Fix pylint issue.
phender Oct 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 1 addition & 3 deletions src/tests/ftest/control/config_generate_run.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,6 @@
SPDX-License-Identifier: BSD-2-Clause-Patent
'''

import os

import yaml
from apricot import TestWithServers
from server_utils import ServerFailed
Expand Down Expand Up @@ -49,7 +47,7 @@ def test_config_generate_run(self):
# path needs to be set in that case.
control_metadata = None
if use_tmpfs_scm:
control_metadata = os.path.join(self.test_env.log_dir, 'control_metadata')
control_metadata = self.test_env.control_metadata

# Call dmg config generate. AP is always the first server host.
server_host = self.hostlist_servers[0]
Expand Down
6 changes: 5 additions & 1 deletion src/tests/ftest/control/daos_agent_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,11 +42,15 @@ def test_daos_agent_config_basic(self):
self.agent_managers[-1],
include_local_host(self.hostlist_clients),
self.hostfile_clients_slots)
self.agent_managers[-1].verify_socket_dir = False

# Get the input to verify
c_val = self.params.get("config_val", "/run/agent_config_val/*/")

# Do not create the agent runtime directory if running as root or the test is attempting
# to test with an invalid runtime directory value.
if self.test_env.agent_user is None or (c_val[0] == "runtime_dir" and c_val[2] == "FAIL"):
self.agent_managers[-1].verify_socket_dir = False

# Identify the attribute and modify its value to test value
self.assertTrue(
self.agent_managers[-1].set_config_value(c_val[0], c_val[1]),
Expand Down
14 changes: 6 additions & 8 deletions src/tests/ftest/control/log_entry.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
(C) Copyright 2023 Intel Corporation.
(C) Copyright 2023-2024 Intel Corporation.

SPDX-License-Identifier: BSD-2-Clause-Patent
"""
Expand All @@ -8,7 +8,7 @@

from apricot import TestWithServers
from ClusterShell.NodeSet import NodeSet
from general_utils import get_journalctl, journalctl_time, wait_for_result
from general_utils import journalctl_time, wait_for_result
from run_utils import run_remote


Expand Down Expand Up @@ -36,16 +36,14 @@ def _verify_journalctl(self, since, expected_messages):
since (str): start time for journalctl
expected_messages (list): list of regular expressions to look for
"""
self.log_step('Verify journalctl output since {}'.format(since))
self.log_step(f'Verify journalctl output since {since}')

not_found = set(expected_messages)
journalctl_per_hosts = []

def _search():
"""Look for each message in any host's journalctl."""
journalctl_results = get_journalctl(
hosts=self.hostlist_servers, since=since, until=journalctl_time(),
journalctl_type="daos_server")
journalctl_results = self.server_managers[0].get_journalctl(since, journalctl_time())

# Convert the journalctl to a dict of hosts : output
journalctl_per_hosts.append({})
Expand Down Expand Up @@ -76,7 +74,7 @@ def _search():

# Fail if any message was not found
if not_found:
fail_msg = '{} messages not found in journalctl'.format(len(not_found))
fail_msg = f'{len(not_found)} messages not found in journalctl'
self.log.error(fail_msg)
for message in not_found:
self.log.error(' %s', message)
Expand Down Expand Up @@ -157,7 +155,7 @@ def test_control_log_entry(self):
self.log_step('Restart server')
expected = [r'Starting I/O Engine instance', r'Listening on']
with self.verify_journalctl(expected):
self.server_managers[0].restart(list(kill_host), wait=True)
self.server_managers[0].restart(kill_host, wait=True)

self.log_step('Reintegrate all ranks and wait for rebuild')
expected = [fr'rank {rank}.*start reintegration' for rank in kill_ranks] \
Expand Down
6 changes: 4 additions & 2 deletions src/tests/ftest/control/ms_resilience.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
(C) Copyright 2021-2023 Intel Corporation.
(C) Copyright 2021-2024 Intel Corporation.

SPDX-License-Identifier: BSD-2-Clause-Patent
"""
Expand Down Expand Up @@ -206,7 +206,9 @@ def kill_servers(self, leader, replicas, num_hosts):
kill_list.remove(kill_list[-1])
kill_list.add(leader)
self.log.info("*** stopping leader (%s) + %d others: %s", leader, num_hosts - 1, kill_list)
stop_processes(self.log, kill_list, self.server_managers[0].manager.job.command_regex)
stop_processes(
self.log, kill_list, self.server_managers[0].manager.job.command_regex,
user=self.server_managers[0].manager.job.run_user)

kill_ranks = self.server_managers[0].get_host_ranks(kill_list)
self.assertGreaterEqual(len(kill_ranks), len(kill_list),
Expand Down
48 changes: 20 additions & 28 deletions src/tests/ftest/deployment/agent_failure.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@

from ClusterShell.NodeSet import NodeSet
from command_utils_base import CommandFailure
from general_utils import get_journalctl, journalctl_time, report_errors
from general_utils import journalctl_time, report_errors
from ior_test_base import IorTestBase
from ior_utils import IorCommand
from job_manager_utils import get_job_manager
Expand Down Expand Up @@ -55,7 +55,7 @@ def run_ior_collect_error(self, results, job_num, file_name, clients, namespace)
# We'll verify the error message.
results[job_num].append(ior_output.stderr_text)
except CommandFailure as error:
results[job_num] = [False, "IOR failed: {}".format(error)]
results[job_num] = [False, f"IOR failed: {error}"]

def test_agent_failure(self):
"""Jira ID: DAOS-9385.
Expand Down Expand Up @@ -121,14 +121,10 @@ def test_agent_failure(self):
errors.append("IOR worked when agent is killed!")

# 5. Verify journalctl shows the log that the agent is stopped.
results = get_journalctl(
hosts=self.hostlist_clients, since=since, until=until,
journalctl_type="daos_agent")
results = self.agent_managers[0].get_journalctl(since, until)
self.log.info("journalctl results = %s", results)
if "shutting down" not in results[0]["data"]:
msg = "Agent shut down message not found in journalctl! Output = {}".format(
results)
errors.append(msg)
errors.append(f"Agent shut down message not found in journalctl! Output = {results}")

# 6. Restart agent.
self.log.info("Restart agent")
Expand All @@ -146,7 +142,7 @@ def test_agent_failure(self):
self.log.info(ior_results[job_num])
if not ior_results[job_num][0]:
ior_error = ior_results[job_num][-1]
errors.append("IOR with restarted agent failed! Error: {}".format(ior_error))
errors.append(f"IOR with restarted agent failed! Error: {ior_error}")

self.log.info("########## Errors ##########")
report_errors(test=self, errors=errors)
Expand Down Expand Up @@ -211,13 +207,13 @@ def test_agent_failure_isolation(self):
since = journalctl_time()
self.log.info("Stopping agent on %s", agent_host_kill)
pattern = self.agent_managers[0].manager.job.command_regex
detected, running = stop_processes(self.log, hosts=agent_host_kill, pattern=pattern)
detected, running = stop_processes(
self.log, hosts=agent_host_kill, pattern=pattern,
user=self.agent_managers[0].manager.job.run_user)
if not detected:
msg = "No daos_agent process killed on {}!".format(agent_host_kill)
errors.append(msg)
errors.append(f"No daos_agent process killed on {agent_host_kill}!")
elif running:
msg = "Unable to kill daos_agent processes on {}!".format(running)
errors.append(msg)
errors.append(f"Unable to kill daos_agent processes on {running}!")
else:
self.log.info("daos_agent processes on %s killed", detected)
until = journalctl_time()
Expand All @@ -236,29 +232,25 @@ def test_agent_failure_isolation(self):
self.log.info(ior_results[job_num_keep])
if not ior_results[job_num_keep][0]:
ior_error = ior_results[job_num_keep][-1]
errors.append("Error found in IOR on keep client! {}".format(ior_error))
errors.append(f"Error found in IOR on keep client! {ior_error}")

# 6. On the killed client, verify journalctl shows the log that the agent is
# stopped.
results = get_journalctl(
hosts=[agent_host_kill], since=since, until=until,
journalctl_type="daos_agent")
results = self.agent_managers[0].get_journalctl(since, until, agent_host_kill)
self.log.info("journalctl results (kill) = %s", results)
if "shutting down" not in results[0]["data"]:
msg = ("Agent shut down message not found in journalctl on killed client! "
"Output = {}".format(results))
errors.append(msg)
errors.append(
"Agent shut down message not found in journalctl on killed client! "
f"Output = {results}")

# 7. On the other client where agent is still running, verify that the journalctl
# in the previous step doesn't show that the agent is stopped.
results = get_journalctl(
hosts=[agent_host_keep], since=since, until=until,
journalctl_type="daos_agent")
results = self.agent_managers[0].get_journalctl(since, until, agent_host_keep)
self.log.info("journalctl results (keep) = %s", results)
if "shutting down" in results[0]["data"]:
msg = ("Agent shut down message found in journalctl on keep client! "
"Output = {}".format(results))
errors.append(msg)
errors.append(
"Agent shut down message found in journalctl on keep client! "
f"Output = {results}")

# 8. Restart both daos_agent. (Currently, there's no clean way to restart one.)
self.start_agent_managers()
Expand All @@ -274,7 +266,7 @@ def test_agent_failure_isolation(self):
self.log.info(ior_results[job_num_keep])
if not ior_results[job_num_keep][0]:
ior_error = ior_results[job_num_keep][-1]
errors.append("Error found in second IOR run! {}".format(ior_error))
errors.append(f"Error found in second IOR run! {ior_error}")

self.log.info("########## Errors ##########")
report_errors(test=self, errors=errors)
Expand Down
16 changes: 7 additions & 9 deletions src/tests/ftest/deployment/critical_integration.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
from apricot import TestWithoutServers, TestWithServers
from ClusterShell.NodeSet import NodeSet
from exception_utils import CommandFailure
from general_utils import DaosTestError, get_journalctl, journalctl_time, run_command
from general_utils import DaosTestError, journalctl_time, run_command
from run_utils import run_remote

# pylint: disable-next=fixme
Expand Down Expand Up @@ -167,22 +167,20 @@ def test_ras(self):
dmg.system_start(ranks=ranks_to_stop)
check_started_ranks = self.server_managers[0].check_rank_state(sub_list, ["joined"], 5)
if check_started_ranks:
self.fail("Following Ranks {} failed to restart".format(check_started_ranks))
self.fail(f"Following Ranks {check_started_ranks} failed to restart")

until = journalctl_time()

# gather journalctl logs for each server host, verify system stop event was sent to logs
results = get_journalctl(hosts=self.hostlist_servers, since=since,
until=until, journalctl_type="daos_server")
results = self.server_managers[0].get_journalctl(since, until)
str_to_match = "daos_engine exited: process exited with 0"
for count, host in enumerate(self.hostlist_servers):
occurrence = results[count]["data"].count(str_to_match)
if occurrence != 2:
self.log.info("Occurrence %s for rank stop not as expected for host %s",
occurrence, host)
msg = "Rank shut down message not found in journalctl! Output = {}".format(
results[count]["data"])
self.fail(msg)
self.log.error(
"Occurrence %s for rank stop not as expected for host %s", occurrence, host)
self.log.debug("Journalctl output: %s", results[count]["data"])
self.fail("Rank shut down message not found in journalctl!")

dmg.storage_scan()
dmg.network_scan()
4 changes: 3 additions & 1 deletion src/tests/ftest/deployment/server_rank_failure.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,9 @@ def kill_engine(self, engine_kill_host):
engine_kill_host (str): Hostname to kill engine.
"""
pattern = self.server_managers[0].manager.job.command_regex
detected, running = stop_processes(self.log, NodeSet(engine_kill_host), pattern)
detected, running = stop_processes(
self.log, NodeSet(engine_kill_host), pattern,
user=self.server_managers[0].manager.job.run_user)
if not detected:
self.log.info("No daos_engine process killed on %s!", engine_kill_host)
elif running:
Expand Down
62 changes: 52 additions & 10 deletions src/tests/ftest/launch.py
Original file line number Diff line number Diff line change
Expand Up @@ -282,7 +282,8 @@ def _run(self, args):
else:
set_test_environment(
logger, test_env, args.test_servers, args.test_clients, args.provider,
args.insecure_mode, self.details)
args.insecure_mode, self.details, args.agent_user, args.test_log_dir,
args.systemd_path, args.systemd_lib_path)
except TestEnvironmentException as error:
message = f"Error setting up test environment: {str(error)}"
return self.get_exit_status(1, message, "Setup", sys.exc_info())
Expand Down Expand Up @@ -320,12 +321,13 @@ def _run(self, args):
return self.get_exit_status(0, "Listing tests complete")

# Setup the fuse configuration
try:
setup_fuse_config(logger, args.test_servers | args.test_clients)
except LaunchException:
# Warn but don't fail
message = "Issue detected setting up the fuse configuration"
setup_result.warn_test(logger, "Setup", message, sys.exc_info())
if args.fuse_setup:
try:
setup_fuse_config(logger, args.test_servers | args.test_clients)
except LaunchException:
# Warn but don't fail
message = "Issue detected setting up the fuse configuration"
setup_result.warn_test(logger, "Setup", message, sys.exc_info())

# Setup override systemctl files
try:
Expand Down Expand Up @@ -358,8 +360,8 @@ def _run(self, args):
group.update_test_yaml(
logger, args.scm_size, args.scm_mount, args.extra_yaml,
args.timeout_multiplier, args.override, args.verbose, args.include_localhost)
except (RunException, YamlException) as e:
message = "Error modifying the test yaml files: {}".format(e)
except (RunException, YamlException) as error:
message = f"Error modifying the test yaml files: {error}"
status |= self.get_exit_status(1, message, "Setup", sys.exc_info())
except StorageException:
message = "Error detecting storage information for test yaml files"
Expand Down Expand Up @@ -540,6 +542,12 @@ def main():
"-a", "--archive",
action="store_true",
help="archive host log files in the avocado job-results directory")
parser.add_argument(
"-au", "--agent_user",
action="store",
default=None,
type=str,
help="user account to use when running the daos_agent")
parser.add_argument(
"-c", "--clear_mounts",
action="append",
Expand All @@ -562,6 +570,10 @@ def main():
"--failfast",
action="store_true",
help="stop the test suite after the first failure")
parser.add_argument(
"-fs", "--fuse_setup",
action="store_true",
help="enable setting up fuse configuration files")
parser.add_argument(
"-i", "--include_localhost",
action="store_true",
Expand All @@ -584,7 +596,7 @@ def main():
help="modify the test yaml files but do not run the tests")
parser.add_argument(
"-mo", "--mode",
choices=['normal', 'manual', 'ci'],
choices=['normal', 'manual', 'ci', 'custom_a'],
default='normal',
help="provide the mode of test to be run under. Default is normal, "
"in which the final return code of launch.py is still zero if "
Expand Down Expand Up @@ -649,6 +661,18 @@ def main():
"-si", "--slurm_install",
action="store_true",
help="enable installing slurm RPMs if required by the tests")
parser.add_argument(
"-sl", "--systemd_lib_path",
action="store",
default=None,
type=str,
help="the daos_server and daos_agent systemd LD_LIBRARY_PATH to define in the config")
parser.add_argument(
"-sp", "--systemd_path",
action="store",
default=None,
type=str,
help="the daos_server and daos_agent systemd PATH to define in the config")
parser.add_argument(
"--scm_mount",
action="store",
Expand Down Expand Up @@ -681,6 +705,12 @@ def main():
default=NodeSet(),
help="comma-separated list of hosts to use as replacement values for "
"client placeholders in each test's yaml file")
parser.add_argument(
"-tld", "--test_log_dir",
action="store",
default=None,
type=str,
help="test log directory base path")
parser.add_argument(
"-th", "--logs_threshold",
action="store",
Expand Down Expand Up @@ -744,10 +774,22 @@ def main():
args.slurm_install = True
args.slurm_setup = True
args.user_create = True
args.fuse_setup = True
args.clear_mounts.append("/mnt/daos")
args.clear_mounts.append("/mnt/daos0")
args.clear_mounts.append("/mnt/daos1")

elif args.mode == "custom_a":
if args.agent_user is None:
# Run the agent with the current user by default
args.agent_user = getpass.getuser()
args.process_cores = False
args.logs_threshold = None
args.slurm_install = False
args.slurm_setup = False
args.user_create = False
args.fuse_setup = False

# Setup the Launch object
launch = Launch(args.name, args.mode, args.slurm_install, args.slurm_setup)

Expand Down
Loading
Loading