Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Edge-case #47

Closed
jgrancell opened this issue Oct 4, 2019 · 11 comments
Closed

Error Edge-case #47

jgrancell opened this issue Oct 4, 2019 · 11 comments

Comments

@jgrancell
Copy link
Contributor

Background: We are currently running PyTerraBackTyl with the git backend, using Atlantis as a front end.

There appears to be an edge-case where we are hitting your error message here:

if lock_state:
return _json_string(lock_state), C.LOCK_STATES[C.LOCK_STATE_LOCKED]
return 'I don\'t know, man. Something is really fucked up.',\
C.LOCK_STATES[_backends[_env][C.TYL_KEYWORD_BACKEND]._lock_state_]

My testing shows that usually what happens in that PyTerraBackTyl has the specific project LOCKED in its own internal memory, but the git repo that PyTerra saves state into has the project unlocked. I'm not sure yet what causes this mismatch, but a service restart is usually enough to correct the issue (hence why my thinking it's saved in memory incorrectly).

It also seems that when PyTerra is trying to write the git commit message, it can fail on the defaultdict portion of this line:

self.lock_commit_msg = self.C.GIT_COMMIT_MESSAGE_FORMAT.format(**defaultdict(str, state_obj))

It gives a KeyError against Who, with our GIT_COMMIT_MESSAGE_FORMAT set to "{Who}, {Operation} - {ID}". I'm not sure if this is something that's also saved in memory and is perhaps also getting messed up, or if it's entirely unrelated to the LOCKED/UNLOCKED issue.

I will submit shortly a PR shortly that makes the error referenced first in this issue a bit more descriptive.

@jgrancell
Copy link
Contributor Author

Actually, I have one of our projects that is currently in this exact error state right now.

Terraform attempts to call PyTerraBackTyl via a LOCK request to /lock?env=repo-name_project-name, and PyTerra returns a HTTP 423 with your Something is really fucked` response, and terraform informs the user that it is already locked.

Looking at the git backend's committed code, it's not locked. Looking on-server in /opt/pyterrabacktyl/data/repo-name_project-name, it's not LOCKED.

Running curl -X LOCK http://localhost:2442/lock?env=repo-name_project-name gives me the error I don't know, man. Something is really fucked up. If i run an UNLOCK against the unlock endpoint for the same env, it gives the same issue. So at this point I can neither lock or unlock via curl.

Restarting the service is confirmed to fix the issue. As soon as it's restarted I can run terraform and it correctly determines it to be unlocked.

@dev-dull
Copy link
Owner

dev-dull commented Oct 8, 2019

Hey Josh, I've got a guess on what's happening here and I believe this is related to Issue #36. What is GIT_WORKING_PATH set to in config.yaml? If it is set to the default of null (None) or the setting is commented out (again, defaulting the value to None), then PyTerraBackTYL is using an OS temporary directory. I've ran into cases where OSes are good at leaving things untouched there, but I've also ran into cases where the OS will "clean up" files that git is expecting to be present causing commits to fail.

Next time you run into this issue, navigate or curl /state endpoint on your PyTerraBackTYL instance. In there, you should be able to find the temporary directory being used for the repo/branch where you can perform a git status and discover if any files are missing that should be present.

The work-arounds are to do scheduled restarts of the service, or to set GIT_WORKING_PATH to a specific path (I have been using /opt/pyterrabacktyl/data or /opt/pyterrabacktyl/PyTerraBackTyl/data on my instances).

side notes: I've got a mild cold atm and I'm notorious for misreading things when I'm not 100%, so let me know if I've wildly misunderstood the issue here.

@jgrancell
Copy link
Contributor Author

We're using /opt/pyterrabacktyl/data for the GIT_WORKING_PATH, with the application residing in /opt/pyterrabacktyl/PyTerraBackTyl.

@dev-dull
Copy link
Owner

dev-dull commented Oct 8, 2019

Hm. Very strange. Do you happen to know if you're getting a lock state of INIT, or lock state of LOCKED?

What version of terraform are you using? My company refuses to budge from 0.11.x, so if you're using 0.12, we might be running into an unexpected compatibility issue.

Also, just a quick note to mention that I just did a PR to correct a 'c' to a 'C' on line 61 of pyterrabacktyl.py

@jgrancell
Copy link
Contributor Author

I haven't been able to get the expected lock state yet, I'm waiting for the next time it happens to see what gets output in the debug logging.

We're currently running 0.12.8, though we've also had this issue on 0.12.5 I believe.

@dev-dull
Copy link
Owner

dev-dull commented Oct 8, 2019

While we wait for the bad state to show up again, could you double check the following for me? Given the symptoms, I don't expect any will be the case, but I thought of couple scenarios that might cause one or more of the below.

  • Take a look at the branch names and double check that they all have the expected names (no typo'd doup).
  • grep the logs for a warnings that looks like:
    • 'Failed to release lock for ENV %s, already unlocked'
    • 'Failed to obtain lock for ENV %s, already locked: %s'
  • Take a look at the directory names in /opt/pyterrabacktyl/data and double check that there's nothing strange going on there (e.g. a directory with a name that starts or ends with a space)
  • Double check that the backend_service.run() call (last line of pyterrabacktly.py) doesn't have threading/tornado enabled (I think I had accidentally enabled this several months ago when I was doing work on adding a simple UI)

@jgrancell
Copy link
Contributor Author

Sorry for the late response, I was out of the office and away from my computer for almost 2 weeks.

To respond to your questions:

  • There are no obvious branch name issues, but we have more than 400 branches so it's difficult to check every character of every branch.
  • We output all logs to journalctl, but I can find no instance of the string "Failed to" in the logs.
  • None that I can see. Names are all generated programatically as {repo-name}_{atlantis project name} so that shouldn't be an issue.
  • The call looks like this in our running version of PyTerra: backend_service.run(host=C.BACKEND_SERVICE_IP, port=C.BACKEND_SERVICE_PORT, ssl_context=ssl_context)

@jgrancell
Copy link
Contributor Author

We're running into this issue again. It looks like sometime yesterday a build didn't quite finish properly (not a PyTerraBackTyl issue). So within the actual git repo for the branch (we'll call it dc-abc_user-project) the LOCKED file still exists even though the terraform job is long done.

The local git repo at /opt/pyterrabacktyl/data/dc-abc_user-project/ shows the LOCKED file is present, and that the local repo has not diverged from the remote repo in Bitbucket.

Interestingly, when I attempt to manually unlock the file (via curl command), debugging code shows:

Oct 23 11:58:20 us01vltrrfrm003 pyterrabacktyl[1322]: 2019-10-23 11:58:20,547 - root - INFO - Memory state: INIT
Oct 23 11:58:20 us01vltrrfrm003 pyterrabacktyl[1322]: 2019-10-23 11:58:20,547 - root - INFO - Accepted states: ['LOCKED', 'INIT']
Oct 23 11:58:20 us01vltrrfrm003 pyterrabacktyl[1322]: 2019-10-23 11:58:20,547 - root - INFO - Old State: INIT
Oct 23 11:58:20 us01vltrrfrm003 pyterrabacktyl[1322]: 2019-10-23 11:58:20,547 - root - INFO - New State: UNLOCKED
...
Oct 23 11:58:20 us01vltrrfrm003 pyterrabacktyl[1322]: 2019-10-23 11:58:20,914 - root - DEBUG - Lock state set to UNLOCKED

So i'm not sure why PyTerraBackTyl thought that the repo was INIT, since by all accounts it was locked. The curl call to UNLOCK did unlock it, and it's functional again.

@dev-dull
Copy link
Owner

So after spending quite a bit of thought on this, I think some additional logging might shed some more insight on this problem. I'll try to make some additions to the code sometime this week.

  • Log what happens with request.data.decode() or '{}' (line 46 of _set_lock_state) which could be failing and leaving us stuck in an INIT lock state (although I would expect such an unhandled exception to be automatically logged).
  • json.loads(lock_text) (line 47 of _set_lock_state) could be failing and leaving us in an INIT lock state (this would indicate a bug with Terraform, I would expect this error would automatically be logged).
  • Better logging around what the GitPython library is seeing (exit code an any output messages); the library depends on the local installation of git on the host and if the program exits with a status code of 0 even though it encountered an error, it may be telling us that it successfully unlocked when no action has actually happened.

@dev-dull
Copy link
Owner

dev-dull commented Nov 1, 2019

Hi Josh,
I just commited some changes to the 'beta' branch that should help us pin down the exact cause here. Just a heads up that this also includes changes that should fix some of the output issues with the PyTerraBackTYL /state endpoint as well as the Slack post-processor.

FYI, I was unable to get the .decode() function (as outlined in my first bullet point in my previous) to fail on a binary string, so I skipped adding any additional error handling/logging there.

@dev-dull
Copy link
Owner

dev-dull commented Mar 2, 2023

Closing due to age

@dev-dull dev-dull closed this as completed Mar 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants