Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: runtime error: invalid memory address or nil pointer dereference #12502

Closed
girgen opened this issue Jan 12, 2023 · 10 comments · Fixed by #13707
Closed

panic: runtime error: invalid memory address or nil pointer dereference #12502

girgen opened this issue Jan 12, 2023 · 10 comments · Fixed by #13707
Labels
bug unexpected problem or unintended behavior

Comments

@girgen
Copy link

girgen commented Jan 12, 2023

Relevant telegraf.conf

# Configuration for telegraf agent

# Global tags can be specified here in key="value" format.
[global_tags]
  dc = "custom" # will tag all metrics with dc=pionen
  prod = "true"

# add this to rc.conf and put relevant files there
# telegraf_flags="-config-directory /usr/local/etc/telegraf.d"

[agent]
  ## Default data collection interval for all inputs
  interval = "10s"
  ## Rounds collection interval to 'interval'
  ## ie, if interval="10s" then always collect on :00, :10, :20, etc.
  round_interval = true

  ## Telegraf will cache metric_buffer_limit metrics for each output, and will
  ## flush this buffer on a successful write.
  metric_buffer_limit = 1000
  ## Flush the buffer whenever full, regardless of flush_interval.
  flush_buffer_when_full = true

  ## Collection jitter is used to jitter the collection by a random amount.
  ## Each plugin will sleep for a random time within jitter before collecting.
  ## This can be used to avoid many plugins querying things like sysfs at the
  ## same time, which can have a measurable effect on the system.
  collection_jitter = "1s"

  ## Default flushing interval for all outputs. You shouldn't set this below
  ## interval. Maximum flush_interval will be flush_interval + flush_jitter
  flush_interval = "10s"
  ## Jitter the flush interval by a random amount. This is primarily to avoid
  ## large write spikes for users running a large number of telegraf instances.
  ## ie, a jitter of 5s and interval 10s means flushes will happen every 10-15s
  flush_jitter = "5s"

  ## Run telegraf in debug mode
  debug = false
  ## Run telegraf in quiet mode
  quiet = true
  ## Override default hostname, if empty use os.Hostname()
  hostname = "xxx-hostname.yyy.zzz"

  ## Log target controls the destination for logs and can be one of "file",
  ## "stderr" or, on Windows, "eventlog".  When set to "file", the output file
  ## is determined by the "logfile" setting.
  logtarget = "file"

  ## Name of the file to be logged to when using the "file" logtarget.  If set to
  ## the empty string then logs are written to stderr.
  logfile = "/var/log/telegraf/telegraf.log"

  ## The logfile will be rotated after the time interval specified.  When set
  ## to 0 no time based rotation is performed.  Logs are rotated only when
  ## written to, if there is no log activity rotation may be delayed.
  logfile_rotation_interval = "0h"

  ## The logfile will be rotated when it becomes larger than the specified
  ## size.  When set to 0 no size based rotation is performed.
  logfile_rotation_max_size = "1MB"

  ## Maximum number of rotated archives to keep, any older logs are deleted.
  ## If set to -1, no archives are removed.
  logfile_rotation_max_archives = 5


# Configuration for influxdb server to send metrics to
[[outputs.influxdb]]
  ## The full HTTP or UDP endpoint URL for your InfluxDB instance.
  ## Multiple urls can be specified as part of the same cluster,
  ## this means that only ONE of the urls will be written to each interval.
  # urls = ["udp://192.168.1.244:8090"] # UDP endpoint example
  urls = ["https://influx.cxx.zzz:8086"] # required
  ## The target database for metrics (telegraf will create it if not exists).
  database = "pp_prod" # required
  ## Retention policy to write to.
  retention_policy = "default"
  ## Precision of writes, valid values are "ns", "us" (or "µs"), "ms", "s", "m", "h".
  ## note: using "s" precision greatly improves InfluxDB compression.
  precision = "s"

  ## Write timeout (for the InfluxDB client), formatted as a string.
  ## If not provided, will default to 5s. 0s means no timeout (not recommended).
  timeout = "5s"

  username = "xxxxx"
  password = "sEcReT"
  ## Set the user agent for HTTP POSTs (can be useful for log differentiation)
  # user_agent = "telegraf"
  ## Set UDP payload size, defaults to InfluxDB UDP Client default (512 bytes)
  # udp_payload = 512


# Statsd Server
[[inputs.statsd]]
  ## Address and port to host UDP listener on
  service_address = "localhost:8125"
  ## Delete gauges every interval (default=false)
  delete_gauges = true
  ## Delete counters every interval (default=false)
  delete_counters = true
  ## Delete sets every interval (default=false)
  delete_sets = true
  ## Delete timings & histograms every interval (default=true)
  delete_timings = true
  ## Percentiles to calculate for timing & histogram stats
  percentiles = [70.0, 90.0]

  ## separator to use between elements of a statsd metric
  metric_separator = "_"

  ## Number of UDP messages allowed to queue up, once filled,
  ## the statsd server will start dropping packets
  allowed_pending_messages = 10000

  ## Number of timing/histogram values to track per-measurement in the
  ## calculation of percentiles. Raising this limit increases the accuracy
  ## of percentiles but also increases the memory usage and cpu time.
  percentile_limit = 1000

  ## UDP packet size for the server to listen for. This will depend on the size
  ## of the packets that the client is sending, which is usually 1500 bytes.
  udp_packet_size = 1500


### Logs from Telegraf

```text
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x98b91f]

goroutine 1 [running]:
github.com/awnumar/memguard/core.Purge.func1(0xc000155930)
	github.com/awnumar/[email protected]/core/exit.go:23 +0x3f
github.com/awnumar/memguard/core.Purge()
	github.com/awnumar/[email protected]/core/exit.go:51 +0x25
github.com/awnumar/memguard/core.Panic({0x5093f80, 0xc0001119a0})
	github.com/awnumar/[email protected]/core/exit.go:85 +0x25
github.com/awnumar/memguard/core.NewBuffer(0x20)
	github.com/awnumar/[email protected]/core/buffer.go:73 +0x2d5
github.com/awnumar/memguard/core.NewCoffer()
	github.com/awnumar/[email protected]/core/coffer.go:30 +0x34
github.com/awnumar/memguard/core.init.0()
	github.com/awnumar/[email protected]/core/enclave.go:15 +0x2e



### System info

Telegraf 1.25.0, FreeBSD-13.1

### Docker

_No response_

### Steps to reproduce

1. install latest telegraf 1.25.0 from ports or package
2. start
3. se it crash and restart in loop (due to the start script's daemon restart)



### Expected behavior

not panic :)

### Actual behavior

it panics due to segmentation fault

### Additional info

1.24.x works fine. The problem was introduced with telegraf 1.25.0

I am the packager for FreeBSD, btw.
@girgen girgen added the bug unexpected problem or unintended behavior label Jan 12, 2023
@powersj
Copy link
Contributor

powersj commented Jan 12, 2023

This sounds very similar to #12403, which is due to awnumar/memguard#144

@powersj
Copy link
Contributor

powersj commented Jan 12, 2023

@girgen - was this run in a jail or some other type of container? Or was this on a vanilla freebsd system?

@powersj powersj added the waiting for response waiting for response from contributor label Jan 12, 2023
@girgen
Copy link
Author

girgen commented Jan 12, 2023

Ah, yes, in a jail. Sorry, forgot to mention that. I run most stuff in jails.

@telegraf-tiger telegraf-tiger bot removed the waiting for response waiting for response from contributor label Jan 12, 2023
@girgen
Copy link
Author

girgen commented Jan 13, 2023

Is there a way simple, or a least not too hard, way to opt-out that module at buildtime? That would mean to opt out the secret stash feature, but as a short term solution, that would be preferred.

My other alternative is to let the start script fail and inform the user to reconfigure the jail if it does not have the allow.sysvipc=1 jail parameter.

Third alternative would be to downgrade the port until the problem is fixed.

The first alternativ is preferred. Can we fix a patch for the source code that opts out that module?

Best regards,
Palle

@powersj
Copy link
Contributor

powersj commented Jan 13, 2023

Is there a way simple, or a least not too hard, way to opt-out that module at buildtime?

The panic occurs during an init() that happens even with an empty import of memguard as a result, the fix looks to need to be in memguard.

Using golang.org/x/sys/unix and checking the error from the following may be enough:

err := unix.Mlockall(unix.MCL_FUTURE | unix.MCL_CURRENT)

But not sure how that reacts on non-linux/unix systems, need to play with this further.

@girgen
Copy link
Author

girgen commented Jan 13, 2023

Mm, yeah, something like

package core

import (
        "golang.org/x/sys/unix"
        "github.com/awnumar/memcall"
)

func init() {
        err := unix.Mlockall(unix.MCL_FUTURE | unix.MCL_CURRENT)
        if (err != nil) {

but then what? :) How can I opt out in that case? 🤔

@powersj
Copy link
Contributor

powersj commented Jan 18, 2023

but then what? :) How can I opt out in that case? thinking

My first goal was to try to see how that library could not panic, which would allow us to continue using the library. The opt-out would then not be necessary, as we can safely import it and would only need to throw and error if someone tried to use the secret-store features when the jail/container/etc. did not have the correct privilege.

For now, I believe what you should document is the need to add the least amount of privileges. I think that is the allow.mlock parameter. Let me know if that is indeed the minimum required or if you need additional parameters please!

@owlcall
Copy link

owlcall commented Jan 22, 2023

Adding allow.mlock = 1; in the jail config resolved the panic. Thank you. No other changes needed to be made other than the jail config.

Details from my issue (for posterity/relevance):

Running FreeBSD 13.1 RELEASE (telegraf-1.25), problems started suddenly around a month ago. Configs have been untouched for a very long time, but telegraf updates are automated.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x98b91f]

goroutine 1 [running]:
github.com/awnumar/memguard/core.Purge.func1(0xc000143930)
        github.com/awnumar/[email protected]/core/exit.go:23 +0x3f
github.com/awnumar/memguard/core.Purge()
        github.com/awnumar/[email protected]/core/exit.go:51 +0x25
github.com/awnumar/memguard/core.Panic({0x5093f80, 0xc0000f59b0})
        github.com/awnumar/[email protected]/core/exit.go:85 +0x25
github.com/awnumar/memguard/core.NewBuffer(0x20)
        github.com/awnumar/[email protected]/core/buffer.go:73 +0x2d5
github.com/awnumar/memguard/core.NewCoffer()
        github.com/awnumar/[email protected]/core/coffer.go:30 +0x34
github.com/awnumar/memguard/core.init.0()
        github.com/awnumar/[email protected]/core/enclave.go:15 +0x2e

Errors above are identical to those seen in issue #12403.

@girgen
Copy link
Author

girgen commented Mar 1, 2023

Hi,

While this is workaround, I would still like to pursue the idea of actually change the code to afvoid using the mlock. Is the mlock really necessary?

@powersj
Copy link
Contributor

powersj commented Mar 1, 2023

I would still like to pursue the idea of actually change the code to afvoid using the mlock. Is the mlock really necessary?

The memguard library is used by Telegraf in the secret store functionality and that is not a feature we are going to remove. If you have an idea as to workaround the import of the library when not necessary please do put up a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants