UNSOLVED reComputer Xavier+327S stress testing - dma_alloc_coherent failed

newleaf

We've been doing some stress testing with timelapse captures and inference. So far the Xavier+327S has been a great combo (just ordered 2 more 327S cams too).

Running into an issue, which could be resource related, but just seeing if anyone else has run into it.

We ran a stress test capturing still images in a loop with a 250ms delay. Each image is fed to an inference pipeline (TensorRT). Everything works great and our services appear free of memory leaks. After about 3 hours (approximately 21K captures), the video device becomes unusable and our service freezes up. It requires a reboot for the device to work again.

dmesg reports:

[Fri Apr  5 01:23:41 2024] tegra194-vi5 15c10000.vi: dma_alloc_coherent failed
[Fri Apr  5 01:23:41 2024] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[Fri Apr  5 01:23:41 2024] Mem abort info:
[Fri Apr  5 01:23:41 2024]   ESR = 0x96000046
[Fri Apr  5 01:23:41 2024]   Exception class = DABT (current EL), IL = 32 bits
[Fri Apr  5 01:23:41 2024]   SET = 0, FnV = 0
[Fri Apr  5 01:23:41 2024]   EA = 0, S1PTW = 0
[Fri Apr  5 01:23:41 2024] Data abort info:
[Fri Apr  5 01:23:41 2024]   ISV = 0, ISS = 0x00000046
[Fri Apr  5 01:23:41 2024]   CM = 0, WnR = 1
[Fri Apr  5 01:23:41 2024] user pgtable: 4k pages, 39-bit VAs, pgd = 00000000ca31e3b8
[Fri Apr  5 01:23:41 2024] [0000000000000000] *pgd=000000025b1d1003, *pud=000000025b1d1003, *pmd=0000000000000000
[Fri Apr  5 01:23:41 2024] Internal error: Oops: 96000046 [#1] PREEMPT SMP
[Fri Apr  5 01:23:41 2024] Modules linked in: fuse xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack br_netfilter zram cdc_acm overlay userspace_alert binfmt_misc veyecam nvgpu ip_tables x_tables
[Fri Apr  5 01:23:41 2024] CPU: 3 PID: 8996 Comm: vi-output, veye Not tainted 4.9.337-tegra #1
[Fri Apr  5 01:23:41 2024] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[Fri Apr  5 01:23:41 2024] task: 000000006596177f task.stack: 000000004c535bec
[Fri Apr  5 01:23:41 2024] PC is at __memcpy+0x110/0x180
[Fri Apr  5 01:23:41 2024] LR is at tegra_channel_kthread_capture_enqueue+0x14c/0x498
[Fri Apr  5 01:23:41 2024] pc : [<ffffff800845fed0>] lr : [<ffffff8008b3c6a4>] pstate: 20c00045
[Fri Apr  5 01:23:41 2024] sp : ffffffc0bb0abd40
[Fri Apr  5 01:23:41 2024] x29: ffffffc0bb0abd40 x28: ffffffc1f4db6018
[Fri Apr  5 01:23:41 2024] x27: 0000000000000000 x26: 0000000000000f00
[Fri Apr  5 01:23:41 2024] x25: ffffff80301f0000 x24: ffffffc0bb0abdf8
[Fri Apr  5 01:23:41 2024] x23: ffffffc1f4db6934 x22: ffffffc1b5af6000
[Fri Apr  5 01:23:41 2024] x21: 0000000000000001 x20: ffffffc1f4db69f8
[Fri Apr  5 01:23:41 2024] x19: 0000000000000000 x18: 0000000000000400
[Fri Apr  5 01:23:41 2024] x17: 0000007f9382d4b0 x16: 0000000000000001
[Fri Apr  5 01:23:41 2024] x15: 0000000000000219 x14: 0000000000000000
[Fri Apr  5 01:23:41 2024] x13: 0000000000000000 x12: 0000000000000000
[Fri Apr  5 01:23:41 2024] x11: 0000000000000000 x10: 0000000000000000
[Fri Apr  5 01:23:41 2024] x9 : 0000000000000000 x8 : 0000000000000000
[Fri Apr  5 01:23:41 2024] x7 : 0000000300000000 x6 : 0000000000000000
[Fri Apr  5 01:23:41 2024] x5 : 00000004bfc00000 x4 : 0000000000000000
[Fri Apr  5 01:23:41 2024] x3 : 0000000000000000 x2 : 0000000000000240
[Fri Apr  5 01:23:41 2024] x1 : ffffff8009092980 x0 : 0000000000000000

[Fri Apr  5 01:23:41 2024] Process vi-output, veye (pid: 8996, stack limit = 0x000000004c535bec)
[Fri Apr  5 01:23:41 2024] Call trace:
[Fri Apr  5 01:23:41 2024] [<000000004dcbfb33>] __memcpy+0x110/0x180
[Fri Apr  5 01:23:41 2024] [<00000000cc923fd9>] kthread+0xec/0xf0
[Fri Apr  5 01:23:41 2024] [<000000002de8d5bf>] ret_from_fork+0x10/0x30
[Fri Apr  5 01:23:41 2024] ---[ end trace 6389bf17dc0e47e6 ]---

Could this be an issue with the VEYE driver?

Thank you!

newleaf

@newleaf After some high-level research, this may require an update to JetPack 5. Seeing some issues reported about vi-output misbehaving.

newleaf

@newleaf Adding an update here. Upon repeated testing, it was discovered that the issue described above happens after about 16348 still image captures (individual camera capture invocations). After that time, the camera becomes unresponsive until after a reboot.

Being that this number is close to 2^14 (16384), is there a register or counter that is overflowing/wrapping? Perhaps a bug in the driver or camera firmware?

After speaking with out of our engineers, he noted that that the NULL pointer dereference at virtual address 00000000 could simply be the side effect of an internal issue.

We will be exploring an upgrade to JetPack 5 and continue testing.

Thank you!

veye_xumm

@newleaf Hi, have you made any further progress on this issue?

newleaf

@veye_xumm We haven’t had time to try JP 5. It is definitely related to exactly 16348 camera invocations. That is, starting and stoping streaming. We actually took a different approach for the 1-2 second rapid timelapse images and utilized the dropped frames and capture modes via I2C and Go. It effectively delays the issue a bit as we only need to invoke streaming when a motion event is detected via a radar sensor. Regardless, it’s still a mystery and reproducible on JP 4.6. I was going to try another camera, like a Pi HQ one to see if it’s firmware or OS related. Any suggestions or thoughts?

newleaf

@newleaf BTW, the I2C features are nice and work quite well. Was able to port the code over to Go.