We've been doing some stress testing with timelapse captures and inference. So far the Xavier+327S has been a great combo (just ordered 2 more 327S cams too).
Running into an issue, which could be resource related, but just seeing if anyone else has run into it.
We ran a stress test capturing still images in a loop with a 250ms delay. Each image is fed to an inference pipeline (TensorRT). Everything works great and our services appear free of memory leaks. After about 3 hours (approximately 21K captures), the video device becomes unusable and our service freezes up. It requires a reboot for the device to work again.
dmesg reports:
[Fri Apr 5 01:23:41 2024] tegra194-vi5 15c10000.vi: dma_alloc_coherent failed
[Fri Apr 5 01:23:41 2024] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[Fri Apr 5 01:23:41 2024] Mem abort info:
[Fri Apr 5 01:23:41 2024] ESR = 0x96000046
[Fri Apr 5 01:23:41 2024] Exception class = DABT (current EL), IL = 32 bits
[Fri Apr 5 01:23:41 2024] SET = 0, FnV = 0
[Fri Apr 5 01:23:41 2024] EA = 0, S1PTW = 0
[Fri Apr 5 01:23:41 2024] Data abort info:
[Fri Apr 5 01:23:41 2024] ISV = 0, ISS = 0x00000046
[Fri Apr 5 01:23:41 2024] CM = 0, WnR = 1
[Fri Apr 5 01:23:41 2024] user pgtable: 4k pages, 39-bit VAs, pgd = 00000000ca31e3b8
[Fri Apr 5 01:23:41 2024] [0000000000000000] *pgd=000000025b1d1003, *pud=000000025b1d1003, *pmd=0000000000000000
[Fri Apr 5 01:23:41 2024] Internal error: Oops: 96000046 [#1] PREEMPT SMP
[Fri Apr 5 01:23:41 2024] Modules linked in: fuse xt_conntrack ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xt_addrtype iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack br_netfilter zram cdc_acm overlay userspace_alert binfmt_misc veyecam nvgpu ip_tables x_tables
[Fri Apr 5 01:23:41 2024] CPU: 3 PID: 8996 Comm: vi-output, veye Not tainted 4.9.337-tegra #1
[Fri Apr 5 01:23:41 2024] Hardware name: NVIDIA Jetson Xavier NX Developer Kit (DT)
[Fri Apr 5 01:23:41 2024] task: 000000006596177f task.stack: 000000004c535bec
[Fri Apr 5 01:23:41 2024] PC is at __memcpy+0x110/0x180
[Fri Apr 5 01:23:41 2024] LR is at tegra_channel_kthread_capture_enqueue+0x14c/0x498
[Fri Apr 5 01:23:41 2024] pc : [<ffffff800845fed0>] lr : [<ffffff8008b3c6a4>] pstate: 20c00045
[Fri Apr 5 01:23:41 2024] sp : ffffffc0bb0abd40
[Fri Apr 5 01:23:41 2024] x29: ffffffc0bb0abd40 x28: ffffffc1f4db6018
[Fri Apr 5 01:23:41 2024] x27: 0000000000000000 x26: 0000000000000f00
[Fri Apr 5 01:23:41 2024] x25: ffffff80301f0000 x24: ffffffc0bb0abdf8
[Fri Apr 5 01:23:41 2024] x23: ffffffc1f4db6934 x22: ffffffc1b5af6000
[Fri Apr 5 01:23:41 2024] x21: 0000000000000001 x20: ffffffc1f4db69f8
[Fri Apr 5 01:23:41 2024] x19: 0000000000000000 x18: 0000000000000400
[Fri Apr 5 01:23:41 2024] x17: 0000007f9382d4b0 x16: 0000000000000001
[Fri Apr 5 01:23:41 2024] x15: 0000000000000219 x14: 0000000000000000
[Fri Apr 5 01:23:41 2024] x13: 0000000000000000 x12: 0000000000000000
[Fri Apr 5 01:23:41 2024] x11: 0000000000000000 x10: 0000000000000000
[Fri Apr 5 01:23:41 2024] x9 : 0000000000000000 x8 : 0000000000000000
[Fri Apr 5 01:23:41 2024] x7 : 0000000300000000 x6 : 0000000000000000
[Fri Apr 5 01:23:41 2024] x5 : 00000004bfc00000 x4 : 0000000000000000
[Fri Apr 5 01:23:41 2024] x3 : 0000000000000000 x2 : 0000000000000240
[Fri Apr 5 01:23:41 2024] x1 : ffffff8009092980 x0 : 0000000000000000
[Fri Apr 5 01:23:41 2024] Process vi-output, veye (pid: 8996, stack limit = 0x000000004c535bec)
[Fri Apr 5 01:23:41 2024] Call trace:
[Fri Apr 5 01:23:41 2024] [<000000004dcbfb33>] __memcpy+0x110/0x180
[Fri Apr 5 01:23:41 2024] [<00000000cc923fd9>] kthread+0xec/0xf0
[Fri Apr 5 01:23:41 2024] [<000000002de8d5bf>] ret_from_fork+0x10/0x30
[Fri Apr 5 01:23:41 2024] ---[ end trace 6389bf17dc0e47e6 ]---
Could this be an issue with the VEYE driver?
Thank you!