UDP connection loss after migration from EHS6 to ELS61 | Thales IoT Developer Community
October 13, 2021 - 7:43pm, 4418 views
Hello again,
We recently migrated our data logging java app from EHSx to ELS61 platform. Not many changes were needed, we can still build the app form the same codebase, just switching the WTK.
We use UDP protocol for data transfer from module to server and we have been using it very successfully for several years without issues on EHSx plaftorm (even on TC65i). But now, moving to ELS61, we are noticing connection failures occurring every few hours. In previous platforms, this event was so rare we just used the watchdog to kick in and reset the device after an hour since last data exchange.
On ELS61, this happens several ***** daily and so frequent restarts look bad on the stability statistics charts.
Most of the time, our app is waiting in DatagramConnection.receive method. Every few hours, there are no received datagrams for 5 minutes (should come at least every minute) and then the method throws an InterruptedIOException. On retry, the exception is thrown instantly.
So, is there a setting, mybe a timer that would not get triggered by UDP-only data transfer? We are seeing this behavior on 50+ devices, so it's not just one piece gone rouge.
Next issue is trying to work around it by resetting the network on InterruptedIOException and not triggering the watchdog.
We tried with AT+COPS=2, then AT+COPS=0, but more ***** than not, the module freezes at AT+COPS=2. As we don't use ATCommandListener yet, that essentially freezes our comm Thread, and again, we wait for the watchdog.
We tried with AT+CFUN=4 followed by AT+CFUN=1, that put the module permanently in Airplane ****, even restart didn't help. Like AT+CFUN=1 was never executed. We don't want to got that way. If this happens in the field, we're doomed. Watchdog is no help here...
Next is to try only with AT+CGACT=0/AT+CGACT=1, maybe that will be enough...
So, what say you. What is different in ELS61 UDP or what else can I try to make it just work all the time?
Thank you in advance for any comments or suggestions!
Jure
Hello,
The signal quality isn't too good or sometimes it's really poor.
The not changing SMONI output is really suspicious.
You can force 3G with AT^SXRAT=2
You may also try AT+CEER command when this happens to possibly see any error information (send AT+CEER=0 before the test to clear the record). Or set some more indicators:
AT^SIND="psinfo",1
AT^SIND="lsta",1
AT^SIND="service",1
AT^SIND="ceer",1
AT+CREG=2
AT+CEREG=2
AT+CGREG=2
If it turns out that this happens in LTE only you could then experiment with bands restricting with AT^SCFG="Radio/Band/4G",
Anyway the reliable diagnosis of this issue may be impossible without the access to module traces.
Regards,
Bartłomiej
Hello,
AT^SXRAT=2 is not documented? Or is my AT Command Spec too old?
My tests show the AT^SXRAT=1 selects 3G?
I ran the module for a day with AT^SXRAT=0 (2G) and indeed it was better (no silent disconnects).
Now, since the locations for our devices will probably often be at remote uninhabited places (overhead power lines), I don't really expect perfect conditions. So having the watchdog enabled is a solid bottom line, but every module restart breaks the continuity of measurements. So I thought, ok, when I reach 5min without comm reply, I shutdown UDP connection, reconnect to data service, the measurement thread is not interrupted, all is well.
But! If this situation occurs (5min without data transfer), I verified that SMS receive/send is not working, BearerControl listener is silent and if I issue either AT+COPS=2 or at+CFUN=4 in terminal, I never regain the command prompt - terminal is stuck, when I let the program do it automatically, the calling thread is stalled, pending a watchdog trigger again. In case of AT+CFUN=4 this is critical, because the command is evidently executed - on reboot, the airplane **** is in effect, so remote access is permanently disabled.
What can I do to solve this? If I could reset the network connection with AT+COPS=2/AT+COPS=0 and if that could reenable the data transfer, I'd be all set...
I'll try a 3G only run today...
Hello,
AT^SXRAT=2 must be documented, I believe. Please share the document ID if not.
Setting 1 is GSM/UMTS dual **** and 2 is UMTS. You might have the AT spec for release 1 but then there would only be 2G and 4G available.
I'm really surprised that you experience such regular problems with these commands. Normally I can send these commands nad everything works. It must have something to do with this particular problem that you are facing. I'm really curious of the results in 3G. Maybe it could be a solution. In desert areas there may be no reliable 4G anyway.
Please also check if it is possible to execute commands by another ATCommand instance when this happens. Maybe it could help to reboot the module earlier. Or you could shorten the watchdog timer before executing these rescue commands or use another Watchdog2 instance.
Best regards,
Bartłomiej
Hello again,
I've checked my AT Command Set document ID(ELS61-E2_ATC_V01.000) and it is indeed old. I've found a newer spec on https://files.c-wm.net/index.php/s/GRPgoz5m7a73c54 under ELS61 for rev2, but the pdf seems corrupted! Is there another link for it?
I've looked at the pls62 spec, which should be similar and there the SXRAT values are as you explained.
Yes, with watchdog it works, but I would like to avoid the data loss caused by restarts. I think I'll try switching SXRAT states (3G to 4G and back) when data transfer stops, maybe that will reestablish the connection.
I'll report with new findings..
Regards,
Jure
As for the AT spec I checked and you are right - the file is corrupted. But you can also download it from here: https://www.thalesgroup.com/en/markets/digital-identity-and-security/iot...
Hello,
it's time to update this issue a bit. I've actually implement an automatic RAT switching with SXRAT, albeit with not much improvement on the system's operation. I still see several terminals in the field losing data connectivity only restored by watchdog reset after a configured timeout.
I had no success using AT+COPS method to disconnect and reconnect the network after sudden loss of communication, because of AT+COPS=2 disconnect command frequently freezing the calling thread in such state. I'm clueless why it happens.
Testing these scenarios is challenging because most critical situations only occur frequently on devices in the field, which are practically not accessible, whereas in lab environment, things just work.
Would setting a fixed access technology (2G, 3G,4G) with AT+COPS have better results? I'm a bit weary of doing it for fear of setting a technology not available on the specific location thus locking myself out.
thank you for any further insights on this issue...
Jure
Hello,
It seems that we have two different issues here. One with the network connectivity and the other one with AT commands. Or maybe these two are related but for now it's hard to say.
As for the AT commands - maybe it will not help but please make sure that you don't use cwmlib_1.0.jar in your project. This lib for AT commands was only necessary for older firmware revisions. Now it's a part of the module's software. The module would use the built-in implementation (which is the desired behavior) if the MIDlet is not obfuscated. Additionally please make sure that there is at least 100ms break between AT commands execution on a single instance or interface (there is such a restriction/recommendation somewhere in the AT commands spec document).
As we don't know why the commands execution hangs in your app, maybe for test/debug purposes or just to exclude as much potential issues as possible you could open a separate AtCommand instance for the purpose of applying the workaround.
As for the connectivity issue it seems to be somehow dependent on network as you reproduce it much easier in the field than in the office. But as I have probably written before the module tracing would be necessary to investigate the issue. So you would have to contact your supplier to report it as a potential defect to Thales. Of course any fixed RAT restriction would decrease the product flexibility. Until you develop some kind of intelligent solution.
BR,
Bartłomiej