Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPS Data Intermittently Corrupted #68

Open
kb1lqc opened this issue Feb 26, 2017 · 11 comments
Open

GPS Data Intermittently Corrupted #68

kb1lqc opened this issue Feb 26, 2017 · 11 comments
Assignees
Labels

Comments

@kb1lqc
Copy link
Member

kb1lqc commented Feb 26, 2017

H1. Summary
@jbemenheiser and I have now both independently observed this error where Telemetry stops saving data to the database. Using https://github.com/kb1lqc/Faraday-Software/tree/issue113 branch of the software which provides better telemetry parsing and APRS operation I was able to get much better diagnostic information and reliable APRS operation overnight.

When an error occurs in Telemetry.py in the function sqlInsert() the following exception is caught:

    except conn.ProgrammingError as e:
        logger.error(e)
        logger.error(telem)
conn.rollback()

This allows us to obtain the following error diagnostic information:

2017-02-26 09:57:11,665 - telemetry - ERROR - You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
2017-02-26 09:57:11,667 - telemetry - ERROR - [None, 'KB1LQC', 2, 'KB1LQC', 1, 3, 57, 17, 26, 0, 2, 57607, '3359.7027', 'N', '11826.9555', 'W', '84.50500', 'M', '1.43\xac', '2', '1.15', 192, 0, 2376, 2369, 2349, 2309, 2161, 2140, 1143, 21, 2846, 0, 0, 7200, 0, 1488131831.665]

Clearly the GPS HDOP value of '1.43\xac' is causing our problem.

The timeline is sporadic. @jbemenheiser was able to see this error regularly while it took my test 9 hours, 40 minutes and 26 seconds to manifest! (Epoch 1488095638.345 to 1488098960.392)

@kb1lqc
Copy link
Member Author

kb1lqc commented Feb 26, 2017

Confirmed this is a firmware bug since restarting Proxy and Telemetry applications without restarting the Faraday radio SN 7 with GPS still resulted in:

2017-02-26 10:17:28,632 - telemetry - ERROR - [None, 'KB1LQC', 2, 'KB1LQC', 1, 24, 17, 18, 26, 0, 2, 57607, '3359.6972', 'N', '11826.9382', 'W', '19.50500', 'M', '1.47\xac', '2', '0.93', 192, 0, 2400, 2368, 2334, 2224, 2233, 2214, 1143, 21, 2856, 0, 0, 7200, 0, 1488133048.632]

Next up is restarting Faraday:

Yup once GPS lock was achieve this is the first telemetry packet received from Faraday after a reboot:

"6531" "6531" "KB1LQC" "2" "KB1LQC" "1" "9" "0" "0" "26" "0" "2" "57607" "3359.7626" "N" "11826.8578" "W" "183.9" "M" "0.23" "1" "8.39" "192" "0" "1693" "1550" "1721" "1555" "1514" "1527" "1141" "20" "2849" "0" "0" "7200" "0" "1488133298.287"

The HDOP value was "8.39" which is a valid float. This is most definitely a firmware bug @kb1lqd !

@kb1lqc kb1lqc added the bug label Feb 26, 2017
@kb1lqc kb1lqc added this to the Initial Release milestone Feb 26, 2017
@jbemenheiser
Copy link

jbemenheiser commented Feb 26, 2017

I have seen other issues with values other than GPS HDOP.

The longitude below for example.
2017-02-26 10:36:24,357 - telemetry - ERROR - [None, 'KB3SVQ', 7, 'KB3SVQ', 1, 22, 36, 18, 26, 0, 2, 57607, '3358.4664', 'N', '11822.7\x90\x9c\x99', '\xab', '\xb28.52000', 'M', '0.020', '2', '1.01', 192, 0, 2493, 2260, 2354, 2278, 2279, 2230, 996, 30, 2847, 0, 0, 7200, 0, 1488134184.355]

Or altitude:
2017-02-26 10:37:42,339 - telemetry - ERROR - [None, 'KB3SVQ', 7, 'KB3SVQ', 1, 40, 37, 18, 26, 0, 2, 57607, '3358.4665', 'N', '11822.7083', 'W', '28\x9e\x9a\x98\xb000', 'M', '0.020', '2', '1.66', 192, 0, 2493, 2278, 2422, 2300, 2292, 2248, 995, 29, 2856, 0, 0, 7200, 0, 1488134262.337]

@kb1lqc
Copy link
Member Author

kb1lqc commented Feb 26, 2017

@jbemenheiser sounds like classic buffer overflow to me

@kb1lqc
Copy link
Member Author

kb1lqc commented Feb 26, 2017

@jbemenheiser just noticed that you had several items blown out there: '11822.7\x90\x9c\x99', '\xab', '\xb28.52000',

That's looks like a continuous portion of memory just got overwritten. Spanning several telemetry items.

Can you confirm that this is persistent as in the values do not change until reboot?

@jbemenheiser
Copy link

This is not a persistent error. Using @kb1lqc 's conn.rollback() code above, good packets get logged after the exception is thrown for bad packets. All of the following 'bad' packets were logged in one telemetry session over the course of two hours without a power cycle. (I removed the error messages after the first one)

2017-02-26 10:29:51,434 - telemetry - INFO - Starting telemetry server
2017-02-26 10:29:51,479 - telemetry - INFO - Starting telemetry_worker thread
2017-02-26 10:29:51,497 - werkzeug - INFO -  * Running on http://127.0.0.1:8001/ (Press CTRL+C to quit)
2017-02-26 10:36:24,355 - telemetry - ERROR - You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.
2017-02-26 10:36:24,357 - telemetry - ERROR - [None, 'KB3SVQ', 7, 'KB3SVQ', 1, 22, 36, 18, 26, 0, 2, 57607, '3358.4664', 'N', '11822.7\x90\x9c\x99', '\xab', '\xb28.52000', 'M', '0.020', '2', '1.01', 192, 0, 2493, 2260, 2354, 2278, 2279, 2230, 996, 30, 2847, 0, 0, 7200, 0, 1488134184.355]
2017-02-26 10:37:42,339 - telemetry - ERROR - [None, 'KB3SVQ', 7, 'KB3SVQ', 1, 40, 37, 18, 26, 0, 2, 57607, '3358.4665', 'N', '11822.7083', 'W', '28\x9e\x9a\x98\xb000', 'M', '0.020', '2', '1.66', 192, 0, 2493, 2278, 2422, 2300, 2292, 2248, 995, 29, 2856, 0, 0, 7200, 0, 1488134262.337]
2017-02-26 11:25:06,982 - telemetry - ERROR - [None, 'KB3SVQ', 1, 'KB3SVQ', 1, 1, 0, 0, 0, 2, 0, 53255, '\x93\x9b\x9b\x98.4650', 'N', '11822.7050', 'W', '000030.0', 'M', '00000', '0', '0000', 192, 0, 2328, 2116, 2219, 2147, 2216, 2179, 0, 28, 2823, 0, 0, 7200, 0, 1488137106.981]
2017-02-26 11:28:18,884 - telemetry - ERROR - [None, 'KB3SVQ', 7, 'KB3SVQ', 1, 16, 28, 19, 26, 0, 2, 57607, '3358.4575', 'N', '11822.7050', 'W', '26.12000', 'M', '0.\x90\x9b\x98', '\xb2', '1.07', 192, 0, 2489, 2277, 2427, 2271, 2229, 2162, 984, 31, 2861, 0, 0, 7200, 0, 1488137298.883]
2017-02-26 12:15:36,437 - telemetry - ERROR - [None, 'KB3SVQ', 7, 'KB3SVQ', 1, 34, 15, 20, 26, 128, 130, 61571, '\xb3358.4600', 'N', '11822.7044', 'W', '32.62000', 'M', '0.030', '2', '1.21', 192, 0, 2433, 2295, 2481, 2281, 2208, 2157, 968, 30, 2848, 0, 0, 7200, 0, 1488140136.435]
2017-02-26 12:20:36,842 - telemetry - ERROR - [None, 'KB3SVQ', 7, 'KB3SV\x91\xff\x80\x80', 1, 34, 20, 20, 26, 0, 2, 57607, '3358.4605', 'N', '11822.7052', 'W', '34.02000', 'M', '0.010', '2', '1.21', 192, 0, 2435, 2220, 2196, 2264, 2261, 2190, 961, 30, 2858, 0, 0, 7200, 0, 1488140436.84]
2017-02-26 12:30:19,019 - telemetry - ERROR - [None, 'KB3SVQ', 7, 'KB3SVQ', 1, 16, 30, 20, 26, 0, 2, 57607, '3358.4629', 'N', '11822.7054', 'W', '35.3200\x90', '\xa5', '\x98\x96000', '2', '1.20', 192, 0, 2441, 2199, 2126, 2231, 2256, 2281, 960, 29, 2848, 0, 0, 7200, 0, 1488141019.018]
2017-02-26 12:32:18,858 - telemetry - ERROR - [None, 'KB3SVQI"\xd8', 131, '\xa5\xc23SVQ', 1, 16, 32, 20, 26, 0, 2, 57607, '3358.4629', 'N', '11822.7055', 'W', '35.32000', 'M', '0.010', '2', '0.92', 192, 0, 2435, 2180, 2040, 2194, 2218, 2259, 959, 27, 2847, 0, 0, 7200, 0, 1488141138.857]

@jbemenheiser
Copy link

Looking at the data that was logged to the database during the session I mentioned above, I also found some values that were obviously corrupt, by did not cause exceptions.

Board temp of 32795 at EPOCH 1488143280.656
image

@kb1lqc
Copy link
Member Author

kb1lqc commented Feb 26, 2017 via email

@jbemenheiser
Copy link

jbemenheiser commented Feb 26, 2017

Only for a second...

I did notice that in this line (third from the bottom in the session above) it is the destination callsign that is corrupt.
2017-02-26 12:20:36,842 - telemetry - ERROR - [None, 'KB3SVQ', 7, 'KB3SV\x91\xff\x80\x80', 1, 34, 20, 20, 26, 0, 2, 57607, '3358.4605', 'N', '11822.7052', 'W', '34.02000', 'M', '0.010', '2', '1.21', 192, 0, 2435, 2220, 2196, 2264, 2261, 2190, 961, 30, 2858, 0, 0, 7200, 0, 1488140436.84]

@kb1lqc
Copy link
Member Author

kb1lqc commented Feb 26, 2017

Oh weird... now that is not indicative of the RF unit being corrupt but of the local unit. Which makes sense, they are running the same firmware and it's more likely to corrupt a GPS position since there's simply more bytes in memory available to corrupt in those packets. Good catch @jbemenheiser!

@kb1lqd
Copy link
Contributor

kb1lqd commented Mar 2, 2017

Sweet, ok so two things:

  • I don't check the NMEA checksum/CRC/whatever bytes at the end of the string to verify good NMEA receptions
  • Unlikely but possible for telemetry packet itself to be corrupt
    • I check checksum on each packet into Proxy in the UART protocol stack

Judging that you are seeing board temps and other corruptions this indicates a wider fault than just GPS. I'm thinking either general corruption or a packet parsing offset error... I will look into this.

@kb1lqc
Copy link
Member Author

kb1lqc commented Mar 2, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants