In this scenario the provided download measurement data is being ingested into the SamKnows big data platform.
Your task as a data engineer is to write a basic data integrity application to ensure we are consistently importing valid and accurate data.
The end result of this test should be a report outlining anomalies found in the data to be investigated.
Please provide us with a public github (or similar) link to your solution.
We recommend that you spend no longer than 2 hours on this test
You can find a gzipped CSV file containing the data here
Column | Type | Description |
---|---|---|
dtime | datetime | The datetime the test was run |
device_identifier | string | UUID representing the device |
bytes_total | int | The total number of bytes transferred during the test |
bytes_per_second | int | The bytes per second calculated for the test |
successes | int | Number of successful downloads (in this context 1 or 0) |
failures | int | number of failed attempts (in this context 1 or 0) |
We would like for you to use ONE of the following approaches:
- Use a high level programming language of your choice (e.g. PHP, Python, Go) to produce an application to identify anomalies
- Import the data set into a databasing technology and use SQL to identify anomalies.
- A well defined solution: what assumptions did you make, why did you choose this approach and how do we run it?
- A clear report outlining the anomalies found in the data
- The effieciency of the approach to finding the anomalies
- Given more time, what would you do to improve your solution?
Good luck!