Each load into CarbonData is written into a separate folder called Segment.Segments is a powerful concept which helps to maintain consistency of data and easy transaction management.CarbonData provides DML (Data Manipulation Language) commands to maintain the segments.
This command is used to list the segments of CarbonData table.
SHOW [HISTORY] SEGMENTS
[FOR TABLE | ON] [db_name.]table_name [INCLUDE STAGE] [LIMIT number_of_segments]
[AS (select query from table_name_segments)]
By default, SHOW SEGMENT command will return following fields:
-
Segment ID
-
Segment Status
-
Load Start Time
-
Load Time Taken
-
Partition
-
Data Size
-
Index Size
Example: Show visible segments
SHOW SEGMENTS ON CarbonDatabase.CarbonTable
Show 10 visible segments with the largest segmentid
SHOW SEGMENTS ON CarbonDatabase.CarbonTable LIMIT 10
Show all segments, include invisible segments
SHOW HISTORY SEGMENTS ON CarbonDatabase.CarbonTable
Show all segments, include stages Stage status is 'Unload' or 'Loading', and stage ID, stage load time taken is null.
SHOW SEGMENTS ON CarbonDatabase.CarbonTable INCLUDE STAGE
When more detail of the segment is required, user can issue SHOW SEGMENT by query.
The query should be against table name with '_segments' appended and select from following fields:
-
id: String, the id of the segment
-
status: String, status of the segment
-
loadStartTime: String, loading start time
-
loadEndTime: String, loading end time
-
timeTakenMs: Long, time spent in loading of the segment in milliseconds
-
partitions: String array, partition key and values
-
dataSize: Long, data size in bytes
-
indexSize: Long, index size in bytes
-
mergedToId: String, the target segment that this segment has been compacted
-
format: String, data format of the segment
-
path: String, in case of external segment this will be the path of the segment, otherwise it is null
-
segmentFileName: String, name of the segment file
Example:
SHOW SEGMENTS ON CarbonTable AS SELECT * FROM CarbonTable_segments SHOW SEGMENTS ON CarbonTable LIMIT 10 AS SELECT * FROM CarbonTable_segments SHOW SEGMENTS ON CarbonTable AS SELECT id, dataSize FROM CarbonTable_segments WHERE status='Success' ORDER BY dataSize SHOW SEGMENTS ON CarbonTable AS SELECT avg(timeTakenMs) FROM CarbonTable_segments SHOW SEGMENTS ON CarbonTable INCLUDE STAGE AS SELECT avg(timeTakenMs) FROM CarbonTable_segments
This command is used to delete segment by using the segment ID. Each segment has a unique segment ID associated with it. Using this segment ID, you can remove the segment.
The following command will get the segmentID.
SHOW SEGMENTS FOR TABLE [db_name.]table_name LIMIT number_of_segments
After you retrieve the segment ID of the segment that you want to delete, execute the following command to delete the selected segment.
DELETE FROM TABLE [db_name.]table_name WHERE SEGMENT.ID IN (segment_id1, segments_id2, ...)
Example:
DELETE FROM TABLE CarbonDatabase.CarbonTable WHERE SEGMENT.ID IN (0)
DELETE FROM TABLE CarbonDatabase.CarbonTable WHERE SEGMENT.ID IN (0,5,8)
This command will allow to delete the CarbonData segment(s) from the store based on the date provided by the user in the DML command. The segment created before the particular date will be removed from the specific stores.
DELETE FROM TABLE [db_name.]table_name WHERE SEGMENT.STARTTIME BEFORE DATE_VALUE
Example:
DELETE FROM TABLE CarbonDatabase.CarbonTable WHERE SEGMENT.STARTTIME BEFORE '2017-06-01 12:05:06'
This command is used to read data from specified segments during CarbonScan.
Get the Segment ID:
SHOW SEGMENTS FOR TABLE [db_name.]table_name LIMIT number_of_segments
Set the segment IDs for table
SET carbon.input.segments.<database_name>.<table_name> = <list of segment IDs>
NOTE: carbon.input.segments: Specifies the segment IDs to be queried. This property allows you to query specified segments of the specified table. The CarbonScan will read data from specified segments only.
If user wants to query with segments reading in multi-threading mode, then CarbonSession.threadSet can be used instead of SET query.
CarbonSession.threadSet ("carbon.input.segments.<database_name>.<table_name>","<list of segment IDs>");
Reset the segment IDs
SET carbon.input.segments.<database_name>.<table_name> = *;
If user wants to query with segments reading in multi-threading mode, then CarbonSession.threadSet can be used instead of SET query.
CarbonSession.threadSet ("carbon.input.segments.<database_name>.<table_name>","*");
Examples:
- Example to show the list of segment IDs, segment status, and other required details and then specify the list of segments to be read.
SHOW SEGMENTS FOR carbontable1;
SET carbon.input.segments.db.carbontable1 = 1,3,9;
- Example to query with segments reading in multi-threading mode:
CarbonSession.threadSet ("carbon.input.segments.db.carbontable_Multi_Thread","1,3");
- Example for threadset in multi-thread environment (following shows how it is used in Scala code):
def main(args: Array[String]) {
Future {
CarbonSession.threadSet ("carbon.input.segments.db.carbontable_Multi_Thread","1")
spark.sql("select count(empno) from carbon.input.segments.db.carbontable_Multi_Thread").show();
}
}