This project allows profiles to be executed using Apache Spark. This is a port of the Profiler to Spark that allows you to backfill profiles using archived telemetry.
Using the Streaming Profiler in Apache Storm allows you to create profiles based on the stream of telemetry being captured, enriched, triaged, and indexed by Metron. This does not allow you to create a profile based on telemetry that was captured in the past.
There are many cases where you might want to produce a profile from telemetry in the past. This is referred to as profile seeding or backfilling.
-
As a Security Data Scientist, I want to understand the historical behaviors and trends of a profile so that I can determine if the profile has predictive value for model building.
-
As a Security Platform Engineer, I want to generate a profile using archived telemetry when I deploy a new model to production so that models depending on that profile can function on day 1.
The Batch Profiler running in Apache Spark allows you to seed a profile using archived telemetry.
The portion of a profile produced by the Batch Profiler should be indistinguishable from the portion created by the Streaming Profiler. Consumers of the profile should not care how the profile was generated. Using the Streaming Profiler together with the Batch Profiler allows you to create a complete profile over a wide range of time.
For an introduction to the Profiler, see the Profiler README.
-
If a profile file does not already exist, you can create a profile definition by editing
$METRON_HOME/config/zookeeper/profiler.json
as follows.cat $METRON_HOME/config/zookeeper/profiler.json { "profiles": [ { "profile": "hello-world", "foreach": "'global'", "init": { "count": "0" }, "update": { "count": "count + 1" }, "result": "count" } ], "timestampField": "timestamp" }
See Specifying profiles for information on how to load profile definitions from zookeeper.
-
Ensure that you have archived telemetry available for the Batch Profiler to consume. By default, Metron will store this in HDFS at
/apps/metron/indexing/indexed/*/*
.hdfs dfs -cat /apps/metron/indexing/indexed/*/* | wc -l
-
Copy the
hbase-site.xml
file from/etc/hbase/conf
to/etc/spark2/conf
. It is advised to create a symlink to avoid the duplication of file, also to keep consistency between files while config updates.ln -s /etc/hbase/conf/hbase-site.xml /etc/spark2/conf/hbase-site.xml
-
Review the Batch Profiler's properties located at
$METRON_HOME/config/batch-profiler.properties
. See Configuring the Profiler for more information on these properties. -
You may want to edit the log4j properties that sits in your config directory in
${SPARK_HOME}
or create one. It may be helpful to turn onDEBUG
logging for the Profiler by adding the following line.log4j.logger.org.apache.metron.profiler.spark=DEBUG
-
Run the Batch Profiler.
source /etc/default/metron cd $METRON_HOME $METRON_HOME/bin/start_batch_profiler.sh
-
Query for the profile data using the Profiler Client.
The profile to use for batch processing can be specified as either a JSON file on disk or by utilizing a profile already loaded into zookeeper for use by the streaming profiler.
-
If a profile file does not already exist, you can create a profile definition by editing
$METRON_HOME/config/zookeeper/profiler.json
as follows.cat $METRON_HOME/config/zookeeper/profiler.json { "profiles": [ { "profile": "hello-world", "foreach": "'global'", "init": { "count": "0" }, "update": { "count": "count + 1" }, "result": "count" } ], "timestampField": "timestamp" }
-
When launching the batch profiler directly, use the
--profiles <path to profiler.json>
option. If using the wrapper script to launch the batch profiler, it will automatically add the command argument--profiles $METRON_HOME/config/zookeeper/profiler.json
to the batch launching process if$SPARK_PROFILER_USE_ZOOKEEPER
is not defined.
Choose to use profiles already loaded into zookeeper (e.g. for use by the streaming profiler) by setting the environment variable $SPARK_PROFILER_USE_ZOOKEEPER
.
This will cause the wrapper script to add --zookeeper $ZOOKEEPER
to the batch launching process,
which will cause the spark profiler to extract profiles from the zookeeper quorum located at $ZOOKEEPER
.
The Batch Profiler package is installed automatically when installing Metron using the Ambari MPack. See the following notes when installing the Batch Profiler without the Ambari MPack.
The Batch Profiler requires Spark version 2.3.0+.
-
Build Metron.
mvn clean package -DskipTests -T2C
-
Build the RPMs.
cd metron-deployment/ mvn clean package -Pbuild-rpms
-
Retrieve the package.
find ./ -name "metron-profiler-spark*.rpm"
-
Build Metron.
mvn clean package -DskipTests -T2C
-
Build the DEBs.
cd metron-deployment/ mvn clean package -Pbuild-debs
-
Retrieve the package.
find ./ -name "metron-profiler-spark*.deb"
A script located at $METRON_HOME/bin/start_batch_profiler.sh
has been provided to simplify running the Batch Profiler. This script makes the following assumptions.
-
The script either
- builds the profiles defined in
$METRON_HOME/config/zookeeper/profiler.json
. or - utilises the profiles already loaded into zookeeper quorum at
$ZOOKEEPER
if the environment variable$SPARK_PROFILER_USE_ZOOKEEPER
is set.
- builds the profiles defined in
-
The properties defined in
$METRON_HOME/config/batch-profiler.properties
are passed to both the Profiler and Spark. You can define both Spark and Profiler properties in this same file. -
The script will also configure the event time field to use if the field value is stored in the
${SPARK_PROFILER_EVENT_TIMESTAMP_FIELD}
environment variable. -
The script assumes that Spark is installed at
/usr/hdp/current/spark2-client
. This can be overridden if you define an environment variable calledSPARK_HOME
prior to executing the script.
The Batch Profiler may also be started using spark-submit
as follows. See the Spark Documentation for more information about spark-submit
.
${SPARK_HOME}/bin/spark-submit \
--class org.apache.metron.profiler.spark.cli.BatchProfilerCLI \
--properties-file ${SPARK_PROPS_FILE} \
${METRON_HOME}/lib/metron-profiler-spark-*.jar \
--config ${PROFILER_PROPS_FILE} \
--profiles ${PROFILES_FILE}
The Batch Profiler accepts the following arguments when run from the command line as shown above. All arguments following the Profiler jar are passed to the Profiler. All argument preceeding the Profiler jar are passed to Spark.
Argument | Description |
---|---|
-p , --profiles |
Path to the profile definitions. |
-z , --zookeeper |
Zookeeper quorum to read profile definitions from. |
-t , --timestampfield |
Which data field to use for event time. |
-c , --config |
Path to the profiler properties file. |
-g , --globals |
Path to the Stellar global config file. |
-r , --reader |
Path to properties for the DataFrameReader. |
-h , --help |
Print the help text. |
The path to a file containing the profile definition in JSON. Only one of --zookeeper
or --profiles
should be used
Read profile definitions from the zookeeper quorum at this address. Only one of --zookeeper
or --profiles
should be used.
Specifies which data field to utilising for event time information. The field to use for event time is usually stored as part of the profile. It can be overridden via this setting.
The path to a file containing key-value properties for the Profiler. This file would contain the properties described under Configuring the Profiler.
The path to a file containing key-value properties that define the global properties. This can be used to customize how certain Stellar functions behave during execution.
The path to a file containing key-value properties that are passed to the DataFrameReader when reading the input telemetry. This allows additional customization for how the input telemetry is read.
Spark supports a number of different cluster managers. The underlying cluster manager is transparent to the Profiler. To run the Profiler on a particular cluster manager, it is just a matter of setting the appropriate options as defined in the Spark documentation.
By default, the Batch Profiler instructs Spark to run in local mode. This will run all of the Spark execution components within a single JVM. This mode is only useful for testing with a limited set of data.
$METRON_HOME/config/batch-profiler.properties
spark.master=local
To run the Profiler using Spark on YARN, at a minimum edit the value of spark.master
as shown. In many cases it also makes sense to set the YARN deploy mode to cluster
.
$METRON_HOME/config/batch-profiler.properties
spark.master=yarn
spark.submit.deployMode=cluster
See the Spark documentation for information on how to further control the execution of Spark on YARN. Any of these properties can be added to the Profiler properties file.
The following command can be useful to review the logs generated when the Profiler is executed on YARN.
yarn logs -applicationId <application-id>
See the Spark documentation for information on running the Batch Profiler in a secure, kerberized cluster.
The Profiler can consume archived telemetry stored in a variety of input formats. By default, it is configured to consume the text/json that Metron archives in HDFS. This is often not the best format for archiving telemetry. If you choose a different format, you should be able to configure the Profiler to consume it by doing the following.
- Edit
profiler.batch.input.format
andprofiler.batch.input.path
as needed. For example, to read ORC you might do the following.
$METRON_HOME/config/batch-profiler.properties
profiler.batch.input.format=org.apache.spark.sql.execution.datasources.orc
profiler.batch.input.path=hdfs://localhost:9000/apps/metron/indexing/orc/\*/\*
- If additional options are required for your input format, then use the
--reader
command-line argument when launching the Batch Profiler as described here.
The following examples highlight the configuration values needed to read telemetry stored in common formats. These values should be defined in the Profiler properties (see --config
).
profiler.batch.input.reader=json
profiler.batch.input.path=/path/to/json/
profiler.batch.input.reader=orc
profiler.batch.input.path=/path/to/orc/
profiler.batch.input.reader=parquet
profiler.batch.input.path=/path/to/parquet/
By default, the configuration for the Batch Profiler is stored in the local filesystem at $METRON_HOME/config/batch-profiler.properties
.
You can store both settings for the Profiler along with settings for Spark in this same file. Spark will only read settings that start with spark.
.
Setting | Description |
---|---|
profiler.batch.input.path |
The path to the input data read by the Batch Profiler. |
profiler.batch.input.reader |
The telemetry reader used to read the input data. |
profiler.batch.input.format |
The format of the input data read by the Batch Profiler. |
profiler.batch.input.begin |
Only messages with a timestamp after this will be profiled. |
profiler.batch.input.end |
Only messages with a timestamp before this will be profiled. |
profiler.period.duration |
The duration of each profile period. |
profiler.period.duration.units |
The units used to specify the profiler.period.duration . |
profiler.hbase.salt.divisor |
A salt is prepended to the row key to help prevent hot-spotting. |
profiler.hbase.table |
The name of the HBase table that profiles are written to. |
profiler.hbase.column.family |
The column family used to store profiles. |
Default: hdfs://localhost:9000/apps/metron/indexing/indexed/*/*
The path to the input data read by the Batch Profiler.
Default: json
Defines how the input data is treated when read. The value is not case sensitive so JSON
and json
are equivalent.
json
: Read text/json formatted telemetryorc
: Read Apache ORC formatted telemetryparquet
: Read Apache Parquet formatted telemetrytext
Consumes input data stored as raw text. Should be defined along withprofiler.batch.input.format
. Only use if the input format is not directly supported likejson
.columnar
Consumes input data stored in columnar formats. Should be defined along withprofiler.batch.input.format
. Only use if the input format is not directly supported likejson
.
See Common Formats for further information.
Default: text
The format of the input data read by the Batch Profiler. This is optional and not required in most cases. For example, this property is not required when profiler.batch.input.reader
is json
, orc
, or parquet
.
Default: undefined; no time constraint
Only messages with a timestamp equal to or after this will be profiled. The Profiler will only profiles messages with a timestamp in [profiler.batch.input.begin
, profiler.batch.input.end
] inclusive.
By default, no time constraint is defined. The value is expected to follow the ISO-8601 instant format; 2011-12-03T10:15:30Z.
Default: undefined; no time constraint
Only messages with a timestamp before or equal to this will be profiled. The Profiler will only profiles messages with a timestamp in [profiler.batch.input.begin
, profiler.batch.input.end
] inclusive.
By default, no time constraint is defined. The value is expected to follow the ISO-8601 instant format; 2011-12-03T10:15:30Z.
Default: 15
The duration of each profile period. This value should be defined along with profiler.period.duration.units
.
Important: To read a profile using the Profiler Client, the Profiler Client's profiler.client.period.duration
property must match this value. Otherwise, the Profiler Client will be unable to read the profile data.
Default: MINUTES
The units used to specify the profiler.period.duration
. This value should be defined along with profiler.period.duration
.
Important: To read a profile using the Profiler Client, the Profiler Client's profiler.client.period.duration.units
property must match this value. Otherwise, the Profiler Client will be unable to read the profile data.
Default: 1000
A salt is prepended to the row key to help prevent hotspotting. This constant is used to generate the salt. This constant should be roughly equal to the number of nodes in the Hbase cluster to ensure even distribution of data.
Default: profiler
The name of the HBase table that profile data is written to. The Profiler expects that the table exists and is writable. It will not create the table.
Default: P
The column family used to store profile data in HBase.