Metron Profiler for Spark

This project allows profiles to be executed using Apache Spark. This is a port of the Profiler to Spark that allows you to backfill profiles using archived telemetry.

Introduction
Getting Started
Installation
Running the Profiler
Configuring the Profiler

Introduction

Using the Streaming Profiler in Apache Storm allows you to create profiles based on the stream of telemetry being captured, enriched, triaged, and indexed by Metron. This does not allow you to create a profile based on telemetry that was captured in the past.

There are many cases where you might want to produce a profile from telemetry in the past. This is referred to as profile seeding or backfilling.

As a Security Data Scientist, I want to understand the historical behaviors and trends of a profile so that I can determine if the profile has predictive value for model building.
As a Security Platform Engineer, I want to generate a profile using archived telemetry when I deploy a new model to production so that models depending on that profile can function on day 1.

The Batch Profiler running in Apache Spark allows you to seed a profile using archived telemetry.

The portion of a profile produced by the Batch Profiler should be indistinguishable from the portion created by the Streaming Profiler. Consumers of the profile should not care how the profile was generated. Using the Streaming Profiler together with the Batch Profiler allows you to create a complete profile over a wide range of time.

For an introduction to the Profiler, see the Profiler README.

Getting Started

If a profile file does not already exist, you can create a profile definition by editing $METRON_HOME/config/zookeeper/profiler.json as follows.

cat $METRON_HOME/config/zookeeper/profiler.json
{
  "profiles": [
    {
      "profile": "hello-world",
      "foreach": "'global'",
      "init":    { "count": "0" },
      "update":  { "count": "count + 1" },
      "result":  "count"
    }
  ],
  "timestampField": "timestamp"
}

See Specifying profiles for information on how to load profile definitions from zookeeper.

Ensure that you have archived telemetry available for the Batch Profiler to consume. By default, Metron will store this in HDFS at /apps/metron/indexing/indexed/*/*.
```
hdfs dfs -cat /apps/metron/indexing/indexed/*/* | wc -l
```
Copy the hbase-site.xml file from /etc/hbase/conf to /etc/spark2/conf. It is advised to create a symlink to avoid the duplication of file, also to keep consistency between files while config updates.
```
 ln -s /etc/hbase/conf/hbase-site.xml /etc/spark2/conf/hbase-site.xml
```
Review the Batch Profiler's properties located at $METRON_HOME/config/batch-profiler.properties. See Configuring the Profiler for more information on these properties.
You may want to edit the log4j properties that sits in your config directory in ${SPARK_HOME} or create one. It may be helpful to turn on DEBUG logging for the Profiler by adding the following line.
```
log4j.logger.org.apache.metron.profiler.spark=DEBUG
```

Run the Batch Profiler.

source /etc/default/metron
cd $METRON_HOME
$METRON_HOME/bin/start_batch_profiler.sh

Query for the profile data using the Profiler Client.

Specifying profiles

The profile to use for batch processing can be specified as either a JSON file on disk or by utilizing a profile already loaded into zookeeper for use by the streaming profiler.

Loading a profile from disk

If a profile file does not already exist, you can create a profile definition by editing $METRON_HOME/config/zookeeper/profiler.json as follows.

cat $METRON_HOME/config/zookeeper/profiler.json
{
  "profiles": [
    {
      "profile": "hello-world",
      "foreach": "'global'",
      "init":    { "count": "0" },
      "update":  { "count": "count + 1" },
      "result":  "count"
    }
  ],
  "timestampField": "timestamp"
}

When launching the batch profiler directly, use the --profiles <path to profiler.json> option. If using the wrapper script to launch the batch profiler, it will automatically add the command argument --profiles $METRON_HOME/config/zookeeper/profiler.json to the batch launching process if $SPARK_PROFILER_USE_ZOOKEEPER is not defined.

Loading a profile from zookeeper

Choose to use profiles already loaded into zookeeper (e.g. for use by the streaming profiler) by setting the environment variable $SPARK_PROFILER_USE_ZOOKEEPER. This will cause the wrapper script to add --zookeeper $ZOOKEEPER to the batch launching process, which will cause the spark profiler to extract profiles from the zookeeper quorum located at $ZOOKEEPER.

Installation

The Batch Profiler package is installed automatically when installing Metron using the Ambari MPack. See the following notes when installing the Batch Profiler without the Ambari MPack.

Prerequisites

The Batch Profiler requires Spark version 2.3.0+.

Build the RPM

Build Metron.
```
mvn clean package -DskipTests -T2C
```

Build the RPMs.

cd metron-deployment/
mvn clean package -Pbuild-rpms

Retrieve the package.

find ./ -name "metron-profiler-spark*.rpm"

Build the DEB

Build Metron.
```
mvn clean package -DskipTests -T2C
```

Build the DEBs.

cd metron-deployment/
mvn clean package -Pbuild-debs

Retrieve the package.

find ./ -name "metron-profiler-spark*.deb"

Running the Profiler

Usage
Advanced Usage
Spark Execution
Kerberos
Input Formats

Usage

A script located at $METRON_HOME/bin/start_batch_profiler.sh has been provided to simplify running the Batch Profiler. This script makes the following assumptions.

The script either
- builds the profiles defined in $METRON_HOME/config/zookeeper/profiler.json. or
- utilises the profiles already loaded into zookeeper quorum at $ZOOKEEPER if the environment variable $SPARK_PROFILER_USE_ZOOKEEPER is set.
The properties defined in $METRON_HOME/config/batch-profiler.properties are passed to both the Profiler and Spark. You can define both Spark and Profiler properties in this same file.
The script will also configure the event time field to use if the field value is stored in the ${SPARK_PROFILER_EVENT_TIMESTAMP_FIELD} environment variable.
The script assumes that Spark is installed at /usr/hdp/current/spark2-client. This can be overridden if you define an environment variable called SPARK_HOME prior to executing the script.

Advanced Usage

The Batch Profiler may also be started using spark-submit as follows. See the Spark Documentation for more information about spark-submit.

${SPARK_HOME}/bin/spark-submit \
    --class org.apache.metron.profiler.spark.cli.BatchProfilerCLI \
    --properties-file ${SPARK_PROPS_FILE} \
    ${METRON_HOME}/lib/metron-profiler-spark-*.jar \
    --config ${PROFILER_PROPS_FILE} \
    --profiles ${PROFILES_FILE}

The Batch Profiler accepts the following arguments when run from the command line as shown above. All arguments following the Profiler jar are passed to the Profiler. All argument preceeding the Profiler jar are passed to Spark.

Argument	Description
`-p`, `--profiles`	Path to the profile definitions.
`-z`, `--zookeeper`	Zookeeper quorum to read profile definitions from.
`-t`, `--timestampfield`	Which data field to use for event time.
`-c`, `--config`	Path to the profiler properties file.
`-g`, `--globals`	Path to the Stellar global config file.
`-r`, `--reader`	Path to properties for the DataFrameReader.
`-h`, `--help`	Print the help text.

`--profiles`

The path to a file containing the profile definition in JSON. Only one of --zookeeper or --profiles should be used

`--zookeeper`

Read profile definitions from the zookeeper quorum at this address. Only one of --zookeeper or --profiles should be used.

`--timestampfield`

Specifies which data field to utilising for event time information. The field to use for event time is usually stored as part of the profile. It can be overridden via this setting.

`--config`

The path to a file containing key-value properties for the Profiler. This file would contain the properties described under Configuring the Profiler.

`--globals`

The path to a file containing key-value properties that define the global properties. This can be used to customize how certain Stellar functions behave during execution.

`--reader`

The path to a file containing key-value properties that are passed to the DataFrameReader when reading the input telemetry. This allows additional customization for how the input telemetry is read.

Spark Execution

Spark supports a number of different cluster managers. The underlying cluster manager is transparent to the Profiler. To run the Profiler on a particular cluster manager, it is just a matter of setting the appropriate options as defined in the Spark documentation.

Local Mode

By default, the Batch Profiler instructs Spark to run in local mode. This will run all of the Spark execution components within a single JVM. This mode is only useful for testing with a limited set of data.

$METRON_HOME/config/batch-profiler.properties

spark.master=local

Spark on YARN

To run the Profiler using Spark on YARN, at a minimum edit the value of spark.master as shown. In many cases it also makes sense to set the YARN deploy mode to cluster.

$METRON_HOME/config/batch-profiler.properties

spark.master=yarn
spark.submit.deployMode=cluster

See the Spark documentation for information on how to further control the execution of Spark on YARN. Any of these properties can be added to the Profiler properties file.

The following command can be useful to review the logs generated when the Profiler is executed on YARN.

yarn logs -applicationId <application-id>

Kerberos

See the Spark documentation for information on running the Batch Profiler in a secure, kerberized cluster.

Input Formats

The Profiler can consume archived telemetry stored in a variety of input formats. By default, it is configured to consume the text/json that Metron archives in HDFS. This is often not the best format for archiving telemetry. If you choose a different format, you should be able to configure the Profiler to consume it by doing the following.

Edit profiler.batch.input.format and profiler.batch.input.path as needed. For example, to read ORC you might do the following.

$METRON_HOME/config/batch-profiler.properties

profiler.batch.input.format=org.apache.spark.sql.execution.datasources.orc
profiler.batch.input.path=hdfs://localhost:9000/apps/metron/indexing/orc/\*/\*

If additional options are required for your input format, then use the --reader command-line argument when launching the Batch Profiler as described here.

Common Formats

The following examples highlight the configuration values needed to read telemetry stored in common formats. These values should be defined in the Profiler properties (see --config).

JSON

profiler.batch.input.reader=json
profiler.batch.input.path=/path/to/json/

Apache ORC

profiler.batch.input.reader=orc
profiler.batch.input.path=/path/to/orc/

Apache Parquet

profiler.batch.input.reader=parquet
profiler.batch.input.path=/path/to/parquet/

Configuring the Profiler

By default, the configuration for the Batch Profiler is stored in the local filesystem at $METRON_HOME/config/batch-profiler.properties.

You can store both settings for the Profiler along with settings for Spark in this same file. Spark will only read settings that start with spark..

Setting	Description
`profiler.batch.input.path`	The path to the input data read by the Batch Profiler.
`profiler.batch.input.reader`	The telemetry reader used to read the input data.
`profiler.batch.input.format`	The format of the input data read by the Batch Profiler.
`profiler.batch.input.begin`	Only messages with a timestamp after this will be profiled.
`profiler.batch.input.end`	Only messages with a timestamp before this will be profiled.
`profiler.period.duration`	The duration of each profile period.
`profiler.period.duration.units`	The units used to specify the `profiler.period.duration`.
`profiler.hbase.salt.divisor`	A salt is prepended to the row key to help prevent hot-spotting.
`profiler.hbase.table`	The name of the HBase table that profiles are written to.
`profiler.hbase.column.family`	The column family used to store profiles.

`profiler.batch.input.path`

Default: hdfs://localhost:9000/apps/metron/indexing/indexed/*/*

The path to the input data read by the Batch Profiler.

`profiler.batch.input.reader`

Default: json

Defines how the input data is treated when read. The value is not case sensitive so JSON and json are equivalent.

json: Read text/json formatted telemetry
orc: Read Apache ORC formatted telemetry
parquet: Read Apache Parquet formatted telemetry
text Consumes input data stored as raw text. Should be defined along with profiler.batch.input.format. Only use if the input format is not directly supported like json.
columnar Consumes input data stored in columnar formats. Should be defined along with profiler.batch.input.format. Only use if the input format is not directly supported like json.

See Common Formats for further information.

`profiler.batch.input.format`

Default: text

The format of the input data read by the Batch Profiler. This is optional and not required in most cases. For example, this property is not required when profiler.batch.input.reader is json, orc, or parquet.

`profiler.batch.input.begin`

Default: undefined; no time constraint

Only messages with a timestamp equal to or after this will be profiled. The Profiler will only profiles messages with a timestamp in [profiler.batch.input.begin, profiler.batch.input.end] inclusive.

By default, no time constraint is defined. The value is expected to follow the ISO-8601 instant format; 2011-12-03T10:15:30Z.

`profiler.batch.input.end`

Default: undefined; no time constraint

Only messages with a timestamp before or equal to this will be profiled. The Profiler will only profiles messages with a timestamp in [profiler.batch.input.begin, profiler.batch.input.end] inclusive.

By default, no time constraint is defined. The value is expected to follow the ISO-8601 instant format; 2011-12-03T10:15:30Z.

`profiler.period.duration`

Default: 15

The duration of each profile period. This value should be defined along with profiler.period.duration.units.

Important: To read a profile using the Profiler Client, the Profiler Client's profiler.client.period.duration property must match this value. Otherwise, the Profiler Client will be unable to read the profile data.

`profiler.period.duration.units`

Default: MINUTES

The units used to specify the profiler.period.duration. This value should be defined along with profiler.period.duration.

Important: To read a profile using the Profiler Client, the Profiler Client's profiler.client.period.duration.units property must match this value. Otherwise, the Profiler Client will be unable to read the profile data.

`profiler.hbase.salt.divisor`

Default: 1000

A salt is prepended to the row key to help prevent hotspotting. This constant is used to generate the salt. This constant should be roughly equal to the number of nodes in the Hbase cluster to ensure even distribution of data.

`profiler.hbase.table`

Default: profiler

The name of the HBase table that profile data is written to. The Profiler expects that the table exists and is writable. It will not create the table.

`profiler.hbase.column.family`

Default: P

The column family used to store profile data in HBase.

Files

README.md

Latest commit

History

README.md

File metadata and controls

Metron Profiler for Spark

Introduction

Getting Started

Specifying profiles

Loading a profile from disk

Loading a profile from zookeeper

Installation

Prerequisites

Build the RPM

Build the DEB

Running the Profiler

Usage

Advanced Usage

--profiles

--zookeeper

--timestampfield

--config

--globals

--reader

Spark Execution

Local Mode

Spark on YARN

Kerberos

Input Formats

Common Formats

JSON

Apache ORC

Apache Parquet

Configuring the Profiler

profiler.batch.input.path

profiler.batch.input.reader

profiler.batch.input.format

profiler.batch.input.begin

profiler.batch.input.end

profiler.period.duration

profiler.period.duration.units

profiler.hbase.salt.divisor

profiler.hbase.table

profiler.hbase.column.family

`--profiles`

`--zookeeper`

`--timestampfield`

`--config`

`--globals`

`--reader`

`profiler.batch.input.path`

`profiler.batch.input.reader`

`profiler.batch.input.format`

`profiler.batch.input.begin`

`profiler.batch.input.end`

`profiler.period.duration`

`profiler.period.duration.units`

`profiler.hbase.salt.divisor`

`profiler.hbase.table`

`profiler.hbase.column.family`