RealTime-ECommerce-Analytics (RTEA) is a high-performance, scalable solution for processing and analyzing e-commerce clickstream data in real-time. It handles data from both mobile and web sources, processes up to 50,000 messages per second, and generates valuable metrics for business intelligence.
The project uses Apache Kafka for data ingestion, Apache Flink for stream processing, and Google BigQuery for storing the computed metrics.
- High-volume data generation simulating realistic e-commerce clickstream events
- Real-time data processing with Apache Flink
- Multi-source data handling (mobile app and website)
- Scalable architecture capable of processing 50k messages per second
- Key e-commerce metrics computation:
- Event counts by type and source
- Revenue by product category
- Unique visitor counts
- Seamless integration with Google BigQuery for metric storage and further analysis
- Python 3.7+
- Apache Kafka
- Apache Flink 1.15.0
- Google Cloud account with BigQuery enabled
- Java 8 or 11 (for running Flink)
-
Clone the repository:
git clone https://github.com/yourusername/RealTime-ECommerce-Analytics.git cd RealTime-ECommerce-Analytics
-
Install the required Python packages:
pip install kafka-python pyflink google-cloud-bigquery
-
Download the following JAR files and place them in a
jars
directory in the project root:- flink-sql-connector-kafka-1.15.0.jar
- flink-connector-jdbc-1.15.0.jar
- google-cloud-bigquery-1.27.0.jar
- slf4j-api-1.7.30.jar
- slf4j-log4j12-1.7.30.jar
-
Set up Google Cloud credentials for BigQuery access:
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account-file.json"
-
Update the Kafka broker address in both
ecommerce_data_generator.py
andpyflink_ecommerce_analysis.py
if different fromlocalhost:9092
. -
In
pyflink_ecommerce_analysis.py
, update the BigQuery connection details:- Replace
your-project-id
with your Google Cloud project ID - Replace
your-dataset.ecommerce_metrics
with your desired BigQuery dataset and table name
- Replace
-
Start your Kafka cluster and create the required topic:
bin/kafka-topics.sh --create --topic ecommerce_clickstream --bootstrap-server localhost:9092 --partitions 4 --replication-factor 1
-
Run the data generator:
python ecommerce_data_generator.py
-
In a separate terminal, run the Flink job:
python pyflink_ecommerce_analysis.py
The Flink job will start processing the incoming data and writing metrics to BigQuery.
ecommerce_data_generator.py
: Generates simulated e-commerce clickstream data and pushes it to Kafkapyflink_ecommerce_analysis.py
: Defines and executes the Flink job for processing the data and computing metricsjars/
: Directory containing necessary JAR files for Flink connectorsREADME.md
: This file
- Event counts by type and source (e.g., page_view_mobile, add_to_cart_website)
- Revenue by product category
- Unique visitors count
These metrics are computed over 1-minute tumbling windows and stored in BigQuery for further analysis.
The current configuration is set to handle 50,000 messages per second. To scale the system:
- Increase the number of Kafka partitions
- Adjust the Flink job parallelism in
pyflink_ecommerce_analysis.py
- Scale your BigQuery resources as needed
Contributions to RealTime-ECommerce-Analytics are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE.md file for details.
- Apache Flink community
- Apache Kafka community
- Google Cloud BigQuery team