Mark Marquez

IT & Automation Specialist | Cybersecurity | Data-Engineering Enthusiast

Building an Automated Threat Intelligence ETL Pipeline

Splunk Dashboard - Threat Intelligence

Building an Automated Threat Intelligence ETL Pipeline

In today’s cybersecurity landscape, staying ahead of threats requires timely, actionable intelligence. My threat-intel-etl project demonstrates how automation can transform raw threat data into powerful insights. By integrating AlienVault’s Open Threat Exchange (OTX), PostgreSQL, and Splunk, I created an end-to-end Extract, Transform, Load (ETL) pipeline that collects, processes, and visualizes threat intelligence—streamlining the way organizations monitor and respond to cyber risks. The project is hosted on GitHub.

The Challenge: Turning Raw Data into Actionable Insights

Threat intelligence platforms like AlienVault OTX provide a wealth of data, including indicators of compromise (IoCs) like malicious IPs, URLs, and domains. However, raw data alone isn’t enough. Analysts need structured, accessible, and visually intuitive insights to make informed decisions quickly. Manually collecting and analyzing this data is time-consuming and error-prone, so I set out to build an automated solution that simplifies the process.

My Solution: Threat-Intel-ETL

threat-intel-etl is a Python-based pipeline that automates the entire threat intelligence workflow:

As of March 2025, the pipeline has processed 6,836 pulses and 378,669 indicators, showcasing its ability to handle large-scale datasets with ease.

Splunk Dashboard Verification

Below is a screenshot of the Splunk dashboard, “Intel Overview Dashboard,” visualizing the processed threat intelligence:

Splunk Dashboard Screenshot

PyCharm IDE Verification

Below is a screenshot of the PyCharm IDE showing the ETL scripts in development:

PyCharm IDE Screenshot

Key Features

The project’s centerpiece is a Splunk dashboard, “Intel Overview Dashboard,” which brings the data to life through five interactive visualizations:

  1. Indicator Type Breakdown: A pie chart showing the distribution of IoC types (e.g., IPv4, URL, domain), helping analysts prioritize threats by type.
  2. Expired vs. Active Indicators: A pie chart tracking the freshness of IoCs, ensuring focus on current risks.
  3. Top Pulses by Indicator Count: A bar chart highlighting the most prolific threat campaigns, based on the number of associated IoCs.
  4. Targeted Countries: A bar chart mapping the geographic focus of threats, revealing global attack patterns.
  5. Top Cybersecurity Tags: A bar chart identifying common threat themes (e.g., “phishing,” “malware”), guiding deeper investigations.

A dynamic filter lets users drill down into data by Traffic Light Protocol (TLP) levels, enhancing usability for analysts with specific access permissions.

Technical Highlights

Building this pipeline required integrating multiple technologies and tackling real-world challenges:

One of the trickiest parts was optimizing the transformation step to handle OTX’s nested JSON data. Using Pandas, I flattened and normalized the data into relational tables, balancing performance with accuracy. Setting up Splunk DB Connect also required careful configuration—ensuring the Java runtime and database credentials aligned perfectly to avoid connectivity issues.

Why It Matters

threat-intel-etl isn’t just a technical exercise—it’s a practical tool for cybersecurity teams. By automating data collection and presenting insights in an intuitive format, it saves analysts hours of manual work and helps them focus on responding to threats. The project also demonstrates my ability to bridge data engineering and cybersecurity, combining ETL pipelines with visualization to deliver real-world value.

For organizations, this means faster detection of malicious activity, better prioritization of threats, and a clearer understanding of the global threat landscape. Whether it’s identifying a spike in phishing campaigns or mapping attacks targeting specific regions, the pipeline empowers informed decision-making.

Lessons Learned

This project deepened my expertise in several areas:

It also reinforced the importance of modularity. By structuring the codebase into separate modules for extraction, transformation, and loading, I made it easier to maintain and extend—a principle I carry into all my projects.

Usage and Configuration

To use the pipeline, follow these steps:

  1. Clone the Repository:
    git clone https://github.com/marky224/Threat-Intel-ETL.git
    cd Threat-Intel-ETL
  2. Ensure Prerequisites Are Met:
    • Python 3.8+ with dependencies (requirements.txt).
    • PostgreSQL 17 (local instance).
    • Splunk Enterprise with DB Connect app.
    • Java JRE 11 (e.g., OpenJDK from Adoptium).
    • AlienVault OTX API key.
  3. Run the Pipeline:

    Configure credentials in src/config.py, set up the database with setup_db.py, and execute:

    python main.py
  4. View the Dashboard:

    Access Splunk at localhost:8000, navigate to Dashboards > Intel Overview Dashboard.

Try It Out

The threat-intel-etl project is open-source and available on GitHub. The repository includes detailed setup instructions, from cloning the project to configuring Splunk DB Connect. I welcome contributions—whether it’s adding new visualizations or optimizing the pipeline’s performance.

What’s Next?

I’m excited to expand threat-intel-etl by integrating additional threat feeds, such as VirusTotal or CrowdStrike, to enrich the dataset. I’m also exploring ways to incorporate machine learning for anomaly detection, helping analysts spot emerging threats faster. On the visualization side, I plan to add time-series charts to track threat trends over time.

This project is a testament to my passion for building tools that make complex data accessible and actionable. By blending cybersecurity, data engineering, and visualization, I’m committed to creating solutions that empower teams to stay one step ahead of threats.

Mark’s Project Portfolio