Mark Marquez

IT & Automation Specialist | Cybersecurity | Data-Engineering Enthusiast

Building an Automated Threat Intelligence ETL Pipeline

Splunk Dashboard - Threat Intelligence

Building an Automated Threat Intelligence ETL Pipeline

In today’s cybersecurity landscape, staying ahead of threats requires timely, actionable intelligence. My threat-intel-etl project demonstrates how automation can transform raw threat data into powerful insights. By integrating AlienVault’s Open Threat Exchange (OTX), PostgreSQL, and Splunk, I created an end-to-end Extract, Transform, Load (ETL) pipeline that collects, processes, and visualizes threat intelligence—streamlining the way organizations monitor and respond to cyber risks. The project is hosted on GitHub.

The Challenge: Turning Raw Data into Actionable Insights

Threat intelligence platforms like AlienVault OTX provide a wealth of data, including indicators of compromise (IoCs) like malicious IPs, URLs, and domains. However, raw data alone isn’t enough. Analysts need structured, accessible, and visually intuitive insights to make informed decisions quickly. Manually collecting and analyzing this data is time-consuming and error-prone, so I set out to build an automated solution that simplifies the process.

My Solution: Threat-Intel-ETL

threat-intel-etl is a Python-based pipeline that automates the entire threat intelligence workflow:

As of March 2025, the pipeline has processed 6,836 pulses and 378,669 indicators, showcasing its ability to handle large-scale datasets with ease.

Splunk Dashboard Verification

Below is a screenshot of the Splunk dashboard, “Intel Overview Dashboard,” visualizing the processed threat intelligence:

Splunk Dashboard Screenshot

PyCharm IDE Verification

Below is a screenshot of the PyCharm IDE showing the ETL scripts in development:

PyCharm IDE Screenshot

Key Features

The project’s centerpiece is a Splunk dashboard, “Intel Overview Dashboard,” which brings the data to life through five interactive visualizations:

  1. Indicator Type Breakdown: A pie chart showing the distribution of IoC types (e.g., IPv4, URL, domain), helping analysts prioritize threats by type.
  2. Expired vs. Active Indicators: A pie chart tracking the freshness of IoCs, ensuring focus on current risks.
  3. Top Pulses by Indicator Count: A bar chart highlighting the most prolific threat campaigns, based on the number of associated IoCs.
  4. Targeted Countries: A bar chart mapping the geographic focus of threats, revealing global attack patterns.
  5. Top Cybersecurity Tags: A bar chart identifying common threat themes (e.g., “phishing,” “malware”), guiding deeper investigations.

A dynamic filter lets users drill down into data by Traffic Light Protocol (TLP) levels, enhancing usability for analysts with specific access permissions.

Technical Highlights

Building this pipeline required integrating multiple technologies and tackling real-world challenges:

One of the trickiest parts was optimizing the transformation step to handle OTX’s nested JSON data. Using Pandas, I flattened and normalized the data into relational tables, balancing performance with accuracy. Setting up Splunk DB Connect also required careful configuration—ensuring the Java runtime and database credentials aligned perfectly to avoid connectivity issues.

SQL Queries for Threat Analysis

A key component of the pipeline is the use of SQL queries to extract meaningful metrics from the PostgreSQL database. These queries, defined in src/sql_queries.py, provide a detailed breakdown of the threat intelligence data:

These queries were executed on May 15, 2025, producing results that were then used for further analysis.

Leveraging LLMs for Deeper Insights

To generate actionable insights from the query results, I utilized two large language models (LLMs)—Grok (created by xAI) and Claude—to analyze the data independently. The process, implemented in src/send_to_llms.py, involves formatting the query results into a structured prompt and sending them to each LLM for interpretation. The script ensures that both models receive the same data but provide their unique perspectives, enhancing the depth of analysis.

Results and Insights

The pipeline processed 7,128 pulses and 412,985 indicators as of May 15, 2025. Below are the key findings from the SQL queries and the insights generated by Grok and Claude:

Both LLMs highlighted the need for enhanced protection against phishing and advanced persistent threats (APTs), especially for critical infrastructure and government systems.

LLM Summary Python Script

These insights were generated based on the latest pipeline run on May 15, 2025.

Why It Matters

threat-intel-etl isn’t just a technical exercise—it’s a practical tool for cybersecurity teams. By automating data collection and presenting insights in an intuitive format, it saves analysts hours of manual work and helps them focus on responding to threats. The project also demonstrates my ability to bridge data engineering and cybersecurity, combining ETL pipelines with visualization to deliver real-world value.

For organizations, this means faster detection of malicious activity, better prioritization of threats, and a clearer understanding of the global threat landscape. Whether it’s identifying a spike in phishing campaigns or mapping attacks targeting specific regions, the pipeline empowers informed decision-making.

Lessons Learned

This project deepened my expertise in several areas:

It also reinforced the importance of modularity. By structuring the codebase into separate modules for extraction, transformation, loading, querying, and LLM analysis, I made it easier to maintain and extend—a principle I carry into all my projects.

Usage and Configuration

To use the pipeline, follow these steps:

  1. Clone the Repository:
    git clone https://github.com/marky224/Threat-Intel-ETL.git
    cd Threat-Intel-ETL
  2. Ensure Prerequisites Are Met:
    • Python 3.8+ with dependencies (requirements.txt).
    • PostgreSQL 17 (local instance).
    • Splunk Enterprise with DB Connect app.
    • Java JRE 11 (e.g., OpenJDK from Adoptium).
    • AlienVault OTX API key: Sign up at otx.alienvault.com, navigate to your account settings, and generate an API key.
    • Grok API key: Visit console.x.ai, sign up or log in, and navigate to the API section to create a key. You may need to be a team owner or have a developer role to generate the key.
    • Claude API key: Sign up at the Anthropic Console, navigate to the API section, and generate a key. You may need to purchase credits or join a waitlist, as direct access can be limited.
  3. Run the Pipeline:

    Configure credentials in src/config.py, set up the database with setup_db.py, and execute:

    python main.py
  4. View the Dashboard:

    Access Splunk at localhost:8000, navigate to Dashboards > Intel Overview Dashboard.

Try It Out

The threat-intel-etl project is open-source and available on GitHub. The repository includes detailed setup instructions, from cloning the project to configuring Splunk DB Connect. I welcome contributions—whether it’s adding new visualizations or optimizing the pipeline’s performance.

Mark’s Project Portfolio