Malware Snippets - Beta Release

Posted May 6, 2025

By Ryan Kleffman

9 min read

Malware Snippets

A tool for automated TTP extraction from live malware samples.

Check it out here:

Website: https://malwaresnippets.com
Github: https://github.com/ryanq47/malware_snippets

Initial Idea

The idea for Malware Snippets came about from the desire to be a better malware developer, and to find new techniques for my C2’s WhisperNet Agent. From my experience, most online resources are adequate for learning traditional malware techniques, however the effective, niche, and new ones are hiding in the live samples.

Initially, I started pulling samples from MalwareBazaar, decompiling and analyzing them by hand. As you might guess, that was incredibly tedious. Especially while having a (seemingly endless) stack of homework to do. I wondered if there was a good way to automate this, and given that I wasn’t able to find any examples online, I decided to make my own.

Technical

The entire project is written in Python and has three main components:

1. Parser

A LangChain-powered pipeline that:

Splits large decompiled C# files into coherent chunks
Processes each chunk in parallel with a fixed-output prompt (via a Map-Reduce chain on the o1-mini model).
Merges all chunk responses into one clean JSON list of TTP entries.

2. Storage

A SQLite database that holds analyzed data:

Schema A single ttps table captures every snippet’s core metadata:

- id

- sample_hash (SHA-256)

- ttp (technique identifier + name)

- snippet (original C# code)

- snippet_c (C translation)

- snippet_python (Python translation)

- description (A short snippet summary)

- source_file (file path)

- star_count (number of votes)

3. Web

The frontend is built with NiceGUI, a great framework I’ve recently adopted for web projects. It includes many built-in tools such as CodeMirror and AgGrid, making the implementation of advanced features incredibly simple.

Visual Overview:

![[process_overview.svg]]

Issues/Challenges

I ran into a few challenges, mostly related to interacting with LLMs.

Token size & LLM Run Times

By far, the biggest issue I ran into was token limitations. My first idea was to shove the entire decompiled C# file into one giant prompt:

You are a malware analyst reviewing a C# code snippet for malicious behavior.

Your goal is to identify every TTP by name, find niche techniques, and include the code lines that implement it (inside csharp fences), a *thorough*, decent length description, and high-level steps that the code does.

Additionally, add comments to the code where necessary. Mark with `//ANALYZER:`. A good place to add is when the code is not explicitly clear, Ex, what a certain value may mean, or what certain arguments may mean, or do, in a function

'''csharp

{code_here}

'''

Here is a bad example:

    `Anti_Analysis.RunAntiAnalysis();`

        This is bad, as it only shows the method name. A good example would show the `RunAntiAnalysis()` function code.

Output JSON array where each entry has:

- "ttp": string
- "snippet": string  (the exact code)
- "description": string
- "snippet_python": string (the exact code/TTP, translated into Python)
- "snippet_c": string (the exact code/TTP, translated into c)

If no TTPs are found, return an empty JSON array `[]`.

DO NOT include extra text. ONLY include the formatted JSON data. NO ```json```.

I quickly found out many of these files run into thousands of lines, which easily blows past an LLM’s context window. The solution I came up with was a chunk-based strategy using Lang Chain:

Custom .NET splitter

I wrote get_dotnet_splitter() based on RecursiveCharacterTextSplitter, which splits the data based on logical boundaries, such as double newlines, single newlines, .NET keywords (public, private, class), then spaces, falling back to single characters. Each chunk is capped at around 7,000 tokens with a 200-token overlap to preserve context across splits.

Map-Reduce chain

Using MapReduceChain.from_params(...), I was able to “map” my TTP-extraction prompt (using o1-mini) over each chunk in parallel, then “reduce” by merging the individual JSON arrays into one consolidated list.

Speeding it up

Incrementally querying OpenAI’s API turned out to be a bit slow, but luckily, Lang Chain has concurrency options, which sped everything right up.

With RunnableConfig(max_concurrency=8), the parser can fire off up to eight queries in parallel, dramatically speeding up the full-file analysis. Occasionally, the parser gets rate-limited, but LangChain’s built-in retry logic seems to handle the re-queries fairly well.

Example:

All together this looks like this:

Example C# file:

// — Chunk 1 —

public void ObfuscateFiles() {
    // some obfuscation logic
    // T1027.001 - Obfuscated Files or Information
}

// — Chunk 2 —

public void RunShell() {
    // spawns a shell
    // T1059.003 - Command and Scripting Interpreter
}

// — Chunk 3 —

public void MasqueradeProcess() {
    // renames process to legit exe
    // T1036.005 - Masquerading
}

Submit each chunk to the LLM

Chunk 1

[SNIPPET]

Additionally, add comments to the code where necessary. Mark with `//ANALYZER:`. A good place to add is when the code is not explicitly clear, Ex, what a certain value may mean, or what certain arguments may mean, or do, in a function

'''csharp

public void ObfuscateFiles() {
    // some obfuscation logic
    // T1027.001 - Obfuscated Files or Information
}

'''

Here is a bad example:

[END SNIPPET]

Chunk 2

[SNIPPET]

Additionally, add comments to the code where necessary. Mark with `//ANALYZER:`. A good place to add is when the code is not explicitly clear, Ex, what a certain value may mean, or what certain arguments may mean, or do, in a function

'''csharp

public void RunShell() {
    // spawns a shell
    // T1059.003 - Command and Scripting Interpreter
}

'''

Here is a bad example:

[END SNIPPET]

Chunk 3

[SNIPPET]

Additionally, add comments to the code where necessary. Mark with `//ANALYZER:`. A good place to add is when the code is not explicitly clear, Ex, what a certain value may mean, or what certain arguments may mean, or do, in a function

'''csharp

public void MasqueradeProcess() {
    // renames process to legit exe
    // T1036.005 - Masquerading
}

'''

Here is a bad example:

[END SNIPPET]

Putting it together

Once complete, LangChain merges those three small arrays into one big list, which is parsed, and put into the database.

[
  {
    "ttp": "T1027.001 - Obfuscated Files or Information",
    "snippet": "public void ObfuscateFiles() { /* … */ }",
    "description": "Obfuscates file contents by applying simple transformations to evade static analysis.",
    "snippet_python": "def obfuscate_files():\n    # …",
    "snippet_c": "void obfuscate_files() { /* … */ }"
  },
  {
    "ttp": "T1059.003 - Command and Scripting Interpreter",
    "snippet": "public void RunShell() { /* … */ }",
    "description": "Launches a system shell to execute arbitrary commands.",
    "snippet_python": "def run_shell():\n    # …",
    "snippet_c": "void run_shell() { /* … */ }"
  },
  {
    "ttp": "T1036.005 - Masquerading",
    "snippet": "public void MasqueradeProcess() { /* … */ }",
    "description": "Renames the running process to mimic a legitimate executable.",
    "snippet_python": "def masquerade_process():\n    # …",
    "snippet_c": "void masquerade_process() { /* … */ }"
  }
]

Model selection

Choosing the right model was critical for good, reliable outputs & identification of TTPs. A lot of the conversational models were less than ideal (ex, 4o), whereas the more logic-focused (ex, o1-mini), did a great job at consistent structured output, and identification of TTPs.

Currently o1-mini is the model used by the parser, but diving into others, such as o3 (if I can get approved for it), is likely to yield even better results.

Going Forward & Improvements

As you might have noticed, the tool is currently set up for C#/.NET. I’m planning to extend it to handle Java, Python, and any other languages that can be decompiled, so a user will be able to extract TTPs from a much broader range of samples - but I need to get through finals week before I explore those options.

Additionally, I plan to add support for local LLM deployments to reduce reliance on the OpenAI API, as their API can get pricey, especially with the larger samples and newer models. I’m currently using a key provided to us from the AIxCC challenge, which expires at the end of the month, so I’ll need to implement that fairly soon to not break the bank.

The database setup could use a rework as well. I started this project using SQLite, which has some issues related to multiple concurrent writes and is fairly limited in general. I’d like to switch to a more production-ready database with time, most likely PostgreSQL.

Last but not least, real-time updates are a priority. Ideally, the parser would run every 6 to 12 hours, batch process any new samples, and push them to the database. This would ensure the latest submitted samples get analyzed and keep the site fresh. Currently, the site only has 100 analyzed samples.

Final Thoughts

To put a wrap on this article, I’m really excited about where this project’s headed and can’t wait to see what comes next, expect lots of adjustments, and changes in the coming months!

Tools, Malware-Snippets, Malware

This post is licensed under CC BY 4.0 by the author.