CODEX

The research record.

The artifacts behind ARGUS's decisions: model comparisons, notebooks, and the daemon itself as a research platform.

hybrid-static-malware-detection

research

Research notebooks comparing symbolic rules, XGBoost, and a byte CNN on EMBER 2018 v2.

  • XGBoost on 2381 EMBER features is the current validation baseline.
  • CNN and fusion remain in progress while the held-out test set stays locked.
  • Research outputs are phrased as validation numbers, never deployment claims.
Repository ->

GATE-2 CLASSIFIER · EMBER 2018 V2 · 600K PE FILES

Three approaches, one locked test set

Symbolic rules against named features, XGBoost on the 2381 EMBER features, and a 1D CNN reading raw bytes — compared to pick the classifier behind ARGUS's Gate 2.

Validation-set results per model
ModelVal accuracyPrecisionRecallF1
XGBoost97.78%0.980.980.98
1D CNNpendingpendingpendingpending
Three-tier fusionpendingpendingpendingpending

Results above are validation-set numbers; the held-out test set stays locked until all three approaches are complete.

Notebooks ->

NOTEBOOK · PYTHON

Hybrid Static Malware Detection

Rendered statically from the notebook's last run — input and output as committed, not re-executed.

import pandas as pd


df = pd.read_parquet(r"datasets\train_ember_2018_v2_features.parquet")
print(df.shape)
print(df['Label'].value_counts())
(799912, 2382)
Label
 0.0    299991
 1.0    299929
-1.0    199992
Name: count, dtype: int64
df = df[df['Label'] != -1]
print(df.shape)
print(df['Label'].value_counts())
(599920, 2382)
Label
0.0    299991
1.0    299929
Name: count, dtype: int64
X = df.drop(columns=["Label"])
y = df["Label"]
print(X.shape)
print(y.shape)
(599920, 2381)
(599920,)
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.15, random_state=42, stratify=y
)

print(f'X_train={X_train.shape}\nX_val={X_val.shape}\ny_train={y_train.shape}\ny_val={y_val.shape}')
X_train=(509932, 2381)
X_val=(89988, 2381)
y_train=(509932,)
y_val=(89988,)
import ember
print(dir(ember))
['GridSearchCV', 'PEFeatureExtractor', 'TimeSeriesSplit', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'create_metadata', 'create_vectorized_features', 'features', 'json', 'lgb', 'make_scorer', 'multiprocessing', 'np', 'optimize_model', 'os', 'pd', 'predict_sample', 'raw_feature_iterator', 'read_metadata', 'read_metadata_record', 'read_vectorized_features', 'roc_auc_score', 'tqdm', 'train_model', 'vectorize', 'vectorize_subset', 'vectorize_unpack']
import xgboost as xgb

model_gpu = xgb.XGBClassifier(
    n_estimators=300,
    max_depth=8,
    learning_rate=0.1,
    subsample=0.8,
    colsample_bytree=0.8,
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42,
    n_jobs=-1,
    device='cuda'
)

model_gpu.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=50
)
c:\Users\mdzee\AppData\Local\Programs\Python\Python312\Lib\site-packages\xgboost\training.py:200: UserWarning: [04:48:16] WARNING: C:\actions-runner\_work\xgboost\xgboost\src\learner.cc:782: 
Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)
[0]	validation_0-logloss:0.63206
[50]	validation_0-logloss:0.13795
[100]	validation_0-logloss:0.10284
[150]	validation_0-logloss:0.08670
[200]	validation_0-logloss:0.07623
[250]	validation_0-logloss:0.07003
[299]	validation_0-logloss:0.06540
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=0.8, device='cuda', early_stopping_rounds=None,
              enable_categorical=False, eval_metric='logloss',
              feature_types=None, feature_weights=None, gamma=None,
              grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.1, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=8, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              multi_strategy=None, n_estimators=300, n_jobs=-1,
              num_parallel_tree=None, ...)
from sklearn.metrics import accuracy_score, classification_report

y_pred = model_gpu.predict(X_val)
y_prob = model_gpu.predict_proba(X_val)[:, 1]

print(accuracy_score(y_val, y_pred))
print(classification_report(y_val, y_pred, target_names=['benign', 'malware']))
c:\Users\mdzee\AppData\Local\Programs\Python\Python312\Lib\site-packages\xgboost\core.py:751: UserWarning: [04:50:39] WARNING: C:\actions-runner\_work\xgboost\xgboost\src\common\error_msg.cc:62: Falling back to prediction using DMatrix due to mismatched devices. This might lead to higher memory usage and slower performance. XGBoost is running on: cuda:0, while the input data is on: cpu.
Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.

This warning will only be shown once.

  return func(**kwargs)
0.9778637151620216
              precision    recall  f1-score   support

      benign       0.98      0.98      0.98     44999
     malware       0.98      0.98      0.98     44989

    accuracy                           0.98     89988
   macro avg       0.98      0.98      0.98     89988
weighted avg       0.98      0.98      0.98     89988

import joblib
joblib.dump(model_gpu, 'xgboost_ember_gpu_model.pkl')
['xgboost_ember_gpu_model.pkl']
Source ->

ARGUS · RESEARCH PLATFORM

The daemon is also the lab

ARGUS's own feature extractor produces the named values — like section entropy and signature presence — that the symbolic rules run against, which the anonymized EMBER dataframe cannot provide. The daemon and the research feed each other.

Repository ->