SR-BH 2020 multi-label dataset

  1. Sureda Riera, Tomás 1
  2. Bermejo Higuera, Juan Ramón 2
  3. Bermejo Higuera, Javier 2
  4. Sicilia Montalvo, Juan Antonio 2
  5. Martínez Herráiz, José Javier 1
  1. 1 (Universidad de Alcalá)
  2. 2 Universidad Internacional de La Rioja

    Universidad Internacional de La Rioja

    Logroño, España


Editor: Harvard Dataverse

Año de publicación: 2022

Tipo: Dataset


The dataset is composed of web requests collected during 12 days of July 2020 by a web server (Wordpress) installed on a virtual machine and exposed to Internet. On this server, Modsecurity version 2.9.2 for Apache, with Core Rule Set (CRS) version 3.3.0 was installed in ”Detection only” mode, so that all requests (legitimate and malicious) were recorded in the log generated by ModSecurity, but without being blocked. Daily, the logs generated by ModSecurity were collected and the virtual machine was restored to a clean state. Once the web server exposure period was over, the collected logs were manually and semi-automatically processed to review the web request tagging performed by Modsecurity, correcting where necessary the normal/attack assignment to the corresponding web request and ensuring an appropriate CAPEC classification assignment. The final result is a multi-label dataset aimed especially at web attack detection and composed of 907,814 requests of which 525,195 are normal requests and 382,619 are anomalous requests, where each record has 24 different features and a set of 13 labels.