The Stanford Internet Research Data Repository is a public archive of research datasets that describe the hosts, services, and websites on the Internet. While the repository is hosted by Stanford Empirical Security Research Group, we are also happy to host data for other researchers as well. The data on the site is restricted to non-commercial use. A JSON interface is available. Contact with any questions.

LZR: Identifying Unexpected Internet Services
Abstract: Internet-wide scanning is a commonly used research technique that has helped uncover real-world attacks, find cryptographic weaknesses, and understand both operator and miscreant behavior. Studies that employ scanning have largely assumed that services are hosted on their IANA-assigned ports, overlooking the study of services on unusual ports. In this work, we investigate where Internet services are deployed in practice and evaluate the security posture of services on unexpected ports. We show protocol deployment is more diffuse than previously believed and that protocols run on many additional ports beyond their primary IANA-assigned port. For example, only 3% of HTTP and 6% of TLS services run on ports 80 and 443, respectively. Services on non-standard ports are more likely to be insecure, which results in studies dramatically underestimating the security posture of Internet hosts. Building on our observations, we introduce LZR (Laser), a system that identifies 99% of identifiable unexpected services in five handshakes and dramatically reduces the time needed to perform application-layer scans on ports with few responsive expected services (e.g., 5500% speedup on 27017/MongoDB). We conclude with recommendations for future studies.

On the Origin of Scanning: The Impact of Location on Internet-Wide Scans
Abstract: Fast IPv4 scanning has enabled researchers to answer a wealth of security and networking questions. Yet, despite widespread use, there has been little validation of the methodology’s accuracy, including whether a single scan provides sufficient coverage. In this paper, we analyze how scan origin affects the results of Internet-wide scans by completing three HTTP, HTTPS, and SSH scans from seven geographically and topologically diverse networks. We find that individual origins miss an average 1.6–8.4% of HTTP, 1.5–4.6% of HTTPS, and 8.3–18.2% of SSH hosts. We analyze why origins see different hosts, and show how permanent and temporary blocking, packet loss, geographic biases, and transient outages affect scan results. We discuss the implications for scanning and provide recommendations for future studies.

Designing Toxic Content Classification for a Diversity of Perspectives
Abstract: Despite many efforts to automatically identify toxic comments online (including sexual harassment, threats, and identity attacks), modern systems fail to generalize to the diverse concerns of Internet users. This dataset consists of 107,620 social media comments annotated by 17,280 unique participants, and was collected to understand how user expectations for what constitutes toxic content differ across demographics, beliefs, and personal experiences. The dataset is encrypted – please contact Deepak Kumar for the password.

