Stanford ESRG Data Repository

The Stanford Internet Research Data Repository is a public archive of research datasets that describe the hosts, services, and websites on the Internet. While the repository is hosted by Stanford Empirical Security Research Group, we are also happy to host data for other researchers as well. The data on the site is restricted to non-commercial use. A JSON interface is available. Contact support@esrg.stanford.edu with any questions.

Cloud Watching: Understanding Attacks Against Cloud-Hosted Services

Paper Artifact(s) from Stanford University

Abstract: Cloud computing has dramatically changed service deployment patterns. In this work, we analyze how attackers identify and target cloud services in contrast to traditional enterprise networks and network telescopes. Using a diverse set of cloud honeypots in 5~providers and 23~countries as well as 2~educational networks and 1~network telescope, we analyze how IP address assignment, geography, network, and service-port selection, influence what services are targeted in the cloud. We find that scanners that target cloud compute are selective: they avoid scanning networks without legitimate services and they discriminate between geographic regions. Further, attackers mine Internet-service search engines to find exploitable services and, in some cases, they avoid targeting IANA-assigned protocols, causing researchers to misclassify at least 15% of traffic on select ports. Based on our results, we derive recommendations for researchers and operators.

LZR: Identifying Unexpected Internet Services

Paper Artifact(s) from Stanford University

Abstract: Internet-wide scanning is a commonly used research technique that has helped uncover real-world attacks, find cryptographic weaknesses, and understand both operator and miscreant behavior. Studies that employ scanning have largely assumed that services are hosted on their IANA-assigned ports, overlooking the study of services on unusual ports. In this work, we investigate where Internet services are deployed in practice and evaluate the security posture of services on unexpected ports. We show protocol deployment is more diffuse than previously believed and that protocols run on many additional ports beyond their primary IANA-assigned port. For example, only 3% of HTTP and 6% of TLS services run on ports 80 and 443, respectively. Services on non-standard ports are more likely to be insecure, which results in studies dramatically underestimating the security posture of Internet hosts. Building on our observations, we introduce LZR (Laser), a system that identifies 99% of identifiable unexpected services in five handshakes and dramatically reduces the time needed to perform application-layer scans on ports with few responsive expected services (e.g., 5500% speedup on 27017/MongoDB). We conclude with recommendations for future studies.

On the Origin of Scanning: The Impact of Location on Internet-Wide Scans

Paper Artifact(s) from Stanford University

Abstract: Fast IPv4 scanning has enabled researchers to answer a wealth of security and networking questions. Yet, despite widespread use, there has been little validation of the methodology’s accuracy, including whether a single scan provides sufficient coverage. In this paper, we analyze how scan origin affects the results of Internet-wide scans by completing three HTTP, HTTPS, and SSH scans from seven geographically and topologically diverse networks. We find that individual origins miss an average 1.6–8.4% of HTTP, 1.5–4.6% of HTTPS, and 8.3–18.2% of SSH hosts. We analyze why origins see different hosts, and show how permanent and temporary blocking, packet loss, geographic biases, and transient outages affect scan results. We discuss the implications for scanning and provide recommendations for future studies.

Designing Toxic Content Classification for a Diversity of Perspectives

Paper Artifact(s) from Stanford University

Abstract: Despite many efforts to automatically identify toxic comments online (including sexual harassment, threats, and identity attacks), modern systems fail to generalize to the diverse concerns of Internet users. This dataset consists of 107,620 social media comments annotated by 17,280 unique participants, and was collected to understand how user expectations for what constitutes toxic content differ across demographics, beliefs, and personal experiences. The dataset is encrypted – please contact Deepak Kumar for the password.