Short Text Classification Approach to Identify Child Sexual Exploitation Material

Bibliographic Details
Title: Short Text Classification Approach to Identify Child Sexual Exploitation Material
Authors: Al-Nabki, Mhd Wesam, Fidalgo, Eduardo, Alegre, Enrique, Alaiz-Rodríguez, Rocío
Publication Year: 2020
Collection: Computer Science
Subject Terms: Computer Science - Information Retrieval, Computer Science - Machine Learning
More Details: Producing or sharing Child Sexual Exploitation Material (CSEM) is a serious crime fought vigorously by Law Enforcement Agencies (LEAs). When an LEA seizes a computer from a potential producer or consumer of CSEM, they need to analyze the suspect's hard disk's files looking for pieces of evidence. However, a manual inspection of the file content looking for CSEM is a time-consuming task. In most cases, it is unfeasible in the amount of time available for the Spanish police using a search warrant. Instead of analyzing its content, another approach that can be used to speed up the process is to identify CSEM by analyzing the file names and their absolute paths. The main challenge for this task lies behind dealing with short text distorted deliberately by the owners of this material using obfuscated words and user-defined naming patterns. This paper presents and compares two approaches based on short text classification to identify CSEM files. The first one employs two independent supervised classifiers, one for the file name and the other for the path, and their outputs are later on fused into a single score. Conversely, the second approach uses only the file name classifier to iterate over the file's absolute path. Both approaches operate at the character n-grams level, while binary and orthographic features enrich the file name representation, and a binary Logistic Regression model is used for classification. The presented file classifier achieved an average class recall of 0.98. This solution could be integrated into forensic tools and services to support Law Enforcement Agencies to identify CSEM without tackling every file's visual content, which is computationally much more highly demanding.
Document Type: Working Paper
Access URL: http://arxiv.org/abs/2011.01113
Accession Number: edsarx.2011.01113
Database: arXiv
FullText Text:
  Availability: 0
CustomLinks:
  – Url: http://arxiv.org/abs/2011.01113
    Name: EDS - Arxiv
    Category: fullText
    Text: View this record from Arxiv
    MouseOverText: View this record from Arxiv
  – Url: https://resolver.ebsco.com/c/xy5jbn/result?sid=EBSCO:edsarx&genre=article&issn=&ISBN=&volume=&issue=&date=20201029&spage=&pages=&title=Short Text Classification Approach to Identify Child Sexual Exploitation Material&atitle=Short%20Text%20Classification%20Approach%20to%20Identify%20Child%20Sexual%20Exploitation%20Material&aulast=Al-Nabki%2C%20Mhd%20Wesam&id=DOI:
    Name: Full Text Finder (for New FTF UI) (s8985755)
    Category: fullText
    Text: Find It @ SCU Libraries
    MouseOverText: Find It @ SCU Libraries
Header DbId: edsarx
DbLabel: arXiv
An: edsarx.2011.01113
RelevancyScore: 1008
AccessLevel: 3
PubType: Report
PubTypeId: report
PreciseRelevancyScore: 1007.73980712891
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Short Text Classification Approach to Identify Child Sexual Exploitation Material
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Al-Nabki%2C+Mhd+Wesam%22">Al-Nabki, Mhd Wesam</searchLink><br /><searchLink fieldCode="AR" term="%22Fidalgo%2C+Eduardo%22">Fidalgo, Eduardo</searchLink><br /><searchLink fieldCode="AR" term="%22Alegre%2C+Enrique%22">Alegre, Enrique</searchLink><br /><searchLink fieldCode="AR" term="%22Alaiz-Rodríguez%2C+Rocío%22">Alaiz-Rodríguez, Rocío</searchLink>
– Name: DatePubCY
  Label: Publication Year
  Group: Date
  Data: 2020
– Name: Subset
  Label: Collection
  Group: HoldingsInfo
  Data: Computer Science
– Name: Subject
  Label: Subject Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Computer+Science+-+Information+Retrieval%22">Computer Science - Information Retrieval</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+Science+-+Machine+Learning%22">Computer Science - Machine Learning</searchLink>
– Name: Abstract
  Label: Description
  Group: Ab
  Data: Producing or sharing Child Sexual Exploitation Material (CSEM) is a serious crime fought vigorously by Law Enforcement Agencies (LEAs). When an LEA seizes a computer from a potential producer or consumer of CSEM, they need to analyze the suspect's hard disk's files looking for pieces of evidence. However, a manual inspection of the file content looking for CSEM is a time-consuming task. In most cases, it is unfeasible in the amount of time available for the Spanish police using a search warrant. Instead of analyzing its content, another approach that can be used to speed up the process is to identify CSEM by analyzing the file names and their absolute paths. The main challenge for this task lies behind dealing with short text distorted deliberately by the owners of this material using obfuscated words and user-defined naming patterns. This paper presents and compares two approaches based on short text classification to identify CSEM files. The first one employs two independent supervised classifiers, one for the file name and the other for the path, and their outputs are later on fused into a single score. Conversely, the second approach uses only the file name classifier to iterate over the file's absolute path. Both approaches operate at the character n-grams level, while binary and orthographic features enrich the file name representation, and a binary Logistic Regression model is used for classification. The presented file classifier achieved an average class recall of 0.98. This solution could be integrated into forensic tools and services to support Law Enforcement Agencies to identify CSEM without tackling every file's visual content, which is computationally much more highly demanding.
– Name: TypeDocument
  Label: Document Type
  Group: TypDoc
  Data: Working Paper
– Name: URL
  Label: Access URL
  Group: URL
  Data: <link linkTarget="URL" linkTerm="http://arxiv.org/abs/2011.01113" linkWindow="_blank">http://arxiv.org/abs/2011.01113</link>
– Name: AN
  Label: Accession Number
  Group: ID
  Data: edsarx.2011.01113
PLink https://login.libproxy.scu.edu/login?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&scope=site&db=edsarx&AN=edsarx.2011.01113
RecordInfo BibRecord:
  BibEntity:
    Subjects:
      – SubjectFull: Computer Science - Information Retrieval
        Type: general
      – SubjectFull: Computer Science - Machine Learning
        Type: general
    Titles:
      – TitleFull: Short Text Classification Approach to Identify Child Sexual Exploitation Material
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Al-Nabki, Mhd Wesam
      – PersonEntity:
          Name:
            NameFull: Fidalgo, Eduardo
      – PersonEntity:
          Name:
            NameFull: Alegre, Enrique
      – PersonEntity:
          Name:
            NameFull: Alaiz-Rodríguez, Rocío
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 29
              M: 10
              Type: published
              Y: 2020
ResultId 1