Short Text Classification Approach to Identify Child Sexual Exploitation Material
Title: | Short Text Classification Approach to Identify Child Sexual Exploitation Material |
---|---|
Authors: | Al-Nabki, Mhd Wesam, Fidalgo, Eduardo, Alegre, Enrique, Alaiz-Rodríguez, Rocío |
Publication Year: | 2020 |
Collection: | Computer Science |
Subject Terms: | Computer Science - Information Retrieval, Computer Science - Machine Learning |
More Details: | Producing or sharing Child Sexual Exploitation Material (CSEM) is a serious crime fought vigorously by Law Enforcement Agencies (LEAs). When an LEA seizes a computer from a potential producer or consumer of CSEM, they need to analyze the suspect's hard disk's files looking for pieces of evidence. However, a manual inspection of the file content looking for CSEM is a time-consuming task. In most cases, it is unfeasible in the amount of time available for the Spanish police using a search warrant. Instead of analyzing its content, another approach that can be used to speed up the process is to identify CSEM by analyzing the file names and their absolute paths. The main challenge for this task lies behind dealing with short text distorted deliberately by the owners of this material using obfuscated words and user-defined naming patterns. This paper presents and compares two approaches based on short text classification to identify CSEM files. The first one employs two independent supervised classifiers, one for the file name and the other for the path, and their outputs are later on fused into a single score. Conversely, the second approach uses only the file name classifier to iterate over the file's absolute path. Both approaches operate at the character n-grams level, while binary and orthographic features enrich the file name representation, and a binary Logistic Regression model is used for classification. The presented file classifier achieved an average class recall of 0.98. This solution could be integrated into forensic tools and services to support Law Enforcement Agencies to identify CSEM without tackling every file's visual content, which is computationally much more highly demanding. |
Document Type: | Working Paper |
Access URL: | http://arxiv.org/abs/2011.01113 |
Accession Number: | edsarx.2011.01113 |
Database: | arXiv |
FullText | Text: Availability: 0 CustomLinks: – Url: http://arxiv.org/abs/2011.01113 Name: EDS - Arxiv Category: fullText Text: View this record from Arxiv MouseOverText: View this record from Arxiv – Url: https://resolver.ebsco.com/c/xy5jbn/result?sid=EBSCO:edsarx&genre=article&issn=&ISBN=&volume=&issue=&date=20201029&spage=&pages=&title=Short Text Classification Approach to Identify Child Sexual Exploitation Material&atitle=Short%20Text%20Classification%20Approach%20to%20Identify%20Child%20Sexual%20Exploitation%20Material&aulast=Al-Nabki%2C%20Mhd%20Wesam&id=DOI: Name: Full Text Finder (for New FTF UI) (s8985755) Category: fullText Text: Find It @ SCU Libraries MouseOverText: Find It @ SCU Libraries |
---|---|
Header | DbId: edsarx DbLabel: arXiv An: edsarx.2011.01113 RelevancyScore: 1008 AccessLevel: 3 PubType: Report PubTypeId: report PreciseRelevancyScore: 1007.73980712891 |
IllustrationInfo | |
Items | – Name: Title Label: Title Group: Ti Data: Short Text Classification Approach to Identify Child Sexual Exploitation Material – Name: Author Label: Authors Group: Au Data: <searchLink fieldCode="AR" term="%22Al-Nabki%2C+Mhd+Wesam%22">Al-Nabki, Mhd Wesam</searchLink><br /><searchLink fieldCode="AR" term="%22Fidalgo%2C+Eduardo%22">Fidalgo, Eduardo</searchLink><br /><searchLink fieldCode="AR" term="%22Alegre%2C+Enrique%22">Alegre, Enrique</searchLink><br /><searchLink fieldCode="AR" term="%22Alaiz-Rodríguez%2C+Rocío%22">Alaiz-Rodríguez, Rocío</searchLink> – Name: DatePubCY Label: Publication Year Group: Date Data: 2020 – Name: Subset Label: Collection Group: HoldingsInfo Data: Computer Science – Name: Subject Label: Subject Terms Group: Su Data: <searchLink fieldCode="DE" term="%22Computer+Science+-+Information+Retrieval%22">Computer Science - Information Retrieval</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+Science+-+Machine+Learning%22">Computer Science - Machine Learning</searchLink> – Name: Abstract Label: Description Group: Ab Data: Producing or sharing Child Sexual Exploitation Material (CSEM) is a serious crime fought vigorously by Law Enforcement Agencies (LEAs). When an LEA seizes a computer from a potential producer or consumer of CSEM, they need to analyze the suspect's hard disk's files looking for pieces of evidence. However, a manual inspection of the file content looking for CSEM is a time-consuming task. In most cases, it is unfeasible in the amount of time available for the Spanish police using a search warrant. Instead of analyzing its content, another approach that can be used to speed up the process is to identify CSEM by analyzing the file names and their absolute paths. The main challenge for this task lies behind dealing with short text distorted deliberately by the owners of this material using obfuscated words and user-defined naming patterns. This paper presents and compares two approaches based on short text classification to identify CSEM files. The first one employs two independent supervised classifiers, one for the file name and the other for the path, and their outputs are later on fused into a single score. Conversely, the second approach uses only the file name classifier to iterate over the file's absolute path. Both approaches operate at the character n-grams level, while binary and orthographic features enrich the file name representation, and a binary Logistic Regression model is used for classification. The presented file classifier achieved an average class recall of 0.98. This solution could be integrated into forensic tools and services to support Law Enforcement Agencies to identify CSEM without tackling every file's visual content, which is computationally much more highly demanding. – Name: TypeDocument Label: Document Type Group: TypDoc Data: Working Paper – Name: URL Label: Access URL Group: URL Data: <link linkTarget="URL" linkTerm="http://arxiv.org/abs/2011.01113" linkWindow="_blank">http://arxiv.org/abs/2011.01113</link> – Name: AN Label: Accession Number Group: ID Data: edsarx.2011.01113 |
PLink | https://login.libproxy.scu.edu/login?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&scope=site&db=edsarx&AN=edsarx.2011.01113 |
RecordInfo | BibRecord: BibEntity: Subjects: – SubjectFull: Computer Science - Information Retrieval Type: general – SubjectFull: Computer Science - Machine Learning Type: general Titles: – TitleFull: Short Text Classification Approach to Identify Child Sexual Exploitation Material Type: main BibRelationships: HasContributorRelationships: – PersonEntity: Name: NameFull: Al-Nabki, Mhd Wesam – PersonEntity: Name: NameFull: Fidalgo, Eduardo – PersonEntity: Name: NameFull: Alegre, Enrique – PersonEntity: Name: NameFull: Alaiz-Rodríguez, Rocío IsPartOfRelationships: – BibEntity: Dates: – D: 29 M: 10 Type: published Y: 2020 |
ResultId | 1 |