Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach

Bibliographic Details
Title: Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach
Authors: Janez-Martino, F., Alaiz-Rodriguez, R., Gonzalez-Castro, V., Fidalgo, E., Alegre, E.
Publication Year: 2024
Collection: Computer Science
Subject Terms: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
More Details: Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, phishing or hoaxes. Unlike most studies that address the design of efficient anti-spam filters, we approach the spam email problem from a different and novel perspective. Focusing on the needs of cybersecurity units, we follow a topic-based approach for addressing the classification of spam email into multiple categories. We propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets with approximately 15K emails each in English and Spanish, respectively, and we label them using agglomerative hierarchical clustering into 11 classes. We evaluate 16 pipelines, combining four text representation techniques -Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words, Word2Vec and BERT- and four classifiers: Support Vector Machine, N\"aive Bayes, Random Forest and Logistic Regression. Experimental results show that the highest performance is achieved with TF-IDF and LR for the English dataset, with a F1 score of 0.953 and an accuracy of 94.6%, and while for the Spanish dataset, TF-IDF with NB yields a F1 score of 0.945 and 98.5% accuracy. Regarding the processing time, TF-IDF with LR leads to the fastest classification, processing an English and Spanish spam email in and on average, respectively.
Document Type: Working Paper
Access URL: http://arxiv.org/abs/2402.05296
Accession Number: edsarx.2402.05296
Database: arXiv
FullText Text:
  Availability: 0
CustomLinks:
  – Url: http://arxiv.org/abs/2402.05296
    Name: EDS - Arxiv
    Category: fullText
    Text: View this record from Arxiv
    MouseOverText: View this record from Arxiv
  – Url: https://resolver.ebsco.com/c/xy5jbn/result?sid=EBSCO:edsarx&genre=article&issn=&ISBN=&volume=&issue=&date=20240207&spage=&pages=&title=Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach&atitle=Classifying%20spam%20emails%20using%20agglomerative%20hierarchical%20clustering%20and%20a%20topic-based%20approach&aulast=Janez-Martino%2C%20F.&id=DOI:
    Name: Full Text Finder (for New FTF UI) (s8985755)
    Category: fullText
    Text: Find It @ SCU Libraries
    MouseOverText: Find It @ SCU Libraries
Header DbId: edsarx
DbLabel: arXiv
An: edsarx.2402.05296
RelevancyScore: 1085
AccessLevel: 3
PubType: Report
PubTypeId: report
PreciseRelevancyScore: 1085.39135742188
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Janez-Martino%2C+F%2E%22">Janez-Martino, F.</searchLink><br /><searchLink fieldCode="AR" term="%22Alaiz-Rodriguez%2C+R%2E%22">Alaiz-Rodriguez, R.</searchLink><br /><searchLink fieldCode="AR" term="%22Gonzalez-Castro%2C+V%2E%22">Gonzalez-Castro, V.</searchLink><br /><searchLink fieldCode="AR" term="%22Fidalgo%2C+E%2E%22">Fidalgo, E.</searchLink><br /><searchLink fieldCode="AR" term="%22Alegre%2C+E%2E%22">Alegre, E.</searchLink>
– Name: DatePubCY
  Label: Publication Year
  Group: Date
  Data: 2024
– Name: Subset
  Label: Collection
  Group: HoldingsInfo
  Data: Computer Science
– Name: Subject
  Label: Subject Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Computer+Science+-+Machine+Learning%22">Computer Science - Machine Learning</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+Science+-+Artificial+Intelligence%22">Computer Science - Artificial Intelligence</searchLink>
– Name: Abstract
  Label: Description
  Group: Ab
  Data: Spam emails are unsolicited, annoying and sometimes harmful messages which may contain malware, phishing or hoaxes. Unlike most studies that address the design of efficient anti-spam filters, we approach the spam email problem from a different and novel perspective. Focusing on the needs of cybersecurity units, we follow a topic-based approach for addressing the classification of spam email into multiple categories. We propose SPEMC-15K-E and SPEMC-15K-S, two novel datasets with approximately 15K emails each in English and Spanish, respectively, and we label them using agglomerative hierarchical clustering into 11 classes. We evaluate 16 pipelines, combining four text representation techniques -Term Frequency-Inverse Document Frequency (TF-IDF), Bag of Words, Word2Vec and BERT- and four classifiers: Support Vector Machine, N\"aive Bayes, Random Forest and Logistic Regression. Experimental results show that the highest performance is achieved with TF-IDF and LR for the English dataset, with a F1 score of 0.953 and an accuracy of 94.6%, and while for the Spanish dataset, TF-IDF with NB yields a F1 score of 0.945 and 98.5% accuracy. Regarding the processing time, TF-IDF with LR leads to the fastest classification, processing an English and Spanish spam email in and on average, respectively.
– Name: TypeDocument
  Label: Document Type
  Group: TypDoc
  Data: Working Paper
– Name: URL
  Label: Access URL
  Group: URL
  Data: <link linkTarget="URL" linkTerm="http://arxiv.org/abs/2402.05296" linkWindow="_blank">http://arxiv.org/abs/2402.05296</link>
– Name: AN
  Label: Accession Number
  Group: ID
  Data: edsarx.2402.05296
PLink https://login.libproxy.scu.edu/login?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&scope=site&db=edsarx&AN=edsarx.2402.05296
RecordInfo BibRecord:
  BibEntity:
    Subjects:
      – SubjectFull: Computer Science - Machine Learning
        Type: general
      – SubjectFull: Computer Science - Artificial Intelligence
        Type: general
    Titles:
      – TitleFull: Classifying spam emails using agglomerative hierarchical clustering and a topic-based approach
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Janez-Martino, F.
      – PersonEntity:
          Name:
            NameFull: Alaiz-Rodriguez, R.
      – PersonEntity:
          Name:
            NameFull: Gonzalez-Castro, V.
      – PersonEntity:
          Name:
            NameFull: Fidalgo, E.
      – PersonEntity:
          Name:
            NameFull: Alegre, E.
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 07
              M: 02
              Type: published
              Y: 2024
ResultId 1