Detecting spam in SMS unlabeled data.
Look at these 5 messages from SMS data:
"XXXMobileMovieClub: To use your credit, click the WAP link in the next txt message or click here>> http://wap. xxxmobilemovieclub.com?n=QJKGIGHJJGCBL" Oh k...i'm watching here:) Eh u remember how 2 spell his name... Yes i did. He v naughty make until i v wet. Fine if thatåÕs the way u feel. ThatåÕs the way its gota b "England v Macedonia - dont miss the goals/team news. Txt ur national team to 87077 eg ENGLAND to 87077 Try:WALES, SCOTLAND 4txt/̼1.20 POBOXox36504W45WQ 16+"You can detect yourself that the first and the last messages look like spam. But can Math do this enough well? An ordinary expert system can do this job probably very well. But they say, that Math can retrieve information from source data that you can't notice neither at first look nor after a week of thoroughly study it. So, this is the first simplest data-set to start.
This is about data mining (extracting information) from data without a classifying model (that trained on labeled train data). It's called "Unsupervised machine learning". If you look at Unsupervised learning, then you should find that it's not easy, it assumes that you should know Linear Algebra, Probability and Statistics, etc.
Any way, lets try Clustering, for example Clustering text documents using k-means.
But first take a look at Preprocessing data. That is we must convert text entries into numeric arrays (vectorization), etc. But before this we should prepare messages - lower case, leave only words, etc. See Spam classifier about data preparation. For example phrase "morefrmmob" stays unseparated by all methods. Sentence "ye gauti sehwag odi seri" becames "yes gauti sehwag odi series" by lemmatiziers (probably wrong fixing), although they did not fix "u know" to "you know".
So, the first attempt is in src/python/spam-detect/spam-clustering.py. But results are different every time. And so does https://scikit-learn.org/stable/_downloads/ba68199eea858ec04949b2c6c65147e0/plot_document_clustering.py For SCILEARN example 1-st invocation:
Homogeneity: 0.480 Completeness: 0.506 V-measure: 0.493 Adjusted Rand-Index: 0.486 Silhouette Coefficient: 0.005 Top terms per cluster: Cluster 0: space cleveland cs com polygon sci higgins university freenet book Cluster 1: graphics image thanks university file 3d files program format images Cluster 2: god com people sandvik don jesus say article think christian Cluster 3: com space access nasa digex posting article nntp host patFor SCILEARN example 2-nd invocation:
Homogeneity: 0.514 Completeness: 0.547 V-measure: 0.530 Adjusted Rand-Index: 0.478 Silhouette Coefficient: 0.008 Top terms per cluster: Cluster 0: graphics image university thanks com ac 3d file files posting Cluster 1: sandvik keith com sgi livesey kent apple morality caltech objective Cluster 2: space nasa access digex henry gov pat shuttle toronto alaska Cluster 3: god com people don just article jesus think like knowSo, it's definitely more complex than previous article (Support Vector Machines), that gives the same result for any invocation. At first invocation category "space" seems to be divided into clusters #0 and #1, and "Atheism" and "Religion" seems to be merged into cluster #2. But why results are different? Maybe changing position of a word inside top terms means that neighboring words have the same frequency?
The API says:
In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That’s why it can be useful to restart it several times.
Using default settings (n_init - default=10 and max_iter - default=300) gives more stable results for SMS task, but it's more slowly:Homogeneity: 0.000 Completeness: 0.000 V-measure: 0.000 Adjusted Rand-Index: 0.003 Silhouette Coefficient: 0.005 Top 20 terms per cluster: Cluster 0: to the is in ok it my me and now for call your of on not ur that no at Cluster 1: you to are and me the have do call in your that how when can for what my it know First 100 labels source-cluster: 0 - 0; 0 - 0; 1 - 0; 0 - 0; 0 - 0; 1 - 0; 0 - 0; 0 - 0; 1 - 1; 1 - 0; 0 - 0; 1 - 0; 1 - 1; 0 - 1; 0 - 0; 1 - 0; 0 - 0; 0 - 0; 0 - 0; 1 - 0; 0 - 1; 0 - 0; 0 - 0; 0 - 0; 0 - 1; 0 - 0; 0 - 0; 0 - 1; 0 - 1; 0 - 0; 0 - 1; 0 - 0; 0 - 1; 0 - 1; 1 - 1; 0 - 0; 0 - 1; 0 - 0; 0 - 0; 0 - 1; 0 - 0; 0 - 1; 1 - 1; 0 - 1; 0 - 1; 0 - 0; 0 - 1; 0 - 0; 0 - 0; 0 - 0; 0 - 1; 0 - 0; 0 - 0; 0 - 1; 1 - 0; 0 - 1; 1 - 0; 0 - 0; 0 - 1; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 1 - 1; 0 - 0; 1 - 0; 1 - 1; 0 - 0; 0 - 0; 0 - 1; 0 - 0; 0 - 1; 0 - 0; 0 - 1; 0 - 1; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 1; 0 - 0; 0 - 1; 0 - 0; 0 - 1; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 1; 0 - 0; 0 - 0; 1 - 1; 0 - 0; 1 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; Accuracy = 68.12634601579325%Sometimes cluster's labels are changed (inverted), so 32% means 68%.
Spam classifier gives 98.47% accuracy for SVM linear, 4457 train samples (5572 total), 1115 test samples (145 spam in them, 14 from them are wrong, i.e. Accuracy spam = 90.35%), PorterStemmer and sklearn.feature_extraction.text.CountVectorizer. If SVM (that is trained on these labeled data) can separate these samples so well, then it should be a Math method to train and separate the unlabeled ones enough well (70% isn't good). Indeed, for those processors the result is better (src/python/spam-detect/spam-clustering1.py):
Homogeneity: 0.169 Completeness: 0.193 V-measure: 0.180 Adjusted Rand-Index: 0.367 Silhouette Coefficient: 0.090 Top 20 terms per cluster: Cluster 0: go get ur gt lt come ok day know love like got good time want send text need free txt Cluster 1: call free mobil claim min prize pleas contact urgent later award phone sorri ppm text custom number servic cash guarante First 100 labels source-cluster: 0 - 0; 0 - 0; 1 - 0; 0 - 0; 0 - 0; 1 - 0; 0 - 0; 0 - 0; 1 - 1; 1 - 1; 0 - 0; 1 - 0; 1 - 0; 0 - 0; 0 - 0; 1 - 0; 0 - 0; 0 - 0; 0 - 0; 1 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 1 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 1 - 1; 0 - 0; 0 - 0; 0 - 1; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 1 - 0; 0 - 0; 1 - 1; 0 - 1; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 1 - 1; 0 - 0; 1 - 0; 1 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 1; 0 - 1; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 1; 0 - 1; 0 - 0; 0 - 0; 0 - 0; 0 - 1; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; 1 - 1; 0 - 0; 1 - 0; 0 - 0; 0 - 0; 0 - 0; 0 - 0; Accuracy total = 87.86791098348887% Spam total = 747 Accuracy spam = 45.64926372155288%But accuracy for exactly SPAM messages (747 from 5572 total) is not good.
References:
- https://scikit-learn.org/stable/modules/preprocessing.html#preprocessing
- https://scikit-learn.org/stable/unsupervised_learning.html
- scikit-learn: Clustering text documents using k-means
- https://github.com/ksdkamesh99/Spam-Classifier by Kota Sai Durga Kamesh (MIT License)
- SMS Spam Dataset created by UCI Machine Learning