PENGUKURAN TINGKAT KEMIRIPAN DOKUMEN BERBASIS CLUSTER

Ibnu Santoso, Lya Hulliyyatus Suadaa

Abstract


Document similarity can be measured and used to discover other similar documents in a document collection (corpus). In a small corpus, measuring document similarity is not a problem. In a bigger corpus, comparing similarity rate between documents can be time consuming. A clustering method can be used to minimize number of document collection that has to be compared to a document to save time. This research is aimed to discover the effect of clustering technique in measuring document similarity and evaluate the performance. Corpus used was undergraduate thesis of Politeknik Statistika STIS students from year 2007-2016 as many as 2.049 documents. These documents were represented as bag of words model and clustered using k-means clustering method. Measurement of similarity used is Cosine similarity. From the simulation, clustering process for 3 clusters needs longer preparation time (17,32%) but resulting in faster query processing (77,88%) with accuracy of 0,98. Clustering process for 5 clusters needs longer preparation time (31,10%) but resulting in faster query processing (83,79%) with accuracy of 0,86. Clustering process for 7 clusters needs longer preparation time (45,10%) but resulting in faster query processing (85,30%) with accuracy of 0,98.


Full Text:

PDF

References


R. Maxion, “Making Experiments Dependable”, Dependable and Historic Computing, Halaman 344-357, 2011.

U. Fayad, G. Piatetsky-Shapiro, dan P. Smyth, “From Data Mining to Knowledge Discovery in Databases”, American Association for Artificial Intelligence, 1996.

C. D. Manning, P. Raghavan and H. Schütze, “Introduction to Information Retrieval”, Cambridge University Press. 2008.

W. B. Frakes dan R. Baeza, “Information Retrieval, Data Structures and Algorithms”, Prentice Hall, 1992.

F. Z. Tala, “A Study of Stemming Effects on Information Retrieval in Bahasa Indonesia”, M.Sc. Thesis, Institute for Logic, Language and Computation, Universiteti van Amsterdam The Netherlands, 2003.




DOI: http://dx.doi.org/10.20527/klik.v6i1.181

Copyright (c) 2019 KLIK - KUMPULAN JURNAL ILMU KOMPUTER

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

Indexed by:

  
 

 

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.joomla
counter View My Stats