Measuring similarity between Karel programs using character and word  n -grams

G. Sidorov; M. Ibarra Romero; I. Markov; R. Guzman-Cabrera; L. Chanona-Hernández; F. Velásquez

doi:10.1134/S0361768817010066

Measuring similarity between Karel programs using character and word n-grams

Authors: Sidorov G.¹, Ibarra Romero M.¹, Markov I.¹, Guzman-Cabrera R.², Chanona-Hernández L.³, Velásquez F.⁴
Affiliations:
1. Instituto Politécnico Nacional (IPN)
2. Engineering Division
3. Instituto Politécnico Nacional
4. Polytechnic University of Queretaro
Issue: Vol 43, No 1 (2017)
Pages: 47-50
Section: Article
URL: https://journal-vniispk.ru/0361-7688/article/view/176478
DOI: https://doi.org/10.1134/S0361768817010066
ID: 176478

Cite item

Full Text

Open Access
Restricted Access

Access granted
Restricted Access

Subscription Access

Abstract
About the authors
References
Supplementary files
Statistics

Abstract

We present a method for measuring similarity between source codes. We approach this task from the machine learning perspective using character and word n-grams as features and examining different machine learning algorithms. Furthermore, we explore the contribution of the latent semantic analysis in this task. We developed a corpus in order to evaluate the proposed approach. The corpus consists of around 10,000 source codes written in the Karel programming language to solve 100 different tasks. The results show that the highest classification accuracy is achieved when using Support Vector Machines classifier, applying the latent semantic analysis, and selecting as features trigrams of words.

Keywords

machine learning, similarity, Karel programming language, character n-grams, word n-grams, SVM, LSA

Supplementary files

Supplementary Files

Action

1. JATS XML

Download

Username
Password
Remember me

Forgot password?	Register

Username
Password
Remember me

Forgot password?	Register

Measuring similarity between Karel programs using character and word n-grams

Full Text

Abstract

Keywords

About the authors

G. Sidorov

M. Ibarra Romero

I. Markov

R. Guzman-Cabrera

L. Chanona-Hernández

F. Velásquez

Supplementary files