{"id":49768,"date":"2019-07-02T09:51:00","date_gmt":"2019-07-02T16:51:00","guid":{"rendered":"https:\/\/github.blog\/?p=49768"},"modified":"2022-02-16T17:07:00","modified_gmt":"2022-02-17T01:07:00","slug":"c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages","status":"publish","type":"post","link":"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/","title":{"rendered":"C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages"},"content":{"rendered":"<p><span style=\"font-weight: 400;\">GitHub hosts over 300 programming languages\u2014from commonly used languages such as Python, Java, and Javascript to esoteric languages such as <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Esoteric_programming_language#Befunge\"><span style=\"font-weight: 400;\">Befunge<\/span><\/a><span style=\"font-weight: 400;\">, only known to very small communities.<\/span><\/p>\n<h6 style=\"text-align: center;\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" class=\"alignnone\" src=\"https:\/\/i0.wp.com\/user-images.githubusercontent.com\/7935808\/60456030-506b1a80-9bf5-11e9-9f18-89f1681d0821.png?resize=1812%2C818&#038;ssl=1\" alt=\"JavaScript is the top programming language on GitHub, followed by Java and HTML\" width=\"1812\" height=\"818\" \/><br \/>\n<span style=\"font-weight: 400;\">Figure 1: Top 10 programming languages hosted by GitHub by repository count <\/span><span style=\"font-weight: 400;\"><br \/>\n<\/span><\/h6>\n<p><span style=\"font-weight: 400;\">One of the necessary challenges that GitHub faces is to be able to recognize these different languages. When some code is pushed to a repository, it\u2019s important to recognize the type of code that was added for the purposes of search, security vulnerability alerting, and syntax highlighting\u2014and to show the repository\u2019s content distribution to users.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Despite the appearance, language recognition isn\u2019t a trivial task. File names and extensions, while providing a good indication of what the coding language is likely to be, do not offer the full picture. In fact, many extensions are associated with the same language (e.g., &#8220;.pl&#8221;, &#8220;.pm&#8221;, &#8220;.t&#8221;, &#8220;.pod&#8221; are all associated with Perl), while others are ambiguous and used almost interchangeably across languages (e.g., \u201c.h\u201d is commonly used to indicate many languages of the \u201cC\u201d family, including C, C++, and Objective-C). In other cases, files are simply provided with no extension (especially for executable scripts) or with the incorrect extension (either on purpose or accidentally).<\/span><\/p>\n<p><a href=\"https:\/\/github.com\/github\/linguist\"><span style=\"font-weight: 400;\">Linguist<\/span><\/a><span style=\"font-weight: 400;\"> is the tool we currently use to detect coding languages at GitHub. Linguist a Ruby-based application that uses various strategies for language detection, leveraging naming conventions and file extensions and also taking into account Vim or Emacs modelines, as well as the content at the top of the file (shebang). Linguist handles language disambiguation via heuristics and, failing that, via a Naive Bayes classifier trained on a small sample of data.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Although Linguist does a good job making file-level language predictions (84% accuracy), its performance declines considerably when files use unexpected naming conventions and, crucially, when a file extension is not provided. This renders Linguist unsuitable for content such as GitHub Gists or code snippets within README\u2019s, issues, and pull requests.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">In order to make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua based on an Artificial Neural Network (ANN) architecture which can handle language predictions in tricky scenarios. The current version of the model is able to make predictions for the top 50 languages hosted by GitHub and surpasses Linguist in accuracy and performance.\u00a0<\/span><\/p>\n<h2 id=\"the-nuts-and-bolts-behind-octolingua\"><a class=\"heading-link\" href=\"#the-nuts-and-bolts-behind-octolingua\"><span style=\"font-weight: 400;\">The Nuts and Bolts Behind OctoLingua<\/span><span class=\"heading-hash pl-2 text-italic text-bold\" aria-hidden=\"true\"><\/span><\/a><\/h2>\n<p><span style=\"font-weight: 400;\">OctoLingua was built from scratch using Python, Keras with TensorFlow backend\u2014and is built to be accurate, robust, and easy to maintain. In this section, we describe our data sources, model architecture, and performance benchmark for OctoLingua. We also describe what it takes to add support for a new language.\u00a0<\/span><\/p>\n<h3 id=\"data-sources\"><a class=\"heading-link\" href=\"#data-sources\">Data sources<span class=\"heading-hash pl-2 text-italic text-bold\" aria-hidden=\"true\"><\/span><\/a><\/h3>\n<p><span style=\"font-weight: 400;\">The current version of OctoLingua was trained on files retrieved from <\/span><a href=\"http:\/\/www.rosettacode.org\/wiki\/Rosetta_Code\"><span style=\"font-weight: 400;\">Rosetta Code<\/span><\/a><span style=\"font-weight: 400;\"> and from a set of quality repositories internally crowdsourced. We limited our language set to the top 50 languages hosted on GitHub.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Rosetta Code was an excellent starter dataset as it contained source code for the same task expressed in different programming languages. For example, the task of generating a <\/span><a href=\"https:\/\/github.com\/acmeism\/RosettaCodeData\/tree\/master\/Task\/Fibonacci-sequence\/\"><span style=\"font-weight: 400;\">Fibonacci sequence<\/span><\/a><span style=\"font-weight: 400;\"> is expressed in C, C++, CoffeeScript, D, Java, Julia, and more. However, the coverage across languages was not uniform where some languages only have a handful of files and some files were just too sparsely populated. Augmenting our training set with some additional sources was therefore necessary and substantially improved language coverage and performance.<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Our process for adding a new language is now fully automated. We programmatically collect source code from public repositories on GitHub. We choose repositories that meet a minimum qualifying criteria such as having a minimum number of forks, covering the target language and covering specific file extensions. For this stage of data collection, we determine the primary language of a repository using the classification from Linguist.\u00a0<\/span><\/p>\n<h3 id=\"features-leveraging-prior-knowledge\"><a class=\"heading-link\" href=\"#features-leveraging-prior-knowledge\">Features: leveraging prior knowledge<span class=\"heading-hash pl-2 text-italic text-bold\" aria-hidden=\"true\"><\/span><\/a><\/h3>\n<p><span style=\"font-weight: 400;\">Traditionally, for text classification problems with Neural Networks, memory-based architectures such as Recurrent Neural Networks (RNN) and Long Short Term Memory Networks (LSTM) are often employed. However, given that programming languages have differences in vocabulary, commenting style, file extensions, structure, libraries import style and other minor differences, we opted for a simpler approach that leverages all this information by extracting some relevant features in tabular form to be fed to our classifier. The features currently extracted are as follows:<\/span><\/p>\n<ol>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Top five special characters per file<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Top 20 tokens per file<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">File extension<\/span><\/li>\n<li style=\"font-weight: 400;\"><span style=\"font-weight: 400;\">Presence of certain special characters commonly used in source code files such as colons, curly braces, and semicolons<\/span><\/li>\n<\/ol>\n<h3 id=\"the-artificial-neural-network-ann-model\"><a class=\"heading-link\" href=\"#the-artificial-neural-network-ann-model\">The Artificial Neural Network (ANN) model<span class=\"heading-hash pl-2 text-italic text-bold\" aria-hidden=\"true\"><\/span><\/a><\/h3>\n<p><span style=\"font-weight: 400;\">We use the above features as input to a two-layer Artificial Neural Network built using Keras with Tensorflow backend.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The diagram below shows that the feature extraction step produces an n-dimensional tabular input for our classifier. As the information moves along the layers of our network, it is regularized by dropout and ultimately produces a 51-dimensional output which represents the predicted probability that the given code is written in each of the top 50 GitHub languages plus the probability that it is not written in any of those.<\/span><\/p>\n<h6 style=\"text-align: center;\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/user-images.githubusercontent.com\/7935808\/60456057-6a0c6200-9bf5-11e9-9f87-fdc4f8cc145b.png?ssl=1\" alt=\"image\" \/><br \/>\n<span style=\"font-weight: 400;\">Figure 2: The ANN Structure of our initial model (50 languages + 1 for \u201cother\u201d)<\/span><\/h6>\n<p><span style=\"font-weight: 400;\">We used 90% of our dataset for training over approximately eight epochs. Additionally, we removed a percentage of file extensions from our training data at the training step, to encourage the model to learn from the vocabulary of the files, and not overfit on the file extension feature, which is highly predictive.<\/span><\/p>\n<h3 id=\"performance-benchmark\"><a class=\"heading-link\" href=\"#performance-benchmark\">Performance benchmark<span class=\"heading-hash pl-2 text-italic text-bold\" aria-hidden=\"true\"><\/span><\/a><\/h3>\n<h4 id=\"octolingua-vs-linguist\"><a class=\"heading-link\" href=\"#octolingua-vs-linguist\"><strong>OctoLingua vs. Linguist<\/strong><span class=\"heading-hash pl-2 text-italic text-bold\" aria-hidden=\"true\"><\/span><\/a><\/h4>\n<p><span style=\"font-weight: 400;\">In Figure 3, we show the <\/span><a href=\"https:\/\/en.wikipedia.org\/wiki\/Precision_and_recall\"><span style=\"font-weight: 400;\">F1 Score<\/span><\/a><span style=\"font-weight: 400;\"> (harmonic mean between precision and recall) of OctoLingua and Linguist calculated on the same test set (10% from our initial data source).\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">Here we show three tests. The first test is with the test set untouched in any way. The second test uses the same set of test files with file extension information removed and the third test also uses the same set of files but this time with file extensions scrambled so as to confuse the classifiers (e.g., a Java file may have a \u201c.txt\u201d extension and a Python file may have a \u201c.java\u201d) extension.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The intuition behind scrambling or removing the file extensions in our test set is to assess the robustness of OctoLingua in classifying files when a key feature is removed or is misleading. A classifier that does not rely heavily on extension would be extremely useful to classify gists and snippets, since in those cases it is common for people not to provide accurate extension information (e.g., many code-related gists have a .txt extension).<\/span><\/p>\n<p><span style=\"font-weight: 400;\">The table below shows how OctoLingua maintains a good performance under various conditions, suggesting that the model learns primarily from the vocabulary of the code, rather than from meta information (i.e. file extension), whereas Linguist fails as soon as the information on file extensions is altered.<\/span><\/p>\n<h6 style=\"text-align: center;\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/user-images.githubusercontent.com\/7935808\/60456070-7abcd800-9bf5-11e9-8bd8-18eed09b3612.png?ssl=1\" alt=\"image\" \/><br \/>\n<span style=\"font-weight: 400;\">Figure 3: Performance of OctoLingua vs. Linguist on the same test set<\/span><\/h6>\n<p>&nbsp;<\/p>\n<h4 id=\"effect-of-removing-file-extension-during-training-time\"><a class=\"heading-link\" href=\"#effect-of-removing-file-extension-during-training-time\"><strong>Effect of removing file extension during training time<\/strong><span class=\"heading-hash pl-2 text-italic text-bold\" aria-hidden=\"true\"><\/span><\/a><\/h4>\n<p><span style=\"font-weight: 400;\">As mentioned earlier, during training time we removed a percentage of file extensions from our training data to encourage the model to learn from the vocabulary of the files. The table <\/span><span style=\"font-weight: 400;\">below shows the performance of our model with different fractions of file extensions removed during training time.\u00a0<\/span><\/p>\n<h6 style=\"text-align: center;\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/user-images.githubusercontent.com\/7935808\/60456079-87d9c700-9bf5-11e9-842a-7c6044d0c587.png?ssl=1\" alt=\"image\" \/><br \/>\n<span style=\"font-weight: 400;\">Figure 4: Performance of OctoLingua with different percentage of file extensions removed on our three test variations<\/span><\/h6>\n<p><span style=\"font-weight: 400;\">Notice that with no file extension removed during training time, the performance of OctoLingua on test files with no extensions and randomized extensions decreases considerably from that on the regular test data. On the other hand, when the model is trained on a dataset where some file extensions are removed, the model performance does not decline much on the modified test set. This confirms that removing the file extension from a fraction of files at training time induces our classifier to learn more from the vocabulary. It also shows that the file extension feature, while highly predictive, had a tendency to dominate and prevented more weights from being assigned to the content features.\u00a0<\/span><\/p>\n<h3 id=\"supporting-a-new-language\"><a class=\"heading-link\" href=\"#supporting-a-new-language\">Supporting a new language<span class=\"heading-hash pl-2 text-italic text-bold\" aria-hidden=\"true\"><\/span><\/a><\/h3>\n<p><span style=\"font-weight: 400;\">Adding a new language in OctoLingua is fairly straightforward. It starts with obtaining a bulk of files in the new language (we can do this programmatically as described in data sources). These files are split into a training and a test set and then run through our preprocessor and feature extractor. This new train and test set is added to our existing pool of training and testing data. The new testing set allows us to verify that the accuracy of our model remains acceptable.<\/span><\/p>\n<h6 style=\"text-align: center;\"><img data-recalc-dims=\"1\" decoding=\"async\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/user-images.githubusercontent.com\/7935808\/60456097-988a3d00-9bf5-11e9-9ce9-005fc154982e.png?ssl=1\" alt=\"image\" \/><br \/>\n<span style=\"font-weight: 400;\">Figure 5: Adding a new language with OctoLingua<\/span><\/h6>\n<h2 id=\"our-plans\"><a class=\"heading-link\" href=\"#our-plans\">Our plans<span class=\"heading-hash pl-2 text-italic text-bold\" aria-hidden=\"true\"><\/span><\/a><\/h2>\n<p><span style=\"font-weight: 400;\">As of now, OctoLingua is at the \u201cadvanced prototyping stage\u201d. Our language classification engine is already robust and reliable, but does not yet support all coding languages on our platform. Aside from broadening language support\u2014which would be rather straightforward\u2014we aim to enable language detection at various levels of granularity. Our current implementation already allows us, with a small modification to our machine learning engine, to classify code snippets. It wouldn\u2019t be too far fetched to take the model to the stage where it can reliably detect and classify embedded languages.\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400;\">We are also contemplating the possibility of open sourcing our model and would love to hear from the community if you\u2019re interested.<\/span><\/p>\n<h2 id=\"summary\"><a class=\"heading-link\" href=\"#summary\">Summary<span class=\"heading-hash pl-2 text-italic text-bold\" aria-hidden=\"true\"><\/span><\/a><\/h2>\n<p><span style=\"font-weight: 400;\">With OctoLingua, our goal is to provide a service that enables robust and reliable source code language detection at multiple levels of granularity, from file level or snippet level to potentially line-level language detection and classification. Eventually, this service can support, among others, code searchability, code sharing, language highlighting, and diff rendering\u2014all of this aimed at supporting developers in their day to day development work in addition to helping them write quality code.\u00a0 If you are interested in leveraging or contributing to our work, please feel free to get in touch on Twitter <a href=\"https:\/\/twitter.com\/github\">@github<\/a>!<\/span><\/p>\n<h2 id=\"topics-data-team\" id=\"topics-data-team\" ><a class=\"heading-link\" href=\"#topics-data-team\">Authors<span class=\"heading-hash pl-2 text-italic text-bold\" aria-hidden=\"true\"><\/span><\/a><\/h2>\n<ul>\n<li><a href=\"http:\/\/www.kavita-ganesan.com\/\">Kavita Ganesan<\/a>,\u00a0<a href=\"https:\/\/github.com\/kavgan\/\">@kavgan<\/a>, Machine Learning Engineer<\/li>\n<li>Romano Foti, <a href=\"https:\/\/github.com\/romanofoti\/\">@romanofoti<\/a>, Senior Machine Learning Engineer<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"<p>To make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua based on an Artificial Neural Network (ANN) architecture which can handle language predictions in tricky scenarios.<\/p>\n","protected":false},"author":1331,"featured_media":49776,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_gh_post_show_toc":"no","_gh_post_is_no_robots":"","_gh_post_is_featured":"no","_gh_post_is_excluded":"no","_gh_post_is_unlisted":"","_gh_post_related_link_1":"","_gh_post_related_link_2":"","_gh_post_related_link_3":"","_gh_post_sq_img":"","_gh_post_sq_img_id":"","_gh_post_cta_title":"","_gh_post_cta_text":"","_gh_post_cta_link":"","_gh_post_cta_button":"","_gh_post_recirc_hide":"","_gh_post_recirc_col_1":"","_gh_post_recirc_col_2":"","_gh_post_recirc_col_3":"","_gh_post_recirc_col_4":"","_featured_video":"","_gh_post_additional_query_params":"","_jetpack_memberships_contains_paid_content":false,"footnotes":"","jetpack_publicize_message":"","jetpack_publicize_feature_enabled":true,"jetpack_social_post_already_shared":true,"jetpack_social_options":{"image_generator_settings":{"template":"highway","default_image_id":0,"font":"","enabled":false},"version":2},"_wpas_customize_per_network":false,"_links_to":"","_links_to_target":""},"categories":[3293,3297],"tags":[2510],"coauthors":[2511],"class_list":["post-49768","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-and-ml","category-machine-learning","tag-machine-learning"],"yoast_head":"<!-- This site is optimized with the Yoast SEO Premium plugin v27.3 (Yoast SEO v27.3) - https:\/\/yoast.com\/product\/yoast-seo-premium-wordpress\/ -->\n<title>C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages - The GitHub Blog<\/title>\n<meta name=\"description\" content=\"To make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages\" \/>\n<meta property=\"og:description\" content=\"To make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/\" \/>\n<meta property=\"og:site_name\" content=\"The GitHub Blog\" \/>\n<meta property=\"article:published_time\" content=\"2019-07-02T16:51:00+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-02-17T01:07:00+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/github.blog\/wp-content\/uploads\/2019\/07\/BlogHeaders_final_INSIGHTS_1200x630.png?fit=1201%2C630\" \/>\n\t<meta property=\"og:image:width\" content=\"1201\" \/>\n\t<meta property=\"og:image:height\" content=\"630\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Kavita Ganesan\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages\" \/>\n<meta name=\"twitter:description\" content=\"To make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua.\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/github.blog\/wp-content\/uploads\/2019\/07\/BlogHeaders_final_INSIGHTS_1200x630.png?fit=1201%2C630\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Kavita Ganesan\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"8 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\\\/\\\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\\\/\\\/github.blog\\\/ai-and-ml\\\/machine-learning\\\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\\\/#article\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/github.blog\\\/ai-and-ml\\\/machine-learning\\\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\\\/\"},\"author\":{\"name\":\"Kavita Ganesan\",\"@id\":\"https:\\\/\\\/github.blog\\\/#\\\/schema\\\/person\\\/1d0a884cd22c0e9dd4f7934d91f81589\"},\"headline\":\"C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages\",\"datePublished\":\"2019-07-02T16:51:00+00:00\",\"dateModified\":\"2022-02-17T01:07:00+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\\\/\\\/github.blog\\\/ai-and-ml\\\/machine-learning\\\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\\\/\"},\"wordCount\":1660,\"image\":{\"@id\":\"https:\\\/\\\/github.blog\\\/ai-and-ml\\\/machine-learning\\\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/github.blog\\\/wp-content\\\/uploads\\\/2019\\\/07\\\/BlogHeaders_final_INSIGHTS_1200x630.png?fit=1201%2C630\",\"keywords\":[\"machine learning\"],\"articleSection\":[\"AI &amp; ML\",\"Machine learning\"],\"inLanguage\":\"en-US\"},{\"@type\":\"WebPage\",\"@id\":\"https:\\\/\\\/github.blog\\\/ai-and-ml\\\/machine-learning\\\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\\\/\",\"url\":\"https:\\\/\\\/github.blog\\\/ai-and-ml\\\/machine-learning\\\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\\\/\",\"name\":\"C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages - The GitHub Blog\",\"isPartOf\":{\"@id\":\"https:\\\/\\\/github.blog\\\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\\\/\\\/github.blog\\\/ai-and-ml\\\/machine-learning\\\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\\\/#primaryimage\"},\"image\":{\"@id\":\"https:\\\/\\\/github.blog\\\/ai-and-ml\\\/machine-learning\\\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\\\/#primaryimage\"},\"thumbnailUrl\":\"https:\\\/\\\/github.blog\\\/wp-content\\\/uploads\\\/2019\\\/07\\\/BlogHeaders_final_INSIGHTS_1200x630.png?fit=1201%2C630\",\"datePublished\":\"2019-07-02T16:51:00+00:00\",\"dateModified\":\"2022-02-17T01:07:00+00:00\",\"author\":{\"@id\":\"https:\\\/\\\/github.blog\\\/#\\\/schema\\\/person\\\/1d0a884cd22c0e9dd4f7934d91f81589\"},\"description\":\"To make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua.\",\"breadcrumb\":{\"@id\":\"https:\\\/\\\/github.blog\\\/ai-and-ml\\\/machine-learning\\\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\\\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\\\/\\\/github.blog\\\/ai-and-ml\\\/machine-learning\\\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\\\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/github.blog\\\/ai-and-ml\\\/machine-learning\\\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\\\/#primaryimage\",\"url\":\"https:\\\/\\\/github.blog\\\/wp-content\\\/uploads\\\/2019\\\/07\\\/BlogHeaders_final_INSIGHTS_1200x630.png?fit=1201%2C630\",\"contentUrl\":\"https:\\\/\\\/github.blog\\\/wp-content\\\/uploads\\\/2019\\\/07\\\/BlogHeaders_final_INSIGHTS_1200x630.png?fit=1201%2C630\",\"width\":1201,\"height\":630,\"caption\":\"null\"},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\\\/\\\/github.blog\\\/ai-and-ml\\\/machine-learning\\\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\\\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\\\/\\\/github.blog\\\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"AI &amp; ML\",\"item\":\"https:\\\/\\\/github.blog\\\/ai-and-ml\\\/\"},{\"@type\":\"ListItem\",\"position\":3,\"name\":\"Machine learning\",\"item\":\"https:\\\/\\\/github.blog\\\/ai-and-ml\\\/machine-learning\\\/\"},{\"@type\":\"ListItem\",\"position\":4,\"name\":\"C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\\\/\\\/github.blog\\\/#website\",\"url\":\"https:\\\/\\\/github.blog\\\/\",\"name\":\"The GitHub Blog\",\"description\":\"Updates, ideas, and inspiration from GitHub to help developers build and design software.\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\\\/\\\/github.blog\\\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\\\/\\\/github.blog\\\/#\\\/schema\\\/person\\\/1d0a884cd22c0e9dd4f7934d91f81589\",\"name\":\"Kavita Ganesan\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/37c722061c007bf95bfe953c3e6403140fb00c677382a998d6973c85be76a993?s=96&d=mm&r=g2b0a9702a27ba660cc8c944901d08c22\",\"url\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/37c722061c007bf95bfe953c3e6403140fb00c677382a998d6973c85be76a993?s=96&d=mm&r=g\",\"contentUrl\":\"https:\\\/\\\/secure.gravatar.com\\\/avatar\\\/37c722061c007bf95bfe953c3e6403140fb00c677382a998d6973c85be76a993?s=96&d=mm&r=g\",\"caption\":\"Kavita Ganesan\"},\"sameAs\":[\"http:\\\/\\\/kavita-ganesan.com\"],\"url\":\"https:\\\/\\\/github.blog\\\/author\\\/kavgan\\\/\"}]}<\/script>\n<!-- \/ Yoast SEO Premium plugin. -->","yoast_head_json":{"title":"C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages - The GitHub Blog","description":"To make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/","og_locale":"en_US","og_type":"article","og_title":"C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages","og_description":"To make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua.","og_url":"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/","og_site_name":"The GitHub Blog","article_published_time":"2019-07-02T16:51:00+00:00","article_modified_time":"2022-02-17T01:07:00+00:00","og_image":[{"width":1201,"height":630,"url":"https:\/\/github.blog\/wp-content\/uploads\/2019\/07\/BlogHeaders_final_INSIGHTS_1200x630.png?fit=1201%2C630","type":"image\/png"}],"author":"Kavita Ganesan","twitter_card":"summary_large_image","twitter_title":"C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages","twitter_description":"To make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua.","twitter_image":"https:\/\/github.blog\/wp-content\/uploads\/2019\/07\/BlogHeaders_final_INSIGHTS_1200x630.png?fit=1201%2C630","twitter_misc":{"Written by":"Kavita Ganesan","Est. reading time":"8 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/#article","isPartOf":{"@id":"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/"},"author":{"name":"Kavita Ganesan","@id":"https:\/\/github.blog\/#\/schema\/person\/1d0a884cd22c0e9dd4f7934d91f81589"},"headline":"C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages","datePublished":"2019-07-02T16:51:00+00:00","dateModified":"2022-02-17T01:07:00+00:00","mainEntityOfPage":{"@id":"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/"},"wordCount":1660,"image":{"@id":"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/#primaryimage"},"thumbnailUrl":"https:\/\/github.blog\/wp-content\/uploads\/2019\/07\/BlogHeaders_final_INSIGHTS_1200x630.png?fit=1201%2C630","keywords":["machine learning"],"articleSection":["AI &amp; ML","Machine learning"],"inLanguage":"en-US"},{"@type":"WebPage","@id":"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/","url":"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/","name":"C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages - The GitHub Blog","isPartOf":{"@id":"https:\/\/github.blog\/#website"},"primaryImageOfPage":{"@id":"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/#primaryimage"},"image":{"@id":"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/#primaryimage"},"thumbnailUrl":"https:\/\/github.blog\/wp-content\/uploads\/2019\/07\/BlogHeaders_final_INSIGHTS_1200x630.png?fit=1201%2C630","datePublished":"2019-07-02T16:51:00+00:00","dateModified":"2022-02-17T01:07:00+00:00","author":{"@id":"https:\/\/github.blog\/#\/schema\/person\/1d0a884cd22c0e9dd4f7934d91f81589"},"description":"To make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua.","breadcrumb":{"@id":"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/#primaryimage","url":"https:\/\/github.blog\/wp-content\/uploads\/2019\/07\/BlogHeaders_final_INSIGHTS_1200x630.png?fit=1201%2C630","contentUrl":"https:\/\/github.blog\/wp-content\/uploads\/2019\/07\/BlogHeaders_final_INSIGHTS_1200x630.png?fit=1201%2C630","width":1201,"height":630,"caption":"null"},{"@type":"BreadcrumbList","@id":"https:\/\/github.blog\/ai-and-ml\/machine-learning\/c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/github.blog\/"},{"@type":"ListItem","position":2,"name":"AI &amp; ML","item":"https:\/\/github.blog\/ai-and-ml\/"},{"@type":"ListItem","position":3,"name":"Machine learning","item":"https:\/\/github.blog\/ai-and-ml\/machine-learning\/"},{"@type":"ListItem","position":4,"name":"C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages"}]},{"@type":"WebSite","@id":"https:\/\/github.blog\/#website","url":"https:\/\/github.blog\/","name":"The GitHub Blog","description":"Updates, ideas, and inspiration from GitHub to help developers build and design software.","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/github.blog\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/github.blog\/#\/schema\/person\/1d0a884cd22c0e9dd4f7934d91f81589","name":"Kavita Ganesan","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/secure.gravatar.com\/avatar\/37c722061c007bf95bfe953c3e6403140fb00c677382a998d6973c85be76a993?s=96&d=mm&r=g2b0a9702a27ba660cc8c944901d08c22","url":"https:\/\/secure.gravatar.com\/avatar\/37c722061c007bf95bfe953c3e6403140fb00c677382a998d6973c85be76a993?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/37c722061c007bf95bfe953c3e6403140fb00c677382a998d6973c85be76a993?s=96&d=mm&r=g","caption":"Kavita Ganesan"},"sameAs":["http:\/\/kavita-ganesan.com"],"url":"https:\/\/github.blog\/author\/kavgan\/"}]}},"jetpack_publicize_connections":[],"jetpack_featured_media_url":"https:\/\/github.blog\/wp-content\/uploads\/2019\/07\/BlogHeaders_final_INSIGHTS_1200x630.png?fit=1201%2C630","jetpack_shortlink":"https:\/\/wp.me\/pamS32-cWI","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/github.blog\/wp-json\/wp\/v2\/posts\/49768","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/github.blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/github.blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/github.blog\/wp-json\/wp\/v2\/users\/1331"}],"replies":[{"embeddable":true,"href":"https:\/\/github.blog\/wp-json\/wp\/v2\/comments?post=49768"}],"version-history":[{"count":17,"href":"https:\/\/github.blog\/wp-json\/wp\/v2\/posts\/49768\/revisions"}],"predecessor-version":[{"id":49789,"href":"https:\/\/github.blog\/wp-json\/wp\/v2\/posts\/49768\/revisions\/49789"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/github.blog\/wp-json\/wp\/v2\/media\/49776"}],"wp:attachment":[{"href":"https:\/\/github.blog\/wp-json\/wp\/v2\/media?parent=49768"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/github.blog\/wp-json\/wp\/v2\/categories?post=49768"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/github.blog\/wp-json\/wp\/v2\/tags?post=49768"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/github.blog\/wp-json\/wp\/v2\/coauthors?post=49768"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}