For this, we will use the following ingredients:
- Data: List of H2020 call topics with codeID, title and objectives. I have chosen the topics related to “connected car”, “autonomous vehicle”, “transportation”, … to see how they relate to each other. Here is the link to the CSV file with the data on the EU Open Data Portal website.
- Tools: Orange3 with the Text Mining add-on. Here is a list of the different widgets available. It is an Open Source tool from the University of Ljubljana with more than enough documentation (blog, tutorials, …). As a visual non-code tool, it allows you to focus on the problem and the different components provided to analyze it without the “inconvenience” of programming. That doesn’t mean you shouldn’t learn some Data Science and Machine Learning (check out this book from IBM). It is based on Python and the most common tools such as gensim,
- Recipe: This is my first test with Orange3 so comments are welcome. I leave you a link with the workflow so you don’t have to generate it from scratch.
As I said, I have used the English topics related to transportation *ART* and *GV*. In total there are 43 topics that include the codeID, the title and the objectives of the topic. For example:
|GV-2-2014||Optimised and systematic energy management in electric vehicles|
Specific challenge: Range limitation, due to the limited storage capacity of electric batteries, is one of the major drawbacks of electric vehicles. The main challenge will be to achieve a systematic energy management of the vehicle based on the integration of components and sub-systems. The problem is worsened by the need to use part of the storage capacity in order to feed auxiliary equipment such as climate control. In extreme conditions up to 50% of the batteries’ capacity is absorbed by these systems. The systematic management of energy in electric vehicles is a means to gain extended range without sacrificing comfort. The challenge is therefore to extend the range of electric vehicles in all weather conditions […]
As you can see, you have pre-processing widgets with different functions (Transformation, Tokenization, Normalization, Filtering, N-grams and POS tagger), word clouds, corpus viewers and almost everything you need to set up your workflow.
As a first result and after using the cosine distance widget, these are the 10 clusters obtained from Hierarchical Clustering.
Results: Topic Modeling
Regarding Topic Modeling, I used the LDA algorithm with 10 topics and this is where the magic happens: Orange3’s visualization capabilities allow us to select one of the detected topics, show the words associated to it and, if we select any of them, we can see the topics associated to it to check the degree of closeness they have.
As an exploratory tool, I have found Orange3 to be most convenient as you only have to concentrate on the data and its analysis. And you can always generate your own widget using python.
Obviously, you can’t deploy your models on any infrastructure but it does the job. A more advanced tool in that sense would be KNIME, also OpenSource but with premium plans for a professional environment.
Any suggestions? You can leave it in the comments. Here are some more workflows directly from the Orange3 website.