"Conventional surveillance systems for monitoring infectious diseases, such as influenza, face challenges due to shortage of skilled healthcare professionals, remoteness of communities and absence of communication infrastructures. Internet-based approaches for surveillance are appealing logistically as well as economically. Influenza epidemics possess certain easily identifiable characteristics, which have allowed their identification throughout history. These characteristics include immense attack rates and explosive spread of the disease."
The researchers'' goal was "to assess the predictive power of an alternative data source, Instagram. By using 317 weeks of publicly available data from Instagram, we trained several machine learning algorithms to both nowcast and forecast the number of official influenza-like illness incidents in Finland where population-wide official statistics about the weekly incidents are available. In addition to date and hashtag count features of online posts, we were able to utilize also the visual content of the posted images with the help of deep convolutional neural networks."
"We selected 4 reference images that contributed to the definition of 4 image features. The images were collected using Google Images search engine and all of them are released under public domain enabling full permission for usage as is. The images were searched using terms ”boxes of drugs/medicine”, ”boxes of drugs/medicine and pills”, ”mint, ginger and lemons” and ”ginger and lemon tea”, respectively... In order to count the weekly number of images similar to reference images on Instagram, we employed a pretrained deep convolutional neural network (CNN) model, i.e., Inception-ResNet-v2 [59]. The model is 164 layers deep and has been pretrained on the well known ImageNet dataset. We trained 9 different machine learning algorithms for nowcasting the official weekly ILI counts in Finland. These algorithms include linear regression (also known as ordinary least squares), ridge regression, elastic net, LASSO, k-nearest neighbor regression, support vector machine, random forest, AdaBoost and XGBoost. For each algorithm, we used date and count features for modeling."
"317 weeks of publicly available data from Instagram", including images, hashtags, and dates. "For this study, weekly ILI incidents reported by public primary healthcare register in Finland between the dates 30 April 2012 and 27 May 2018 (in total of 317 weeks) were used. The data is publicly available and accessible [57]. 2) Instagram data: We identified 7 keywords in Finnish language to be searched from the hashtags of the Instagram posts, namely cough, fever, flu, influenza, muscle ache, sick, throat ache. These keywords correspond to the most common symptoms of ILI and we hypothesized that they would be often used in social media posts associated with ILI. We collected publicly available Instagram posts containing at least one of these hashtags between the dates 30 April 2012 and 27 May 2018 (in total of 317 weeks). We used weekly data from 30 April 2012 to 22 May 2017 (265 weeks) as the training data, i.e., hyper-parameter optimization and model comparison. In order to report the performance of the trained models, data from one year was used as the test (hold-out) data, i.e., weekly data from 29 May 2017 to 27 May 2018 (52 weeks)".
Forecasting models for predicting 1 week and 2 weeks ahead showed statistical significance as well by reaching correlation coefficients of 0.903 and 0.862, respectively. This study demonstrates how social media and in particular, digital photographs shared in them, can be a valuable source of information for the field of infodemiology. Overall, we show that Instagram can be considered as a significant source of information for Internet-based monitoring and forecasting of influenza epidemics. Furthermore, we show that the visual content of the posted images can also be utilized as input features with the help of a deep convolutional neural network, increasing the prediction performance. A mean absolute error of 11.33 incidents per week and Pearson’s correlation of 0.963 were achieved with XGBoost algorithm when several modalities of Instagram posts (date, count, image) have been used as an input for nowcasting the official influenza-like illness counts in Finland.