Machine learning for text classification in building management systems
Abstract
In building management systems (BMS), a medium building may have between 200 and 1000 sensor points. Their labels need to be translated into a naming standard so they can be automatically recognised by the BMS platform. The current industrial practices often manually translate these points into labels (this is known as the tagging process), which takes around 8 hours for every 100 points. We introduce an AI-based multi-stage text classification that translates BMS points into formatted BMS labels. After comparing five different techniques for text classification (logistic regression, random forests, XGBoost, multinomial Naive Bayes and linear support vector classification), we demonstrate that XGBoost is the top performer with 90.29% of true positives, and use the prediction confidence to filter out false positives. This approach can be applied in sensors networks in various applications, where manual free-text data pre-processing remains cumbersome.
Keyword : free-text classification, building management systems, Haystack data standard, sensor tagging
This work is licensed under a Creative Commons Attribution 4.0 International License.
References
Ali, J., Khan, R., Ahmad, N., & Maqsood, I. (2012). Random forests and decision trees. International Journal of Computer Science Issues, 9(5), 272–277.
Alsaleem, S. (2011). Automated Arabic text categorization using SVM and NB. International Arab Journal of e-Technology, 2(2), 124–128.
Barandiaran, I. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832–844. https://doi.org/10.1109/34.709601
Boser, B., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory (pp. 144–152). ACM. https://doi.org/10.1145/130385.130401
Brown, P., Desouza, P., Mercer, R., Della Pietra, V., & Lai, J. (1992). Class-based n-gram models of natural language. Computational Linguistics, 18(4), 467–479.
Chai, K., Chieu, H., & Ng, H. T. (2002). Bayesian online classifiers for text classification and filtering. In SIGIR ‘02: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 97–104). ACM. https://doi.org/10.1145/564376.564395
Chatterjee, S., George Jose, P., & Datta, D. (2019). Text classification using SVM enhanced by multithreading and CUDA. International Journal of Modern Education & Computer Science, 11(1), 11–23. https://doi.org/10.5815/ijmecs.2019.01.02
Chen, T., & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). ACM. https://doi.org/10.1145/2939672.2939785
Dalal, M., & Zaveri, M. (2011). Automatic text classification: a technical review. International Journal of Computer Applications, 28(2), 37–40. https://doi.org/10.5120/3358-4633
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. https://arxiv.org/abs/1810.04805
Elnagar, A., Al-Debsi, R., & Einea, O. (2020). Arabic text classification using deep learning models. Information Processing & Management, 57(1), 102121. https://doi.org/10.1016/j.ipm.2019.102121
Gargiulo, F., Silvestri, S., Ciampi, M., & De Pietro, G. (2019). Deep neural network for hierarchical extreme multi-label text classification. Applied Soft Computing, 79, 125–138. https://doi.org/10.1016/j.asoc.2019.03.041
Genkin, A., Lewis, D., & Madigan, D. (2007). Large-scale Bayesian logistic regression for text categorization. Technometrics, 49(3), 291–304. https://doi.org/10.1198/004017007000000245
Gneiting, T., & Raftery, A. (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102, 359–378. https://doi.org/10.1198/016214506000001437
Goodman, J. (2001). A bit of progress in language modeling. Computer Speech & Language, 15(4), 403–434. https://doi.org/10.1006/csla.2001.0174
Gopi, A. P., Jyothi, R. N. S., Narayana, V. L, & Sandeep, K. S. (2020). Classification of tweets data based on polarity using improved RBF kernel of SVM. International Journal of Information Technology. https://doi.org/10.1007/s41870-019-00409-4
Hasanli, H., & Rustamov, S. (2019). Sentiment analysis of Azerbaijani twits using logistic regression, Naive Bayes and SVM. In 2019 IEEE 13th International Conference on Application of Information and Communication Technologies (AICT). IEEE. https://doi.org/10.1109/AICT47866.2019.8981793
Haystack Project. (2019). https://project-haystack.org
Ifrim, G., Bakir, G., & Weikum, G. (2008). Fast logistic regression for text categorization with variable-length n-grams. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 354–362). ACM. https://doi.org/10.1145/1401890.1401936
Jaskie, K., Elkan, C., & Spanias, A. (2019). A modified logistic regression for positive and unlabeled learning. In 2019 53rd Asilomar Conference on Signals, Systems, and Computers (pp. 2007–2011). IEEE. https://doi.org/10.1109/IEEECONF44664.2019.9048765
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning (pp. 137–142). Springer. https://doi.org/10.1007/BFb0026683
Joachims, T. (2001). A statistical learning learning model of text classification for support vector machines. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 128–136). ACM. https://doi.org/10.1145/383952.383974
Kurnia, R., Tangkuman, Y., & Girsang, A. (2020). Classification of user comment using Word2Vec and SVM classifier. International Journal of Advanced Trends in Computer Science and Engineering, 9(1), 643–648. https://doi.org/10.30534/ijatcse/2020/90912020
Lai, S., Xu, L., Liu, K., & Zhao, J. (2015). Recurrent convolutional neural networks for text classification. In Twenty-Ninth AAAI Conference on Artificial Intelligence (pp. 2267–2273). AAAI.
Le, C., Prasad, P., Alsadoon, A., Pham, L., & Elchouemi, A. (2019). Text classification: Naive Bayes classifier with sentiment lexicon. IAENG International Journal of Computer Science, 46(2), 141–148.
Liu, B., Lee, W., Yu, P., & Li, X. (2002). Partially supervised classification of text documents. In ICML ‘02: Proceedings of the Nineteenth International Conference on Machine Learning (pp. 387–394).
Liu, J., Chang, W., Wu, Y., & Yang, Y. (2017). Deep learning for extreme multi-label text classification. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 115–124). https://doi.org/10.1145/3077136.3080834
Liu, P., Zhao, H., Teng, J., Yang, Y., Liu, Y., & Zhu, Z. (2019). Parallel Naive Bayes algorithm for large-scale Chinese text classification based on spark. Journal of Central South University, 26, 1–12. https://doi.org/10.1007/s11771-019-3978-x
Maron, M. (1961). Automatic indexing: an experimental inquiry. Journal of the ACM, 8(3), 404–417. https://doi.org/10.1145/321075.321084
McCallum, A., & Nigam, K. (1998). A comparison of event models for Naive Bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization (pp. 41–48).
Miaschi, A., & Della-Orletta, F. (2020). Contextual and non-contextual word embeddings: an in-depth linguistic investigation. In Proceedings of the 5th Workshop on Representation Learning for NLP (pp. 110–119). https://doi.org/10.18653/v1/2020.repl4nlp-1.15
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013a). Efficient estimation of word representations in vector space. https://arxiv.org/abs/1301.3781v3
Mikolov, T., Le, Q., & Sutskever, I. (2013b). Exploiting similarities among languages for machine translation. https://arxiv.org/abs/1309.4168
Mikolov, T., Yih, W., & Zweig, G. (2013c). Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 746–751).
Montieri, A., Ciuonzo, D., Bovenzi, G., Persico, V., & Pescape, A. (2019). A dive into the dark web: Hierarchical traffic classification of anonymity tools. IEEE Transactions on Network Science and Engineering, 7(3), 1043–1054. https://doi.org/10.1109/TNSE.2019.2901994
Onan, A. (2017). Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes, 46(2), 330–348. https://doi.org/10.1108/K-10-2016-0300
Onan, A. (2018). An ensemble scheme based on language function analysis and feature engineering for text genre classification. Journal of Information Science, 44(1), 28–47. https://doi.org/10.1177/0165551516677911
Onan, A. (2019). Topic-enriched word embeddings for sarcasm identification. In Computer Science On-line Conference (pp. 293–304). Springer. https://doi.org/10.1007/978-3-030-19807-7_29
Onan, A. (2020). Sentiment analysis on product reviews based on weighted word embeddings and deep neural networks. Concurrency and Computation: Practice and Experience, 33(23), e5909. https://doi.org/10.1002/cpe.5909
Onan, A. (2021). Sentiment analysis on massive open online course evaluations: a text mining and deep learning approach. Computer Applications in Engineering Education, 29(3), 572–589. https://doi.org/10.1002/cae.22253
Onan, A., Korukolu, S., & Bulut, H. (2016). A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Systems with Applications, 62, 1–16. https://doi.org/10.1016/j.eswa.2016.06.005
Onan, A., & Korukolu, S. (2017). A feature selection model based on genetic rank aggregation for text sentiment classification. Journal of Information Science, 43(1), 25–38. https://doi.org/10.1177/0165551515613226
Onan, A., & Tocoglu, M. (2020). Satire identification in Turkish news articles based on ensemble of classifiers. Turkish Journal of Electrical Engineering & Computer Sciences, 28(2), 1086–1106. https://doi.org/10.3906/elk-1907-11
Onan, A., & Tocoglu, M. (2021). A term weighted neural language model and stacked bidirectional LSTM based framework for sarcasm identification. IEEE Access, 9, 7701–7722. https://doi.org/10.1109/ACCESS.2021.3049734
Prabhat, A., & Khullar, V. (2017). Sentiment classification on big data using Naïve Bayes and logistic regression. In 2017 International Conference on Computer Communication and Informatics (ICCCI). IEEE. https://doi.org/10.1109/ICCCI.2017.8117734
Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. https://doi.org/10.1007/BF00116251
Ramadhan, W., Novianty, S., & Setianingsih, S. (2017). Sentiment analysis using multinomial logistic regression. In 2017 International Conference on Control, Electronics, Renewable Energy and Communications (ICCREC) (pp. 46–49). IEEE. https://doi.org/10.1109/ICCEREC.2017.8226700
Rane, A., & Kumar, A. (2018). Sentiment classification system of twitter data for us airline service analysis. In 2018 IEEE 42nd Annual Computer Software and Applications Conference (COMPSAC) (Vol. 1, pp. 769–773). IEEE. https://doi.org/10.1109/COMPSAC.2018.00114
Singh, R., Kumar, B., Gaur, L., & Tyagi, A. (2019). Comparison between multinomial and Bernoulli Naïve Bayes for text classification. In 2019 International Conference on Automation, Computational and Technology Management (ICACTM) (pp. 593–596). IEEE. https://doi.org/10.1109/ICACTM.2019.8776800
Sun, A., Lim, E., & Liu, Y. (2009). On strategies for imbalanced text classification using SVM: A comparative study. Decision Support Systems, 48(1), 191–201. https://doi.org/10.1016/j.dss.2009.07.011
Tocoglu, M., & Onan, A. (2020). Sentiment analysis on students evaluation of higher educational institutions. In International Conference on Intelligent and Fuzzy Systems (pp. 1693–1700). Springer. https://doi.org/10.1007/978-3-030-51156-2_197
Tong, S., & Koller, D. (2001). Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2, 45–66.
Vapnik, V., & Lerner, A. (1963). Recognition of patterns with help of generalized portraits. Avtomatika i Telemekhanika, 24(6), 774–780.
Venkatesh Ranjitha, K. V., & Venkatesh Prasad, B. S. (2020). Optimization scheme for text classification using machine learning Naive Bayes classifier. In A. Kumar, M. Paprzycki, & V. Gunjan (Eds.), Lecture notes in electrical engineering: Vol. 601. ICDSMLA 2019 (pp. 576–586). Springer. https://doi.org/10.1007/978-981-15-1420-3_61
Wang, X., Sheng, Y., Deng, H., & Zhao, Z. (2019). CHARCNN-SVM for Chinese text datasets sentiment classification with data augmentation. International Journal of Innovative Computing, Information and Control, 15(1), 227–246.
Xu, B., Guo, X., Ye, Y., & Cheng, J. (2012). An improved random forest classifier for text categorization. Journal of Computing, 7(12), 2913–2920. https://doi.org/10.4304/jcp.7.12.2913-2920
Yao, L., Mao, C., & Luo, Y. (2019). Graph convolutional networks for text classification. Proceedings of the AAAI Conference on Artificial Intelligence, 33, 7370–7377. https://doi.org/10.1609/aaai.v33i01.33017370
Zhang, Y., Jin, R., & Zhou, Z. (2010). Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 1(1–4), 43–52. https://doi.org/10.1007/s13042-010-0001-0
Zhang, X., Zhao, J., & LeCun, Y. (2015). Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems 28 (NIPS 2015) (pp. 649–657).
Zhang, M., Ai, X., & Hu, Y. (2019). Chinese text classification system on regulatory information based on SVM. In IOP Conference Series: Earth and Environmental Science (Vol. 252), 022133. IOP Publishing. https://doi.org/10.1088/1755-1315/252/2/022133