Hong-Jie Dai, Mira Anne C. dela Rosa,, Ting-You Zhang, Chung-Lin Chen, Chen-Kai Wang
Abstract
With the emerging of new experimental techniques, there has been a remarkable increase in the amount of available biomedical data. Processing and mining large volumes of data in chemistry has now presented a challenging issue. In order to deal with the challenge, we developed SCHEMA (Spark-based CHEMicAl entity recognizer), a robust and efficient chemical entity recognition system on top of Apache Spark. SCHEMA is developed by following the asyn-chronous queue design pattern, which has been employed in service-oriented architecture for providing scalable and resilient services. SCHEMA that can retrieve patents in a form of unstructured free text from different websites and recognize chemical named entities described in them. To programmatically interact with SCHEMA, a restful Web application programming interface is provided. By using the custom request tests of the BeCalm (Biomedical annotation meta-server) platform, the test results illustrated that SCHEMA can process 5,000 patients within 5 minutes, indicating an average of only 0.06 second for processing one patent including the data fetch and analysis time.
comments powered by Disqus