Authors:
            
                    Luiz Olmes Carvalho
                    
                        
                    
                    ; 
                
                    Lucio F. D. Santos
                    
                        
                    
                    ; 
                
                    Willian D. Oliveira
                    
                        
                    
                    ; 
                
                    Agma J. M. Traina
                    
                        
                    
                     and
                
                    Caetano Traina Jr.
                    
                        
                    
                    
                
        
        
            Affiliation:
            
                    
                        
                    
                    University of São Paulo, Brazil
                
        
        
        
        
        
             Keyword(s):
            Similarity Search, Similarity Join, Query Operators, Wide-join, Near-duplicate Detection.
        
        
            
                Related
                    Ontology
                    Subjects/Areas/Topics:
                
                        Databases and Information Systems Integration
                    ; 
                        Enterprise Information Systems
                    ; 
                        Query Languages and Query Processing
                    
            
        
        
            
                Abstract: 
                Crowdsourcing information is being increasingly employed to improve and support decision making in emergency situations. However, the gathered records quickly become too similar among themselves and handling several similar reports does not add valuable knowledge to assist the helping personnel at the control center in their decision making tasks. The usual approaches to detect and handle the so-called near-duplicate data rely on costly twofold processing. Aimed at reducing the cost and also improving the ability of duplication detection, we developed a framework model based on the similarity wide-join database operator. We extended the wide-join definition empowering it to surpass its restrictions and accomplish the near-duplicate task too. In this paper, we also provide an efficient algorithm based on pivots that speeds up the entire process, which enables retrieving the top similar elements in a single-pass processing. Experiments using real datasets show that our framework is up 
                to three orders of magnitude faster than the competing techniques in the literature, whereas also improving the quality of the result in about 35 percent.
                (More)