Selecting Adequate Samples for Approximate Decision Support Queries

Amit Rudra, Raj P. Gopalan, N. R. Achuthan

Abstract

For highly selective queries, a simple random sample of records drawn from a large data warehouse may not contain sufficient number of records that satisfy the query conditions. Efficient sampling schemes for such queries require innovative techniques that can access records that are relevant to each specific query. In drawing the sample, it is advantageous to know what would be an adequate sample size for a given query. This paper proposes methods for picking adequate samples that ensure approximate query results with a desired level of accuracy. A special index based on a structure known as the k-MDI Tree is used to draw samples. An unbiased estimator named inverse simple random sampling without replacement is adapted to estimate adequate sample sizes for queries. The methods are evaluated experimentally on a large real life data set. The results of evaluation show that adequate sample sizes can be determined such that errors in outputs of most queries are within the acceptable limit of 5%.

References

  1. Aouiche, K. and Lemire, D. 2007. A Comparison of Five Probabilistic View-Size Estimation Techniques in OLAP, DOLAP'07, November 2007, Lisboa, Portugal.
  2. Bentley, J. L. 1975. Multidimensional binary search trees used for associative searching, Communications of the ACM, September 1975, 18(9): 509-517.
  3. Berenson, M. L., and D. M. Levine. 1992. Basic Business Statistics - Concepts and Applications. Prentice Hall, Upper Saddle River, New Jersey, USA.
  4. Chaudhuri, A., and R. Mukerjee. 1985. Domain Estimation in Finite Populations. Australian Journal of Statistics. 27(2): 135-137.
  5. Chaudhuri, S. 2012. What Next? A Half-Dozen Data Management Research Goals for BigData and the Cloud. PODS 2012. May 21-23. Scottsdale, Arizona, USA.
  6. Fisher, D. 2011. Incremental, Approximate Database Queries and Uncertainty for Exploratory Visualization. IEEE Symposium on Large Data Analsis and Visualization. 73-80. October 23-24. Providence, RI, USA.
  7. Fisher, D., I. Popov, S. M. Drucker, and M. Schraefel. 2012. Trust Me, I'm Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster. CHI 2012, May 5-10. Austin, Texas, USA. 1673-1682.
  8. Heule, S., Numkesser, M. and Hall, A. 2013. HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm. EDBT/ICDT'13 2013, March 18-22. Genoa, Italy.
  9. Hobbs, L., S. Hillson, and S. Lawande. 2003. Oracle9iR2 Data Warehousing. Elsevier Science, MA, USA.
  10. Jermaine, C., 2007. Random Shuffling of Large Database Tables. IEEE Transactions on Knowledge and Data Engineering. 18(1):73-84.
  11. Jermaine, C. 2003. Robust Estimation with Sampling and Approximate Pre-Aggregation. VLDB Conference Proceedings 2003, 886-897.
  12. Jermaine, C., A. Pol, and S. Arumugam. 2004. Online Maintenance of Very Large Random Samples. SIGMOD Conference Proceedings 2004.
  13. Jin, R., L. Glimcher, C. Jermaine, and G. Agrawal. 2006. New Sampling-Based Estimators for OLAP Queries. Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), Atlanta, GA, USA.
  14. Joshi, S., and C. Jermaine. 2008. Matirialized Sample Views for Database Approximation, IEEE Transcaions on Knowledge and Data Engineering, 20:3 pp. 337-351.
  15. Li, X., J. Han, Z. Yin, J-G. Lee, and Y. Sun. 2008. Sampling Cube: A Framework for Statistical OLAP over Sampling Data. Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD'08), Vancouver, BC, Canada, June.
  16. Olken, F., and D. Rotem. 1990. Random Sampling from Database File. In: A Survey. International Conference on Scientific and Statistical Database Management, 1990. pp. 92-111.
  17. Rudra, A., R. Gopalan and N.R. Achuthan. 2012. Efficient Sampling Techniques in Approximate Decision Support Query Processing. Proceedings of the International Conference on Enterprise Information Systems - ICEIS 2012, Wroclaw, Poland. June 28-July 2 2012.
  18. Sangngam, P., and P. Suwatee. 2010. Modified Sampling Scheme in Inverse Sampling without Replacement. 2010 International Conference on Networking and Information Technology. IEEE Press, New York, USA. 580-584.
  19. Spiegel, J., and N. Polyzotis. 2009. TuG Synopses for Approximate Query Answering. ACM Transactions on Database Systems. (TODS) 34(1).
  20. TUN. 2007. Teradata University Network. http://www.teradata.com/TUN_databases. (accessed June 12, 2007).
  21. TPC-H. 2007. Transaction Processing Council. Decision Support Queries. http://www.teradata.com/ TUN_databases. (accessed April 23, 2007).
Download


Paper Citation


in Harvard Style

Rudra A., P. Gopalan R. and R. Achuthan N. (2013). Selecting Adequate Samples for Approximate Decision Support Queries . In Proceedings of the 15th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-8565-59-4, pages 46-55. DOI: 10.5220/0004444200460055


in Bibtex Style

@conference{iceis13,
author={Amit Rudra and Raj P. Gopalan and N. R. Achuthan},
title={Selecting Adequate Samples for Approximate Decision Support Queries},
booktitle={Proceedings of the 15th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2013},
pages={46-55},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0004444200460055},
isbn={978-989-8565-59-4},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 15th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - Selecting Adequate Samples for Approximate Decision Support Queries
SN - 978-989-8565-59-4
AU - Rudra A.
AU - P. Gopalan R.
AU - R. Achuthan N.
PY - 2013
SP - 46
EP - 55
DO - 10.5220/0004444200460055