An Efficient Sampling Scheme for Approximate Processing of Decision Support Queries

Amit Rudra, Raj Gopalan, Narasimaha Achuthan

Abstract

Decision support queries usually involve accessing enormous amount of data requiring significant retrieval time. Faster retrieval of query results can often save precious time for the decision maker. Pre-computation of materialised views and sampling are two ways of achieving significant speed up. However, drawing random samples for queries on range restricted attributes has two problems: small random samples may miss relevant records and drawing larger samples from disk can be inefficient due to the large number of disk accesses required. In this paper, we propose an efficient indexing scheme for quickly drawing relevant samples for data warehouse queries as well as propose the concepts of database and sample relevancy ratios. We describe a method for estimating query results for range restricted queries using this index and experimentally evaluate the scheme using a relatively large real dataset. Further, we compute the confidence intervals for the estimates to investigate whether the results can be guaranteed to be within the desired level of confidence. Our experiments on data from a retail data warehouse show promising results. We also report the levels of accuracy achieved for various types of aggregate queries and relate them to the database relevancy ratios of the queries.

References

  1. Berenson, M. L., Levine, D. M., 1992. Basic Business Statistics - Concepts and Applications. Prentice Hall, Upper Saddle River, New Jersey, USA.
  2. Bernardino, J., Furtado, P., Madeira, H., 2002. Approximate Query Answering Using Data Warehouse Striping. Journal of Intelligent Information Systems. 19:2, pp.145-167.
  3. Chaudhuri, A., Mukherjee, R., 1985. Domain Estimation in Finite Populations. Australian Journal of Statistics. Vol. 27:2, pp. 135-137.
  4. Hellerstein, H, Haas, P, Wang, J., 1997. Online Aggregation. SIGMOD 1997, pp. 171-182.
  5. Hobbs, L., Hillson, S., Lawande, S., 2003. Oracle9iR2 Data Warehousing. Elsevier Science, MA, USA.
  6. Jermaine, C., 2007. Random Shuffling of Large Database Tables. IEEE Transactions on Knowledge and Data Engineering. 18:1, pp.73-84.
  7. Jermaine, C., 2003. Robust Estimation with Sampling and Approximate Pre-Aggregation. VLDB Conference Proceedings 2003, pp. 886-897.
  8. Jermaine, C., Pol, A., Arumugam, S., 2004. Online Maintenance of Very Large Random Samples. SIGMOD Conference Proceedings 2004.
  9. Jin, R., Glimcher, L, Jermaine, C, Agrawal, G., 2006. New Sampling-Based Estimators for OLAP Queries. Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), Atlanta, GA, USA.
  10. Joshi, S., Jermaine, C., 2008. Materialized Sample Views for Database Approximation, IEEE Transactions on Knowledge and Data Engineering, 20:3 pp. 337-351.
  11. Keller, G., 2009. Statistics for Management and Economics. Cengage Learning, Mason, OH, USA.
  12. Kimball, R., Ross, M., 2002. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd Ed. John Wiley & Sons, Indianapolis, USA.
  13. Li, X., Han, J., Yin, Z., Lee, J-G., Sun, Y., 2008. Sampling Cube: A Framework for Statistical OLAP over Sampling Data. Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD'08), Vancouver, BC, Canada, June.
  14. Olken, F., Rotem, D., 1990, Random Sampling from Database File. In: A Survey. International Conference on Scientific and Statistical Database Management, 1990. pp. 92-111.
  15. Spiegel, J., N. Polyzotis, 2009. TuG Synopses for Approximate Query Answering. ACM Transactions on Database Systems. (TODS) 34(1).
  16. TPC Benchmarks, 2011. Transaction Processing Performance Council - TPC-H: Decision Support Benchmark. http://www.tpc.org [Accessed 20 November 2011].
  17. TUN - Teradata University Network, 2011. http://www.teradata.com/TUN_databases. [Accessed: 13 April 2007].
Download


Paper Citation


in Harvard Style

Rudra A., Gopalan R. and Achuthan N. (2012). An Efficient Sampling Scheme for Approximate Processing of Decision Support Queries . In Proceedings of the 14th International Conference on Enterprise Information Systems - Volume 1: ICEIS, ISBN 978-989-8565-10-5, pages 16-26. DOI: 10.5220/0003995100160026


in Bibtex Style

@conference{iceis12,
author={Amit Rudra and Raj Gopalan and Narasimaha Achuthan},
title={An Efficient Sampling Scheme for Approximate Processing of Decision Support Queries},
booktitle={Proceedings of the 14th International Conference on Enterprise Information Systems - Volume 1: ICEIS,},
year={2012},
pages={16-26},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0003995100160026},
isbn={978-989-8565-10-5},
}


in EndNote Style

TY - CONF
JO - Proceedings of the 14th International Conference on Enterprise Information Systems - Volume 1: ICEIS,
TI - An Efficient Sampling Scheme for Approximate Processing of Decision Support Queries
SN - 978-989-8565-10-5
AU - Rudra A.
AU - Gopalan R.
AU - Achuthan N.
PY - 2012
SP - 16
EP - 26
DO - 10.5220/0003995100160026