does not require explicitly defining legal issues and
constructing queries. Second, by providing meaning-
ful, hierarchically structured labels by way of our la-
beling algorithm for legal issues, we show that users
can effectively identify interesting and useful topics.
And third, the system is highly scalable and flexible,
as it has been applied to on the order of 100 million
associations across different document types.
Based on our studies, users, especially legal re-
searchers, often prefer to have the ability to drill down
and focus on key issues common within a document
set, as opposed to getting a high-level overview of
a document collection. Attention to fine-grained le-
gal issues, robustness and resulting topically homo-
geneous but content-type heterogeneous, high quality
document clusters, not to mention scalability are the
chief characteristics of this issue-based recommenda-
tion system. It represents a powerful research tool for
the legal community.
8 FUTURE WORK
Our future work will focus on improvements in the
existing topic segmentation algorithm for documents
which contain little metadata information. We have
been experimenting with topic modeling algorithms,
such as Latent Dirichlet Allocation (LDA) (Blei et al.,
2003) and Non-negative Matrix Factorization (NMF)
(Lee and Seung, 1999), in other related projects, and
have witnessed very promising outcomes. Human
quality labels remain a challenge since up until now,
substantial manual reviews by human experts have
been required to ensure quality. We are pursuing this
subject as another future research direction.
ACKNOWLEDGEMENTS
We thank John Duprey, Helen Hsu, Debanjan Ghosh
and Dave Seaman for their help in developing soft-
ware for this work, and we are also grateful for the as-
sistance of Julie Gleason and her team of legal experts
for their detailed quality assessments and invaluable
feedback. We thank Khalid Al-Kofahi, Bill Keenan
and Peter Jackson as well for their on-going feedback
and support.
REFERENCES
Adomavicius, G. and Tuzhilin, A. (2005). Toward the
next generation of recommender systems: A survey
of the state-of-the-art and possible extensions. IEEE
Transactions on Knowledge and Data Engineering,
17(6):734–749.
Aggarwal, C. C. and Yu, P. S. (2006). A framework for
clustering massive text and categorical data streams.
In Proceedings of the Sixth SIAM International Con-
ference on Data Mining (SDM 2006).
Al-Kofahi, K. and et al. (2007). A document recommenda-
tion system blending retrieval and categorization tech-
nologies. In Proceedings of AAAI Workshop on Rec-
ommender Systems in e-Commerce, pages 9–16.
Al-Kofahi, K., Tyrrell, A., Vachher, A., Travers, T., and
Jackson, P. (2001). Combining multiple classifiers for
text categorization. In Proceedings of the 10th Inter-
national Conference on Information and Knowledge
Management (CIKM01), pages 97–104. ACM Press.
Beeferman, D., Berger, A. L., and Lafferty, J. D. (1997). A
model of lexical attraction and repulsion. In Proceed-
ings of the 35th Annual Meeting of the Association of
Computational Linguistics (ACL97), pages 373–380.
Bennett, J. and Lanning, S. (2007). The netflix prize. In
Proceedings of KDD Cup and Workshop.
Blei, D. M., Ng, J. A., and Jordan, M. I. (2003). Latent
dirichlet allocation. Journal of Machine Learning Re-
search, 3:993–1022.
Bun, K. K. and Ishizuka, M. (2002). Topic extraction from
news archive using tf*pdf algorithm. In Proceedings
of the Third International Conference on Web Infor-
mation Systems Engineering (WISE02), pages 73–82.
Chen, K.-Y., Luesukprasert, L., and cho Timothy Chou, S.
(2007). Hot topic extraction based on timeline analy-
sis and multidimensional sentence modeling. IEEE
Transactions on Knowledge and Data Engineering,
19(8):1016–1025.
Choi, F. Y. (2000). Advances in domain independent lin-
ear text segmentation. In Proceedings of the Applied
Natural Language Processing Conference (ANLP00),
pages 26–33.
Choi, F. Y., Wiemer-Hastings, P. M., and Moore, J.
(2001). Latent semantic analysis for text segmen-
tation. In Proceedings of the 2001 Conference on
Empirical Methods in Natural Language Processing
(EMNLP01), pages 109–117.
Cohen, M. L. and Olsen, K. C. (2007). Legal Research in a
Nutshell. Thomson West, Saint Paul, MN, 9th edition.
Fukumoto, F. and Suzuki, Y. (2011). Cluster labeling based
on concepts in a machine-readable dictionary. In Pro-
ceedings of the 5th International Joint Conference
on Natural Language Processing, pages 1371–1375.
AFNLP.
Glover, E. J., Kostas, T., Lawrence, S., Pennock, D. M.,
and Flake, G. W. (2002a). Using web structure for
classifying and describing web pages. In Proc. of the
World Wide Web, pages 562–569. ACM Press.
Glover, E. J., Pennock, D. M., Lawrence, S., and Krovetz,
R. (2002b). Inferring hierarchical descriptions. In
Proceedings of the 11th International Conference on
Information and Knowledge Management (CIKM02),
pages 507–514. ACM Press.
Hearst, M. (1997). Texttiling: Segmenting text into multi-
paragraph subtopic passages. Computational Linguis-
tics, 23:33–64.
BringingOrdertoLegalDocuments-AnIssue-basedRecommendationSystemViaClusterAssociation
87