pyvideo.org: Videos of Alejandro Weinsteinhttp://pyvideo.org/speaker/592/alejandro-weinstein/rssen-usWed, 18 Jul 2012 00:00:00 -0500500A tale of four librarieshttp://pyvideo.org/video/1211/a-tale-of-four-libraries<p>Description</p>
In addition to bringing efficient array computing and standard mathematical
tools to Python, the NumPy/SciPy libraries provide an ecosystem where multiple
libraries can coexist and interact. This talk describes a success story where
we integrate several libraries, developed by different groups, to solve our
research problems. A brief description of our research and how we use these
components follows.
Our research focuses on using Reinforcement Learning (RL) to gather
information in domains described by an underlying linked dataset. For
instance, we are interested in problems such as the following: given a
Wikipedia article as a seed, finding other articles that are interesting
relative to the starting point. Of particular interest is to find articles
that are more than one-click away from the seed, since these articles are in
general harder to find by a human.
In addition to the staples of scientific Python computing NumPy, SciPy,
Matplotlib, and IPython, we use the libraries RL-Glue/RL-Library, NetworkX,
Gensim, and scikit-learn.
Reinforcement Learning considers the interaction between a given environment
and an agent. The objective is to design an agent able to learn a policy that
allows it to maximize its total expected reward. We use the RL-Glue/RL-Library
libraries for our RL experiments. This libraries provide the infrastructure to
connect an environment and an agent, each one described by an independent
Python program.
We represent the linked datasets we work with as graphs. For this we use
NetworkX, which provides data structures to efficiently represent graphs
together with implementations of many classic graph algorithms. We use
NetworkX graphs to describe the environments implemented in RL-Glue/RL-
Library. We also use these graphs to create, analyze and visualize graphs
built from unstructured data.
One of the contributions of our research is the idea of representing the items
in the datasets as vectors belonging to a linear space. To this end, we build
a Latent Semantic Analysis (LSA) model to project documents onto a vector
space. This allows us, in addition to being able to compute similarities
between documents, to leverage a variety of RL techniques that require a
vector representation. We use the Gensim library to build the LSA model. This
library provides all the machinery to build, among other options, the LSA
model. One place where Gensim shines is in its capability to handle big data
sets, like the entire Wikipedia, that do not fit in memory. We also combine
the vector representation of the items as property of the NetworkX nodes.
Finally, we also use the manifold learning capabilities of sckit-learn, like
the ISOMAP algorithm, to perform some exploratory data analysis. By reducing
the dimensionality of the LSA vectors obtained using Gensim from 400 to 3, we
are able to visualize the relative position of the vectors together with their
connections.
In summary, this talk shows, by combining a variety of libraries to solve our
research problems, that the NumPy/SciPy ecosystem has become the lingua-franca
of scientific Python computing.
Alejandro Weinstein,Michael WakinWed, 18 Jul 2012 00:00:00 -0500http://pyvideo.org/video/1211/a-tale-of-four-libraries