Skip to content

Correlations

Anže Sršen edited this page Dec 12, 2017 · 1 revision

An interesting type of data analysis includes identifying dependencies between variables. For two continuous variables we can, for example, compute correlation between them as an estimate of the dependency between them. Each concept in Event Registry can be seen as a time series where the value on a particular day corresponds to the number of articles that we collected that mention the concept. Given any input time series we can then compute which concepts are correlating the most with the provided input time series.

To compute what things in Event Registry correlate the most with a time series we can use the GetTopCorrelations() class.

import { EventRegistry, GetTopCorrelations, QueryArticles } from "eventregistry";
const er = new EventRegistry({apiKey: "YOUR_API_KEY"});
const corr = new GetTopCorrelations(er);

Step 1: Providing input data

Depending on what you want to use as the input time series, you have three options - (a) loading a time series of a concept/category, (b) loading a time series based on an article query, or (c) providing your own data.

Input time series based on a concept/category from Event Registry

To load a time series of a concept or a category, we can simply use the GetCounts() class.

er.getConceptUri("Obama").then(() => {
    const counts = new GetCounts();
    corr.loadInputDataWithCounts(counts);
})

Input time series based on an article query

You can also form an article query using different set of conditions. The resulting set of articles also forms a time series that can be used as the input time series. In the bottom example we would find all articles that mention keyword "iphone" and use the obtained time series as the input data.

const query = new QueryArticles({ keywords: "iphone" });
corr.loadInputDataWithQuery(query)

Input time series based on the users input

The user is also able to provide his own input data. The data can be provided by calling the setCustomInputData() method where the argument is expected to be a list of python tuples, containing date and count values.

const query = new QueryArticles({ keywords: "iphone" })
corr.setCustomInputData([("2015-01-01", 213), ("2015-01-02", 13), ("2015-01-03", 423), ...])

Step 2: Computing top correlations

Once the user in some way provides the input data, we can compute the things that correlate the most with input data. Depending on the interests, the user can compute the correlations with either concepts or categories.

To compute top correlations with concepts, getTopConceptCorrelations() method can be called:

const conceptInfo = corr.getTopConceptCorrelations({
    conceptType: ["person", "org"],
    exactCount: 10,
    approxCount: 100,
});

The method arguments are as follows:

  • candidateConceptsQuery: optional. An instance of QueryArticles that can be used to limit the space of concept candidates
  • candidatesPerType: If candidateConceptsQuery is provided, then this number of concepts for each valid type will be return as candidates
  • conceptType: optional. A string or an array containing the concept types that are valid candidates on which to compute top correlations. Valid values are person, org, loc and/or wiki
  • exactCount: the number of returned concepts for which the exact value of the correlation is computed
  • approxCount: the number of returned concepts for which only an approximate value of the correlation is computed
  • returnInfo: specifies the details about the concepts that should be returned in the output result

Alternatively, one can compute the list of categories that correlate the most with the input data. For this purpose, the getTopCategoryCorrelations should be called:

const categoryInfo = corr.getTopCategoryCorrelations({
    exactCount: 10,
    approxCount: 100,
})

The method arguments are as follows:

  • exactCount: the number of returned categories for which the exact value of the correlation is computed
  • approxCount: the number of returned categories for which only an approximate value of the correlation is computed
  • returnInfo: specifies the details about the categories that should be returned in the output result