The goal of The Encyclopedia Of DNA Elements (ENCODE) Project is to identify all of the elements in the human and mouse genomes and make this information available as a resource to the biomedical community. The ENCODE Project is a collaboration of research groups funded by the National Human Genome Research Institute (NHGRI) that was planned as a follow-up to the Human Genome Project after it's conclusion in 2003. In February 2017, ENCODE began it's fourth funding phase. A large project that has been around for 15 years, the ENCODE Project has produced a lot of data and hundreds of publications. NHGRI is interested in curating the Consortium's publication information in-house to track the Consortium's progress.
It is important for both researchers and the public to be able to access the information from databases like PubMed, GEO, etc. Here, we developed a method to access and extract publication information from specified PMIDs in PubMed utilzing the Entrez package within the Biopython module. We then assessed trends such as the number of ENCODE publications per year, the number of ENCODE publications over time and the journal most published in.
Using Entrez, define a function to extract PMID's publication information from PubMed. More information on Biopython's Entrez.eftech can be found here.
#Import modules
import pandas as pd
import numpy as np
from Bio import Entrez
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
from collections import Counter
#Register with Entrez
Entrez.email = "[email protected]"
#Define a function to grab the desired attributes of the PMIDs from PubMed
def get_pubmed_data(pmids, attrb_list):
"""Returns full PubMed data records for desired PMIDs in XML format. PMIDs can be found online
in PubMed and can be accepted individually or as a list. Desired data from PMIDs ('attrb_list')
can be viewed here https://github.com/michael-pagan/BIOF-309-Project/blob/master/PubMed.txt"""
pubs_list = []
pmid_number = len(pmids.index)
iteration_number = 1
while True:
if (iteration_number*200 > pmid_number):
upper_limit = pmid_number
else:
upper_limit = iteration_number*200
#Convert the CSV file into a list
pmids_list = pmids["PMID"][iteration_number*200-200:upper_limit].to_csv(index=False)
xml_doc = Entrez.efetch(db='pubmed', id=pmids_list, retmode='xml', rettype='docsum')
for pub in Entrez.parse(xml_doc):
pub_attribs = []
for attrib in attrb_list:
pub_attribs.append(pub[attrib])
pubs_list.append(pub_attribs)
if (iteration_number*200 > pmid_number):
break
else:
iteration_number += 1
pd_docs = pd.DataFrame(pubs_list, columns=attrb_list)
#Show the DataFrame
display(pd_docs.head(10))
print(pd_docs.shape)
#Export to .csv
pd_docs.to_csv("pd_docs.csv")
return pd_docs
#Import the csv with the pmids as a DataFrame
pmids = pd.read_csv("publications.csv")
#List desired PMID attributes to get from PubMed
attrb_list = ["Id", "FullJournalName", "LastAuthor", "PubDate"]
#Call the function to grab the data
pd_docs = get_pubmed_data(pmids, attrb_list)
PMID | Full Journal Name | Last Author | Publication Date | |
---|---|---|---|---|
0 | 18665130 | Nature genetics | Gingeras TR | 2008 |
1 | 17568007 | Genome research | Stamatoyannopoulos JA | 2007 |
2 | 17166863 | Nucleic acids research | Kent WJ | 2007 |
3 | 17567993 | Genome research | Gerstein MB | 2007 |
4 | 17567995 | Genome research | Sidow A | 2007 |
5 | 18258921 | Genome research | Liu XS | 2008 |
6 | 17568011 | Genome research | Baxevanis AD | 2007 |
7 | 19425134 | Genome informatics. International Confer | Tullius TD | 2008 |
8 | 21439813 | Current opinion in structural biology | Tullius TD | 2011 |
9 | 19286520 | Science (New York, N.Y.) | Margulies EH | 2009 |
(691, 4)
#Clean the data
pd_docs = pd_docs.rename(index=str, columns={"Id": "PMID", "FullJournalName": "Full Journal Name",
"LastAuthor": "Last Author", "PubDate": "Publication Date"})
pd_docs['Publication Date'] = pd_docs['Publication Date'].apply(lambda x: str(x)[:4])
pd_docs['Full Journal Name'] = pd_docs['Full Journal Name'].apply(lambda x: str(x)[:40])
pd_docs["Author: Journal"] = pd_docs["Last Author"] + ": " + pd_docs["Full Journal Name"]
display(pd_docs.head(10))
print(pd_docs.shape)
PMID | Full Journal Name | Last Author | Publication Date | Author: Journal | |
---|---|---|---|---|---|
0 | 18665130 | Nature genetics | Gingeras TR | 2008 | Gingeras TR: Nature genetics |
1 | 17568007 | Genome research | Stamatoyannopoulos JA | 2007 | Stamatoyannopoulos JA: Genome research |
2 | 17166863 | Nucleic acids research | Kent WJ | 2007 | Kent WJ: Nucleic acids research |
3 | 17567993 | Genome research | Gerstein MB | 2007 | Gerstein MB: Genome research |
4 | 17567995 | Genome research | Sidow A | 2007 | Sidow A: Genome research |
5 | 18258921 | Genome research | Liu XS | 2008 | Liu XS: Genome research |
6 | 17568011 | Genome research | Baxevanis AD | 2007 | Baxevanis AD: Genome research |
7 | 19425134 | Genome informatics. International Confer | Tullius TD | 2008 | Tullius TD: Genome informatics. International ... |
8 | 21439813 | Current opinion in structural biology | Tullius TD | 2011 | Tullius TD: Current opinion in structural biology |
9 | 19286520 | Science (New York, N.Y.) | Margulies EH | 2009 | Margulies EH: Science (New York, N.Y.) |
(691,5)
#Define a function to sort desired data
def get_variable(dataFrame, variable):
"""Gets desired column data from DataFrame and sorts from most to least frequent.
Data are assigned to 'variable_results', data frequency is assigned to 'counts'."""
#Create an empty list
frequency_list = []
#Fill the list with the frequency of the data
for i in dataFrame[variable]:
frequency_list.append(i)
#Use Counter to count how many times a journal appears
journal_count = Counter(frequency_list)
#Sort the frequency data by most to least common
sorted_journal_count = journal_count.most_common()
#Create 2 new lists for the variables that you want
global variable_results
variable_results = []
global counts
counts = []
#Fill variable_results and counts
for i in sorted_journal_count:
key, count = i
variable_results.append(key)
counts.append(count)
return variable_results, counts
#Authenticate plotly
plotly.tools.set_credentials_file(username='mpagan2', api_key='9oJfHnTtef1NWTokn4lI')
def hbargraph(xaxis, yaxis, title):
"""Using plotly, develops a horizontal bargraph with hover-over value labels and a custom title"""
data = [go.Bar(
x= xaxis,
y= yaxis,
orientation = 'h',
marker = dict(
color = 'rgba(0, 158, 28, 0.6)',
line = dict(
color = 'rgba(0, 158, 28, 1.0)',
width = 3))
)
]
layout = go.Layout(
title = title,
autosize=False,
width=1000,
height=500,
margin=go.Margin(
l=300,
r=50,
b=100,
t=100,
pad=4
),
)
fig = go.Figure(data=data, layout=layout)
graph = py.iplot(fig, filename=title)
return graph
#Call ‘get_variable’ to obtain the desired data to fill the horizontal bar graph with
get_variable(pd_docs, "Full Journal Name")
#Call ‘hbargraph’ to create the plot
hbargraph(counts[:10], variable_results[:10], "Top 10 Journals ENCODE Authors Publish In")
View in plotly
get_variable(pd_docs, "Author: Journal")
hbargraph(counts[:10], variable_results[:10], "Top 10 Journals Published In By Single ENCODE Author")
View in plotly
get_variable(pd_docs, "Last Author")
hbargraph(counts[:10], variable_results[:10], "Top 10 ENCODE Authors")
View in plotly
#Define function to make bar graph
def bargraph(xaxis, yaxis, bar_labels, title):
"""Using plotly, develops a bar graph with custom hover-over value label descriptions and a custom title"""
trace0 = go.Bar(
x = xaxis,
y = yaxis,
text = bar_labels,
marker=dict(
color='rgb(153, 153, 255)',
line=dict(
color='rgb(8,48,107)',
width=1.5,
)
),
opacity=0.6
)
data = [trace0]
layout = go.Layout(
title= title,
xaxis=dict(
autotick=False,
ticks='outside',
tick0=0,
dtick=1,
ticklen=8,
tickwidth=4,
tickcolor='#000'
),
)
fig = go.Figure(data=data, layout=layout)
graph = py.iplot(fig, filename=title)
return graph
#Grab unique years from "PubDate" column of pd_docs, sort the years, and create "'Year' Publications" labels
get_year_labels = pd_docs["Publication Date"].unique()
sorted_year_labels = sorted(get_year_labels)
labels = [i + " Publications" for i in sorted_year_labels]
#Get x-axis years and sort chronologically
years = pd_docs["Publication Date"].unique().tolist()
sorted_years = sorted(years)
#Get y-axis values
pubs_by_year = pd_docs.groupby('Publication Date')['PMID'].nunique().tolist()
#Call the function
bargraph(sorted_years, pubs_by_year, labels, "ENCODE Publications By Year")
View in plotly
def linegraph(xaxis, yaxis, line_labels, title):
"""Using plotly, develops a line graph with custom hover-over value label descriptions and a custom title"""
trace = go.Scatter(
x = xaxis,
y = yaxis,
text = line_labels,
marker=dict(
color='rgb(153, 153, 255)'
),
)
line_layout = go.Layout(
title= title,
xaxis=dict(
autotick=False,
ticks='outside',
tick0=0,
dtick=1,
ticklen=8,
tickwidth=4,
tickcolor='#000'
),
)
line = [trace]
fig = go.Figure(data=line, layout=line_layout)
graph = py.iplot(fig, filename='Publications Over Time')
return graph
#Sum the publications sequentially by year
sum_pubs = np.cumsum(pubs_by_year)
#Grab unique years from "PubDate" column of pd_docs, sort the years,
#and create "Total Number of Publications in 'Year'" labels
get_total_year_labels = pd_docs["Publication Date"].unique()
sorted_total_year_labels = sorted(get_total_year_labels)
total_labels = ["Total Number of Publications in " + i for i in sorted_total_year_labels]
#Call the line graph function
linegraph(sorted_years, sum_pubs, total_labels, "ENCODE Publications Over Time")
View in plotly
Thanks to Martin, Ben, and Michael for a great semester as we took our first stab at Python! We are grateful for this experience and look forward to becoming even better programmers with the foundation we gained here!