Mining PubMed via Neo4J Graph Database–Getting the data
I have a blog dealing with various complex autoimmune diseases and spend a lot of time walking links at PubMed.com. Often readers send me an article that I missed.
I thought that a series of post on how to do it will help other people (including MDs, grad students and citizen scientists) better research medical issues.
Getting the data from Pub Med
I implemented a simple logic to obtain a collection of relevant articles:
-
Query for 10,000 articles on a subject or key word
- Retrieve each of these articles and any articles they referenced (i.e. the knowledge graph).
- Keep repeating until you have enough articles or you run out of them!!
Getting the bootstrapping list of articles
A console application that reads the command line arguments and retrieves the list. For example,
downloader.exe Crohn’s Disease
which produces this URI
This results in an XML file being sent
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD esearch 20060628//EN" "http://eutils.ncbi.nlm.nih.gov/eutils/dtd/20060628/esearch.dtd">
<eSearchResult><Count>44880</Count><RetMax>10000</RetMax><RetStart>0</RetStart><QueryKey>1</QueryKey><WebEnv>NCID_1_84230330_130.14.22.215_9001_1462926138_46088356_0MetA0_S_MegaStore_F_1</WebEnv><IdList>
<Id>27159423</Id>
<Id>27158773</Id>
<Id>27158547</Id>
<Id>27158537</Id>
<Id>27158536</Id>
<Id>27158345</Id>
<Id>27158125</Id>
<Id>27157449</Id>
<Id>27156530</Id>
<Id>27154890</Id>
<Id>27154001</Id>
<Id>27153721</Id>
<Id>27152873</Id>
<Id>27152872</Id>
<Id>27152547</Id>
So let us look at the code
class Program
{
static Downloader downloader = new Downloader();
static void Main(string[] args)
{
if (args.Length > 0)
{
var search = new StringBuilder();
foreach (var arg in args)
{
search.AppendFormat("{0} ", arg);
}
downloader.TermSearch(search.ToString());
downloader.ProcessAll();
}
downloader.Save();
}
}
The Downloader class tracks articles already downloaded and those to do next. It simply starts downloading and saving each article summary to an Xml file using the unique article Id as the file name. I wanted to keep the summaries on my disk to speed reprocessing if my Neo4J model changes.
using System;
using System.Collections.Generic;
using System.Collections.Concurrent;
using System.Net;
using System.Linq;
using System.Threading.Tasks;
using System.Xml;
using System.Text;
using System.Configuration;
using System.IO;
namespace PubMed
{
public class Downloader
{
// Entrez E-utilities at the US National Center for Biotechnology Information:
static readonly String server = "http://www.ncbi.nlm.nih.gov/entrez/eutils/";
string dataFolder = "C:\\PubMed";
string logFile;
public System.Collections.Concurrent.ConcurrentBag<string> index = new ConcurrentBag<string>();
public System.Collections.Concurrent.ConcurrentQueue<string> todo = new ConcurrentQueue<string>();
public Downloader()
{
logFile = Path.Combine(dataFolder, "article.log");
if (File.Exists(logFile))
{
var lines = File.ReadAllLines(logFile);
foreach (var line in lines)
{
if (!string.IsNullOrWhiteSpace(line))
index.Add(line);
}
}
}
public void Save()
{
File.WriteAllLines(logFile, index.ToArray());
}
public void ProcessAll()
{
var nextId = string.Empty;
while (todo.Count > 0)
{
if (todo.Count > 12)
{
var tasks = new List<Task>();
int t = 0;
for (t = 0; t < 10; t++)
{
if (todo.TryDequeue(out nextId))
{
tasks.Add(Task.Factory.StartNew(() => NcbiPubmedArticle(nextId)));
}
}
Task.WaitAll(tasks.ToArray());
Save();
}
else
{
if (todo.TryDequeue(out nextId))
{
NcbiPubmedArticle(nextId);
}
}
}
}
public void TermSearch(String term)
{
var search = string.Format("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=Pubmed&retmax=1000&usehistory=y&term={0}", term.Replace(" ", "+"));
new WebClient().DownloadFile(new Uri(search), "temp.log");
var xml = new XmlDocument();
xml.Load("temp.Log");
foreach (XmlNode node in xml.DocumentElement.SelectNodes("//Id"))
{
var id = node.InnerText;
if (!index.Contains(id) && !todo.Contains(id))
{
todo.Enqueue(id);
}
}
}
public void NcbiPubmedArticle(String term)
{
if (!index.Contains(term))
{
try
{
var fileLocation = Path.Combine(dataFolder, string.Format("{0}.xml", term));
if (File.Exists(fileLocation)) return;
var search = string.Format("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id={0}&retmode=xml", term);
new WebClient().DownloadFile(new Uri(search), fileLocation);
index.Add(term);
GetChildren(fileLocation);
Console.WriteLine(term);
}
catch
{
}
}
}
private void GetChildren(string fileName)
{
try
{
var dom = new XmlDocument();
dom.Load(fileName);
foreach (XmlNode node in dom.DocumentElement.SelectNodes("//PMID"))
{
var id = node.InnerText;
if (!index.Contains(id) && !todo.Contains(id))
{
todo.Enqueue(id);
}
}
}
catch (Exception exc)
{
Console.WriteLine(exc.Message);
}
}
}
}
Next Importing into Neo4J
An example of the structured data to load is shown below. Try defining your own model while you wait for the next post.
<?xml version="1.0"?>
<!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st January 2016//EN" "http://www.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_160101.dtd">
<PubmedArticleSet>
<PubmedArticle>
<MedlineCitation Owner="NLM" Status="MEDLINE">
<PMID Version="1">10022306</PMID>
<DateCreated>
<Year>1999</Year>
<Month>02</Month>
<Day>25</Day>
</DateCreated>
<DateCompleted>
<Year>1999</Year>
<Month>02</Month>
<Day>25</Day>
</DateCompleted>
<DateRevised>
<Year>2006</Year>
<Month>11</Month>
<Day>15</Day>
</DateRevised>
<Article PubModel="Print">
<Journal>
<ISSN IssnType="Print">0378-4274</ISSN>
<JournalIssue CitedMedium="Print">
<Volume>102-103</Volume>
<PubDate>
<Year>1998</Year>
<Month>Dec</Month>
<Day>28</Day>
</PubDate>
</JournalIssue>
<Title>Toxicology letters</Title>
<ISOAbbreviation>Toxicol. Lett.</ISOAbbreviation>
</Journal>
<ArticleTitle>Epidemiological association in US veterans between Gulf War illness and exposures to anticholinesterases.</ArticleTitle>
<Pagination>
<MedlinePgn>523-6</MedlinePgn>
</Pagination>
<Abstract>
<AbstractText>To investigate complaints of Gulf War veterans, epidemiologic, case-control and animal modeling studies were performed. Looking for OPIDP variants, our epidemiologic project studied 249 Naval Reserve construction battalion (CB24) men. Extensive surveys were drawn for symptoms and exposures. An existing test (PAI) was used for neuropsychologic. Using FACTOR, LOGISTIC and FREQ in 6.07 SAS, symptom clusters were sought with high eigenvalues from orthogonally rotated two-stage factor analysis. After factor loadings and Kaiser measure for sampling adequacy (0.82), three major and three minor symptom clusters were identified. Internally consistent by Cronbach's coefficient, these were labeled syndromes: (1) impaired cognition; (2) confusion-ataxia; (3) arthro-myo-neuropathy; (4) phobia-apraxia; (5) fever-adenopathy; and (6) weakness-incontinence. Syndrome variants identified 63 patients (63/249, 25%) with 91 syndromes. With pyridostigmine bromide as the drug in these drug-chemical exposures, syndrome chemicals were: (1) pesticide-containing flea and tick collars (P < 0.001); (2) alarms from chemical weapons attacks (P < 0.001), being in a sector later found to have nerve agent exposure (P < 0.04); and (3) insect repellent (DEET) (P < 0.001). From CB24, 23 cases, 10 deployed and 10 non-deployed controls were studied. Auditory evoked potentials showed dysfunction (P < 0.02), nystagmic velocity on rotation testing, asymmetry on saccadic velocity (P < 0.04), somatosensory evoked potentials both sides (right P < 0.03, left P < 0.005) and synstagmic velocity after caloric stimulation bilaterally (P-range, 0.02-0.04). Brain dysfunction was shown on the Halstead Impairment Index (P < 0.01), General Neuropsychological Deficit Scale (P < 0.03) and Trail Making part B (P < 0.03). Butylcholinesterase phenotypes did not trend for inherent abnormalities. Parallel hen studies at Duke University established similar drug-chemical delayed neurotoxicity. These investigations lend credibility that sublethal exposures to drug-chemical combinations caused delayed-onset neurotoxic variants.</AbstractText>
</Abstract>
<AuthorList CompleteYN="Y">
<Author ValidYN="Y">
<LastName>Kurt</LastName>
<ForeName>T L</ForeName>
<Initials>TL</Initials>
<AffiliationInfo>
<Affiliation>Department of Internal Medicine, University of Texas Southwestern Medical School, Dallas 75235, USA.</Affiliation>
</AffiliationInfo>
</Author>
</AuthorList>
<Language>eng</Language>
<PublicationTypeList>
<PublicationType UI="D016428">Journal Article</PublicationType>
<PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
</PublicationTypeList>
</Article>
<MedlineJournalInfo>
<Country>NETHERLANDS</Country>
<MedlineTA>Toxicol Lett</MedlineTA>
<NlmUniqueID>7709027</NlmUniqueID>
<ISSNLinking>0378-4274</ISSNLinking>
</MedlineJournalInfo>
<ChemicalList>
<Chemical>
<RegistryNumber>0</RegistryNumber>
<NameOfSubstance UI="D002800">Cholinesterase Inhibitors</NameOfSubstance>
</Chemical>
</ChemicalList>
<CitationSubset>IM</CitationSubset>
<MeshHeadingList>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D016022">Case-Control Studies</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D002800">Cholinesterase Inhibitors</DescriptorName>
<QualifierName MajorTopicYN="Y" UI="Q000633">toxicity</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D006801">Humans</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D008297">Male</DescriptorName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="N" UI="D018923">Persian Gulf Syndrome</DescriptorName>
<QualifierName MajorTopicYN="Y" UI="Q000209">etiology</QualifierName>
</MeshHeading>
<MeshHeading>
<DescriptorName MajorTopicYN="Y" UI="D014728">Veterans</DescriptorName>
</MeshHeading>
</MeshHeadingList>
</MedlineCitation>
<PubmedData>
<History>
<PubMedPubDate PubStatus="pubmed">
<Year>1999</Year>
<Month>2</Month>
<Day>18</Day>
</PubMedPubDate>
<PubMedPubDate PubStatus="medline">
<Year>1999</Year>
<Month>2</Month>
<Day>18</Day>
<Hour>0</Hour>
<Minute>1</Minute>
</PubMedPubDate>
<PubMedPubDate PubStatus="entrez">
<Year>1999</Year>
<Month>2</Month>
<Day>18</Day>
<Hour>0</Hour>
<Minute>0</Minute>
</PubMedPubDate>
</History>
<PublicationStatus>ppublish</PublicationStatus>
<ArticleIdList>
<ArticleId IdType="pubmed">10022306</ArticleId>
</ArticleIdList>
</PubmedData>
</PubmedArticle>
</PubmedArticleSet>
Comments
Post a Comment