Skip navigation.
Home

CiteSeerX DataSet

 

Steps for downloading the full dataset from CiteSeerX:

Update:
Gregor has kindly provided CiteSeerX fetcher that works much better than the old approach.  Thanks :)

 

P.S.  I am in the process of setting up a new web server at the lab; so I will try make a whole dump available one of these days.  If you need it sooner email me.

 

Old

Great thanks to this blog post

I have added slightly more details, and made some minor changes (that seemed to work better in my case)

  1. Download and extract the "Demo" from http://purl.oclc.org/NET/OPENSRC/downloads/oaiharvester/jars/oaiharvesterdemo.tar
     
  2. download xerces and place xerces.jar it in the same directory as Demo
    http://www.apache.org/dist/xerces/j/Xerces-J-bin.2.9.1.zip
     
  3. Go to the Demo directory, type the following command (all in one line) to download the full dataset of CiteSeerX to the file "citeseerx_alldata.xml"
    java -classpath .:oaiharvester.jar:xerces.jar org.acme.oai.OAIReaderRawDump http://citeseerx.ist.psu.edu/oai2 -o citeseerx_alldata.xml

    Note most likely you will not see anything right away; so you may check the file citeseerx_alldata.xml to make sure that things are being added to it.  The size of the data is about 520 MB, so it may take a while.

For more information you may also want to look at the original website for the oai harverster here

keywords: citeseer citeseerx data set dataset dump download database db citation analysis link analysis