Archived: Soundcloud Scraper

I like music. When I find a song I like, I want to listen to it until it becomes dead to me. During the summer, I would listen to random songs here and there (courtesy of r/electrohouse) while working and often find good songs. I emailed myself a list of links to download when I got home. The list had 3 types of links:

If I wanted to download these links, I would need to do it one by one. There must be an easier way!

The first problem I encountered even before coding was that not all SC links were downloadable. I of course could care less why SC limits their downloads, but it bothers me that they do. One notion I see myself mention a lot (and that should be the 11th law of security) is: "If it displays on my screen or plays through my headphones, I can download it". However this coding project was simply a test to familiarize myself in the ways of web scraping. Actually running this program would breach the Terms of Use of SC:

(i) You must not copy, rip or capture, or attempt to copy, rip or capture, any audio Content from the Platform or any part of the Platform , other than by means of download in circumstances where the relevant Uploader has enabled the download functionality with respect to the relevant item of Content.

Of course, I find it kind of odd that by accessing their website, you automatically agree to their Terms of Use which I find impossible since you can’t agree to something you haven’t read yet. On a similar note, you just lost The Game. Also, by explaining how to do this, I’m violating yet another clause:

(vii) You must not, and must not permit any third party to, copy or adapt the object code of the Website or any of the Services, or reverse engineer, reverse assemble, decompile, modify or attempt to discover any source or object code of any part of the Platform, or circumvent or attempt to circumvent or copy any copy protection mechanism or access any rights management information pertaining to Content other than Your Content.

I’ve never heard of something so ballsy as to say that I can’t view your html code that you gave to me!

If any party disagrees with me posting this, please notify me via email (dlachanc@ualberta.ca) and I will swiftly remove this page from my website. Although, it might take awhile for me to get the email since I’m behind seven proxies.

Alright, here we go. First I found this awesome JS script by Captain Frech, and aye, I must say it sure helped. What his/her code is based on is the fact that SC hides the download link unencrypted, in plain sight:

Those sly devils! Note: this is for the older version of SC, the newer version is completely locked down…. oh wait what’s this. I took the code and tranfered it to a PhantomJS script so that I didn’t need to do it myself. PhantomJS is built on NodeJS which is a server-side javascript engine, which I seem to be the only one to find that hilarious seeing as javascript is looser than a….oh nvm.

Here’s my code snippet (full code here):

var script_content = $("a.pl-button.favorite").parents("div.actionbar:first").nextAll("script:first").text();
var temp = script_content.match(/"streamUrl":"[^"]+"/);
var link = temp.toString().split('"')[3];
var temp = $.trim($('h1.with-artwork em').html());
if(temp == "" || temp == null){									//alternate title
	temp = $.trim($('div.info-header.large h1 em').html());
}
console.log(temp);											 	//title
console.log($.trim($('span.user.tiny a.user-name').html())); 	//artist
console.log($.trim($('div.actionbar a span.genre').html())); 	//genre
console.log($.trim($('div#zoomed-artwork img').attr("src")));	//image link
console.log(link);

The script uses console.log() to print out various strings. Since this is run by command line, I implemented a perl script to run PhantomJS with the correct arguments for each link in my playlist text file. If you look at the full code, I first initialize an array of steps then run through each one, waiting with a polling loop for each step to finish. I originally got the idea from this.

Downloading

My perl script reads the playlist text file, determines the type of link, and downloads it. Youtube links were not implemented. For downloading a single SC song link, it does the following:

if($track_info[4] ne ""){
	#download mp3
	system("curl -s -L -o \"$track_info[0].mp3\" $track_info[4]");
	if($track_info[3] ne ""){
		#download image
		system("curl -s -L -o \"$track_info[0] - Temp Image.jpg\" $track_info[3]");
		while(wait() != -1){
		}
		#add image to mp3 file
		system("perl mp3art.pl \"$track_info[0] - Temp Image.jpg\" \"$track_info[0].mp3\"");
		wait();
		system("rm -f \"$track_info[0] - Temp Image.jpg\"");
		#add metadata
		$mp3 = MP3::Tag->new("$track_info[0].mp3");
		if($mp3){
			$id3v1 = $mp3->new_tag("ID3v1");
			$id3v1->all("$track_info[0]","$track_info[1]","","",
						"Generated by scrip",1,"$track_info[2]");
			$id3v1->write_tag();
		}
	}else{
		print "# could not find image URL\n";
	}
}else{
	print "# could not find mp3 URL\n";
}

Downloading an entire artist’s worth of songs is just as simple since my PhantomJS script does all the heavy lifting. The perl script basically downloads whatever comes it way, while the PhantomJS script navigates from page to page extracting all the songs it finds.

I make use of this in line 10 to add an image to the downloaded mp3 file. Beware, this requires MP3::Tag and ImageMagick. This whole setup was such a pain on Windows, I just ran it on my mac.

Photo Evidence

Ha! you almost got me. Of course I wouldn’t actually run this after having paradoxically read and agreed to the Terms of Use the instant I opened soundcloud.

END OF POST

Archived from my old website. And yes, I do realize that all that soundcloud needs to do is correlate webserver access times to this post but idgaf