And throwing the data away (ignore everything except the first field) is easier than adding the data if it didn't exist in the first place.
From my experience with large noisy corpora, you really want to throw away the low frequency stuff -- it is mostly garbage. Signal-to-noise ratio there is very bad.
Ah, I forgot to mention, it's 10B unique sentences.
About getting the data, it would be OK if you would download it with rate limit (2MB/s is ok). May take couple of days, but should not be that bad.
Arseny